Spatial-SSRL
[CVPR 2026] Official release of "Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning"
Stars: 112
Spatial-SSRL is a lightweight self-supervised RL framework designed to enhance LVLM spatial understanding. It optimizes Qwen2.5-VL-7B model, exhibiting stronger spatial intelligence across seven benchmarks. It is highly scalable, cost-effective, lightweight, and naturally verifiable. The framework supports five tasks using ordinary RGB and RGB-D images, inviting users to contribute with effective pretext tasks to enhance LVLM capabilities.
README:
Yuhong Liu · Beichen Zhang · Yuhang Zang† · Yuhang Cao · Long Xing Xiaoyi Dong · Haodong Duan · Dahua Lin · Jiaqi Wang†
†Corresponding authors.
📖Paper | 🏠Homepage | 🤗Spatial-SSRL-7B Model | 🤗Spatial-SSRL-3B Model | 🤗Spatial-SSRL-Qwen3VL-4B Model | 🤗Spatial-SSRL-81k Dataset | 📰Daily Paper
- 🚀 [2026/02/25] We have released the 🤗Spatial-SSRL-3B Model, initialized from Qwen2.5-VL-3B-Instruct.
- 🚀 [2026/02/21] Our work has been accepted by CVPR 2026.
- 🚀 [2025/11/24] We have released the 🤗Spatial-SSRL-Qwen3VL-4B Model, initialized from Qwen3-VL-4B-Instruct.
- 🚀 [2025/11/03] Now you can try out Spatial-SSRL-7B on 🤗Spatial-SSRL Space.
- 🚀 [2025/11/03] We have released the Spatial-SSRL-7B Model, and Spatial-SSRL-81k Dataset.
- 🚀 [2025/11/02] We have released the Spatial-SSRL repository.
We are thrilled to introduce Spatial-SSRL, a novel self-supervised RL paradigm aimed at enhancing LVLM spatial understanding. By optimizing Qwen2.5-VL-7B with Spatial-SSRL, the model exhibits stronger spatial intelligence across seven spatial understanding benchmarks in both image and video settings.
Spatial-SSRL is a lightweight tool-free framework that is natually compatible with the RLVR training paradigm and easy to extend to a multitude of pretext tasks. Five tasks are currently formulated in the framework, requiring only ordinary RGB and RGB-D images. And we welcome you to join Spatial-SSRL with effective pretext tasks to further strengthen the capabilities of LVLMs!- 🔥 Highly Scalable: Spatial-SSRL uses ordinary raw RGB and RGB-D images instead of richly-annotated public datasets or manual labels for data curation, making it highly scalable.
- 🔥 Cost-effective: Avoiding the need for human labels or API calls for general LVLMs throughout the entire pipeline endows Spatial-SSRL with cost-effectiveness.
- 🔥 Lightweight: Prior approaches for spatial understanding heavily rely on annotation of external tools, incurring inherent errors in training data and additional cost. In constrast, Spatial-SSRL is completely tool-free and can easily be extended to more self-supervised tasks.
- 🔥 Naturally Verifiable: Intrinsic supervisory signals determined by pretext objectives are naturally verifiable, aligning Spatial-SSRL well with the RLVR paradigm.
We train Qwen2.5-VL-3B and Qwen2.5-VL-7B with our Spatial-SSRL paradigm and the experimental results across seven spatial understanding benchmarks are shown below.
To directly experience Spatial-SSRL-7B, you can try it out on 🤗Spatial-SSRL Space!
Here we provide a code snippet for you to start a simple trial of Spatial-SSRL-7B on your own device. You can download the model from 🤗Spatial-SSRL-7B Model before your trial!
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path = "internlm/Spatial-SSRL-7B" #You can change it to your own local path if deployed already
img_path = "examples/eg1.jpg"
question = "Consider the real-world 3D locations of the objects. Which object has a higher location? A. yellow bear kite B. building"
#We recommend using the format prompt to make the inference consistent with training
format_prompt = "\n You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img_path,
},
{"type": "text", "text": question + format_prompt},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Model Response:", output_text)Here we provide a code snippet for you to start a simple trial of Spatial-SSRL-Qwen3VL-4B on your own device. You can download the model from 🤗Spatial-SSRL-Qwen3VL-4B Model before your trial!
from transformers import AutoProcessor, AutoModelForImageTextToText #transformers==4.57.1
from qwen_vl_utils import process_vision_info #0.0.14
import torch
model_path = "internlm/Spatial-SSRL-Qwen3VL-4B" #You can change it to your own local path if deployed already
#Change the path of the input image
img_path = "examples/eg_qwen3vl.jpg"
#Change your question here
question = "Question: Consider the real-world 3D locations and orientations of the objects. If I stand at the man's position facing where it is facing, is the menu on the left or right of me?\nOptions:\nA. on the left\nB. on the right\n"
question += "Please select the correct answer from the options above. \n"
#We recommend using the format prompt to make the inference consistent with training
format_prompt = "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."
model = AutoModelForImageTextToText.from_pretrained(
model_path, torch_dtype=torch.float16, device_map='auto', attn_implementation='flash_attention_2'
)
processor = AutoProcessor.from_pretrained(model_path)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img_path,
},
{"type": "text", "text": question + format_prompt},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Model Response:", output_text[0])Prepare your environment:
git clone https://github.com/InternLM/Spatial-SSRL.git
conda create -n spatialssrl python==3.10
conda activate spatialssrl
cd Spatial-SSRL/evaluation
pip install -r requirements.txt
# Recommended
pip install flash-attn --no-build-isolationStart your evaluation by referring to the tutorials in Eval.md
- [ ] Release the training code.
If you find this project useful, please kindly cite:
@article{liu2025spatial,
title={Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning},
author={Liu, Yuhong and Zhang, Beichen and Zang, Yuhang and Cao, Yuhang and Xing, Long and Dong, Xiaoyi and Duan, Haodong and Lin, Dahua and Wang, Jiaqi},
journal={arXiv preprint arXiv:2510.27606},
year={2025}
}
Usage and License Notices: The data and code are intended and licensed for research use only.
We extend our sincere gratitude to VLMEvalkit, the powerful toolkit to evaluate a vast range of LMMs!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Spatial-SSRL
Similar Open Source Tools
Spatial-SSRL
Spatial-SSRL is a lightweight self-supervised RL framework designed to enhance LVLM spatial understanding. It optimizes Qwen2.5-VL-7B model, exhibiting stronger spatial intelligence across seven benchmarks. It is highly scalable, cost-effective, lightweight, and naturally verifiable. The framework supports five tasks using ordinary RGB and RGB-D images, inviting users to contribute with effective pretext tasks to enhance LVLM capabilities.
LightAgent
LightAgent is a lightweight, open-source Agentic AI development framework with memory, tools, and a tree of thought. It supports multi-agent collaboration, autonomous learning, tool integration, complex task handling, and multi-model support. It also features a streaming API, tool generator, agent self-learning, adaptive tool mechanism, and more. LightAgent is designed for intelligent customer service, data analysis, automated tools, and educational assistance.
LightAgent
LightAgent is a lightweight, open-source active Agentic AI development framework with memory, tools, and a tree of thought. It supports multi-agent collaboration, autonomous learning, tool integration, complex goals, and multi-model support. It enables simpler self-learning agents, seamless integration with major chat frameworks, and quick tool generation. LightAgent also supports memory modules, tool integration, tree of thought planning, multi-agent collaboration, streaming API, agent self-learning, Langfuse log tracking, and agent assessment. It is compatible with various large models and offers features like intelligent customer service, data analysis, automated tools, and educational assistance.
MInference
MInference is a tool designed to accelerate pre-filling for long-context Language Models (LLMs) by leveraging dynamic sparse attention. It achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy. The tool supports various decoding LLMs, including LLaMA-style models and Phi models, and provides custom kernels for attention computation. MInference is useful for researchers and developers working with large-scale language models who aim to improve efficiency without compromising accuracy.
sdialog
SDialog is an MIT-licensed open-source toolkit for building, simulating, and evaluating LLM-based conversational agents end-to-end. It aims to bridge agent construction, user simulation, dialog generation, and evaluation in a single reproducible workflow, enabling the generation of reliable, controllable dialog systems or data at scale. The toolkit standardizes a Dialog schema, offers persona-driven multi-agent simulation with LLMs, provides composable orchestration for precise control over behavior and flow, includes built-in evaluation metrics, and offers mechanistic interpretability. It allows for easy creation of user-defined components and interoperability across various AI platforms.
MathVerse
MathVerse is an all-around visual math benchmark designed to evaluate the capabilities of Multi-modal Large Language Models (MLLMs) in visual math problem-solving. It collects high-quality math problems with diagrams to assess how well MLLMs can understand visual diagrams for mathematical reasoning. The benchmark includes 2,612 problems transformed into six versions each, contributing to 15K test samples. It also introduces a Chain-of-Thought (CoT) Evaluation strategy for fine-grained assessment of output answers.
FlashRank
FlashRank is an ultra-lite and super-fast Python library designed to add re-ranking capabilities to existing search and retrieval pipelines. It is based on state-of-the-art Language Models (LLMs) and cross-encoders, offering support for pairwise/pointwise rerankers and listwise LLM-based rerankers. The library boasts the tiniest reranking model in the world (~4MB) and runs on CPU without the need for Torch or Transformers. FlashRank is cost-conscious, with a focus on low cost per invocation and smaller package size for efficient serverless deployments. It supports various models like ms-marco-TinyBERT, ms-marco-MiniLM, rank-T5-flan, ms-marco-MultiBERT, and more, with plans for future model additions. The tool is ideal for enhancing search precision and speed in scenarios where lightweight models with competitive performance are preferred.
OpenMusic
OpenMusic is a repository providing an implementation of QA-MDT, a Quality-Aware Masked Diffusion Transformer for music generation. The code integrates state-of-the-art models and offers training strategies for music generation. The repository includes implementations of AudioLDM, PixArt-alpha, MDT, AudioMAE, and Open-Sora. Users can train or fine-tune the model using different strategies and datasets. The model is well-pretrained and can be used for music generation tasks. The repository also includes instructions for preparing datasets, training the model, and performing inference. Contact information is provided for any questions or suggestions regarding the project.
EasySteer
EasySteer is a unified framework built on vLLM for high-performance LLM steering. It offers fast, flexible, and easy-to-use steering capabilities with features like high performance, modular design, fine-grained control, pre-computed steering vectors, and an interactive demo. Users can interactively configure models, adjust steering parameters, and test interventions without writing code. The tool supports OpenAI-compatible APIs and provides modules for hidden states extraction, analysis-based steering, learning-based steering, and a frontend web interface for interactive steering and ReFT interventions.
qa-mdt
This repository provides an implementation of QA-MDT, integrating state-of-the-art models for music generation. It offers a Quality-Aware Masked Diffusion Transformer for enhanced music generation. The code is based on various repositories like AudioLDM, PixArt-alpha, MDT, AudioMAE, and Open-Sora. The implementation allows for training and fine-tuning the model with different strategies and datasets. The repository also includes instructions for preparing datasets in LMDB format and provides a script for creating a toy LMDB dataset. The model can be used for music generation tasks, with a focus on quality injection to enhance the musicality of generated music.
AgentFly
AgentFly is an extensible framework for building LLM agents with reinforcement learning. It supports multi-turn training by adapting traditional RL methods with token-level masking. It features a decorator-based interface for defining tools and reward functions, enabling seamless extension and ease of use. To support high-throughput training, it implemented asynchronous execution of tool calls and reward computations, and designed a centralized resource management system for scalable environment coordination. A suite of prebuilt tools and environments are provided.
lionagi
LionAGI is a powerful intelligent workflow automation framework that introduces advanced ML models into any existing workflows and data infrastructure. It can interact with almost any model, run interactions in parallel for most models, produce structured pydantic outputs with flexible usage, automate workflow via graph based agents, use advanced prompting techniques, and more. LionAGI aims to provide a centralized agent-managed framework for "ML-powered tools coordination" and to dramatically lower the barrier of entries for creating use-case/domain specific tools. It is designed to be asynchronous only and requires Python 3.10 or higher.
curator
Bespoke Curator is an open-source tool for data curation and structured data extraction. It provides a Python library for generating synthetic data at scale, with features like programmability, performance optimization, caching, and integration with HuggingFace Datasets. The tool includes a Curator Viewer for dataset visualization and offers a rich set of functionalities for creating and refining data generation strategies.
edsl
The Expected Parrot Domain-Specific Language (EDSL) package enables users to conduct computational social science and market research with AI. It facilitates designing surveys and experiments, simulating responses using large language models, and performing data labeling and other research tasks. EDSL includes built-in methods for analyzing, visualizing, and sharing research results. It is compatible with Python 3.9 - 3.11 and requires API keys for LLMs stored in a `.env` file.
MathCoder
MathCoder is a repository focused on enhancing mathematical reasoning by fine-tuning open-source language models to use code for modeling and deriving math equations. It introduces MathCodeInstruct dataset with solutions interleaving natural language, code, and execution results. The repository provides MathCoder models capable of generating code-based solutions for challenging math problems, achieving state-of-the-art scores on MATH and GSM8K datasets. It offers tools for model deployment, inference, and evaluation, along with a citation for referencing the work.
ChatRex
ChatRex is a Multimodal Large Language Model (MLLM) designed to seamlessly integrate fine-grained object perception and robust language understanding. By adopting a decoupled architecture with a retrieval-based approach for object detection and leveraging high-resolution visual inputs, ChatRex addresses key challenges in perception tasks. It is powered by the Rexverse-2M dataset with diverse image-region-text annotations. ChatRex can be applied to various scenarios requiring fine-grained perception, such as object detection, grounded conversation, grounded image captioning, and region understanding.
For similar tasks
Spatial-SSRL
Spatial-SSRL is a lightweight self-supervised RL framework designed to enhance LVLM spatial understanding. It optimizes Qwen2.5-VL-7B model, exhibiting stronger spatial intelligence across seven benchmarks. It is highly scalable, cost-effective, lightweight, and naturally verifiable. The framework supports five tasks using ordinary RGB and RGB-D images, inviting users to contribute with effective pretext tasks to enhance LVLM capabilities.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.





