CVSBench | Project Page

Abstract

CVSBench provides a unified benchmark for testing cross-view spatial reasoning under the challenging satellite-street setting. Rather than only measuring what a model recognizes in one image, it evaluates whether the model can preserve spatial consistency across viewpoints, infer hidden structure, and answer questions whose evidence is distributed across very different visual perspectives.

What makes CVSBench different

It combines cross-view VQA, grounding, and viewpoint localization in a single benchmark.
It moves beyond indoor and low-variation multi-view settings to realistic urban-scale overhead-to-ground scenes.
It directly probes spatial imagination, not just object recognition or single-view reasoning.

3,297 image groups 9,468 object annotations 40,679 QA pairs

Figure 1. CVSBench expands spatial reasoning evaluation into large-viewpoint satellite-street scenarios and multiple cross-view tasks.

Benchmark Comparison and Composition

Following the paper order, we first show where CVSBench sits relative to existing datasets, then summarize the internal task composition of the benchmark.

Table 1. Comparison with Existing Datasets Task coverage, scale, annotation format, and multi-view support.

Presented as an image version for cleaner reading on the project page while preserving the original Table 1 content.

Benchmark construction and task overview figure

Figure 2. Benchmark composition summary showing task families, subset structure, and answer distributions across CVSBench.

Figure 3. Dataset annotation and question generation pipeline for VQA, viewpoint localization, and cross-view grounding.

3,297

Image groups

9,468

Bounding boxes

40,679

Question-answer pairs

Method and Spatial Imagination Pipeline

The paper studies both benchmark construction and inference-time improvement strategies. Language-only reasoning gives limited gains, while explicit visual imagination provides stronger spatial signals.

Figure 4. Structured Scene CoT, Spatial Imagination CoT, depth augmentation, and 3D miniature view generation.

1. Structured Scene CoT Force explicit object-level scene decomposition before answering, so the model states what it sees before guessing the cross-view relation.

2. Spatial Imagination CoT Encourage viewpoint projection from a visible source view to an unseen target view instead of relying on shallow appearance matching.

3. Visual Imagination Inputs Add depth maps or synthesized 3D miniature views so reasoning is grounded in a stronger spatial intermediate.

Why this section matters

The main pattern is straightforward: text-only prompting gives limited gains, while explicit visual-spatial intermediates make the cross-view transfer much more stable.

SFT + RL Depth Augmentation 3D Miniature View Cross-view Alignment

The page keeps the method summary compact and leaves low-level training detail to the appendix PDF instead of introducing a large empty side block here.

Main Results

The results show that current VLMs can partially solve coarse cross-view VQA, but still struggle with precise entity alignment, grounding, and stable viewpoint transfer.

Table 2. Overall Comparison on CVSBench Main VQA/viewpoint accuracy and grounding mIoU across model families.

Presented as an image version for cleaner reading on the project page while preserving the original Table 2 content.

Takeaways from the main table

Closed-source models lead overall, but even they remain weak on grounding.
Open-source systems can approach strong VQA performance, yet still fail to maintain cross-view correspondence.
The FOV subset is consistently harder than CVUSA, confirming that limited visible cues sharply increase the difficulty of spatial transfer.

Reasoning Strategy Ablations

The paper next examines whether modifying textual reasoning or adding auxiliary visual views can improve cross-view understanding.

Table 3. Inference-time CoT Qwen3-VL-4B with and without reasoning traces.

Table 3 inference-time chain-of-thought comparison

Presented as an image version for cleaner reading on the project page while preserving the original Table 3 content.

Table 6. Auxiliary Views on FOV Depth versus 3D-view imagination inputs.

Table 6 auxiliary views on FOV comparison

Presented as an image version for cleaner reading on the project page while preserving the original Table 6 content.

Table 4. CVUSA Reasoning Comparison Training strategies and CoT designs on CVUSA G2S/S2G.

Presented as an image version for cleaner reading on the project page while preserving the original Table 4 content.

Table 5. FOV Reasoning Comparison Training strategies and CoT designs on FOV G2S/S2G.

Presented as an image version for cleaner reading on the project page while preserving the original Table 5 content.

Qualitative Examples and Appendix Cases

Main-paper examples are shown in one compact slider for quick comparison.

Main Paper Case 1 Cross-view imagination versus structured scene reasoning on a footprint-style example.