SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

🌍SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

Kaiyu Li*¹

Zepeng Xin*¹

Li Pang¹

Chao Pang²

Yupeng Deng³

Jing Yao³

Guisong Xia²

Deyu Meng¹

Zhi Wang¹

Xiangyong Cao^{✉ 1}

Xi'an Jiaotong University¹

Wuhan University²

Chinese Academy of Sciences³

Dataset [Coming soon]

Code [GitHub]

Paper [arXiv]

Demo [Coming soon]

TL;DR We introduce the geospatial pixel reasoning task, construct the first benchmark dataset (EarthReason),
and propose a simple yet effective baseline (SegEarth-R1).

Comparison of semantic segmentation, referring segmentation and geospatial pixel inference. (left) Samples from the LoveDA and RRSIS-D datasets. (right) Samples from the EarthReason dataset. Previous tasks are limited by fixed taxonomies and explicit instructions, while geospatial pixel reasoning supports complex implicit instructions and requires the reasoning capability of the model.

Abstract

Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, i.e., geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods.

Dataset

Comparison between EarthReason and other related datasets. The gray rendering denotes the natural image dataset. "Seg", "Det", "VG", "Cls" denote segmentation, detection, visual grounding and classification datasets, respectively.

Method

Overview of the proposed SegEarth-R1 architecture. Given an image and a text description, a hierarchical visual encoder and a proposed connector are used to extract and compress visual tokens. Then, the visual tokens and description embeddings are fed into an LLM for instruction interpretation and semantic correlation. Finally, description embeddings are directly mapped to the query vector and used for spatial correlation and segmentation mask generation.

Quantitative Results

Geospatial pixel reasoning results among SegEarth-R1 (ours) and previous related works.

Referring segmentation results among SegEarth-R1 and previous related works on RRSIS-D dataset.

Referring segmentation results among SegEarth-R1 and previous related works on RefSegRS dataset.

Visualizations

Qualitative Results of SegEarth-R1 on EarthReason.

Comparison with other models on EarthReason.

Comparison with PSALM on RRSIS-D.

Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.