ScreenParse

771K

Screenshots

21M

UI Annotations

55

Semantic Classes

316M

Model Parameters

Abstract

Modern computer-use agents must perceive a screen as a structured state, including what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen.

We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse URLs, extracts annotations, and applies VLM-based relabeling and quality filtering.

Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model that decodes a structured ScreenTag representation with a structure-aware loss. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.606 vs. 0.294 PageIoU) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance.

Dataset Samples

Browse ground truth annotations from the ScreenParse dataset. Each screenshot shows dense UI element annotations with bounding boxes and semantic labels. The dataset covers 55 semantic classes across 771K screenshots.

The Problem: Sparse Supervision Limits Screen Understanding

Computer-use agents must perceive screens as structured states before they can reliably ground instructions and act. Yet most grounding datasets provide sparse supervision, annotating only the single element relevant to each task step. This leaves the majority of on-screen elements unlabeled and the full screen structure implicit.

As a result, models can learn shortcuts sufficient for supervised steps while failing to form a complete screen state. When perception fails, errors cascade throughout the agent pipeline, hurting robustness and generalization to new layouts and applications.

We argue that complete screen parsing, recovering all visible UI elements, their bounding boxes, semantic types, and text, is a core training objective for building reliable screen agents.

Sparse Supervision

Only task-relevant elements annotated
Majority of UI elements unlabeled
Models learn shortcuts
Poor generalization

Dense Supervision

All visible elements annotated
Complete screen structure
Holistic understanding
Robust generalization

Key Contributions

ScreenParse Dataset

A large-scale dataset with 21M dense UI annotations across 771K screenshots and 55 semantic classes, making it significantly more comprehensive than existing datasets.

Webshot Pipeline

An automated, scalable pipeline for generating dense screen annotations, combining web rendering, accessibility tree extraction, and VLM-based quality filtering.

ScreenVLM

A compact 316M-parameter VLM that outperforms much larger models on dense parsing while being 4x faster and enabling practical deployment.

ScreenParse Dataset

ScreenParse provides complete screen parsing supervision, with dense annotations covering all visible UI elements, not just sparse task-relevant subsets.

Existing approaches suffer from sparse annotations that leave the majority of UI elements unlabeled, limiting both the diversity and coverage of training data. For example, GroundCUA provides only 55K samples with 8 element types. ScreenParse addresses this with dense annotations spanning 771K samples and 55 semantic types, enabling models to develop comprehensive screen understanding rather than learning task-specific shortcuts.

Dataset	# Types	# Elements	# Samples
UGround	1	9M	773K
JEDI	4	4M	575K
AGUVIS-G	1	3.8M	452K
OS-ATLAS	1*	14.5M	1.85M
RICOSCA	1*	170K	18K
UIBert	32	166K	57K
Widget Caption	1	101K	14K
AMEX	2	1.2M	101K
ScreenSpot	2	3M	270K
GroundCUA	8	3.56M	55K
ScreenParse (Ours)	55	21M	771K

* These datasets do not define a well-specified set of UI element types.

Class Distribution (Top 20)

Class distribution in ScreenParse dataset

Distribution of the 20 most frequent UI element types in ScreenParse. The dataset covers 55 semantic classes total.

Webshot Pipeline

Webshot is an automated, scalable pipeline for generating dense screen annotations from web pages.

1
Web Crawling

Collect diverse URLs as the starting point for large-scale screen capture.
2
Rendering

Render each page with Playwright to obtain screenshots and page metadata.
3
Bounding Box Extraction + Filtering

Extract visible UI boxes and remove noisy or invalid candidates.
4
Refining Class Labels

Use page, crop, and HTML context to refine UI element classes.
5
VLM-as-a-judge Filtering

Score annotation quality and curate accepted screen samples.
6
Hierarchical Grouping and Serialization

Group elements into ScreenTag structure for complete screen parsing.
7
Train

Train ScreenVLM on the curated ScreenTag supervision.

ScreenVLM

ScreenVLM is a compact 316M-parameter vision-language model that decodes structured ScreenTag representations.

Foundation VLMs (8B+ parameters) are too large for low-latency, on-device deployment, while traditional detector-based parsers lack language-grounded understanding. ScreenVLM bridges this gap: it is compact enough for practical deployment (316M parameters, 4x faster than 2B VLMs) yet retains language-aligned representations that enable structured understanding of UI semantics.

ScreenTag Output Format

<Screentag>
<Window> <loc_0><loc_0><loc_500><loc_500>
  <Logo> <loc_59><loc_84><loc_102><loc_90> Microsoft </Logo>
  <Navigation Bar> <loc_85><loc_5><loc_420><loc_35>
    <Button> <loc_85><loc_5><loc_110><loc_35> Azure </Button>
    <Button> <loc_110><loc_5><loc_145><loc_35> Explore </Button>
    …
    <Button> <loc_385><loc_5><loc_420><loc_35> Sign in </Button>
  </Navigation Bar>
  …
</Window>
</Screentag>

316M

Parameters

4x

Faster than 2B VLMs

6x

Smaller memory

Results

ScreenParse Test Set

Model	Size	PageIoU	Label PageIoU	mAP@50
Vision-Language Models
Qwen3-VL-8B-Instruct	8B	0.294	-	-
Qwen3-VL-2B-Instruct	2B	0.228	0.051	0.023
Qwen3-VL-2B + ScreenParse	2B	0.585 (+0.357)	0.166 (+0.115)	0.152 (+0.129)
InternVL3-2B	2B	0.111	0.030	0.000
InternVL3-2B + ScreenParse	2B	0.509 (+0.398)	0.174 (+0.144)	0.072 (+0.072)
ScreenVLM (Ours)	316M	0.606	0.197	0.303
Detectors / Parsers
OmniParser V2	20M	0.270	-	-
OmniParser V2 + ScreenParse	20M	0.503	0.141	0.251
YOLO + ScreenParse	25M	0.533	0.133	0.299
RT-DETRv2 + ScreenParse	43M	0.600	0.172	0.362

GroundCUA Transfer (Out-of-Distribution)

Model	Size	PageIoU	Label PageIoU
Vision-Language Models
Qwen3-VL-8B-Instruct	8B	0.060	0.010
Qwen3-VL-2B-Instruct	2B	0.030	0.005
Qwen3-VL-2B + ScreenParse	2B	0.090 (+0.060)	0.019 (+0.014)
InternVL3-2B	2B	0.025	0.006
InternVL3-2B + ScreenParse	2B	0.203 (+0.178)	0.036 (+0.030)
ScreenVLM (Ours)	316M	0.251	0.043
Detectors / Parsers
OmniParser V2	20M	0.361	0.049
OmniParser V2 + ScreenParse	20M	0.398	0.061

ScreenSpot Benchmark

Model	Size	Web		PC		Mobile
Model	Size	Recall	PixCov	Recall	PixCov	Recall	PixCov
Vision-Language Models
Qwen3-VL-8B-Instruct	8B	0.229	0.346	0.201	0.300	0.193	0.311
Qwen3-VL-2B + ScreenParse	2B	0.477	0.720	0.201	0.443	0.108	0.477
ScreenVLM (Ours)	316M	0.557	0.746	0.222	0.839	0.066	0.847
Detectors / Parsers
OmniParser V2	20M	0.541	0.629	0.483	0.557	0.489	0.521
RT-DETRv2 + ScreenParse	43M	0.768	0.857	0.590	0.699	0.584	0.736

Downstream Action Tasks

We use a fixed Qwen3-VL-8B-Instruct Set-of-Mark pipeline and swap only the detector: OmniParser v2 versus a YOLO detector trained on ScreenParse. Holding OCR, icon captioning, rendering, prompting, and decoding fixed isolates the effect of ScreenParse-trained grounding.

Benchmark / Split	Metric	OmniParser v2	ScreenParse (YOLO)	Delta
Primary benchmark metrics
ScreenSpot	Action Acc.	77.8	80.1	+2.3
ScreenSpot-Pro	Action Acc.	28.5	30.5	+2.0
Mind2Web (website)	Step Success	24.1	27.1	+3.0
Mind2Web (task)	Step Success	26.8	27.0	+0.2
Mind2Web (domain)	Step Success	28.7	29.6	+0.9
OSWorld-G	Overall Acc.	46.67	48.82	+2.15
ScreenSpot domain breakdown
ScreenSpot (Web)	Action Acc.	77.1	81.9	+4.8
ScreenSpot (Mobile)	Action Acc.	83.5	84.9	+1.4
ScreenSpot (Desktop)	Action Acc.	70.4	71.2	+0.8
OSWorld-G capability breakdown
OSWorld-G (Text Matching)	Acc.	59.00	61.69	+2.69
OSWorld-G (Element Recognition)	Acc.	50.30	50.91	+0.61
OSWorld-G (Layout Understanding)	Acc.	53.36	56.13	+2.77
OSWorld-G (Fine-grained Manipulation)	Acc.	25.50	29.53	+4.03

Values are percentages and deltas are percentage points. These controlled evaluations measure step-level grounding and action selection rather than full multi-step agent success.

Structure-Aware Loss Ablation

Upweighting tag and location tokens improves dense parsing and out-of-distribution transfer compared with standard cross-entropy, especially on ScreenSpot PC.

Model	ScreenParse		GroundCUA		ScreenSpot Web	ScreenSpot PC	ScreenSpot Mobile
Model	PageIoU	Label PageIoU	PageIoU	Label PageIoU	Recall	Recall	Recall
ScreenVLM (CE)	0.592	0.192	0.226	0.039	0.541	0.129	0.052
ScreenVLM (Structure-aware)	0.606 +2.4%	0.197 +2.6%	0.251 +11.1%	0.043 +10.3%	0.557 +3.0%	0.222 +72.1%	0.066 +26.9%

Qualitative Examples

Comparison of ground truth annotations vs. model predictions on diverse web screenshots.

Ground Truth

ScreenVLM Prediction

Ground Truth

ScreenVLM Prediction

Ground Truth

Qwen3-VL Finetuned

Ground Truth

OmniParser Finetuned

How Do Other Datasets Compare?

Most existing datasets annotate only one or a few elements per screen. ScreenParse provides dense annotations covering every visible UI element. Click any sample to zoom in.

Sparse ScreenSpot (Web)

Sparse ScreenSpot (PC)

Sparse ScreenSpot (Mobile)

Sparse ScreenSpot-Pro

Sparse MMBench-GUI

Sparse UI-Vision

Dense ScreenParse (Ours)

Limitations & Future Work

Current Limitations

ScreenParse is predominantly web-centric, leaving a domain gap to native desktop and mobile applications with different UI toolkits and interaction patterns. DOM-driven extraction can retain residual noise from dynamic content, canvas-heavy interfaces, ads, overlays, or rendering artifacts even after filtering. ScreenParse also uses viewport-level captures rather than scroll-complete page states, which keeps each sample tied to one visible screen but underrepresents below-the-fold UI contexts.

Future Directions

A natural next step is expanding dense parsing supervision to cover native desktop and mobile UIs. Another promising direction is leveraging screen-parsing-pretrained models as visual backbones for vision-language-action agents, capitalizing on holistic screen understanding for improved grounding and decision making.

Citation

Please cite the arXiv version until the official ICML proceedings entry is available.

@misc{gurbuz2026movingsparsegroundingcomplete,
      title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision},
      author={A. Said Gurbuz and Sunghwan Hong and Ahmed Nassar and Marc Pollefeys and Peter Staar},
      year={2026},
      eprint={2602.14276},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.14276},
}

ScreenParse

Moving Beyond Sparse Grounding with Complete Screen Parsing

Abstract

Dataset Samples

The Problem: Sparse Supervision Limits Screen Understanding

Sparse Supervision

Dense Supervision

Key Contributions

ScreenParse Dataset

Webshot Pipeline

ScreenVLM

ScreenParse Dataset

Class Distribution (Top 20)

Webshot Pipeline

Web Crawling

Rendering

Bounding Box Extraction + Filtering

Refining Class Labels

VLM-as-a-judge Filtering

Hierarchical Grouping and Serialization

Train

ScreenVLM

ScreenTag Output Format

Results

ScreenParse Test Set

GroundCUA Transfer (Out-of-Distribution)

ScreenSpot Benchmark

Downstream Action Tasks

Structure-Aware Loss Ablation

Qualitative Examples

How Do Other Datasets Compare?

Limitations & Future Work

Current Limitations

Future Directions

Citation