ScreenParse

771K

Screenshots

21M

UI Annotations

55

Semantic Classes

316M

Model Parameters

Abstract

Modern computer-use agents must perceive a screen as a structured state—what elements are visible, where they are, and what text they contain—before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen.

We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse URLs, extracts annotations, and applies VLM-based relabeling and quality filtering.

Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model that decodes a structured ScreenTag representation with a structure-aware loss. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.606 vs. 0.294 PageIoU) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance.

Dataset Samples

Browse ground truth annotations from the ScreenParse dataset. Each screenshot shows dense UI element annotations with bounding boxes and semantic labels. The dataset covers 55 semantic classes across 771K screenshots.

The Problem: Sparse Supervision Limits Screen Understanding

Computer-use agents must perceive screens as structured states before they can reliably ground instructions and act. Yet most grounding datasets provide sparse supervision—annotating only the single element relevant to each task step. This leaves the majority of on-screen elements unlabeled and the full screen structure implicit.

As a result, models can learn shortcuts sufficient for supervised steps while failing to form a complete screen state. When perception fails, errors cascade throughout the agent pipeline, hurting robustness and generalization to new layouts and applications.

We argue that complete screen parsing—recovering all visible UI elements, their bounding boxes, semantic types, and text—is a core training objective for building reliable screen agents.

Sparse Supervision

Only task-relevant elements annotated
Majority of UI elements unlabeled
Models learn shortcuts
Poor generalization

Dense Supervision

All visible elements annotated
Complete screen structure
Holistic understanding
Robust generalization

Key Contributions

ScreenParse Dataset

A large-scale dataset with 21M dense UI annotations across 771K screenshots, featuring 55 semantic classes—significantly more comprehensive than existing datasets.

Webshot Pipeline

An automated, scalable pipeline for generating dense screen annotations, combining web rendering, accessibility tree extraction, and VLM-based quality filtering.

ScreenVLM

A compact 316M-parameter VLM that outperforms much larger models on dense parsing while being 4x faster and enabling practical deployment.

ScreenParse Dataset

ScreenParse provides complete screen parsing supervision—dense annotations covering all visible UI elements, not just sparse task-relevant subsets.

Existing approaches suffer from sparse annotations that leave the majority of UI elements unlabeled, limiting both the diversity and coverage of training data. For example, GroundCUA provides only 55K samples with 8 element types. ScreenParse addresses this with dense annotations spanning 771K samples and 55 semantic types, enabling models to develop comprehensive screen understanding rather than learning task-specific shortcuts.

Dataset	# Types	# Elements	# Samples
UGround	1	9M	773K
JEDI	4	4M	575K
AGUVIS-G	1	3.8M	452K
OS-ATLAS	1*	14.5M	1.85M
RICOSCA	1*	170K	18K
UIBert	32	166K	57K
Widget Caption	1	101K	14K
AMEX	2	1.2M	101K
ScreenSpot	2	3M	270K
GroundCUA	8	3.56M	55K
ScreenParse (Ours)	55	21M	771K

* These datasets do not define a well-specified set of UI element types.

Class Distribution (Top 15)

Class distribution in ScreenParse dataset

Distribution of the 15 most frequent UI element types in ScreenParse. The dataset covers 55 semantic classes total.

Webshot Pipeline

Webshot is an automated, scalable pipeline for generating dense screen annotations from web pages.

1

Web Rendering

Render diverse URLs with Playwright, capturing screenshots and accessibility trees

2

Annotation Extraction

Extract bounding boxes, semantic types, and text from DOM/accessibility tree

3

VLM Refinement

Apply VLM-based relabeling to improve annotation accuracy

4

Quality Filtering

Filter low-quality samples using VLM scoring

ScreenVLM

ScreenVLM is a compact 316M-parameter vision-language model that decodes structured ScreenTag representations.

Foundation VLMs (8B+ parameters) are too large for low-latency, on-device deployment, while traditional detector-based parsers lack language-grounded understanding. ScreenVLM bridges this gap: it is compact enough for practical deployment (316M parameters, 4x faster than 2B VLMs) yet retains language-aligned representations that enable structured understanding of UI semantics.

ScreenTag Output Format

<Screentag>
<Window> <loc_0><loc_0><loc_500><loc_500>
  <Logo> <loc_59><loc_84><loc_102><loc_90> Microsoft </Logo>
  <Navigation Bar> <loc_85><loc_5><loc_420><loc_35>
    <Button> <loc_85><loc_5><loc_110><loc_35> Azure </Button>
    <Button> <loc_110><loc_5><loc_145><loc_35> Explore </Button>
    …
    <Button> <loc_385><loc_5><loc_420><loc_35> Sign in </Button>
  </Navigation Bar>
  …
</Window>
</Screentag>

316M

Parameters

4x

Faster than 2B VLMs

6x

Smaller memory

Results

ScreenParse Test Set

Model	Size	PageIoU	Label PageIoU	mAP@50
Vision-Language Models
Qwen3-VL-8B-Instruct	8B	0.294	-	-
Qwen3-VL-2B-Instruct	2B	0.228	0.051	0.023
Qwen3-VL-2B + ScreenParse	2B	0.585 (+0.357)	0.166 (+0.115)	0.152 (+0.129)
InternVL3-2B	2B	0.111	0.030	0.000
InternVL3-2B + ScreenParse	2B	0.509 (+0.398)	0.174 (+0.144)	0.072 (+0.072)
ScreenVLM (Ours)	316M	0.606	0.197	0.303
Detectors / Parsers
OmniParser V2	20M	0.270	-	-
OmniParser V2 + ScreenParse	20M	0.503	0.141	0.251
YOLO + ScreenParse	25M	0.533	0.133	0.299
RT-DETRv2 + ScreenParse	43M	0.600	0.172	0.362

GroundCUA Transfer (Out-of-Distribution)

Model	Size	PageIoU	Label PageIoU
Vision-Language Models
Qwen3-VL-8B-Instruct	8B	0.060	0.010
Qwen3-VL-2B-Instruct	2B	0.030	0.005
Qwen3-VL-2B + ScreenParse	2B	0.090 (+0.060)	0.019 (+0.014)
InternVL3-2B	2B	0.025	0.006
InternVL3-2B + ScreenParse	2B	0.203 (+0.178)	0.036 (+0.030)
ScreenVLM (Ours)	316M	0.251	0.043
Detectors / Parsers
OmniParser V2	20M	0.361	0.049
OmniParser V2 + ScreenParse	20M	0.398	0.061

ScreenSpot Benchmark

Model	Size	Web		PC		Mobile
Model	Size	Recall	PixCov	Recall	PixCov	Recall	PixCov
Vision-Language Models
Qwen3-VL-8B-Instruct	8B	0.229	0.346	0.201	0.300	0.193	0.311
Qwen3-VL-2B + ScreenParse	2B	0.477	0.720	0.201	0.443	0.108	0.477
ScreenVLM (Ours)	316M	0.557	0.746	0.222	0.839	0.066	0.847
Detectors / Parsers
OmniParser V2	20M	0.541	0.629	0.483	0.557	0.489	0.521
RT-DETRv2 + ScreenParse	43M	0.768	0.857	0.590	0.699	0.584	0.736

Qualitative Examples

Comparison of ground truth annotations vs. model predictions on diverse web screenshots.

Ground Truth

ScreenVLM Prediction

Ground Truth

ScreenVLM Prediction

Ground Truth

Qwen3-VL Finetuned

Ground Truth

OmniParser Finetuned

How Do Other Datasets Compare?

Most existing datasets annotate only one or a few elements per screen. ScreenParse provides dense annotations covering every visible UI element. Click any sample to zoom in.

Sparse ScreenSpot (Web)

Sparse ScreenSpot (PC)

Sparse ScreenSpot (Mobile)

Sparse ScreenSpot-Pro

Sparse MMBench-GUI

Sparse UI-Vision

Dense ScreenParse (Ours)

Limitations & Future Work

Current Limitations

ScreenParse is predominantly web-centric, leaving a domain gap to native desktop and mobile applications with different UI toolkits and interaction patterns. While we demonstrate strong transfer to the GroundCUA benchmark (which includes desktop screenshots), there remains room for improvement on native application interfaces.

Future Directions

A natural next step is expanding dense parsing supervision to cover native desktop and mobile UIs. Another promising direction is leveraging screen-parsing-pretrained models as visual backbones for vision-language-action agents, capitalizing on holistic screen understanding for improved grounding and decision making.

Citation

TBD

ScreenParse

Moving Beyond Sparse Grounding with Complete Screen Parsing

Abstract

Dataset Samples

The Problem: Sparse Supervision Limits Screen Understanding

Sparse Supervision

Dense Supervision

Key Contributions

ScreenParse Dataset

Webshot Pipeline

ScreenVLM

ScreenParse Dataset

Class Distribution (Top 15)

Webshot Pipeline

Web Rendering

Annotation Extraction

VLM Refinement

Quality Filtering

ScreenVLM

ScreenTag Output Format

Results

ScreenParse Test Set

GroundCUA Transfer (Out-of-Distribution)

ScreenSpot Benchmark

Qualitative Examples

How Do Other Datasets Compare?

Limitations & Future Work

Current Limitations

Future Directions

Citation