ScreenParse

Moving Beyond Sparse Grounding with Complete Screen Parsing

1IBM Research Zurich 2ETH Zurich

* Corresponding author

771K

Screenshots

21M

UI Annotations

55

Semantic Classes

316M

Model Parameters

Abstract

Modern computer-use agents must perceive a screen as a structured state—what elements are visible, where they are, and what text they contain—before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen.

We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse URLs, extracts annotations, and applies VLM-based relabeling and quality filtering.

Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model that decodes a structured ScreenTag representation with a structure-aware loss. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.606 vs. 0.294 PageIoU) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance.

The Problem: Sparse Supervision Limits Screen Understanding

Computer-use agents must perceive screens as structured states before they can reliably ground instructions and act. Yet most grounding datasets provide sparse supervision—annotating only the single element relevant to each task step. This leaves the majority of on-screen elements unlabeled and the full screen structure implicit.

As a result, models can learn shortcuts sufficient for supervised steps while failing to form a complete screen state. When perception fails, errors cascade throughout the agent pipeline, hurting robustness and generalization to new layouts and applications.

We argue that complete screen parsing—recovering all visible UI elements, their bounding boxes, semantic types, and text—is a core training objective for building reliable screen agents.

Sparse Supervision

  • Only task-relevant elements annotated
  • Majority of UI elements unlabeled
  • Models learn shortcuts
  • Poor generalization

Dense Supervision

  • All visible elements annotated
  • Complete screen structure
  • Holistic understanding
  • Robust generalization

Key Contributions

ScreenParse Dataset

A large-scale dataset with 21M dense UI annotations across 771K screenshots, featuring 55 semantic classes—significantly more comprehensive than existing datasets.

Webshot Pipeline

An automated, scalable pipeline for generating dense screen annotations, combining web rendering, accessibility tree extraction, and VLM-based quality filtering.

ScreenVLM

A compact 316M-parameter VLM that outperforms much larger models on dense parsing while being 4x faster and enabling practical deployment.

ScreenParse Dataset

ScreenParse provides complete screen parsing supervision—dense annotations covering all visible UI elements, not just sparse task-relevant subsets.

Existing approaches suffer from sparse annotations that leave the majority of UI elements unlabeled, limiting both the diversity and coverage of training data. For example, GroundCUA provides only 55K samples with 8 element types. ScreenParse addresses this with dense annotations spanning 771K samples and 55 semantic types, enabling models to develop comprehensive screen understanding rather than learning task-specific shortcuts.

Dataset Complete # Types # Elements # Samples
UGround 1 9M 773K
JEDI 4 4M 575K
AGUVIS-G 1 3.8M 452K
OS-ATLAS 1* 14.5M 1.85M
RICOSCA 1* 170K 18K
UIBert 32 166K 57K
Widget Caption 1 101K 14K
AMEX 2 1.2M 101K
ScreenSpot 2 3M 270K
GroundCUA 8 3.56M 55K
ScreenParse (Ours) 55 21M 771K

* These datasets do not define a well-specified set of UI element types.

Class Distribution (Top 15)

Class distribution in ScreenParse dataset

Distribution of the 15 most frequent UI element types in ScreenParse. The dataset covers 55 semantic classes total.

Webshot Pipeline

Webshot is an automated, scalable pipeline for generating dense screen annotations from web pages.

Webshot Pipeline Diagram
1

Web Rendering

Render diverse URLs with Playwright, capturing screenshots and accessibility trees

2

Annotation Extraction

Extract bounding boxes, semantic types, and text from DOM/accessibility tree

3

VLM Refinement

Apply VLM-based relabeling to improve annotation accuracy

4

Quality Filtering

Filter low-quality samples using VLM scoring

ScreenVLM

ScreenVLM is a compact 316M-parameter vision-language model that decodes structured ScreenTag representations.

Foundation VLMs (8B+ parameters) are too large for low-latency, on-device deployment, while traditional detector-based parsers lack language-grounded understanding. ScreenVLM bridges this gap: it is compact enough for practical deployment (316M parameters, 4x faster than 2B VLMs) yet retains language-aligned representations that enable structured understanding of UI semantics.

ScreenVLM Architecture Diagram

ScreenTag Output Format

<Screentag>
<Window> <loc_0><loc_0><loc_500><loc_500>
  <Logo> <loc_59><loc_84><loc_102><loc_90> Microsoft </Logo>
  <Navigation Bar> <loc_85><loc_5><loc_420><loc_35>
    <Button> <loc_85><loc_5><loc_110><loc_35> Azure </Button>
    <Button> <loc_110><loc_5><loc_145><loc_35> Explore </Button>
    
    <Button> <loc_385><loc_5><loc_420><loc_35> Sign in </Button>
  </Navigation Bar>
  
</Window>
</Screentag>

316M

Parameters

4x

Faster than 2B VLMs

6x

Smaller memory

Results

ScreenParse Test Set

Model Size PageIoU Label PageIoU mAP@50
Vision-Language Models
Qwen3-VL-8B-Instruct 8B 0.294 - -
Qwen3-VL-2B-Instruct 2B 0.228 0.051 0.023
Qwen3-VL-2B + ScreenParse 2B 0.585 (+0.357) 0.166 (+0.115) 0.152 (+0.129)
InternVL3-2B 2B 0.111 0.030 0.000
InternVL3-2B + ScreenParse 2B 0.509 (+0.398) 0.174 (+0.144) 0.072 (+0.072)
ScreenVLM (Ours) 316M 0.606 0.197 0.303
Detectors / Parsers
OmniParser V2 20M 0.270 - -
OmniParser V2 + ScreenParse 20M 0.503 0.141 0.251
YOLO + ScreenParse 25M 0.533 0.133 0.299
RT-DETRv2 + ScreenParse 43M 0.600 0.172 0.362

GroundCUA Transfer (Out-of-Distribution)

Model Size PageIoU Label PageIoU
Vision-Language Models
Qwen3-VL-8B-Instruct 8B 0.060 0.010
Qwen3-VL-2B-Instruct 2B 0.030 0.005
Qwen3-VL-2B + ScreenParse 2B 0.090 (+0.060) 0.019 (+0.014)
InternVL3-2B 2B 0.025 0.006
InternVL3-2B + ScreenParse 2B 0.203 (+0.178) 0.036 (+0.030)
ScreenVLM (Ours) 316M 0.251 0.043
Detectors / Parsers
OmniParser V2 20M 0.361 0.049
OmniParser V2 + ScreenParse 20M 0.398 0.061

ScreenSpot Benchmark

Model Size Web PC Mobile
Recall PixCov Recall PixCov Recall PixCov
Vision-Language Models
Qwen3-VL-8B-Instruct 8B 0.229 0.346 0.201 0.300 0.193 0.311
Qwen3-VL-2B + ScreenParse 2B 0.477 0.720 0.201 0.443 0.108 0.477
ScreenVLM (Ours) 316M 0.557 0.746 0.222 0.839 0.066 0.847
Detectors / Parsers
OmniParser V2 20M 0.541 0.629 0.483 0.557 0.489 0.521
RT-DETRv2 + ScreenParse 43M 0.768 0.857 0.590 0.699 0.584 0.736

Qualitative Examples

Comparison of ground truth annotations vs. model predictions on diverse web screenshots.

Ground Truth 1

Ground Truth

Prediction 1

ScreenVLM Prediction

Ground Truth 2

Ground Truth

Prediction 2

ScreenVLM Prediction

Ground Truth 3

Ground Truth

Prediction 3

Qwen3-VL Finetuned

Ground Truth 4

Ground Truth

Prediction 4

OmniParser Finetuned

How Do Other Datasets Compare?

Most existing datasets annotate only one or a few elements per screen. ScreenParse provides dense annotations covering every visible UI element. Click any sample to zoom in.

ScreenSpot Web example
Sparse ScreenSpot (Web)
ScreenSpot PC example
Sparse ScreenSpot (PC)
ScreenSpot Mobile example
Sparse ScreenSpot (Mobile)
ScreenSpot-Pro example
Sparse ScreenSpot-Pro
MMBench-GUI example
Sparse MMBench-GUI
UI-Vision example
Sparse UI-Vision
ScreenParse dense annotation
Dense ScreenParse (Ours)

Limitations & Future Work

Current Limitations

ScreenParse is predominantly web-centric, leaving a domain gap to native desktop and mobile applications with different UI toolkits and interaction patterns. While we demonstrate strong transfer to the GroundCUA benchmark (which includes desktop screenshots), there remains room for improvement on native application interfaces.

Future Directions

A natural next step is expanding dense parsing supervision to cover native desktop and mobile UIs. Another promising direction is leveraging screen-parsing-pretrained models as visual backbones for vision-language-action agents, capitalizing on holistic screen understanding for improved grounding and decision making.

Citation

TBD