MolmoWeb: The Open-Source Web Agent That Navigates by Screenshots Alone

Most browser automation tools cheat. They read the DOM, parse accessibility trees, or inspect page structure to figure out what's on screen. It's fast and reliable — but it's not how a human would do it, and it breaks the moment a site renders content in a canvas or uses a non-standard component.

MolmoWeb, released by Allen Institute for AI on March 24, 2026, takes a different approach: it looks at a screenshot of your browser, figures out what to do next, and clicks. That's it. No DOM access, no HTML parsing, no site-specific customization.

The result is a fully open-weight web agent — 4B and 8B sizes, Apache 2.0 license — that sets new open-weight state of the art on four major web benchmarks and matches OpenAI's o3 on WebVoyager.

How It Works

MolmoWeb operates in a perception-action loop. At each step it receives three inputs:

→ The task instruction ("Find the cheapest nonstop flight from Seattle to Tokyo")
→ A screenshot of the current browser view
→ A history of recent actions

It produces a short natural-language thought about what to do next, then executes one browser action. The available actions are: navigate to a URL, click at screen coordinates, type text, scroll, open/switch tabs, or send a message back to the user.

This is architecturally simple — and that simplicity is the point. Because the model works on visual coordinates rather than DOM nodes, it handles any web interface the same way a human would. Canvas elements, custom components, PDFs rendered in-browser — all treated identically.

Architecture: MolmoWeb is built on Molmo 2, AI2's multimodal model family. The language backbone is Qwen3 and the vision encoder is SigLIP2. Available in 4B and 8B parameter sizes.

The Training Data

One of the most significant things about MolmoWeb isn't the model — it's the dataset. MolmoWebMix combines three sources:

→ 36,000 human-completed web task demonstrations across 1,100+ websites (623K subtask examples) — the largest publicly released collection of human web task data
→ Synthetic trajectories generated by text-based accessibility-tree agents and multi-agent pipelines
→ 2.2M+ screenshot question-answer pairs from ~400 websites for GUI perception

Notably, AI2 explicitly avoided distilling from proprietary vision-based agents like GPT-4o or Claude. The training signal comes from humans and open-source systems.

Both the models and the full MolmoWebMix dataset are available on Hugging Face under Apache 2.0.

Is It Actually Better?

Yes — with important caveats.

Where MolmoWeb 8B leads (open-weight models):

→ WebVoyager (15 popular sites including GitHub, Google Flights): 78.2% — beats all prior open-weight models
→ DeepShop (online shopping tasks): 42.3% at 30 steps, where competitors needed 100 steps
→ WebTailBench (long-tail web tasks): 49.5% — outperforms GPT-4o-based agents
→ ScreenSpot (UI element localization): beats Claude 3.7 and OpenAI's CUA

Versus the best proprietary systems:

→ vs. OpenAI o3 on WebVoyager: MolmoWeb 8B scores 78.2%, o3 scores 79.3%. Essentially tied.
→ vs. GPT-5 on Online-Mind2Web: MolmoWeb scores 35.3%, GPT-5 scores 57.7%. Clear gap.
→ vs. text-based teacher model: MolmoWeb trails by ~5 points versus a Gemini-based system that has page structure access — showing that the screenshot-only constraint does cost something.

The honest summary: MolmoWeb is the best open-weight web agent and is competitive with closed models on many tasks. It's not universally better than GPT-5. But it's fully open, runs locally, and for tasks involving visual interfaces it holds its own.

There's also a test-time scaling result worth noting: with 4 independent rollouts (Pass@4), WebVoyager accuracy reaches 94.7%. If your application can afford multiple attempts, the ceiling is much higher.

System Requirements

MolmoWeb is available in two sizes designed for different hardware:

→ MolmoWeb-4B: Fits on consumer GPUs. A 4B parameter model in the Qwen3/SigLIP2 architecture will typically need 8-12GB VRAM in fp16, or can be run with quantization on less.
→ MolmoWeb-8B: Needs ~18-24GB VRAM in fp16. Fits on a single A100 40GB or a consumer GPU with quantization.

The models are on Hugging Face. For local installation:

bash

git clone https://github.com/allenai/molmoweb
# follow setup instructions in the README

A hosted demo is available at molmoweb.allen.ai with safety guardrails (whitelisted sites, no login/payment fields).

What It Can't Do

MolmoWeb has real limitations worth understanding before committing to it:

→ No login-required tasks. The training data excludes login flows, so the model isn't equipped for authenticated sessions.
→ No financial transactions. Payments, checkout flows, and credit card fields are explicitly blocked.
→ Screenshot text errors. Reading small text, CAPTCHAs, or dense tables from screenshots introduces errors that DOM-based approaches avoid.
→ Sensitive to prior mistakes. A wrong click early in a sequence degrades subsequent steps disproportionately.
→ Vague instructions degrade performance. Multiple constraints in a single task or ambiguous instructions cause accuracy drops.
→ No drag-and-drop. Complex pointer interactions aren't supported.

Why This Matters

Browser agents built on closed models create a dependency problem. Your automation pipeline runs on OpenAI or Anthropic infrastructure — you're subject to their pricing, their API terms, and their availability. When they change something, your agent breaks.

MolmoWeb changes the calculus. At 8B parameters, Apache 2.0 licensed, trained without proprietary model distillation, it's a web agent you can actually own. Self-host it, fine-tune it on your own task data, inspect the training set. None of that is possible with o3 or Claude.

The 78.2% on WebVoyager vs. o3's 79.3% isn't just a benchmark number — it's evidence that you can get near-frontier web agent performance from hardware you control.

---

Resources: