Browser Agents Don't Need Smarter Models. They Need a Runbook.

A browser agent fails the same way every time. It opens the wrong tab. It clicks the wrong button. It picks last week's template instead of this week's. Three minutes in, it tells you it finished the task. It didn't.

The reason is not the model. Claude Sonnet 5 hits 88% on the OSWorld benchmark, ahead of human experts. Browser Use scores 89% on WebVoyager. The technology works in 2026. Your agent fails because you handed it a vague task and expected it to figure out the rest. It improvises. Improvisation is what makes it fail.

The fix is a runbook.

Why Your Browser Agent Keeps Improvising

A vague task description gets a vague execution. "Send the invoice to Acme" makes sense to you because you've sent that invoice forty times. The agent has never seen it. It has to guess which app, which template, which folder, which subject line. It guesses wrong somewhere around step four and the whole run goes sideways.

Browser Use, the open-source library, reports something specific about this. When you switch the agent from "go figure it out" to "here are the steps," success rates jump from 30% to 80%. Same model. Same browser. Same task. The only thing that changed is whether the agent has a runbook.

Browser agents in 2026 are not unreliable because the AI is dumb. They are unreliable because nobody handed them an SOP.

A Browser Agent Runbook Is the SOP You'd Give a New Hire

A runbook is just a standard operating procedure. The same document you would hand someone on their first day. The agent reads it for the same reason a junior reads it: to skip the discovery phase and get to the work.

A good browser agent runbook has four things in it.

The exact apps to use. Not "the spreadsheet." Google Sheets. Not "your email." Gmail. The agent needs to know which app to open, by name.
Precise actions per step. Not "create the invoice." "Click the New Invoice button. Select the Standard-2024 template. Enter the client name in the Company field."
What success looks like. What the screen should show after each step works. The "Invoice #4521 created" banner. The PDF lands in Downloads.
Where the workflow branches. If the amount is over $10,000, route to manager approval. Otherwise, send it directly.

Try writing all four from memory. You will forget half. The default values you never change. The keyboard shortcut you use without thinking. The modal that pops up every time and you dismiss without reading. New juniors miss the same details. So do agents.

Why a Recorded Runbook Beats a Written One

A typed SOP is what you remember about the workflow. A recorded runbook is what you actually do. The gap between those two is where browser agents fail.

The screen recording captures context that writing from memory always misses. The clicks you make on autopilot. The default values. The tab you keep open in the background. The agent needs every one of those details to make it through the workflow on the first try.

Open your screen recorder. Do the workflow exactly as you normally do it. Do not optimize. Do not skip steps. Just do the work.

ReplayDoc watches the recording frame by frame. It identifies each app by name. It captures every click, every typed value, every selection. The output is a structured runbook with a screenshot at every step. You can read it. Your agent can read it. Both get the same picture of what to do.

This is the same idea as onboarding documentation, with one difference. The new hire writes the SOP after they have done the work for a year. ReplayDoc writes the runbook the first time you record. No memory loss. No skipped steps. No "oh, I forgot to mention that part."

Three Ways to Run Your Runbook

Once the runbook exists, you pick how to run it. Browser-use agents come in different shapes, and the same export feeds all three.

Hand the runbook to a browser agent. Paste it into Browser Use, ChatGPT Agent, or Hermes. The agent reads the steps and clicks through your apps live. You watch it work. This is the fastest way to test whether the runbook is good. If the agent gets stuck, the runbook needs another sentence.

Pair the runbook with a coding agent that drives a browser. Claude Code with the Playwright MCP. Cursor with the Chrome DevTools MCP. Codex with a browser tool. The runbook guides the coding agent step by step in a real session. Slower than a browser agent, more controllable, easier to debug when something breaks.

Convert the runbook into a Playwright script. Drop it into Claude Code, Codex, or Cursor and ask for the script. The output is durable. It lives in your repo. You can run it on a cron, in CI, or on demand. The recording becomes a test you own.

All three modes read the same export. One recording. Three execution surfaces. The choice depends on which agent you already have and how often you plan to run the workflow.

Where Browser Agents Shine, Where They Stumble

Browser agents in 2026 are reliable enough to use. Not reliable enough to ignore. Practical success on well-defined UI tasks lands around 75%. On flaky apps with modal popups and slow loading states it drops to 50% or 60%.

A few patterns hold up across our testing and the broader benchmarks:

They handle stable apps with clear text labels well. Gmail, Notion, most SaaS dashboards.
They are good at form filling and data entry across multiple systems.
They handle research that spans tabs and sites.
They struggle with drag and drop in design tools, rich text editors with custom shortcuts, and apps that need pixel-perfect targeting.
They struggle when the runbook leaves judgment calls undocumented.

Same pattern as a new junior. Hand them the runbook. Watch the first few runs. Fix the gaps you missed. After a few iterations the agent runs solo on the workflows it can handle. The ones it cannot handle stay with you.

Pick One Workflow and Try It

Pick the task you do most often. The one that is repetitive, predictable, and takes 5-15 minutes. The one that goes through three or four apps. Maybe it is your weekly invoicing flow. Maybe it is a triage routine. Maybe it is the Friday status report you assemble from three dashboards.

Record yourself doing it once. Upload to ReplayDoc. Export the runbook.

Paste it into ChatGPT Agent and ask it to run. Or into Browser Use. Or into a Hermes session triggered from Slack. Or hand it to Claude Code with a Playwright MCP and watch it execute step by step.

If the agent makes it through, you just got back the next ten times you would have done that task by hand. If it does not, the runbook tells you exactly where the agent got lost. Tweak the recording. Try again. The same recording also tells you where the workflow itself is broken, which is a different kind of value.

The model is not what is holding your agent back. The runbook is. Record one this week and find out.