Building Safer AI Browsers with BrowseSafe

Today, we are releasing BrowseSafe, an open research benchmark and content detection model aimed at keeping users safe as they navigate the agentic web.

As AI assistants move from search boxes into the browser itself, we expect the next generation of the web to shift from pages to agents: less about where information lives, and more about who retrieves and acts on it. Comet turns the browser into a place where an assistant can accomplish tasks, not just answer questions, so one principle is non‑negotiable: it must stay on the user’s side.

BrowseSafe: protecting agents and users through realtime content scanning

BrowseSafe is a detection model fine‑tuned to answer a single focused question: given a page’s HTML, does it contain malicious instructions aimed at the agent? Large general‑purpose models can reason well about these cases, but they are often too slow and expensive to run on every page. BrowseSafe scans full web pages in real time without slowing the browser. We're also releasing the BrowseSafe‑Bench evaluation suite as a resource for evaluating and improving defense effectiveness.

Trust boundaries and layered defenses

However, a new generation of AI browsing also means a new generation of cybersecurity threats that require novel approaches to keeping users safe. In an earlier post, we walked through how Comet uses multiple layers of protection to keep the assistant doing what the user asked, even when a website tries to hijack it with prompt injection. Today we’re zooming in on how we tackle that problem: how those threats are defined, tested against real-world attacks, and used to train specialized models that can spot and stop bad instructions fast enough to run safely in the browser.

How browser prompt injection works

Prompt injection is malicious language embedded in the text that AI reads that is designed to override its original intent. In the browser, agents read whole pages, so attacks can hide in places like comments, templates, or long footers.

Attackers use those spots to slip in instructions that quietly redirect the agent. Because it reads everything, including content most people never notice, those messages can hijack behavior without strong safeguards.

These attacks often avoid obvious phrases and can be written in polished or multilingual text, or placed in HTML elements that never appear on screen, such as data attributes or form fields that browsers don’t visibly render but agents still parse.

BrowseSafe-Bench: advancing agent security in real-world environments

To study these attacks in a setting that looks like the real web, we built BrowseSafe, a detection model that we trained and open-sourced, and BrowseSafe‑Bench, a public benchmark of 14,719 examples that mimic production pages. It includes complex HTML, noisy content, and a mix of malicious and harmless samples that vary along three axes: what the attacker tries to do, where the instruction sits in the page, and how the language is written.

The benchmark includes 11 attack types, nine injection strategies spanning hidden fields to visible paragraphs and footers, and three linguistic styles, from explicit commands to indirect, camouflaged text.

A defense in depth approach

In our threat model, the assistant itself lives in a trusted environment, but anything that comes from the web is untrusted. Attackers may control entire sites or just inject content, such as product descriptions, comments, and posts into otherwise benign pages the assistant visits. To manage that risk, tools that can return untrusted content, such as web pages, emails, or files, are flagged, and their raw outputs are always scanned by BrowseSafe before the agent can read or act on them.

BrowseSafe is one layer in a broader defense approach. Raw content is scanned before use, tool permissions are limited by default, and sensitive actions can require explicit user confirmation, all on top of existing browser security features. Defense in depth enables users to adopt powerful browser assistants without trading safety for capability.

What influences attack effectiveness?

Evaluation results on BrowseSafe‑Bench show clear patterns. Direct attacks, like asking the agent to reveal its system prompt or exfiltrate information via URL segments, are among the easiest for models to catch. In contrast, multilingual attacks and those written as indirect or hypothetical instructions are significantly harder, because they avoid the obvious keywords many detectors implicitly rely on.

Placement matters too. Attacks hidden in comments are detected relatively well, while versions rewritten into visible footers, table cells, or inline paragraphs prove much more difficult, revealing a structural bias toward “hidden” injections. Careful training on well-designed examples can significantly improve models’ ability to detect these patterns.

Build safer agents with BrowseSafe

BrowseSafe and BrowseSafe-Bench are fully open-source. Any developer building autonomous agents can immediately harden their systems against prompt injection—no need to build safety rails from scratch. The open-weight detection model runs locally and flags malicious instructions before they reach your agent's core logic, fast enough to scan every page without slowing users down.

Use BrowseSafe-Bench's 14,000+ real-world attack scenarios to stress-test your own models against the messy HTML traps that break standard LLMs. Our chunking and parallel scanning techniques let agents process massive, untrusted pages efficiently—powerful browsing capabilities without exposing users to danger.

To learn more about how we built BrowseSafe and BrowseSafe-Bench, check out the Perplexity Research blog.

Share this article