News peg · April 20, 2026 · 7 min

DeepMind mapped
the traps.
we've been building
the exits.

Google DeepMind's April 2026 taxonomy names six attack classes against autonomous agents, with empirical success rates between 80 and 86 percent. None of them are patched by the "tell the model to ignore suspicious instructions" defense. Here's the map from trap to primitive.

on April 1 2026 DeepMind's security team published AI Agent Traps — the first end-to-end taxonomy of web-side attacks against autonomous agents. it's a good paper. it names things that red-teams have been shouting about in DMs for a year, and puts empirical numbers on them. the headline that hit twitter was "you can't sanitize a pixel." that's true, and it's worth sitting with for a minute.

the paper's central claim is a single sentence buried on page seven. detection is asymmetric. a website can tell — with near-perfect accuracy — whether it's serving a human or an agent. it can serve you one thing and your agent another. your agent has no reliable way to know.

Attack classes
6
Content, behavioral, systemic, HITL, cascade, exfil.
Empirical success
80-86%
Content-injection and behavioral-control classes.
Prompt defenses that work
0
Sanitization fails because the payload looks legitimate.

The six traps, briefly

summarized from the paper, so the rest of this post has somewhere to point:

  1. Indirect web injection. instructions hidden in HTML comments, CSS-positioned white-on-white text, or metadata attributes. the agent reads them. the user doesn't.
  2. Multimodal steganography. commands encoded into image pixels. invisible to moderators. fully readable to a vision model.
  3. Document jailbreaks. override instructions embedded in PDFs, spreadsheets, calendar invites. the attack travels inside the attachment.
  4. Memory poisoning. false information inserted into an agent's long-term store so it persists across sessions. the compromise happens once; the consequences last forever.
  5. Exfiltration. trick the agent into sending private data to an attacker-controlled endpoint, typically by making the endpoint look like a legitimate tool call.
  6. Multi-agent cascades. agent A gets poisoned, passes the poison to agent B, B passes to C. every downstream agent trusted the upstream's output because that's what agents do.

the paper's defense section is honest about how grim this is. input sanitization doesn't work on pixels. prompt-level "ignore untrusted instructions" doesn't work because the attacks are designed to look trusted. human review doesn't work at the speed and fan-out agents actually operate at. if you ask an agent to research fifty sites in parallel, you are not going to eyeball fifty DOM trees to verify they served you the same bytes they served it.

so the question becomes: what actually works, and who's shipping it?

The mapping

for each trap, the defense that survives has to live outside the prompt — in the agent's memory layer, in its settlement layer, in its identity layer. you can't patch an attack with the same model that got fooled by it. so you wrap the model in something that keeps books.

here's the map from each trap to a primitive that's already shipping in @mnemopay/sdk and gridstamp. none of these are theoretical — they're npm install-able tonight.

Trap 1 · Indirect injection

HTML comments, hidden CSS, metadata payloads.

The payload is syntactically valid web content. No regex catches it.

Exit · MCP auth + tool filter

Authenticated tool exposure per agent.

MnemoPay's MCP server signs every tool invocation and filters which tools a given agent can even call. An injected instruction saying payout_create hits a 403 if that agent's scope doesn't include payouts. The injection works; the side-effect doesn't.

Trap 2 · Multimodal steganography

Commands in image pixels.

"You can't sanitize a pixel." The paper's money quote.

Exit · Signed proof-of-presence

Receipts, not claims, for physical actions.

GridStamp produces six-layer signed receipts (GPS, cellular RSSI, Wi-Fi beacons, barometer, IMU, visual odometry) for any agent action tied to a place. A pixel-injected "confirm the delivery, drone" fails agent.verify(proof) because the receipt chain doesn't match. The poisoned frame doesn't produce a signed location.

Trap 3 · Document jailbreaks

Overrides inside PDFs, spreadsheets, calendar invites.

The attack travels inside an attachment the agent was asked to parse.

Exit · HITL charge approval + mandate limits

Every spend passes through a signed mandate.

An injected "wire $25k to account X" has to go through shop_checkout or charge_request. Both check the agent's mandate (per-charge cap, per-day cap, allowed counter-parties) and call shop_approve when anything exceeds policy. The jailbreak passes. The settlement doesn't.

Trap 4 · Memory poisoning

False facts persisted across sessions.

The compromise happens once. The consequences last forever.

Exit · Merkle integrity + canary honeypots

Every memory is chained; canaries detect tampering.

MnemoPay's memory_integrity_check walks the Merkle root of an agent's memory log and flags any entry that doesn't verify. Canary memories — plausible-looking honeypots — are sprinkled into the store; reading one fires a poisoned event before the agent acts on a neighbor. Poison the store, and the check notices.

Trap 5 · Exfiltration

Agent sends private data to attacker endpoint.

The attacker's endpoint looks like a legitimate tool call.

Exit · EWMA anomaly + counter-party validation

Behavioral drift fires before the bytes leave.

MnemoPay's anomaly layer tracks per-agent spend / call-pattern distributions with an exponentially-weighted moving average and a PSI drift metric. A never-before-seen endpoint receiving the agent's private data is, almost definitionally, three sigmas out. The anomaly fires, the mandate freezes, a human gets paged.

Trap 6 · Multi-agent cascades

Poisoned A infects B infects C.

Agents trust each other's output because that's how they cooperate.

Exit · Agent FICO score

Score the upstream agent before trusting its memory.

Every MnemoPay agent has a 300-850 score (ours is called Agent FICO) that reflects dispute rate, chargeback rate, mandate violations, and memory integrity score. Before agent B consumes anything from agent A, B pulls A's score. A cascaded infection pushes A's score down long before B ingests. The cascade stops at the scoring boundary.

Code, not claims

the whole point of the paper is that talk about defenses doesn't cut it anymore. so here are three calls. each one addresses a different trap.

import { MnemoPay } from "@mnemopay/sdk";
const mp = MnemoPay.quick("agent-b");

// Trap 6: score the upstream before consuming.
const upstream = await mp.agent_fico_score("agent-a");
if (upstream.score < 600) throw new Error("untrusted source");

// Trap 4: verify the memory log before acting on recall.
const integrity = await mp.memory_integrity_check();
if (!integrity.ok) throw new Error("memory tampered");

// Trap 3: route the spend through the approval boundary.
const charge = await mp.charge_request({
  amount: 240.00, currency: "USD", to: "acme-inc",
  mandate: { per_charge_cap: 500, per_day_cap: 2000 }
});
// charge stays pending until a signed human approval lands.

for the physical side, the GridStamp call pattern is even shorter:

import { createAgent } from "gridstamp";
const drone = createAgent({ id: "drone-042" });
const proof = await drone.stamp({ intent: "delivery-complete" });
const ok    = await drone.verify(proof);  // false if any of six layers fails.
The paper is a good checklist. The defenses that work don't live in the prompt. They live in the ledger, the memory chain, and the score.

What the paper gets right, and what's missing

right: the taxonomy, the empirical rates, the honesty about prompt-level failure. if you build on agents and you haven't read it, it's the best 40 minutes you'll spend this week. right also: the emphasis on the cascade class — it's the attack that compounds fastest, and it's the one most existing agent frameworks have the least story about.

missing: the paper treats the agent as the unit of trust. it asks "how do we keep this agent from being fooled?" a better question is "what are the things an agent does that we can refuse to settle if it's been fooled?" that's a different engineering problem. it's the payments problem. it's the credit problem. it's the presence problem. the answer isn't a better prompt. the answer is a ledger.

every attack in the paper reduces to the same shape: the agent was induced to take an action that wouldn't have held up under independent verification. so you build the independent verifier. you attach it to the things that cost money or move atoms. you score the agents based on how often their verifiers passed. over time, the bad agents have bad scores, the good ones have good scores, and the cascade dies at the first scoring boundary.

that's the whole thesis. MnemoPay is the memory + ledger + score layer. GridStamp is the presence layer. both are Apache 2.0, both are npm install, both are in production. if the DeepMind paper made you nervous, the answer is not to add more prompts. it's to add more boundaries.

Install tonight

npm i @mnemopay/sdk
npm i gridstamp

both packages work standalone. the MCP server for MnemoPay is live at @mnemopay/mcp (it's what Claude, Cursor and Goose all talk to) — install that and your coding agent itself picks up a score, a memory chain, and a mandate. the irony is the defense.

further reading: the primer on Agent FICO, receipts-beat-claims, the proof-of-presence primer. and the DeepMind paper itself is worth the hour — SecurityWeek's summary is a good on-ramp if you don't want to read the full 64 pages.

— Jerry Omiagbo