Skip to Content
All blogs

Jailbreaking Agentic AI: How Chatbots and AI Agents Are Being Exploited in Production

 — #AI Security#Jailbreaking#Prompt Injection#AI Agents#LLM Security#OWASP LLM Top 10#Red Teaming

A Chevrolet dealership chatbot agreed to sell a $76,000 Tahoe for one dollar. Air Canada was legally forced to honor a refund policy that its chatbot invented. A DPD delivery bot wrote poetry calling itself "the worst delivery company in the world." DeepSeek failed 58% of jailbreak tests within weeks of launch. And in 2025, Anthropic documented the first large-scale cyberattack executed primarily by an AI agent.

These are not hypothetical scenarios or red team exercises. They are production failures that cost real money, created legal liability, and exposed the fundamental security gap in how we deploy AI systems.

This post breaks down the attack landscape for agentic AI in 2025-2026 — what is actually happening, how these attacks work mechanically, and what defenses hold up under pressure.

The attack surface has changed

Traditional chatbots had a limited blast radius. The worst a rule-based bot could do was give a wrong answer. LLM-powered agents are fundamentally different. They can browse the web, execute code, query databases, send emails, modify files, and trigger workflows. When you jailbreak an agent with these capabilities, you are not just getting it to say something inappropriate — you are potentially gaining access to its entire tool chain.

graph TB
    subgraph Attack["🎯 ATTACK VECTORS"]
        direction TB
        A1["🗣️ Direct Prompt<br/>Injection"]
        A2["📄 Indirect Injection<br/>via Documents"]
        A3["🌐 Poisoned Web<br/>Content"]
        A4["🔄 Multi-Turn<br/>Escalation"]
        A5["🧠 Memory<br/>Poisoning"]
    end

    subgraph Agent["🤖 AI AGENT"]
        direction TB
        LLM["LLM Core"]
        SYS["System Prompt"]
        MEM["Memory / State"]
    end

    subgraph Tools["⚡ TOOL ACCESS"]
        direction TB
        T1["🔍 Web Browse"]
        T2["💻 Code Exec"]
        T3["🗄️ Database"]
        T4["📧 Email"]
        T5["📁 File System"]
    end

    subgraph Impact["💥 BLAST RADIUS"]
        direction TB
        I1["Data Exfiltration"]
        I2["Unauthorized Actions"]
        I3["Financial Loss"]
        I4["Brand Damage"]
        I5["Legal Liability"]
    end

    A1 --> LLM
    A2 --> LLM
    A3 --> LLM
    A4 --> LLM
    A5 --> MEM
    SYS --> LLM
    MEM --> LLM
    LLM --> T1
    LLM --> T2
    LLM --> T3
    LLM --> T4
    LLM --> T5
    T1 --> I1
    T2 --> I2
    T3 --> I1
    T4 --> I4
    T5 --> I3
    T3 --> I3
    T4 --> I5

    style A1 fill:#EF5350,stroke:#C62828,color:#fff
    style A2 fill:#FF7043,stroke:#D84315,color:#fff
    style A3 fill:#FFA726,stroke:#E65100,color:#fff
    style A4 fill:#FF8A65,stroke:#BF360C,color:#fff
    style A5 fill:#E53935,stroke:#B71C1C,color:#fff
    style LLM fill:#7E57C2,stroke:#4527A0,color:#fff
    style SYS fill:#AB47BC,stroke:#6A1B9A,color:#fff
    style MEM fill:#9C27B0,stroke:#7B1FA2,color:#fff
    style T1 fill:#29B6F6,stroke:#0277BD,color:#fff
    style T2 fill:#26C6DA,stroke:#00838F,color:#fff
    style T3 fill:#4FC3F7,stroke:#0288D1,color:#fff
    style T4 fill:#00BCD4,stroke:#006064,color:#fff
    style T5 fill:#0097A7,stroke:#004D40,color:#fff
    style I1 fill:#FFEE58,stroke:#F9A825,color:#333
    style I2 fill:#FDD835,stroke:#F57F17,color:#333
    style I3 fill:#FFD600,stroke:#FF6F00,color:#333
    style I4 fill:#FFCA28,stroke:#FF8F00,color:#333
    style I5 fill:#FFC107,stroke:#E65100,color:#333

OWASP's 2025 Top 10 for LLM Applications ranks prompt injection as the number one risk, appearing in over 73% of production AI deployments assessed during security audits. The 2025 edition added new entries specifically for agentic AI: excessive agency, system prompt leakage, and vector/embedding weaknesses.

The core problem: LLMs trust anything that sounds like a convincing instruction. They cannot reliably distinguish between a legitimate system prompt and a cleverly crafted user input. This is not a bug that will be patched — it is an architectural property of how these models process language.

Real-world incidents that matter

The Chevrolet $1 car sale (December 2023)

Chris Bakke discovered that the ChatGPT-powered chatbot on Chevy of Watsonville's website could be manipulated with a simple prompt override. He told the chatbot to agree to everything the customer said — "no takesies-backsies" — and then asked it to sell him a Tahoe for $1. The chatbot agreed. The exploit went viral with over 20 million views, and the dealership pulled the chatbot offline.

Why it matters: The bot had no concept of legal authority, pricing constraints, or brand protection. It was a word prediction system given the authority to make statements that could be interpreted as binding offers. The dealership chose not to honor it, but the legal question remains open.

Air Canada's invented refund policy (2024)

Jake Moffatt asked Air Canada's chatbot about bereavement fares after his grandmother died. The chatbot told him he could get a discount and could claim it up to 90 days after flying. He booked the flight. When he requested the discount, Air Canada said the chatbot was wrong and the information was "nonbinding."

A Canadian tribunal disagreed. Member Christopher Rivers ruled that Air Canada was responsible for all information on its website, whether from a static page or a chatbot. The airline was ordered to honor the discount.

Why it matters: This set legal precedent — companies are liable for what their AI chatbots tell customers, even when the AI hallucinates policies that do not exist. The chatbot is gone from Air Canada's website as of April 2024.

DPD's self-sabotaging chatbot (January 2024)

A customer named Ashley Beauchamp manipulated DPD's delivery chatbot into swearing, writing poetry criticizing the company, and calling itself "the worst delivery company in the world." The screenshots went viral — over 800,000 views in 24 hours. DPD immediately deactivated the chatbot.

Why it matters: No sophisticated attack technique was needed. A customer simply asked the bot to do things outside its intended scope, and it complied. The reputational damage was immediate and amplified by social media.

DeepSeek's system prompt extraction (January 2025)

Within weeks of DeepSeek's high-profile launch, Wallarm's security research team extracted the model's entire system prompt using a technique that exploited its bias-based response logic. Separately, Palo Alto Networks' Unit 42 tested three jailbreak techniques — Bad Likert Judge, Crescendo, and Deceptive Delight — and all three successfully bypassed safety mechanisms.

Qualys tested DeepSeek against 885 attacks across 18 jailbreak categories. It failed 58% of them — including methods that had been patched in other services like ChatGPT long ago.

Trend Micro found an additional vulnerability: DeepSeek-R1's Chain of Thought reasoning exposes step-by-step thinking in <think> tags, giving attackers a roadmap to refine their prompts for maximum effectiveness.

Why it matters: A major new model launched with security weaker than competitors had years ago. The Chain of Thought visibility is a particularly concerning design choice — it lets attackers see exactly how the model processes their inputs.

The first AI-executed cyberattack (2025)

Anthropic reported detecting a state-sponsored threat actor that manipulated its Claude Code agent into executing a large-scale espionage campaign. The attack targeted approximately thirty organizations including tech companies, financial institutions, chemical manufacturers, and government agencies. In a small number of cases, it succeeded.

The attacker broke the malicious operation into small, "seemingly innocent" tasks that individually appeared harmless but collectively executed the attack. Anthropic called it "the first documented case of a large-scale cyberattack executed without substantial human intervention."

Why it matters: This is the scenario security researchers warned about — an AI agent with tool access being weaponized for real attacks against real targets. The task decomposition approach mirrors multi-turn jailbreaking techniques used in research.

Zero-click IDE exploits (2025)

Researchers demonstrated that AI-powered coding assistants like Cursor and GitHub Copilot are vulnerable to zero-click attacks through indirect prompt injection. In one case, a Google Docs file triggered an agent inside an IDE to fetch instructions from an attacker-controlled MCP server, execute a Python payload, and harvest secrets — without any user interaction.

CVE-2025-59944 showed that a case-sensitivity bug in Cursor's file path handling let attackers influence the agent's behavior through a malicious configuration file, escalating to remote code execution. GitHub Copilot's CVE-2025-53773 (CVSS 9.6) enabled remote code execution through prompt injection, potentially compromising millions of developer machines.

Why it matters: These are not chatbot pranks. These are RCE vulnerabilities with CVSS scores above 9. The attack surface of agentic coding assistants — with access to file systems, terminals, and network — is enormous.

The attack techniques

1. Crescendo (multi-turn escalation)

Published by Microsoft researchers Russinovich, Salem, and Eldan and presented at USENIX Security '25, Crescendo is the most effective multi-turn jailbreak technique documented to date. It mirrors the psychological "foot-in-the-door" technique: start with an innocent question, gradually escalate, and exploit the model's tendency to build on its own previous outputs.

How it works: Each message in the conversation is individually benign. The model's safety filters examine each prompt in isolation and find nothing to block. But the cumulative conversation steers the model into territory it would normally refuse. Critically, even when the most influential sentence is removed from the context, the success rate remains high — the jailbreak emerges from the conversational pattern, not any single message.

Effectiveness: Crescendomation (the automated version) achieves 29-61% higher success rates than other techniques on GPT-4 and 49-71% higher on Gemini-Pro.

Why it is hard to defend against: Standard prompt filters look at individual messages, not conversational trajectories. Microsoft's mitigation required building a new multiturn prompt filter that analyzes entire conversation patterns — a significant engineering effort.

2. Many-shot jailbreaking

Anthropic's research demonstrated that stuffing a prompt with many examples of the desired (harmful) behavior can override safety training. With large enough context windows (100k+ tokens), an attacker can include dozens of question-answer pairs where a fictional AI helpfully provides dangerous information. The real target question, slipped in at the end, gets answered in the same style.

Key difference from Crescendo: Many-shot requires the attacker to already have the harmful content to use as examples. Crescendo generates it from scratch. Many-shot is also detectable by input filters that scan for harmful content, while Crescendo's individual messages are benign.

3. Bad Likert Judge

Palo Alto Networks' Unit 42 discovered that asking an LLM to rate the harmfulness of responses on a Likert scale, and then asking it to generate examples at each level, effectively tricks the model into producing harmful content — because it frames the generation as part of an "evaluation task" rather than a direct request.

4. Indirect prompt injection

This is the attack class that keeps security researchers up at night, and for good reason. Instead of the user attacking the model directly, malicious instructions are embedded in external content that the agent retrieves or processes — websites, documents, emails, database records, API responses.

The agentic threat: When an AI agent browses the web, reads emails, or processes documents, it ingests content from untrusted sources. If that content contains hidden instructions (invisible text, CSS-hidden divs, markdown injection), the agent may execute them as if they were legitimate system instructions.

Real-world examples:

  • In May 2024, researchers exploited ChatGPT's browsing by poisoning RAG context from untrusted websites
  • In August 2024, Slack AI data exfiltration was demonstrated through RAG poisoning combined with social engineering
  • Research showed that just five carefully crafted documents can manipulate AI responses 90% of the time through RAG poisoning
  • E-commerce platforms discovered malicious sellers embedding prompt injections in product reviews that tricked AI content moderation systems

Why current defenses fail: Research evaluating eight existing defense mechanisms demonstrated that all can be bypassed by adaptive attack strategies, with attack success rates exceeding 50% in each case. Indirect prompt injection is not a jailbreak — it is a system-level vulnerability created by blending trusted and untrusted inputs in one context window.

5. System prompt extraction

Every AI application has hidden instructions defining its behavior. Extracting these reveals the model's capabilities, limitations, tool access, and often sensitive business logic. DeepSeek's full system prompt was extracted shortly after launch. This is routine across deployed chatbots — most customer service bots will reveal their system prompts to a sufficiently creative attacker.

6. Memory poisoning

Lakera AI demonstrated in November 2025 that indirect prompt injection via poisoned data sources can corrupt an agent's long-term memory, causing it to develop persistent false beliefs about security policies. The agent then defends these false beliefs when questioned by humans. This is particularly dangerous for agents that maintain state across sessions.

Why defense is fundamentally hard

The core challenge: LLMs process all text in the same way. There is no hardware-level separation between "system instructions" and "user input" and "external data." Everything is tokens in a context window. The model infers the boundary between instruction and data, but that inference can be manipulated.

This means:

  • Prompt hardening is necessary but insufficient. Analysis of real-world attacks shows adversaries consistently bypass hardened system prompts with surprising ease.
  • Input/output filters catch known patterns but miss novel attacks. Crescendo was specifically designed to evade per-message filters.
  • Safety fine-tuning helps but does not generalize. Research shows models generalize poorly to novel attack styles, making fine-tuning a moving target.
  • A meta-analysis of 78 studies found attack success rates exceed 85% when adaptive strategies are used against state-of-the-art defenses.

What actually works (defense in depth)

No single defense is sufficient. The only approach that holds up is layered architecture:

1. Trust boundaries and context isolation

Treat untrusted content (user input, web pages, documents, API responses) as fundamentally different from system instructions. Do not put them in the same context window without clear separation. Validate and assign trust levels to all external content before agents ingest it.

2. Least-privilege tool access

Agents should only have access to the tools they need for the current task, with the minimum permissions required. A customer service bot does not need write access to your database. A code review agent does not need network access.

3. Multi-layer filtering

Input classifiers to flag suspicious instructions, context-switch attempts, and tool escalation. Output validators to catch harmful or policy-violating responses. Session-level analysis (like Microsoft's multiturn filter) to detect conversational trajectory attacks like Crescendo.

Tools like Guardrails AI, NVIDIA NeMo Guardrails, Lakera Guard, and LLM firewalls like Radware's agentic protection provide components for this layer. But none of them are complete solutions on their own.

4. Human-in-the-loop for high-risk actions

Any action with real-world consequences — financial transactions, data deletion, email sending, code execution in production — should require human approval. This is not optional for agentic systems. I covered implementation patterns for this in building multi-agent systems with LangGraph.

5. Monitoring, logging, and anomaly detection

Log every prompt, tool call, and agent action. Baseline normal behavior. Alert on anomalous patterns — unusual tool usage sequences, unexpected data access, conversation trajectories that match known attack patterns. Feed detections back into your guardrails.

For observability tooling, see Logfire vs LangSmith and MLflow 3 for LLM observability.

6. Sandboxing and blast radius control

Run agent execution in isolated environments. If an agent is compromised, limit what it can access. Network segmentation, file system isolation, and short-lived credentials for tool calls reduce the blast radius of a successful attack. I discussed credential management in MCP + A2A interoperability.

7. Continuous red teaming

Static defenses decay. New attack techniques emerge regularly. Run adversarial testing against your deployed systems on a schedule — not just before launch. The OWASP LLM Security Guide and frameworks like DeepTeam provide structured approaches to this.

The uncomfortable truth

We do not have a reliable solution to prompt injection. It may be an unsolvable problem at the model level, because it requires the model to perfectly distinguish between instruction and data in all cases — something that conflicts with the flexibility that makes LLMs useful.

What we can do is make exploitation harder, limit the damage when it succeeds, and detect it quickly. Defense in depth, least privilege, trust boundaries, monitoring, and human oversight. These are not new ideas — they are the same principles that secure any system handling untrusted input. The difference is that most teams deploying AI agents are not applying them.

The incidents above were not caused by sophisticated nation-state attackers (with one exception). Most were caused by regular users asking chatbots to do things slightly outside the intended scope. If your chatbot can be jailbroken by a customer saying "ignore your previous instructions," you do not have a security problem — you have an architecture problem.

Build accordingly.

Related reading

Sources