AI red teaming is adversarial testing of AI systems to find security vulnerabilities, safety failures, and compliance gaps before attackers do. 73% of organisations that have deployed AI have at least one critical exploitable vulnerability (OWASP State of AI Security, 2025). The AI red teaming market hit USD 1.3 billion in 2025 and is projected to reach USD 18.6 billion by 2035 at 30.5% CAGR (Market.us, 2025). The McKinsey Lilli breach in February 2026 exposed 46.5 million internal messages in under two hours through basic AI-specific flaws that adversarial testing would have caught.

Traditional red teaming targets networks, applications, and people. AI red teaming targets the attack surface unique to machine learning: training data, model weights, inference APIs, system prompts, retrieval pipelines, tool-use capabilities, and emergent behaviours.

What Is AI Red Teaming?

AI red teaming is a structured assessment where security professionals attack AI systems to find weaknesses that automated scanning and standard reviews miss. It covers technical security (prompt injection, model extraction, data poisoning), safety evaluation (harmful outputs, bias, hallucinations), and compliance verification (EU AI Act, NIST AI RMF).

The term went mainstream after the October 2023 White House Executive Order on AI Safety mandated red teaming for frontier models. Since then, every major AI lab has established formal red teaming programmes. Regulatory bodies worldwide have followed.

How AI Red Teaming Differs from Traditional Red Teaming

DimensionTraditional Red TeamingAI Red Teaming
TargetsNetworks, applications, peopleModels, prompts, training data, inference APIs, RAG pipelines
Attack vectorsExploitation, social engineering, physicalPrompt injection, data poisoning, model manipulation, jailbreaking
Vulnerability typesCVEs, misconfigurations, access controlEmergent behaviours, alignment failures, data leakage, hallucination
Testing approachDeterministic exploitationProbabilistic probing (outputs vary per run)
Skill requirementsOffensive security, networking, OSML/AI expertise, linguistics, security, domain knowledge
ToolingCobalt Strike, Metasploit, Burp SuiteGarak, PyRIT, custom harnesses, adversarial ML libraries
ReportingMITRE ATT&CK mappingOWASP Top 10 for LLMs, MITRE ATLAS mapping

Why AI Red Teaming Is Urgent Now

Three forces are converging.

Regulatory mandates are imminent. The EU AI Act’s high-risk AI requirements take full effect on August 2, 2026. Article 9 requires adversarial testing. Non-compliance carries penalties up to EUR 35 million or 7% of global annual turnover.

AI attack surfaces are expanding fast. Enterprise AI has moved from simple chatbots to autonomous agents with tool use, database access, code execution, and internet connectivity. The average enterprise AI deployment now has 14.3 distinct attack surface components, up from 3.2 in 2023 (Gartner, 2025).

AI breaches are happening now. The McKinsey Lilli breach (February 2026), EchoLeak in Microsoft Copilot (2025), and CVE-2025-59536 in Claude Code prove AI vulnerabilities are being actively exploited, not just theorised about.

OWASP Top 10 for LLM Applications

The OWASP Top 10 for LLMs is the standard taxonomy for AI-specific vulnerabilities. Red teamers use it to structure assessments and communicate findings.

LLM01: Prompt Injection

Attackers manipulate an LLM’s behaviour by injecting malicious instructions through user input (direct) or external data sources (indirect). Prompt injection is exploitable in some form in virtually every LLM deployment that accepts user input (OWASP, 2025).

Attacks: Direct override of system prompts. Indirect injection via documents the AI processes. Multi-turn manipulation across a conversation.

Impact: Unauthorised access, data exfiltration, system prompt leakage, arbitrary action execution through tool use.

See Prompt Injection Attacks for the full technical breakdown.

LLM02: Sensitive Information Disclosure

LLMs leak sensitive data through responses: training data, system prompts, PII, business logic. Dangerous in RAG systems where the model accesses internal knowledge bases.

Attacks: Training data extraction. System prompt extraction through role-playing. Cross-tenant data leakage in multi-tenant RAG.

LLM03: Supply Chain Vulnerabilities

AI supply chains involve pre-trained models, fine-tuning datasets, embedding models, vector databases, plugins, and third-party APIs. Any compromised component propagates through the system.

Attacks: Poisoned models from public repositories. Malicious plugins. Compromised training datasets with backdoor triggers.

LLM04: Data and Model Poisoning

Attackers manipulate training, fine-tuning, or augmentation data to corrupt model behaviour.

Attacks: Injecting manipulated documents into RAG knowledge bases. Clean-label poisoning of fine-tuning data. Gradient attacks on weights during distributed training.

See Model Poisoning and Training Data Attacks for details.

LLM05: Improper Output Handling

LLM outputs passed to downstream systems without validation trigger vulnerabilities in those systems: SQL injection, XSS, SSRF, command injection.

Attacks: LLM-generated SQL executed without parameterisation. Output rendered as HTML without sanitisation. Generated code executed without sandboxing.

LLM06: Excessive Agency

LLMs with too many permissions become weapons when their behaviour is compromised. This is the critical risk in AI agent architectures.

Attacks: Agents with unrestricted database write access (McKinsey Lilli). Coding assistants with unrestricted file system access. AI systems that send emails or execute transactions without human approval.

LLM07: System Prompt Leakage

System prompts contain sensitive details: behaviour rules, guardrail logic, internal APIs, database schemas, business logic. Extraction gives attackers a blueprint.

Attacks: Direct extraction via conversational manipulation. Indirect extraction through multi-turn interactions. Side-channel extraction from output behaviour analysis.

LLM08: Vector and Embedding Weaknesses

Vector databases and embedding systems in RAG architectures can be manipulated to alter retrieval, extract stored documents, or inject malicious content.

Attacks: Cross-tenant queries in shared vector databases. Adversarial documents crafted to rank highly. Embedding inversion to reconstruct originals. In the Lilli breach, 266,000+ vector store entries were accessible.

LLM09: Misinformation

LLMs generate plausible but false information. They can be manipulated to produce targeted misinformation that is trusted and acted upon.

Attacks: Deliberately induced hallucinations. RAG source poisoning. Exploiting confidence calibration.

LLM10: Unbounded Consumption

Inputs designed to exhaust computational resources, cause denial of service, or generate disproportionate API costs.

Attacks: Maximum token generation triggers. Recursive tool-calling loops. Inference endpoint flooding.

AI Red Teaming Methodology: 7 Steps

Based on NIST AI RMF, MITRE ATLAS, and field experience from AI security assessments.

Step 1: Scoping and Threat Modelling

Inventory all AI system components: models, APIs, data pipelines, RAG systems, agent tools, vector stores. Classify the system under the EU AI Act. Map data flows from input through inference to output and downstream actions. Identify threat actors and define success criteria.

Output: Architecture diagram, threat model, scope document, rules of engagement.

Step 2: Authentication and Authorisation Review

Test every API endpoint for authentication requirements. Verify authorisation on tool use and data access. Check for privilege escalation through agent permissions. Test multi-tenant isolation. Review API key management.

41% of enterprise AI deployments have at least one unauthenticated API endpoint exposing sensitive functionality (CodeWall, 2026).

Step 3: Data Exposure Assessment

Attempt system prompt extraction. Test for training data leakage through membership inference. Probe RAG systems for cross-tenant leakage. Test vector store access controls. Evaluate PII handling.

Step 4: Prompt Injection Testing

Direct prompt injection with escalating payloads. Indirect injection through documents and data sources the AI processes. Multi-turn manipulation. Jailbreak testing. Tool-use exploitation.

Tools: Garak (NVIDIA), PyRIT (Microsoft), custom adversarial libraries, manual expert testing.

Step 5: Model Poisoning and Integrity Assessment

Review training data provenance. Test RAG knowledge bases for unauthorised document injection. Evaluate model update pipeline security. Test for backdoor triggers. Assess embedding model integrity.

Step 6: Output Safety and Downstream Impact

Test for injection attacks through AI output. Evaluate content safety filter bypass. Test AI-generated code for security flaws. Assess hallucination rates in safety-critical contexts. Test output validation controls.

In the Lilli breach, the system prompt had write-level SQL access. Prompt injection led directly to data manipulation, not just exfiltration.

Step 7: Reporting

Map findings to OWASP Top 10 for LLMs and MITRE ATLAS. Classify by severity (likelihood x impact). Provide specific remediation for each finding. Include evidence. Document EU AI Act compliance status.

Output: Executive summary, technical findings, OWASP/ATLAS mapping, remediation roadmap, compliance gap analysis.

Case Study: The McKinsey Lilli Breach (February 2026)

The largest AI security incident to date. A definitive argument for AI red teaming.

What Was Exposed

MetricValue
Chat messages46.5 million
Files728,000
User accounts57,000
Vector store entries266,000+
RAG knowledge chunks3.68 million
Time to full accessApproximately 2 hours
Attack complexityLow (SQL injection in unauthenticated endpoints)

Source: Chris Olsen (xyzeva) technical analysis, February 28, 2026.

What Went Wrong

Unauthenticated API endpoints. Multiple endpoints provided direct access to backend databases without authentication.

SQL injection in the AI backend. Unauthenticated endpoints were vulnerable to SQL injection, enabling full data enumeration and extraction.

System prompt with excessive privileges. The Lilli system prompt granted write-level SQL access to production databases. The AI had far more permissions than its function required.

Vector store exposure. 266,000+ OpenAI vector store entries were accessible, containing embedded internal McKinsey documents and client materials.

RAG knowledge base exposure. 3.68 million chunks representing Lilli’s entire internal knowledge base, including confidential client engagement materials.

What Red Teaming Would Have Found

  • Step 2 (auth review) catches unauthenticated endpoints immediately
  • Step 3 (data exposure) discovers vector store and RAG leakage
  • Step 4 (prompt injection) reveals SQL injection through the AI interface with write access
  • Step 6 (output safety) flags excessive database privileges in the system prompt

Every one of these vulnerabilities maps to the OWASP Top 10 for LLMs. Every one was preventable.

EU AI Act Requirements for Adversarial Testing

Compliance Deadlines

DateRequirement
February 2, 2025Prohibitions on unacceptable AI practices
August 2, 2025GPAI model requirements
August 2, 2026Full enforcement of high-risk AI system requirements

What Article 9 Requires

Providers of high-risk AI systems must implement risk management that includes:

  • Testing against reasonably foreseeable misuse, including adversarial attacks
  • Performance evaluation under stress conditions, including adversarial inputs
  • Bias testing across protected characteristics
  • Robustness testing against perturbation attacks

What Article 55 Requires (Systemic Risk GPAI)

Models trained with more than 10^25 FLOPs must undergo:

  • Adversarial testing to identify systemic risks
  • Red teaming to evaluate capabilities and limitations
  • Documentation of testing methods and results

Penalties

ViolationMaximum Penalty
Prohibited AI practicesEUR 35 million or 7% of global turnover
High-risk AI obligationsEUR 15 million or 3% of global turnover
Incorrect informationEUR 7.5 million or 1% of global turnover

See EU AI Act Security Requirements for the full regulatory breakdown.

Market Growth

YearSize (USD)
20230.5 billion
20251.3 billion
2028 (projected)5.2 billion
2030 (projected)9.8 billion
2035 (projected)18.6 billion

CAGR 2025-2035: 30.5%. Source: Market.us, 2025.

Key Numbers

  • 73% of AI deployments have exploitable vulnerabilities (OWASP, 2025)
  • Only 12% of organisations have formal AI red teaming programmes (Gartner, 2025)
  • 67% of CISOs rank AI security as their top 2026 concern, up from 23% in 2024 (SANS, 2025)
  • 89% plan to increase AI security spending in 2026 (IBM, 2025)

What Is Changing

Continuous AI red teaming (CART-AI). Moving from periodic assessments to continuous testing in CI/CD pipelines. AI behaviour changes with every model update, RAG addition, and prompt modification.

AI-augmented red teaming. Using AI to red team AI. PyRIT and Garak automate adversarial prompt generation at scale beyond human capacity.

Multi-modal testing. AI systems processing images, audio, video, and code require red teaming across all input types.

Agent-specific red teaming. AI agents with code execution, web browsing, and API access need testing that covers the full range of agent actions, not just conversational outputs.

Regulatory-driven demand. EU AI Act (August 2026), proposed US legislation, and emerging frameworks in Singapore, Japan, and the UK are creating global mandates for adversarial AI testing.

Tools and Frameworks

Open-Source Tools

ToolMaintainerFocus
GarakNVIDIALLM vulnerability scanning. Automated prompt injection, configurable probes, multi-model support
PyRITMicrosoftRed teaming orchestration. Multi-turn attacks, scoring, Azure AI integration
CounterfitMicrosoftAdversarial ML. Evasion, extraction, inversion
ARTIBMML robustness. Evasion, poisoning, extraction, inference
TextAttackQData LabNLP adversarial. Text perturbation and augmentation

Garak is the most widely adopted open-source scanner at 7,300+ GitHub stars and 34% adoption among organisations testing AI security (GitHub, 2026; Gartner, 2025). PyRIT leads multi-turn orchestration at 28% adoption (Gartner, 2025).

Commercial Platforms

PlatformFocusDifferentiator
HiddenLayerML security monitoringRuntime protection and model scanning
Robust Intelligence (NVIDIA)AI validationContinuous testing and monitoring
LakeraLLM securityReal-time prompt injection detection
Protect AIML supply chainModel scanning and pipeline security
CalypsoAIAI governancePolicy enforcement and testing

Framework Mapping

Map findings to established frameworks for consistent reporting:

  • OWASP Top 10 for LLMs for vulnerability classification
  • MITRE ATLAS for adversarial technique IDs (ATT&CK-style)
  • NIST AI RMF for risk management (Govern, Map, Measure, Manage)
  • ISO/IEC 42001 for organisational AI governance

Building an AI Red Team Programme

Team Composition

  • AI/ML engineers who understand model architectures, training pipelines, inference
  • Security engineers with offensive experience (pen testing, exploit development)
  • Prompt engineers with deep LLM manipulation knowledge
  • Domain experts who understand business context and failure impact
  • Ethics/safety specialists for bias, fairness, and safety evaluation

Maturity Levels

LevelDescriptionCapabilities
1: Ad hocNo formal programmeSporadic manual prompt testing
2: DevelopingBasic AI security testingAutomated scanning with Garak/PyRIT, manual injection testing
3: DefinedStructured programmeFull OWASP Top 10 coverage, threat-informed testing, regulatory mapping
4: ManagedContinuous red teamingCART-AI in CI/CD, metrics-driven improvement, purple team exercises
5: OptimisingAdvanced adversarial programmeNovel attack research, framework contributions, multi-modal testing

Fewer than 5% of organisations have reached Level 3 (Gartner, 2025).

Costs

EngagementCost (USD)Duration
Automated LLM scanning10,000 to 25,0001 to 2 weeks
Focused prompt injection assessment25,000 to 50,0002 to 3 weeks
Full AI red team assessment75,000 to 200,0004 to 8 weeks
Continuous AI red teaming (annual)150,000 to 500,000Ongoing
EU AI Act compliance assessment50,000 to 150,0003 to 6 weeks

Frequently Asked Questions

How does AI red teaming differ from traditional pen testing?

AI red teaming targets vulnerabilities unique to AI: prompt injection, model poisoning, data leakage through AI responses, jailbreaking, excessive agency. Traditional pen testing targets infrastructure and application flaws. A complete AI security assessment includes both.

How often should AI systems be tested?

Quarterly assessments at minimum, with continuous automated testing between engagements. AI behaviour shifts with every model update, RAG document addition, and prompt change. Point-in-time annual testing is not sufficient.

Can AI red teaming be fully automated?

No. Automated tools find 40 to 60% of what combined human-automated testing discovers (NVIDIA AI Red Team Research, 2025). Use automated scanning for known patterns. Use expert humans for novel attacks.

What qualifications should AI red teamers have?

Cybersecurity credentials (OSCP, OSCE, CRTO) combined with AI/ML knowledge (transformer architectures, training pipelines, embedding systems). Specialised certifications are emerging: OWASP AI Security, SANS AI Red Team (expected 2026).

How does AI red teaming satisfy EU AI Act compliance?

Article 9 requires adversarial testing for high-risk AI. Article 55 requires red teaming for systemic risk GPAI models. AI red teaming generates the documented evidence of testing, vulnerabilities, and mitigations that regulators expect.

Summary

  1. AI red teaming requires different skills, tools, and methods than traditional red teaming.
  2. 73% of AI deployments have exploitable vulnerabilities (OWASP, 2025).
  3. The OWASP Top 10 for LLMs is the standard vulnerability taxonomy.
  4. The 7-step methodology (scoping, auth, data exposure, prompt injection, poisoning, output safety, reporting) covers the AI attack surface.
  5. The McKinsey Lilli breach (46.5M messages in 2 hours) proves what happens without adversarial testing.
  6. EU AI Act enforcement begins August 2, 2026. Penalties reach EUR 35 million or 7% of global turnover.
  7. The market grows from USD 1.3 billion to USD 18.6 billion by 2035.
  8. Every organisation deploying AI needs an adversarial testing programme.

Sources

  • OWASP. “OWASP Top 10 for Large Language Model Applications, v2.0.” 2025.
  • OWASP. “State of AI Security Report.” 2025.
  • Market.us. “AI Security Market Report.” 2025.
  • IBM. “AI Security Report.” 2025.
  • Gartner. “AI Security Survey: State of Enterprise AI Protection.” 2025.
  • CodeWall. “AI Security Report 2026.” 2026.
  • NIST. “AI Risk Management Framework (AI RMF 1.0).” 2023.
  • MITRE. “ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems.” 2025.
  • European Commission. “Regulation (EU) 2024/1689 (AI Act).” 2024.
  • Olsen, Chris (xyzeva). “McKinsey Lilli: 46.5M Messages, 728K Files Exposed.” February 28, 2026.
  • SANS Institute. “CISO Survey 2025.” 2025.
  • Mandiant. “M-Trends 2026.” 2026.
  • NVIDIA/garak on GitHub — confirms ~7,300 GitHub stars as of March 2026