arXiv 2026  ·  arXiv:2602.11510

AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems

Faouzi El Yagoubi1Godwin Badu-Marfo2Ranwa Al Mallah3
1,2,3 Polytechnique Montreal
Paper (arXiv) Code (GitHub) Awesome List
If this work helps you, please give us a ⭐ on GitHub and cite our paper!
1,000
Evaluation Scenarios
68.8%
Inter-Agent Leakage Rate
41.7%
Violations Missed by
Output-Only Audits
5
LLM Models Evaluated
4
Real-World Domains

Abstract

Multi-agent LLM systems are increasingly deployed in sensitive domains yet their inter-agent privacy leakage remains critically underexplored. We introduce AgentLeak, the first full-stack benchmark for systematically evaluating privacy leakage across the entire communication pipeline of multi-agent LLM systems — not just at the final output. Our evaluation of 1,000 scenarios across healthcare, finance, legal, and corporate domains reveals 68.8% inter-agent leakage versus only 27.2% at the output layer, demonstrating that output-only monitoring misses 41.7% of violations. We evaluate five representative LLMs and show that all models exhibit significant privacy vulnerabilities when intermediate agent communications are examined. AgentLeak establishes a rigorous, reproducible evaluation standard and calls for full-stack auditing in multi-agent deployments.

Key Findings

🔍

Output-Only Monitoring is Insufficient

Standard output-layer audits miss 41.7% of privacy violations. Sensitive data propagates through intermediate agent communications before being filtered at output.

📡

High Inter-Agent Leakage

Across all tested models, inter-agent channels show 68.8% leakage — 2.5× higher than the 27.2% at outputs alone.

🏥

Cross-Domain Vulnerability

Privacy risks are pervasive across all four tested domains: healthcare, finance, legal, and corporate.

🤖

All Models Affected

Every evaluated LLM exhibits significant inter-agent privacy leakage — a systemic architectural challenge, not a model-specific bug.

Benchmark Framework

AgentLeak Benchmark Framework — three-phase pipeline: Benchmark Authoring, Execution & Instrumentation, Evaluation & Reporting

Figure 3 — AgentLeak's three-phase pipeline: (1) Benchmark Authoring (scenario specs, private vault, task library, experiment matrix), (2) Execution & Instrumentation (framework adapters for LangChain/CrewAI/AutoGPT/MetaGPT, unified trace store capturing all 7 channels C1–C7), (3) Evaluation & Reporting (leakage analyzer, utility scorer, metrics engine, leaderboard).

Main Results

ModelInter-Agent LeakageOutput LeakageMissed by Output-Only
GPT-4o71.2%29.4%41.8%
Claude 3 Opus65.4%24.1%41.3%
Gemini 1.5 Pro70.3%28.7%41.6%
LLaMA-3 70B66.8%25.8%41.0%
Mistral Large70.1%27.8%42.3%
Average68.8%27.2%41.7%

Full results and ablations in the paper.

BibTeX

@article{elyagoubi2026agentleak, title = {AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems}, author = {El Yagoubi, Faouzi and Badu-Marfo, Godwin and Al Mallah, Ranwa}, journal = {arXiv preprint arXiv:2602.11510}, year = {2026} }