CalypsoAI Model Security Leaderboards

Find the Right Model

Compare Security, Cost & Capabilities

The world’s major AI models and systems are vulnerable—we’ve proven it. The CalypsoAI Security Leaderboards rank top GenAI models based on real-world security testing, exposing critical risks overlooked by performance benchmarks. Powered by Inference Red-Team, these leaderboards are the only tools that help you find the safest model and stress test your AI system before you deploy.

The CASI Leaderboard

The top 10 performing models at resisting direct attacks such as prompt injection and jailbreaks.

CASI (CalypsoAI Security Index) is our benchmark score for measuring how vulnerable a model is to common prompt injection and jailbreak attacks. It evaluates how easily an LLM can be manipulated into producing harmful or policy-violating outputs.

A higher CASI score means a model is more secure against standard attack techniques.

The Agentic Leaderboard

The top 10 models at maintaining safe behavior during autonomous, real-world attacks.

AWR (Agentic Warfare Resistance) captures how well a model holds up under real-world, multi-step, and autonomous agent scenarios where simple safety checks often break down. It reflects a model’s ability to stay aligned and secure during complex workflows.

A higher AWR score signals lower risk and better performance under agentic pressure.

June 2025 Edition

June 2025 Edition
May 2025 Edition
April 2025 Edition
March 2025 Edition
Feb 2025 Edition

Updated 16th June, 2025

CASI Leaderboard

Model Name	CASI	Avg. Performance	RTP	CoS
Claude 4 Sonnet	95.12	60.78%	0.8	18.92
Claude 3.5 Sonnet	93.27	44.44%	0.69	19.3
Claude 3.7 Sonnet	87.24	57.39%	0.74	20.63
Claude 3.5 Haiku	85.69	34.74%	0.6	5.6
Phi4	81.44	40.22%	0.61	0.77
DeepSeek-R1-Distill-Llama-70B	73.96	48.24%	0.62	1.24
GPT-4o	68.13	41.46%	0.56	18.35
Llama 3.1 405b	64.65	40.49%	0.54	1.24
Qwen3-14B	60.82	55.72%	0.59	0.51
Qwen3-30B-A3B	58.61	55.60%	0.57	0.63

Agentic Leaderboard

Model Name	AWR	Avg. Performance	RTP	CoS
Claude 3.5 Sonnet	95.85	44.44%	0.78	18.78
Phi4	90.63	40.22%	0.65	0.69
Claude 3.5 Haiku	90.32	34.74%	0.62	5.31
Claude 4 Sonnet	86.73	60.78%	0.75	20.75
Claude 3.7 Sonnet	80.31	57.39%	0.7	22.41
GPT-4o	80.28	41.46%	0.62	15.57
Llama-4 Maverick	76.3	50.53%	0.65	0.52
Llama-4 Scout	70.51	42.99%	0.58	0.54
Grok 3 Mini Beta	69.8	66.67%	0.69	1.15
Gemini 2.0 Flash	69.75	48.09%	0.6	0.95

CASI Leaderboard

Model Name	CASI	Avg. Performance	RTP	CoS
Claude 3.5 Sonnet	94.88	44.44%	0.7	18.7
Claude 3.7 Sonnet	88.11	57.39%	0.74	20.22
Claude 3.5 Haiku	87.47	34.74%	0.6	5.14
Phi4-14B	82.47	40.22%	0.62	0.66
DeepSeek-R1-Distill-Llama-70B	69.84	48.24%	0.6	1.24
GPT-4o	67.85	41.46%	0.56	16.65
Llama 3.1 405b	65.06	40.49%	0.54	2.05
Gemini 2.5 Pro	57.08	67.84%	0.61	17.5
GPT 4.1-nano	54.05	41.01%	0.48	0.93
Llama 4 Maverick-17B-128E	52.45	50.53%	0.52	0.77

Agentic Leaderboard

Model Name	AWR	Avg. Performance	A_RTP	A_CoS
Claude 3.5 Sonnet	96.67	44.44%	0.71	18.7
Phi4-14B	92.28	40.22%	0.76	0.66
Claude 3.5 Haiku	91.79	34.74%	0.62	5.14
GPT-4o	81.12	41.46%	0.62	16.65
Grok 3	77.75	50.63%	0.65	18
Claude 3.7 Sonnet	76.83	57.39%	0.68	20.22
Grok 3-mini	72.04	66.76%	0.7	0.8
Gemma 3 27b	72.03	37.62%	0.56	1.8
Llama 4 Maverick-17B-128E	71.71	50.53%	0.62	0.77
GPT 4.1	68.77	52.63%	0.62	10

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude 3.5 Sonnet	94.3	84.50%	0.9	18.7
Anthropic	Claude 3.7 Sonnet	88.52	86.30%	0.88	20.22
Anthropic	Claude 3.5 Haiku	87.56	68.28%	0.79	5.14
Microsoft	Phi4-14B	82.77	75.90%	0.8	0.66
DeepSeek	DeepSeek-R1-Distill-Llama-70B	71.46	72.67%	0.72	1.24
OpenAI	GPT-4o	68.65	80.50%	0.73	16.65
Google	Gemini 2.0 Pro (experimental)	63.89	79.10%	0.7	NA
Meta	Llama 3.1 405b	60.73	79.80%	0.68	2.05
DeepSeek	DeepSeek-R1	52.91	86.53%	0.64	4.24
Google	Gemma 3 27b	55.25	78.60%	0.64	1.8

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude 3.5 Sonnet	94.94	84.50%	0.93	18.7
Anthropic	Claude 3.7 Sonnet	89.54	86.30%	0.89	20.22
Anthropic	Claude 3.5 Haiku	88.84	68.28%	0.57	5.14
Microsoft	Phi4-14B	86.04	75.90%	0.68	0.66
DeepSeek	DeepSeek-R1-Distill-Llama-70B	71.7	72.67%	0.74	1.24
OpenAI	GPT-4o	68.44	80.50%	0.52	16.65
Meta	Llama 3.1 405b	61.86	79.80%	0.77	2.05
Meta	Llama 3.3 70b	55.57	74.50%	0.69	1.85
DeepSeek	DeepSeek-R1	52.91	86.53%	0.58	4.24
Google	Gemini 1.5 Flash	29.79	66.70%	0.92	0.51
Google	Gemini 2.0 Flash	29.18	77.20%	0.66	0.66
Google	Gemini 1.5 Pro	27.38	74.10%	0.63	8.58
OpenAI	GPT-4o-mini	24.25	71.78%	0.73	1.03
OpenAI	GPT-3.5 Turbo	18.73	59.20%	0.82	2.75

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS	Source
Anthropic	Claude 3.5 Sonnet	96.25	84.50%	0.93	18.7	Anthropic
Anthropic	Phi4-14B	94.25	75.90%	0.68	0.66	Azure
Anthropic	Claude 3.5 Haiku	93.45	68.28%	0.57	5.14	Anthropic
OpenAI	GPT-4o	75.06	80.50%	0.52	16.65	OpenAI
Meta	Llama 3.3 70b	74.79	74.50%	0.69	1.85	Hugging Face
DeepSeek	DeepSeek-R1-Distill-Llama-70B	74.42	72.67%	0.74	1.24	Hugging Face
DeepSeek	DeepSeek-R1	74.26	86.53%	0.58	4.24	Hugging Face
OpenAI	GPT-4o-mini	73.08	71.78%	0.73	1.03	OpenAI
Google	Gemini 1.5 Flash	73.06	66.70%	0.92	0.51	Google
Google	Gemini 1.5 Pro	72.85	74.10%	0.63	8.58	Google
OpenAI	GPT-3.5 Turbo	72.76	59.20%	0.82	2.75	OpenAI
Alibaba Cloud	Qwen QwQ-32B-preview	67.77	68.87%	0.65	2.14	Hugging Face

CASI version 1.0

Threat Insights - June 2025

Welcome to our June insight notes. This section is our commentary on the ever-shifting landscape of AI model security, where we highlight key data points, discuss emerging trends, and offer context to help you navigate your AI journey securely.

Testing Spotlight: Scenario Nesting

This month’s leaderboard incorporates a wider range of tests, including the strategy known as Scenario Nesting. The technique embeds a harmful instruction within a benign-looking task — like code completion or table generation — to bypass safety filters. By forcing the model to focus on the structure of the benign request, attackers can sneak malicious payloads past its defenses. More details on which models are vulnerable to it are available in our Inference Red-Team product.

New Notable Models Tested

Anthropic Claude 4 Sonnet: Anthropic’s new Claude 4 Sonnet enters the leaderboard directly at #1 with a CASI score of 95.12. It’s encouraging to see a new “hybrid reasoning” model debut with such a strong security posture, reflecting a continued commitment to security from their team.
Qwen3’s Open-Source Ascent: Two new open-source models from Alibaba, Qwen3-14B and Qwen3-30B-A3B, have earned their spots in our Top 10 for CASI. While their initial safety scores are competitive, it’s worth noting they did not place in the top 10 on our Agentic Warfare leaderboard, reinforcing the need to evaluate models against multiple security dimensions.
DeepSeek’s Impressive Patch: DeepSeek R1 saw its CASI score jump by over 4 points following its latest update. The data suggests this patch successfully addressed several key vulnerabilities, which is a positive development for model security maintenance and a welcome signal for enterprise users who value post-release support.

Wider Security Trends

A Welcome Reversal: The average CASI score across our tracked models increased by approximately 7% this month, a significant reversal of the recent downward trend. We’re optimistic this indicates providers are placing a greater emphasis on security and will be watching to see if this trend holds.
Two Sides of Security: Static vs. Agentic: This month highlighted the growing divergence between standard safety (CASI) and resistance to complex attacks (AWR). For instance, Claude 3.7 Sonnet’s AWR score improved while its CASI score dipped slightly. For production use, this means the “best” model truly depends on the job: a conversational bot has different security needs than a complex autonomous agent.
Balancing Security and Budget: As a reminder, our “Cost of Security” (CoS) metric helps quantify the trade-off between a model’s security and its operational cost. This month’s data shows this clearly: while Anthropic’s models hold the top spots for security, a model like Microsoft’s Phi4 offers a strong CASI score of 81.44 for a fraction of the cost.

Threat Insights - May 2025

Welcome to our insight notes. This section serves as our commentary space, where we highlight interesting data points from our research, discuss trends in AI model security behavior, and explain changes to our methodology. Our goal here is to provide transparency into the work happening behind the scenes at CalypsoAI’s research lab.

Agentic Warfare Resistance (AWR) Score:

This month we debut the scoring for our Agentic Warfare™ testing in its own leaderboard. While we have already raised the bar with our Signature attacks, moving away from basic attack success rates by incorporating severity and complexity of attack, with AWR we are taking another leap forward and evaluating how your choice of model can compromise your entire AI system. The AWR score is calculated based on depth and complexity of the attacks our agents need to use to achieve the desired goals.

Agentic Warfare deploys a team of autonomous attack agents trained to attack your model, extract information and compromise your infrastructure. In this way it can extract sensitive PII from vector stores, understand your system architecture and test your model’s alignment to your explicit instructions.

Updated Performance Benchmarks:

We now use seven different benchmarks in our performance metric: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. As benchmarks continue to evolve and improve we’ll keep evaluating what should be used in our leaderboard.

LOTS of New Models:

This last month has been the busiest for model release and testing we’ve seen in a long time. Llama 4 Maverick and Scout, Gemini 2.5 Pro and Flash, GPT4.1, 4.1-mini and 4.1-nano—and we finally get API access to test Grok 3 and Grok 3-mini.

Two New Agentic Attacks:

Our Agentic Warfare simulations now incorporate two new conversational attack methods, FRAME and Trolley, developed by the CalypsoAI AI Threat Research team. These techniques target known LLM architectural vulnerabilities and demonstrate the effectiveness of sustained, cohesive attacks during extended interactions, replicating tactics used by real-world adversaries.

Wider Security Trends:

Decreasing average scores: The average CASI score across the tracked models decreased by approximately 6% in this leaderboard iteration. We noted this last month and as the trend continues it’s becoming more obvious that foundational models are favouring performance over security.
Upgrade with caution: We are seeing a consistent trend where new releases, even minor ones, have lower CASI scores than their predecessors. With the upgrade path for these models being relatively easy, it’s important for companies to rigorously re-test their models and AI systems if they choose to upgrade. Notable examples:
- Claude: 3.5 sonnet = 94 vs 3.7 sonnet = 88
- OpenAI: GPT4o = 67 vs GPT4.1 = 51
- Llama: 3.1 405B = 65 vs 4 Maverick = 52

AI security means testing AI systems: Our research using Agentic Warfare demonstrates that even if a model appears secure when tested in isolation, integrating it into a wider system can expose a new array of vulnerabilities. For every model we tested using this approach within a system context, we were able to:
- Extract user-provided system prompts.
- Break the model’s alignment based on those system prompts.
- Extract sensitive personally identifiable information (PII) when the model was integrated into a retrieval-augmented generation (RAG) system.

Threat Insights - April 2025

These notes share key insights from CalypsoAI’s research team on AI model security trends, updates to our leaderboard, and changes in testing methodology.

Design & Functionality Updates

We’ve refreshed the leaderboard’s design (we hope you like the changes!), but the updates aren’t just cosmetic. We’ve also enhanced functionality: users can now review previous leaderboard iterations by clicking on the specific version number. We believe this is important for users who need to reference past data used in their decision-making processes.

Note: Please ensure you note the version number when recording or citing metrics.

Transitioning to a Top 10

We’ve decided to focus the leaderboard on the Top 10 models for several reasons. Primarily, as a leaderboard, its purpose is to spotlight the leading models in terms of security at a specific point in time, rather than listing every model ever published. While we continue to test a wide range of models, only those achieving the Top 10 CASI scores will be featured here. All models and additional data is available in our Inference Red-Team product where users can explore what attack types each model is vulnerable to.

New Notable Models Tested

Gemma 3 27B (Google): Google’s new open-source model enters the leaderboard in 9th place with a CASI score of 55.25. This pushes DeepSeek R1 into the final spot, while Llama 3.3 70B (previously in the Top 10) is now displaced with a score of 50.86.
Gemini 2.0 Pro (Experimental): Google’s recent Gemini release pattern presented challenges. While Gemini 2.0 Pro entered our Top 10 with a security score more than double that of its predecessor (1.5 Pro), Google released the beta of its newer model, 2.5 Pro, during our testing window and appears to have deprecated 2.0 Pro. Due to API rate limits (2 requests per minute), we couldn’t adequately test 2.5 Pro for this release, but intend to add it as soon as limits are relaxed. However, the significant security improvement observed from 1.5 to 2.0 makes us hopeful for continued progress in 2.5.
Mistral Small & Qwen QwQ: The recent emergence of capable sub-70B parameter models is exciting, particularly for performance in local deployments. Unfortunately, this excitement didn’t extend to their security evaluations in our tests. Neither Mistral Small nor Qwen came close to the Top 10, scoring 28.86 and 22.76 CASI respectively. This leaves Phi-4 as the leading Small Language Model (SLM) in terms of security for another release cycle.

Wider Security Trends

Decreasing Average Scores: The average CASI score across the tracked models decreased by approximately 4% in this leaderboard iteration. This could partially be attributed to our team improving our attack generation processes and incorporating new attack vectors. Nonetheless, it’s a developing trend and moving in the wrong direction.
Anthropic Remains Strong: Anthropic models continue to top our security rankings, although interestingly, their newest model, Claude 3.7 Sonnet, isn’t their highest-scoring one on our board. This observation aligns with Anthropic’s discussion around “Appropriate Harmlessness” for Sonnet, aiming to reduce refusals for benign prompts. Our tests suggest this tuning might have introduced slight vulnerabilities in the pursuit of improved helpfulness.
Older Models Receiving Patches: Several older models, including GPT4o-mini and Gemini 1.5 Pro received revisions since our last tests. These seemed to add some additional safeguards. The data suggests these patches incorporate learnings from newer models to address common jailbreaks, which is a positive development for model security maintenance, however we still would recommend additional safeguards if using these models. With scores of 41 and 27 respectively they still score well below our acceptable threshold.
Shift Towards Reasoning? With Anthropic releasing models like Claude 3.7 Sonnet, their first “hybrid reasoning model”, and Google quickly iterating from Gemini 2.0 Pro to the more advanced “thinking” version 2.5 Pro, we’re observing a potential trend. Are major providers shifting focus from releasing general base models towards models specifically enhanced for reasoning capabilities? If this trend holds, it could have significant implications for the attack surface of future models, as we’ve seen enhanced reasoning capabilities introduce new vulnerabilities.

Stay Updated

What Are the CalypsoAI Model Security Leaderboards?

The CalypsoAI Leaderboards are a holistic assessment of base model and AI system security, focusing on the most popular models and models deployed by our customers. We developed these tools to align with the business needs of selecting a production-ready model, helping CISOs and developers build with security at the forefront.

These leaderboards cut through the noise in the AI space, distilling complex model security questions into a few key metrics:

CalypsoAI Security Index (CASI)

A metric designed to measure the overall security of a model (explained in detail below).

Agentic Warfare Resistance (AWR) Score

AWR evaluates how a model can compromise an entire AI system. We do this by unleashing our team of autonomous attack agents on the system, which are trained to attack the model, extract information and compromise infrastructure. In this way these agents can extract sensitive PII from vector stores, understand system architecture, and test model alignment with explicit instructions.

Performance

The average performance of the model is based on popular benchmarks (e.g., MMLU, GPQA, MATH, HumanEval).

Risk-to-Performance Ratio (RTP)

Provides insight into the tradeoff between model safety and performance.

Cost of Security (CoS)

Evaluate the current inference cost relative to the model’s CASI, assessing the financial impact of security.

Introducing CASI

What is the CalypsoAI Security Index (CASI),
and Why Do We Need It?

CASI is a metric we developed to answer the complex question: “How secure is my model?” A higher CASI score indicates a more secure model or application.

While many studies on attacking or red-teaming models rely on Attack Success Rate (ASR), this metric often oversimplifies the reality. Traditional ASR treats all attacks as equal, which is misleading. For example, an attack that bypasses a bicycle lock should not be equated to one that compromises nuclear launch codes. Similarly, in AI, a small, unsecured model might be easily compromised with a simple request for sensitive information, while a larger model might require sophisticated techniques like Agentic Warfare™ to break its alignment.

To illustrate this, consider the following hypothetical comparison between a small, unsecured model and a larger, safeguarded model:

Attack	Weak model	Strong Model
Plain text Attack (ASR)	30%	4%
Complex Attack (ASR)	0%	26%
Total ASR	30%	30%
CASI	56	84

In this scenario, both models have the same total ASR. However, the larger model is significantly more secure because it resists simpler attacks and is only vulnerable to more complex ones. CASI captures this nuance, providing a more accurate representation of security.

CASI evaluates several critical factors beyond simple success rates:

By incorporating these factors, CASI offers a holistic and nuanced measure of model and application security.

Severity: The potential impact of a successful attack (e.g., bicycle lock vs. nuclear launch codes).
Complexity: The sophistication of the attack being assessed (e.g. plain text vs. complex encoding).
Defensive Breaking Point (DBP): Identifies the weakest link in the model’s defences, focusing on the path of least resistance and considering factors like computational resources required for a successful attack.

Agentic Warfare Resistance (AWR) Score

Measuring True AI Security with the Agentic Warfare Resistance (AWR) Score

Standard AI vulnerability scans are great to get a baseline view of model security but only scratch the surface in understanding how an AI system might act under real world attacks. This is why we use Agentic Warfare, a sophisticated red-teaming methodology where autonomous AI agents simulate a team of persistent, intelligent threat analysts. These agents probe, learn, and adapt, executing multi-step attacks to uncover critical weaknesses that static tests miss.

This rigorous process produces the Agentic Warfare Resistance (AWR) Score, a quantitative measure of an AI system’s defensive strength, rated on a scale of 0 to 100.

A higher AWR score means the system requires a more sophisticated, persistent, and informed attacker to be compromised. It directly translates a complex attack narrative into a single, benchmarkable number that is calculated across three critical vectors:

Required Sophistication: What is the minimum level of attacker ingenuity required to breach your AI? Does it withstand advanced, tailored strategies, or does it fall to simpler, common attacks?

Defensive Endurance: How long can the AI system hold up under a persistent assault? We measure if its defenses crumble after a few interactions or endure a prolonged, adaptive conversational attack.
Counter-Intelligence: Is AI accidentally training its attackers? This assesses whether a failed attack still leaks critical intelligence, like revealing the nature of its filters, which in turn, would provide a roadmap for the next attack.

The AWR score gives a clear, actionable metric to track, report on, and improve an organization’s AI security posture against the threats of tomorrow.

Experience Proactive AI Vulnerability Discovery

with CalypsoAI Inference Red-Team

Request a Demo Learn More

How Should the Leaderboard Be Used?

The CalypsoAI Leaderboard serves as a starting point for assessing which model to build with. It evaluates the guardrails implemented by model providers and reflects their performance against the latest vulnerabilities in the AI space.

It’s important to note that the leaderboard is a living artefact. At CalypsoAI, we will continue to develop new vulnerabilities and work with model providers to responsibly disclose and resolve these issues. As a result, model scores will evolve, and new models will be added. The leaderboard will be versioned based on updates to our signature attack database and iterations of our security score.

What Does the Leaderboard Not Do?

The leaderboard does not account for specific applications or use cases. It is solely an assessment of foundational models. For a deeper understanding of your application’s vulnerabilities, including targeted concerns like sensitive data disclosure or misalignment from system prompts, our full red-teaming product is available.

Do we supply all of the output and testing data?

Users of our red-teaming product gain access to our comprehensive suite of penetration testing attacks, including:

Signature Attacks:

A vast prompt database of state-of-the-art AI vulnerabilities.

Operational Attacks:

Traditional cybersecurity concerns applied to AI applications (e.g., DDoS, open parameters, PCS).

Agentic Warfare™:

An attack agent capable of discovering general or directed vulnerabilities specific to a customer’s use case. For example, a bank might use Agentic Warfare to determine if the model is susceptible to disclosing customer financial information. The agent designs custom attacks based on the model’s setup and application context.

Product users will be able to see additional data such as where the vulnerabilities of each models are along with solutions to mitigate the risk.

CalypsoAI Model Security Leaderboards

Find the Right Model

Compare Security, Cost & Capabilities

The CASI Leaderboard

The top 10 performing models at resisting direct attacks such as prompt injection and jailbreaks.

The Agentic Leaderboard

The top 10 models at maintaining safe behavior during autonomous, real-world attacks.

CASI Leaderboard

Agentic Leaderboard

CASI Leaderboard

Agentic Leaderboard

Testing Spotlight: Scenario Nesting

New Notable Models Tested

Wider Security Trends

Agentic Warfare Resistance (AWR) Score:

Updated Performance Benchmarks:

LOTS of New Models:

Two New Agentic Attacks:

Wider Security Trends:

Design & Functionality Updates

Transitioning to a Top 10

New Notable Models Tested

Wider Security Trends

Stay Updated

What Are the CalypsoAI Model Security Leaderboards?

CalypsoAI Security Index (CASI)

Agentic Warfare Resistance (AWR) Score

Performance

Risk-to-Performance Ratio (RTP)

Cost of Security (CoS)

Introducing CASI

What is the CalypsoAI Security Index (CASI), and Why Do We Need It?

CASI evaluates several critical factors beyond simple success rates:

Agentic Warfare Resistance (AWR) Score

Measuring True AI Security with the Agentic Warfare Resistance (AWR) Score

Experience Proactive AI Vulnerability Discovery

with CalypsoAI Inference Red-Team

How Should the Leaderboard Be Used?

What Does the Leaderboard Not Do?

Do we supply all of the output and testing data?

Signature Attacks:

Operational Attacks:

Agentic Warfare™:

Sources:

What is the CalypsoAI Security Index (CASI),
and Why Do We Need It?