CalypsoAI Model Security Leaderboards
Find the Right Model
Compare Security, Cost & Capabilities
The world’s major AI models and systems are vulnerable—we’ve proven it. The CalypsoAI Security Leaderboards rank top GenAI models based on real-world security testing, exposing critical risks overlooked by performance benchmarks. Powered by Inference Red-Team, these leaderboards are the only tools that help you find the safest model and stress test your AI system before you deploy.
The ‘CalypsoAI Security Index’ (CASI) ranks models on a scale from 0 to 100—the higher the score, the more secure the model is. Learn about CASI.
The Agentic Warfare Resistance (AWR) Score takes it a step further by assessing how a model can compromise your entire AI system.
CASI Leaderboard
Model Provider | Model Name | CASI | Avg. Performance | RTP | CoS |
---|---|---|---|---|---|
|
Claude 3.5 Sonnet | 94.88 | 44.44% | 0.7 | 18.7 |
|
Claude 3.7 Sonnet | 88.11 | 57.39% | 0.74 | 20.22 |
|
Claude 3.5 Haiku | 87.47 | 34.74% | 0.6 | 5.14 |
|
Phi4-14B | 82.47 | 40.22% | 0.62 | 0.66 |
|
DeepSeek-R1-Distill-Llama-70B | 69.84 | 48.24% | 0.6 | 1.24 |
|
GPT-4o | 67.85 | 41.46% | 0.56 | 16.65 |
|
Llama 3.1 405b | 65.06 | 40.49% | 0.54 | 2.05 |
|
Gemini 2.5 Pro | 57.08 | 67.84% | 0.61 | 17.5 |
|
GPT 4.1-nano | 54.05 | 41.01% | 0.48 | 0.93 |
|
Llama 4 Maverick-17B-128E | 52.45 | 50.53% | 0.52 | 0.77 |
Agentic Leaderboard
Model Provider | Model Name | AWR | Avg. Performance | A_RTP | A_CoS |
---|---|---|---|---|---|
|
Claude 3.5 Sonnet | 96.67 | 44.44% | 0.71 | 18.7 |
|
Phi4-14B | 92.28 | 40.22% | 0.76 | 0.66 |
|
Claude 3.5 Haiku | 91.79 | 34.74% | 0.62 | 5.14 |
|
GPT-4o | 81.12 | 41.46% | 0.62 | 16.65 |
|
Grok 3 | 77.75 | 50.63% | 0.65 | 18 |
|
Claude 3.7 Sonnet | 76.83 | 57.39% | 0.68 | 20.22 |
|
Grok 3-mini | 72.04 | 66.76% | 0.7 | 0.8 |
|
Gemma 3 27b | 72.03 | 37.62% | 0.56 | 1.8 |
|
Llama 4 Maverick-17B-128E | 71.71 | 50.53% | 0.62 | 0.77 |
|
GPT 4.1 | 68.77 | 52.63% | 0.62 | 10 |
Model Provider | Model Name | CASI | Avg. Performance | RTP | CoS |
---|---|---|---|---|---|
|
Claude 3.5 Sonnet | 94.3 | 84.50% | 0.9 | 18.7 |
|
Claude 3.7 Sonnet | 88.52 | 86.30% | 0.88 | 20.22 |
|
Claude 3.5 Haiku | 87.56 | 68.28% | 0.79 | 5.14 |
|
Phi4-14B | 82.77 | 75.90% | 0.8 | 0.66 |
|
DeepSeek-R1-Distill-Llama-70B | 71.46 | 72.67% | 0.72 | 1.24 |
|
GPT-4o | 68.65 | 80.50% | 0.73 | 16.65 |
|
Gemini 2.0 Pro (experimental) | 63.89 | 79.10% | 0.7 | NA |
|
Llama 3.1 405b | 60.73 | 79.80% | 0.68 | 2.05 |
|
DeepSeek-R1 | 52.91 | 86.53% | 0.64 | 4.24 |
|
Gemma 3 27b | 55.25 | 78.60% | 0.64 | 1.8 |
Model Provider | Model Name | CASI | Avg. Performance | RTP | CoS |
---|---|---|---|---|---|
|
Claude 3.5 Sonnet | 94.94 | 84.50% | 0.93 | 18.7 |
|
Claude 3.7 Sonnet | 89.54 | 86.30% | 0.89 | 20.22 |
|
Claude 3.5 Haiku | 88.84 | 68.28% | 0.57 | 5.14 |
|
Phi4-14B | 86.04 | 75.90% | 0.68 | 0.66 |
|
DeepSeek-R1-Distill-Llama-70B | 71.7 | 72.67% | 0.74 | 1.24 |
|
GPT-4o | 68.44 | 80.50% | 0.52 | 16.65 |
|
Llama 3.1 405b | 61.86 | 79.80% | 0.77 | 2.05 |
|
Llama 3.3 70b | 55.57 | 74.50% | 0.69 | 1.85 |
|
DeepSeek-R1 | 52.91 | 86.53% | 0.58 | 4.24 |
|
Gemini 1.5 Flash | 29.79 | 66.70% | 0.92 | 0.51 |
|
Gemini 2.0 Flash | 29.18 | 77.20% | 0.66 | 0.66 |
|
Gemini 1.5 Pro | 27.38 | 74.10% | 0.63 | 8.58 |
|
GPT-4o-mini | 24.25 | 71.78% | 0.73 | 1.03 |
|
GPT-3.5 Turbo | 18.73 | 59.20% | 0.82 | 2.75 |
Model Provider | Model Name | CASI | Avg. Performance | RTP | CoS | Source |
---|---|---|---|---|---|---|
|
Claude 3.5 Sonnet | 96.25 | 84.50% | 0.93 | 18.7 | Anthropic |
|
Phi4-14B | 94.25 | 75.90% | 0.68 | 0.66 | Azure |
|
Claude 3.5 Haiku | 93.45 | 68.28% | 0.57 | 5.14 | Anthropic |
|
GPT-4o | 75.06 | 80.50% | 0.52 | 16.65 | OpenAI |
|
Llama 3.3 70b | 74.79 | 74.50% | 0.69 | 1.85 | Hugging Face |
|
DeepSeek-R1-Distill-Llama-70B | 74.42 | 72.67% | 0.74 | 1.24 | Hugging Face |
|
DeepSeek-R1 | 74.26 | 86.53% | 0.58 | 4.24 | Hugging Face |
|
GPT-4o-mini | 73.08 | 71.78% | 0.73 | 1.03 | OpenAI |
|
Gemini 1.5 Flash | 73.06 | 66.70% | 0.92 | 0.51 | |
|
Gemini 1.5 Pro | 72.85 | 74.10% | 0.63 | 8.58 | |
|
GPT-3.5 Turbo | 72.76 | 59.20% | 0.82 | 2.75 | OpenAI |
Alibaba Cloud | Qwen QwQ-32B-preview | 67.77 | 68.87% | 0.65 | 2.14 | Hugging Face |
Welcome to our insight notes. This section serves as our commentary space, where we highlight interesting data points from our research, discuss trends in AI model security behavior, and explain changes to our methodology. Our goal here is to provide transparency into the work happening behind the scenes at CalypsoAI’s research lab.
Agentic Warfare Resistance (AWR) Score:
This month we debut the scoring for our Agentic Warfare™ testing in its own leaderboard. While we have already raised the bar with our Signature attacks, moving away from basic attack success rates by incorporating severity and complexity of attack, with AWR we are taking another leap forward and evaluating how your choice of model can compromise your entire AI system. The AWR score is calculated based on depth and complexity of the attacks our agents need to use to achieve the desired goals.
Agentic Warfare deploys a team of autonomous attack agents trained to attack your model, extract information and compromise your infrastructure. In this way it can extract sensitive PII from vector stores, understand your system architecture and test your model’s alignment to your explicit instructions.
Updated Performance Benchmarks:
We now use seven different benchmarks in our performance metric: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. As benchmarks continue to evolve and improve we’ll keep evaluating what should be used in our leaderboard.
LOTS of New Models:
This last month has been the busiest for model release and testing we’ve seen in a long time. Llama 4 Maverick and Scout, Gemini 2.5 Pro and Flash, GPT4.1, 4.1-mini and 4.1-nano—and we finally get API access to test Grok 3 and Grok 3-mini.
Two New Agentic Attacks:
Our Agentic Warfare simulations now incorporate two new conversational attack methods, FRAME and Trolley, developed by the CalypsoAI AI Threat Research team. These techniques target known LLM architectural vulnerabilities and demonstrate the effectiveness of sustained, cohesive attacks during extended interactions, replicating tactics used by real-world adversaries.
Wider Security Trends:
- Decreasing average scores: The average CASI score across the tracked models decreased by approximately 6% in this leaderboard iteration. We noted this last month and as the trend continues it’s becoming more obvious that foundational models are favouring performance over security.
- Upgrade with caution: We are seeing a consistent trend where new releases, even minor ones, have lower CASI scores than their predecessors. With the upgrade path for these models being relatively easy, it’s important for companies to rigorously re-test their models and AI systems if they choose to upgrade. Notable examples:
- Claude: 3.5 sonnet = 94 vs 3.7 sonnet = 88
- OpenAI: GPT4o = 67 vs GPT4.1 = 51
- Llama: 3.1 405B = 65 vs 4 Maverick = 5
- AI security means testing AI systems: Our research using Agentic Warfare demonstrates that even if a model appears secure when tested in isolation, integrating it into a wider system can expose a new array of vulnerabilities. For every model we tested using this approach within a system context, we were able to:
- Extract user-provided system prompts.
- Break the model’s alignment based on those system prompts.
- Extract sensitive personally identifiable information (PII) when the model was integrated into a retrieval-augmented generation (RAG) system.
Stay Updated
Sign up for updates on each release of our leaderboard each month
What Are the CalypsoAI Model Security Leaderboards?
The CalypsoAI Leaderboards are a holistic assessment of base model and AI system security, focusing on the most popular models and models deployed by our customers. We developed these tools to align with the business needs of selecting a production-ready model, helping CISOs and developers build with security at the forefront.
These leaderboards cut through the noise in the AI space, distilling complex model security questions into a few key metrics:
- CalypsoAI Security Index (CASI): A metric designed to measure the overall security of a model (explained in detail below).
- Agentic Warfare Resistance (AWR) Score: AWR evaluates how a model can compromise an entire AI system. We do this by unleashing our team of autonomous attack agents on the system, which are trained to attack the model, extract information and compromise infrastructure. In this way these agents can extract sensitive PII from vector stores, understand system architecture, and test model alignment with explicit instructions.
- Performance: The average performance of the model is based on popular benchmarks (e.g., MMLU, GPQA, MATH, HumanEval).
- Risk-to-Performance Ratio (RTP): Provides insight into the tradeoff between model safety and performance.
- Cost of Security (CoS): Evaluate the current inference cost relative to the model’s CASI, assessing the financial impact of security.
Introducing CASI
What is the CalypsoAI Security Index (CASI),
and Why Do We Need It?
CASI is a metric we developed to answer the complex question: “How secure is my model?” A higher CASI score indicates a more secure model or application.
While many studies on attacking or red-teaming models rely on Attack Success Rate (ASR), this metric often oversimplifies the reality. Traditional ASR treats all attacks as equal, which is misleading. For example, an attack that bypasses a bicycle lock should not be equated to one that compromises nuclear launch codes. Similarly, in AI, a small, unsecured model might be easily compromised with a simple request for sensitive information, while a larger model might require sophisticated techniques like Agentic Warfare™ to break its alignment.
To illustrate this, consider the following hypothetical comparison between a small, unsecured model and a larger, safeguarded model:
Attack | Weak model | Strong Model |
Plain text Attack (ASR) | 30% | 4% |
Complex Attack (ASR) | 0% | 26% |
Total ASR | 30% | 30% |
CASI | 56 | 84 |
In this scenario, both models have the same total ASR. However, the larger model is significantly more secure because it resists simpler attacks and is only vulnerable to more complex ones. CASI captures this nuance, providing a more accurate representation of security.
CASI evaluates several critical factors beyond simple success rates:
By incorporating these factors, CASI offers a holistic and nuanced measure of model and application security.
- Severity: The potential impact of a successful attack (e.g., bicycle lock vs. nuclear launch codes).
- Complexity: The sophistication of the attack being assessed (e.g. plain text vs. complex encoding).
- Defensive Breaking Point (DBP): Identifies the weakest link in the model’s defences, focusing on the path of least resistance and considering factors like computational resources required for a successful attack.

Experience Proactive AI Vulnerability Discovery
with CalypsoAI Inference Red-Team
How Should the Leaderboard Be Used?
The CalypsoAI Leaderboard serves as a starting point for assessing which model to build with. It evaluates the guardrails implemented by model providers and reflects their performance against the latest vulnerabilities in the AI space.
It’s important to note that the leaderboard is a living artefact. At CalypsoAI, we will continue to develop new vulnerabilities and work with model providers to responsibly disclose and resolve these issues. As a result, model scores will evolve, and new models will be added. The leaderboard will be versioned based on updates to our signature attack database and iterations of our security score.
What Does the Leaderboard Not Do?
The leaderboard does not account for specific applications or use cases. It is solely an assessment of foundational models. For a deeper understanding of your application’s vulnerabilities, including targeted concerns like sensitive data disclosure or misalignment from system prompts, our full red-teaming product is available.
Do we supply all of the output and testing data?
Users of our red-teaming product gain access to our comprehensive suite of penetration testing attacks, including:
Signature Attacks:
A vast prompt database of state-of-the-art AI vulnerabilities.
Operational Attacks:
Traditional cybersecurity concerns applied to AI applications (e.g., DDoS, open parameters, PCS).
Agentic Warfare™:
An attack agent capable of discovering general or directed vulnerabilities specific to a customer’s use case. For example, a bank might use Agentic Warfare to determine if the model is susceptible to disclosing customer financial information. The agent designs custom attacks based on the model’s setup and application context.
Product users will be able to see additional data such as where the vulnerabilities of each models are along with solutions to mitigate the risk.
Sources:
- https://docs.anthropic.com/en/docs/about-claude/models
- https://ai.azure.com/explore/models/Phi-4/version/3/registry/azureml
- https://platform.openai.com/docs/models/o1
- https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
- https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- https://huggingface.co/deepseek-ai/DeepSeek-R1
- https://ai.google.dev/gemini-api/docs/models/gemini
- https://huggingface.co/Qwen/QwQ-32B-Preview