Skip to content

CalypsoAI Security Leaderboard

CalypsoAI Security Leaderboard

Find the Right Model

Compare Security, Cost & Capabilities 

The world’s major AI models are vulnerable—we’ve proven it. The CalypsoAI Security Leaderboard ranks top GenAI models based on real-world security testing, exposing critical risks overlooked by performance benchmarks. Powered by Inference Red-Team, it’s the only tool that helps you find the safest model before you deploy.

The ‘CalypsoAI Security Index’ (CASI) ranks models on a scale from 0 to 100the higher the score, the more secure the model is. Learn about CASI.

March Edition – Updated 6th March, 2025

Powered by

 

 


 

What is the CalypsoAI Leaderboard?

The CalypsoAI Leaderboard is a holistic assessment of base model security, focusing on the most popular models and models deployed by our customers. We developed this tool to align with the business needs of selecting a production-ready model, helping CISOs and developers build with security at the forefront.  

The leaderboard cuts through the noise in the AI space, distilling complex model security questions into a few key metrics:  

 

 

  • CalypsoAI Security Index (CASI): A metric designed to measure the overall security of a model (explained in detail below).  
  • Performance: The average performance of the model is based on popular benchmarks (e.g., MMLU, GPQA, MATH, HumanEval).  
  • Risk-to-Performance Ratio (RTP): Provides insight into the tradeoff between model safety and performance.  
  • Cost of Security (CoS): Evaluate the current inference cost relative to the model’s CASI, assessing the financial impact of security.  

 

Introducing CASI

What is the CalypsoAI Security Index (CASI), and Why Do We Need It? 

CASI is a metric we developed to answer the complex question: “How secure is my model?” A higher CASI score indicates a more secure model or application.  

While many studies on attacking or red-teaming models rely on Attack Success Rate (ASR), this metric often oversimplifies the reality. Traditional ASR treats all attacks as equal, which is misleading. For example, an attack that bypasses a bicycle lock should not be equated to one that compromises nuclear launch codes. Similarly, in AI, a small, unsecured model might be easily compromised with a simple request for sensitive information, while a larger model might require sophisticated techniques like Agentic Warfare™ to break its alignment. 

To illustrate this, consider the following hypothetical comparison between a small, unsecured model and a larger, safeguarded model:  

Attack Weak model Strong Model
Plain text Attack (ASR) 30% 4%
Complex Attack (ASR) 0% 26%
Total ASR 30% 30%
CASI 56 84

 

In this scenario, both models have the same total ASR. However, the larger model is significantly more secure because it resists simpler attacks and is only vulnerable to more complex ones. CASI captures this nuance, providing a more accurate representation of security.  

 

 


 

CASI evaluates several critical factors beyond simple success rates:  

By incorporating these factors, CASI offers a holistic and nuanced measure of model and application security.  

  • Severity: The potential impact of a successful attack (e.g., bicycle lock vs. nuclear launch codes).  
  • Complexity: The sophistication of the attack being assessed (e.g. plain text vs. complex encoding).  
  • Defensive Breaking Point (DBP): Identifies the weakest link in the model’s defences, focusing on the path of least resistance and considering factors like computational resources required for a successful attack.  

How Should the Leaderboard Be Used?  

The CalypsoAI Leaderboard serves as a starting point for assessing which model to build with. It evaluates the guardrails implemented by model providers and reflects their performance against the latest vulnerabilities in the AI space.  

It’s important to note that the leaderboard is a living artefact. At CalypsoAI, we will continue to develop new vulnerabilities and work with model providers to responsibly disclose and resolve these issues. As a result, model scores will evolve, and new models will be added. The leaderboard will be versioned based on updates to our signature attack database and iterations of our security score.  

 


 

What Does the Leaderboard Not Do?

The leaderboard does not account for specific applications or use cases. It is solely an assessment of foundational models. For a deeper understanding of your application’s vulnerabilities, including targeted concerns like sensitive data disclosure or misalignment from system prompts, our full red-teaming product is available.

 


 

Do we supply all of the output and testing data?

Users of our red-teaming product gain access to our comprehensive suite of penetration testing attacks, including:  

Signature Attacks:

A vast prompt database of state-of-the-art AI vulnerabilities.  

Operational Attacks:

Traditional cybersecurity concerns applied to AI applications (e.g., DDoS, open parameters, PCS).  

Agentic Warfare™:

An attack agent capable of discovering general or directed vulnerabilities specific to a customer’s use case. For example, a bank might use Agentic Warfare to determine if the model is susceptible to disclosing customer financial information. The agent designs custom attacks based on the model’s setup and application context.  

Product users will be able to see additional data such as where the vulnerabilities of each models are along with solutions to mitigate the risk.