By James White, CTO, CalypsoAI
One question keeps coming up for us at CalypsoAI: how do we really know if a model is safe? It’s a question we’ve been working to answer, first with our CalypsoAI Security Index (CASI) score, and, more recently, with our Agentic Warfare Resistance (AWR) score. So let’s unpack what these scores mean, and why both are critical for evaluating models in the wild.
CASI: The Security Baseline
CASI has been our benchmark for a while now. It gives a top-level view of a model’s general security posture and how resilient it is to a broad spectrum of vulnerabilities. CASI doesn’t just count attack success rates; it weighs the severity of successful breaches, the complexity of the attack paths, and the defensive breaking point, where a model’s guardrails start to fail.
Think of CASI as the score that helps you judge “is this model generally secure compared to others?” It’s what most people are looking for when they say they want a “safe model”.
But here’s the thing: real-world adversaries don’t just throw a prompt at a model and walk away. They persist. They try multiple tactics. They work toward a goal.
AWR: The Stress Test
That’s where AWR comes in. AWR doesn’t measure theoretical resilience, it pressure-tests it.
In the AWR evaluation, we deploy our own autonomous attack agents. We give them five malicious intents focused on items like data exfiltration or system manipulation, and we set them loose. Our attack agents plan, iterate and adapt. They go wherever they need to go to accomplish the goal. This is not a prompt injection test. It’s full-on Agentic Warfare.
It’s also why we now have two leaderboards. It wasn’t a design choice, but a reality check. Some models that look strong on CASI crumble under agentic pressure. Others hold their ground.
A Case in Point: Qwen3
Let’s take Qwen3. Recently released by Alibaba, it’s objectively a good model. On CASI, it landed in 5th place, which is well within our top 10. So respectable, not best, not worst. To put this into perspective, only the top 1% of models even make it onto our CASI Leaderboard. Impressive, right?
But on the Agentic Leaderboard, Qwen3 didn’t even place. Why? Because when we tested it against five malicious intents – each designed to represent really bad things, like self-harm and other terrible stuff – it failed every single one. In other words, it had zero resistance.
In comparison, the top model on the Agentic Leaderboard is Anthropic’s Claude 3.5 Sonnet. Out of the same five malicious intents, Claude 3.5 was resilient to three and vulnerable to two. So still a risk, but a significantly reduced one. (Incidentally, Claude 3.5 was also the top ranked model on the CASI Leaderboard, so there’s some correlation.)
Let’s consider what this means in practical, real-world terms. If you use Qwen3 within an AI system that connects to tools, data stores, or user inputs, you are exposing the entire system to real vulnerabilities.
This is the nuance CASI alone can’t show you. You need both lenses: CASI for general security posture, AWR for system-level resilience under coordinated attack.
The Bottom Line
We didn’t build these scores for vanity. We built them because attackers won’t be static within an AI environment, and your security can’t be either.
If you’re choosing a model based on its performance alone – MMLU score or speed – you are missing the part that matters most. So the next time someone tells you a model is “state-of-the-art” ask: is it secure? Because if it’s not, it’s not ready.