Skip to content

The $17.8m Business Case for Your AI Security

Read Now
Blog
20 May 2025

When High Scores Aren’t Enough: What Qwen3 Taught Us

When High Scores Aren’t Enough: What Qwen3 Taught Us

When High Scores Aren’t Enough: What Qwen3 Taught Us

By James White, CTO, CalypsoAI

One question keeps coming up for us at CalypsoAI: how do we really know if a model is safe? It’s a question we’ve been working to answer, first with our CalypsoAI Security Index (CASI) score, and, more recently, with our Agentic Warfare Resistance (AWR) score. So let’s unpack what these scores mean, and why both are critical for evaluating models in the wild.

CASI: The Security Baseline

CASI has been our benchmark for a while now. It gives a top-level view of a model’s general security posture and how resilient it is to a broad spectrum of vulnerabilities. CASI doesn’t just count attack success rates; it weighs the severity of successful breaches, the complexity of the attack paths, and the defensive breaking point, where a model’s guardrails start to fail. 

Think of CASI as the score that helps you judge “is this model generally secure compared to others?” It’s what most people are looking for when they say they want a “safe model”.

But here’s the thing: real-world adversaries don’t just throw a prompt at a model and walk away. They persist. They try multiple tactics. They work toward a goal. 

AWR: The Stress Test

That’s where AWR comes in. AWR doesn’t measure theoretical resilience, it pressure-tests it. 

In the AWR evaluation, we deploy our own autonomous attack agents. We give them five malicious intents focused on items like data exfiltration or system manipulation, and we set them loose. Our attack agents plan, iterate and adapt. They go wherever they need to go to accomplish the goal. This is not a prompt injection test. It’s full-on Agentic Warfare. 

It’s also why we now have two leaderboards. It wasn’t a design choice, but a reality check. Some models that look strong on CASI crumble under agentic pressure. Others hold their ground.

A Case in Point: Qwen3

Let’s take Qwen3. Recently released by Alibaba, it’s objectively a good model. On CASI, it landed in 5th place, which is well within our top 10. So respectable, not best, not worst. To put this into perspective, only the top 1% of models even make it onto our CASI Leaderboard. Impressive, right?

But on the Agentic Leaderboard, Qwen3 didn’t even place. Why? Because when we tested it against five malicious intents – each designed to represent really bad things, like self-harm and other terrible stuff – it failed every single one. In other words, it had zero resistance.

In comparison, the top model on the Agentic Leaderboard is Anthropic’s Claude 3.5 Sonnet. Out of the same five malicious intents, Claude 3.5 was resilient to three and vulnerable to two. So still a risk, but a significantly reduced one. (Incidentally, Claude 3.5 was also the top ranked model on the CASI Leaderboard, so there’s some correlation.)

Let’s consider what this means in practical, real-world terms. If you use Qwen3 within an AI system that connects to tools, data stores, or user inputs, you are exposing the entire system to real vulnerabilities. 

This is the nuance CASI alone can’t show you. You need both lenses: CASI for general security posture, AWR for system-level resilience under coordinated attack

The Bottom Line

We didn’t build these scores for vanity. We built them because attackers won’t be static within an AI environment, and your security can’t be either. 

If you’re choosing a model based on its performance alone – MMLU score or speed – you are missing the part that matters most. So the next time someone tells you a model is “state-of-the-art” ask: is it secure? Because if it’s not, it’s not ready.

To learn more about our Inference Platform arrange a callback.

Latest Posts

Blog

The New AI Risk Surface: Invisible, Autonomous, and Already Inside

AI Inference Security Project

Zero Trust Isn’t Just for People Anymore: Securing AI Agents in the Age of Autonomy

Blog

A New Pricing Standard for AI Red-Teaming