Skip to content

Calypso AI is now part of F5

Read More
Blog
20 May 2025

When High Scores Aren’t Enough: What Qwen3 Taught Us

When High Scores Aren’t Enough: What Qwen3 Taught Us

When High Scores Aren’t Enough: What Qwen3 Taught Us

By James White, VP Engineering, F5

One question keeps coming up for us at F5: how do we really know if a model is safe? It’s a question we’ve been working to answer, first with our Comprehensive AI Security Index (CASI) score, and, more recently, with our Agentic Resistance Score (ARS). So let’s unpack what these scores mean, and why both are critical for evaluating models in the wild.

CASI: The Security Baseline

CASI has been our benchmark for a while now. It gives a top-level view of a model’s general security posture and how resilient it is to a broad spectrum of vulnerabilities. CASI doesn’t just count attack success rates; it weighs the severity of successful breaches, the complexity of the attack paths, and the defensive breaking point, where a model’s guardrails start to fail. 

Think of CASI as the score that helps you judge “is this model generally secure compared to others?” It’s what most people are looking for when they say they want a “safe model”.

But here’s the thing: real-world adversaries don’t just throw a prompt at a model and walk away. They persist. They try multiple tactics. They work toward a goal. 

ARS: The Stress Test

That’s where ARS comes in. ARS doesn’t measure theoretical resilience, it pressure-tests it. 

In the AWR evaluation, we deploy our own autonomous attack agents. We give them five malicious intents focused on items like data exfiltration or system manipulation, and we set them loose. Our attack agents plan, iterate and adapt. They go wherever they need to go to accomplish the goal.

It’s also why we now have two leaderboards. It wasn’t a design choice, but a reality check. Some models that look strong on CASI crumble under agentic pressure. Others hold their ground.

A Case in Point: Qwen3

Let’s take Qwen3. Recently released by Alibaba, it’s objectively a good model. On CASI, it landed in 5th place, which is well within our top 10. So respectable, not best, not worst. To put this into perspective, only the top 1% of models even make it onto our CASI Leaderboard. Impressive, right?

But on the ARS Leaderboard, Qwen3 didn’t even place. Why? Because when we tested it against five malicious intents – each designed to represent really bad things, like self-harm and other terrible stuff – it failed every single one. In other words, it had zero resistance.

In comparison, the top model on the ARS Leaderboard is Anthropic’s Claude 3.5 Sonnet. Out of the same five malicious intents, Claude 3.5 was resilient to three and vulnerable to two. So still a risk, but a significantly reduced one. (Incidentally, Claude 3.5 was also the top ranked model on the CASI Leaderboard, so there’s some correlation.)

Let’s consider what this means in practical, real-world terms. If you use Qwen3 within an AI system that connects to tools, data stores, or user inputs, you are exposing the entire system to real vulnerabilities. 

This is the nuance CASI alone can’t show you. You need both lenses: CASI for general security posture, ARS for system-level resilience under coordinated attack

The Bottom Line

We didn’t build these scores for vanity. We built them because attackers won’t be static within an AI environment, and your security can’t be either. 

If you’re choosing a model based on its performance alone – MMLU score or speed – you are missing the part that matters most. So the next time someone tells you a model is “state-of-the-art” ask: is it secure? Because if it’s not, it’s not ready.

To learn more about our Inference Platform arrange a callback.

Latest Posts

Blog

Closing the Loop: Why AI Security Remediation Matters

Blog

Smarter Guardrails, Stronger Security with the New AI Assistant

Blog

Explainability: Shining a Light into the AI Black Box