Skip to content

The $17.8m Business Case for Your AI Security

Read Now
Blog
30 Jun 2025

LLM Evaluation: Building Trust with Security Scoring

LLM Evaluation: Building Trust with Security Scoring

LLM Evaluation: Building Trust with Security Scoring

As enterprise AI adoption accelerates, LLM evaluation is no longer just about performance, it’s about trust. Security teams must validate not only how well a model performs, but how safely it behaves. Yet most evaluation frameworks focus on accuracy benchmarks, leaving security and risk factors vague or unmeasured.

This lack of clarity is dangerous. Without explainable, benchmarkable security metrics, there’s no credible way to compare models, demonstrate resilience, or make risk-adjusted decisions about how and where AI should be deployed.

At CalypsoAI, we’ve spent months working with enterprise security leaders who’ve echoed a consistent concern: “We need to explain to internal stakeholders what a score means, how it was derived, and what we can do to improve it.”

This post explores how to build that trust by understanding the two complementary scores that now form the backbone of secure GenAI evaluation: the CalypsoAI Security Index (CASI) and the Agentic Warfare Resistance (AWR) score.

The Problem with AI Security "Scoring" Today

When enterprises test AI models, they often get either a binary outcome: Did the model break or not? Or a vague risk rating: Low/medium/high, with little detail.

Neither approach holds up when AI systems enter production and begin interacting with sensitive data, external users, or downstream agents.

Security leaders need more than attack success rates, they need visibility into:

  • Severity: How dangerous is a successful attack?
  • Complexity: How hard is it to exploit the model?
  • Context: Does it break under real-world, multi-turn, or application-level use?

That’s why we developed two distinct scoring systems, each addressing a different layer of AI system security.

CASI: A Model's Security DNA

The CalypsoAI Security Index (CASI) helps teams add security rigor to their LLM evaluation process by measuring more than just jailbreak success. CASI scores foundational models on a scale of 0–100, incorporating:

  • Severity of impact: Not all failures are equal—disclosing credentials is worse than answering trivia.
  • Attack complexity: Models that break under simple phrasing aren’t as secure as those that require advanced adversarial strategies.
  • Defensive breaking point: How quickly and under what conditions does a model's alignment collapse?

This lets security teams choose models with high resilience—not just those that pass easy tests.

For example, Qwen3 scored competitively on CASI, suggesting strong model-level defenses. However, when tested using agentic methods, it failed to withstand more sophisticated, persistent attacks. This highlights the limits of model-only evaluation.

These results are published on the CalypsoAI Model Security Leaderboards, a regularly updated, public resource that ranks the most widely used LLMs based on real-world red-teaming. Unlike conventional performance charts, this leaderboard helps enterprises compare models on security, risk, cost, and system-level resilience, making it a critical tool for informed LLM evaluation.

AWR: When Models Become Systems

While CASI measures foundational model resilience, it doesn’t account for system-level behavior. That’s where the Agentic Warfare Resistance (AWR) score comes in.

AWR measures how an AI system—including any agents, retrieval tools, or orchestration layers—holds up under persistent, adaptive attacks. These tests are executed by autonomous adversarial agents that:

  • Learn from failed attempts
  • Chain attacks across multiple turns
  • Target hidden prompts, vector stores, and retrieval logic

Scored from 0–100, AWR is built around three dimensions:

  1. Required Sophistication: How clever does the attacker need to be?
  2. Defensive Endurance: How long can the system resist?
  3. Counter-Intelligence: Does the system reveal useful attack clues even when it blocks the initial threat?

A higher AWR score means your AI system is not just secure in isolation. It can withstand contextual, agentic attacks that mimic what real threat actors are already testing in the wild.

Why Security Leaders Need Both

Security leaders evaluating AI deployments, especially those involving RAG architectures, autonomous agents, or complex orchestration workflows, need a layered view of trust. CASI helps choose a foundation. It tells you whether a model’s built-in defenses are robust enough for enterprise-grade applications. AWR validates your deployment. It shows whether your custom workflows or integrated systems introduce new vulnerabilities, even when the base model scores well.

Recent model testing emphasizes this divergence. In June, Claude 3.7’s AWR score improved while its CASI dipped slightly, signaling that its system-level behavior had been hardened, even as some model-level vulnerabilities remained.

Transparency Drives Actionability

Scoring isn’t useful unless it drives decisions. Security leaders tell us they need:

  • Explainable scoring methodologies that can be shared with risk committees and product teams
  • Clear, numerical indicators that can be benchmarked, tracked, and improved
  • Application-aware red-teaming that exposes system-level weaknesses (not just model flaws)

That’s why we publish scoring methodologies, provide prompt-level logs in red-team reports, and support continuous security testing across both models and AI systems.

Trust in AI doesn’t come from vague assurance. It comes from scoring that reflects real risk, testing that mirrors real threats, and reporting that leads to real improvements.

Final Thought: Trust Is Built, Not Claimed

AI is only as trustworthy as the process you use to evaluate and monitor it. Transparent security scoring—at both the model and system level—gives security teams the language, evidence, and confidence they need to deploy GenAI safely.

And for enterprises working across regulated industries, high-risk domains, or user-facing AI agents, that confidence is mandatory.

To learn more about our Inference Platform arrange a callback.

Latest Posts

Blog

The New AI Risk Surface: Invisible, Autonomous, and Already Inside

AI Inference Security Project

Zero Trust Isn’t Just for People Anymore: Securing AI Agents in the Age of Autonomy

Blog

A New Pricing Standard for AI Red-Teaming