Skip to main content

The companies developing generative AI (GenAI) models, such as OpenAI, Microsoft, Cohere,  and others, have made attempts—some might say “strides”—to build guardrails into their products, and undoubtedly conduct red-teaming engagements to determine their efficacy. On the other side of that exercise are the black hats, who don’t have to work nearly as hard because their chosen profession is breaking things, stealing things, or both. They just need to find one small, forgotten, dismissed, or unseen vulnerability in a model, and they can put up their feet and call it a day. 

When those same bad actors develop a new successful jailbreak, those feet start tap-dancing because they know the press coverage will be heavy and light (hearted) at the same time. Whereas that sort of news makes AI security professionals reach for the antacid tablets, the average user—or non-user—is amused, or at least bemused, by all the fuss. Seriously, what’s the harm? they think. It’s a prank.

Not quite.

The general public’s reaction is understandable: Typical attacks are code-heavy and happen behind the scenes, deep in the heart of IT networks. However, jailbreak attacks (or “prompt injection attacks” for us wonks) are the little black dress of AI threats: Deceptively simple, easily recognizable, and fit for purpose. 

How They Work

Typical jailbreak prompts are written in natural language and are generally creative, if not also clever. They tell a story that frequently begins with assigning the model a role or identity and then either provides it with omniscient powers or suspends established controls on its behavior, or both. 

The story continues with key information presented using persuasive, emotional language, compliments, or questions in languages other than English, as well as additional role-play, predictive text, reverse psychology, and other rhetorical devices. And, too often, these attacks produce the intended result, which is to “trick” a large language model (LLM) into circumventing established use thresholds and then request that it provide information or take actions identified as dangerous, illegal, immoral, or unethical, or otherwise antithetical to standard social norms or organizational acceptable use policies. 

The most dangerous thing about these attacks is something the general public doesn’t see, but that gives practitioners nightmares: After the boundary of the model’s internal controls is breached, the model has no further safeguards in place to stop it from following the threat actor’s instructions. It’s all in, with no turning back. The bad guys will get the goods they’re after. 

How do we know this? Because we tried it and succeeded. One of our team members tried unsuccessfully to get ChatGPT-4 to provide questionable instructions, and then tried again using over-the-top flattery and story-telling. Spoiler alert: The second attempt was successful.   

How They’re Stopped

The motivations behind jailbreak attempts can range from being little more than a playful challenge (see above) to a dire and dangerous threat to an enterprise and beyond. That huge spectrum of intentions is a significant factor in why it is so critically important for safeguards to be built-in and built up stronger than the black hats’ most ambitious efforts. However, one of the problems with executing on that is the complexity of the ask: With so many motivations and use cases in play, a stringent, one-size-fits-all solution will never protect every vulnerability. 

Enter the white-hat creative problem-solvers. 

A recent paper by a team of researchers from three renowned AI organizations describes a novel technique inspired by psychology that has shown good results. Put simply, the model is trained to execute self-reminders on an established cadence, “nudging” itself to refuse to carry out instructions that are immoral, dangerous, toxic, etc. The team reduced the number of successful attacks from nearly 70% to just under 20% using this methodology. 

While promising, this technique and others are still in the testing phase, which does not help organizations that need protections now. And there is a solution available. Our model-agnostic GenAI security and enablement platform is the first of its kind to provide proven protection against jailbreak attacks, adding critical value where other AI security defenses fall short. 

Our solution creates a trust layer that protects the organization from internal users seeking to bypass internal information controls. By reviewing content for syntactic constructions and language patterns typically used in prompt injection attacks, such as hypothetical conversations, world-building, role-playing, reverse psychology, and others, our platform identifies attempts to violate usage rules. 

How do we know it works? Because we’ve tested it: The same team member who successfully completed the jailbreak described above with ChatGPT-4 tries the same approach with CalypsoAI and gets shut down cold. We welcome you to do the same when you sign up for our beta. 

Conclusion

Innovative protective strategies must be developed seemingly on the fly to keep up with new technologies, such as LLMs, and the novel threats they invite. The idea of protecting such systems when they are deployed at scale across an organization is nothing if not daunting. CalypsoAI’s platform is the only solution on the market that provides a secure, resilient environment with customizable layers of protection against a broad range of threats, including jailbreak attempts. Additionally, our platform offers full auditability and activity attribution across models, enabling administrators to see who is doing what, how often, and on which models, providing critical comparative and analytic data about system content, cost, use—and abuse—in real time.

 

Click here to book a demonstration.