Skip to main content

Think you can outsmart AI? Announcing ‘Behind The Mask’ – Our all-new cybercrime role-playing game | Play Now

Jailbreak Prevention

CalypsoAI Thwarts Attempts to Bypass System Safeguards

CalypsoAI is the #1 solution to securing your system from data infiltration via large language models, thus preventing costly data breaches, and protecting your organization’s intellectual property.

The risks an organization faces from internal threat actors using “jailbreak” or prompt injection techniques to “trick” an LLM into providing information your organization has identified as contrary to your values or practices can include unauthorized access to sensitive or confidential data, among other scenarios. CalypsoAI is a proven solution for blocking prompt-driven techniques, such as role-playing, reverse psychology, virtual environment rule-setting, and hypothetical engagements, that attempt to override standard or admin-established boundaries for malign purposes.

The Problem

An employee wants to bypass LLM rules that prohibit highly inflammatory messages from being sent in a prompt. By creating a virtual environment in which existing rules do not apply, the user is able to get the information past the filters, which releases the information into the LLM’s body of knowledge, and into the chat history it maintains on that user, and the organization.

The Challenge

In direct violation of organization rules, a user has “tricked” the LLM into allowing them to send controversial content that violates social norms and company values, sharing it with an unauthorized third party. The information is, therefore, at risk of further dissemination due to leaks or hacks to the third party, as well as at risk of becoming part of the dataset used to train/retrain subsequent iterations of the LLM. The information could also be included in the LLM’s knowledge base and, therefore, be accessible to all users, damaging the organization’s reputation by association.

The Solution

CalypsoAI scans prompts for patterns and categories of techniques, such as role-playing, reverse psychology, virtual environment rule-setting, and hypothetical engagements, that attempt to override standard or admin-established boundaries for malign purposes. All details of the interaction are recorded, providing full auditability and attribution.

We Support

Visit Our Blog

Blog October 28, 2024

The Threats We See and How to Address Them

The increasing integration of AI into all aspects of modern life, both personal and professional, is energizing cyber threat actors to find new ways to exploit both the technology and…