Skip to main content

The Problem

An employee wants to bypass LLM rules that prohibit highly inflammatory messages from being sent in a prompt. By creating a virtual environment in which existing rules do not apply, the user is able to get the information past the filters, which releases the information into the LLM’s body of knowledge, and into the chat history it maintains on that user, and the organization.

The Challenge

In direct violation of organization rules, a user has “tricked” the LLM into allowing them to send controversial content that violates social norms and company values, sharing it with an unauthorized third party. The information is, therefore, at risk of further dissemination due to leaks or hacks to the third party, as well as at risk of becoming part of the dataset used to train/retrain subsequent iterations of the LLM. The information could also be included in the LLM’s knowledge base and, therefore, be accessible to all users, damaging the organization’s reputation by association.

The Solution

CalypsoAI Moderator scans prompts for patterns and categories of techniques, such as role-playing, reverse psychology, virtual environment rule-setting, and hypothetical engagements, that attempt to override standard or admin-established boundaries for malign purposes. All details of the interaction are recorded, providing full auditability and attribution.