Skip to main content

While the term “artificial intelligence” (AI) became part of the everyday vernacular in late 2022, the term “AI Security” is quite new, not well understood, and often ignored or dismissed, even by cybersecurity professionals who ought to know better. This blog post is intended to clear up any ambiguity or misunderstanding about what we mean when we talk about AI Security, the threats it encounters, and approaches to take to create a secure AI environment across your enterprise. 


AI Security refers to the strategic implementation of robust measures and policies to protect your organization’s AI systems, data, and operations from unauthorized access, tampering, malicious attacks, and other digital threats. It is critical for any organization deploying AI tools in or across its enterprise to include security in its AI use strategy because every new component brought into a digital system, including AI-powered applications, increases the number of ways a threat actor can infiltrate or otherwise cause harm to the system; this is called the “attack surface.”

Securing AI-driven additions to an organization’s digital infrastructure, whether those are statistical  or large language models (LLMs), such as ChatGPT and others, is vital for maintaining the integrity, privacy, and reliability of your system, as well as for fostering customer trust and safeguarding your organization’s reputation. And, in a growing number of organizations that use AI-driven tools and solutions for decision-making, operations, and sensitive customer interactions, a strong AI security strategy is crucial to ensure these tools remain trustworthy and are securely protected against external threats or internal vulnerabilities.

This post identifies some of the most significant risks emerging in the AI Security landscape and provides solutions to address them.  


LLMs span the spectrum from being considered powerful tools to clever toys. The reality, however, is that they are the latest addition to an organization’s critical information infrastructure and are an especially vulnerable threat surface. Much of the early discussion of LLM risks centered on identifying and solving for threats based on human behavior, such as data loss and inappropriate use. The more insidious threats, however, are stealth attacks to the models and/or their datasets. These attacks are not only growing in terms of the damage they can inflict, but in scope, scale, and nuance, and are becoming increasingly difficult to detect. In the sections below, we outline the three types of attacks that pose the most significant threats to enterprises deploying LLMs and other generative AI (GenAI) models. 

Jailbreaks/Prompt Injection

“Jailbreak” or prompt injection techniques attempt to “trick” an LLM into providing information identified as dangerous, illegal, immoral, or unethical, or otherwise antithetical to standard social norms. Approximately two hours after ChatGPT-4 was released in March 2023, jailbreak attempts and other hacks on the model proved successful. The jailbreak directed the system to provide instructions for an alarming collection of antisocial activities. In the time since then, numerous threat actors have developed carefully worded, highly detailed role play, predictive text, reverse psychology, and other techniques to get LLMs to bypass internal content filters and controls that regulate the model’s responses. 

The danger of a successful jailbreak attack on a GenAI system, such as an LLM, is that it breaches the safeguards that prevent the model from executing bad commands, such as instructions to ignore protective measures or take destructive actions. Once that boundary between acceptable and unacceptable use disappears, the model has no further safeguards in place to stop it from following the new instructions.   

Data Poisoning

A Poisoning attack can have as its target the model, the model’s training data, or your organization’s unsecured data source. The goal of a Poisoning attack is to skew the results or predictions your model produces, and the outcome is that your organization relies on the flawed results and makes bad, potentially damaging decisions, disseminates faulty or incorrect information, or takes other ill-advised actions based on the model output.

  • Data Poisoning attacks target the model’s training dataset. The threat actor alters or manipulates the data or adds malicious, incorrect, biased, or otherwise inappropriate data that skews the model output. 
  • Model Poisoning attacks target the model itself. The threat actor alters the model or its parameters to ensure faulty output. 
  • “Backdoor” attacks require a two-step approach. The threat actor first manipulates the model’s dataset by adding malicious data to create a hidden vulnerability that does not affect the model in any way until it is triggered. Activating the vulnerability is the second step in this attack; it allows the hacker to cause damage to your organization on their own schedule.

Adversarial AI

Adversarial attacks occur after models have been deployed and are in use. These attacks vary in approach, but are all very clever, very difficult to detect, and very, very dangerous.

  • Model Inversion attacks review a model’s output to uncover sensitive info about the model itself or the dataset it was trained on, which can lead to privacy breaches.
  • Membership Inference attacks involve the threat actor trying to deduce whether specific data points, like information about a particular individual, were part of the training dataset. If successful, these attacks cause a significant invasion of privacy.
  • Model-Stealing attacks involve scrutinizing the output of a trained model to steal or copy its intellectual property with the goal of cloning the original model, typically for commercial gain.
  • Watermarking attacks change the parameters of a trained model to embed a hidden pattern that can be used to “prove” who owns the model. This can lead to significant financial loss, as well as loss of competitive advantage.
  • Model Inference attacks review a model’s output to find sensitive info about the training data or the model’s parameters,  which can lead to privacy breaches.


Innovative protective strategies and deployment tactics must be developed continually to keep up with new AI-powered technologies, such as LLMs, and the thought of doing so is nothing if not daunting. Proactively deploying safeguards that address the threats outlined above and others is the only reasonable approach to creating a secure environment for organizations to implement LLMs at scale and across the enterprise.

Our Moderator solution is the only tool on the market that provides a secure, resilient environment where other AI Security defenses fall short. Our solution protects your organization from users seeking to ignore or override system controls by reviewing language patterns and categories, such as role-playing, hypothetical conversations, world-building, and reverse psychology, to identify prompts seeking to violate acceptable usage rules. It also scans outgoing and incoming content for toxic, biased, or otherwise unacceptable inclusions. Users are alerted that the prompt will not be sent unless changes are made to its content, and a detailed prompt history allows for auditing such prompts and the users creating them. Moderator also enables administrators to require human verification of the information returned by the model to ensure that any content used in organizational documentation is accurate and factual. 


Understanding and implementing AI Security protocols is fast becoming a core business imperative in an increasingly AI-driven marketplace. By integrating AI tools, such as CalypsoAI Moderator, into your security framework, your organization will enable dynamic, real-time adaptation to new threats; ensure robust protection for its AI systems, data, and operations; provide decision-makers with peace of mind; and allow your organization to stay a step ahead of cybercriminals—and your competitors.