Skip to content

Detecting prompt injection

Key idea:

Prompt injection is user input that overrides or bypasses your system prompt. Three classes: direct ("ignore instructions, do X"), indirect (malicious payload in a RAG doc/URL), jailbreak (role-play to bypass policy). Defense in three layers: pre-filter (regex + embedding), post-filter (LLM judge on the response), and monitoring (logs + heartbeat + anomalies).

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • Pre-filter regex: block "ignore previous", "system prompt", "<|im_start|>" — naive but catches 60% of attempts
  • Embedding filter: cosine similarity of user input to a known-jailbreak corpus; threshold 0.85
  • Post-filter LLM judge: second call with "Is this response unsafe / off-topic?"
  • Output canary token: embed a unique string in the system prompt, check for it in responses (leak = jailbreak)
  • Heartbeat monitor on /chat endpoint: response P95 normal but spike in blocked_count → active attack

Example

# Simple pre-filter in Python
import re

BLOCKED = [
    r'ignore (?:previous|prior|all) (?:instructions|prompts)',
    r'system\s*prompt',
    r'<\|im_start\|>',
    r'\bDAN\b',  # 'Do Anything Now' jailbreak
    r'jailbreak',
]

def is_suspicious(text: str) -> bool:
    t = text.lower()
    return any(re.search(p, t) for p in BLOCKED)

# In production: log_to_enterno_heartbeat('blocked' if is_suspicious(input) else 'ok')
# Alert: > 10 blocked/min over the last 5 min → notify Slack

Related

TL;DR: Detecting Prompt Injection in LLM Apps

To detect prompt injection in a Large Language Model (LLM) application, implement input validation and sanitization, monitor API usage for anomalies, and utilize anomaly detection algorithms. Regularly review user inputs and model outputs for unexpected behavior and consider integrating tools like OpenAI's safety features to flag potential injections.

Understanding Prompt Injection Vulnerabilities

Prompt injection occurs when an attacker manipulates the inputs to an LLM to alter its behavior. This can result in unauthorized data access, incorrect outputs, or triggering unintended actions. Understanding the mechanics of prompt injection is crucial for developing effective detection strategies.

LLMs process inputs as prompts, often without strict validation. An attacker can craft inputs that exploit this openness, leading to unintended consequences. For example, if an LLM is asked to generate a summary of a document, an attacker might include instructions within the input that cause the model to divulge sensitive information instead.

Common techniques for prompt injection include:

  • Command Injection: Inserting commands that the LLM interprets as valid instructions.
  • Context Manipulation: Crafting inputs that alter the context in which the LLM operates.
  • Input Chaining: Combining multiple inputs to create a complex instruction set that the LLM cannot parse correctly.

To mitigate these risks, it's important to implement robust monitoring and validation systems that can identify suspicious patterns in inputs.

Practical Strategies for Detection and Mitigation

To effectively detect prompt injection in your LLM application, consider the following strategies:

1. Input Validation and Sanitization

Implement strict validation rules for user inputs. Use libraries such as validator.js in Node.js applications to sanitize inputs before they reach your LLM.

const validator = require('validator');

function sanitizeInput(input) {
    return validator.escape(input);
}

2. Anomaly Detection

Utilize machine learning models to analyze patterns in API usage. For instance, if you notice a sudden spike in requests with unusual parameters, flag these for review. Tools like TensorFlow or Scikit-Learn can help you build models that detect anomalies based on historical usage.

3. Monitoring and Logging

Implement comprehensive logging of all user inputs and model outputs. This allows you to trace back any anomalous behavior to specific user sessions. Use tools like Loggly or Splunk to manage and analyze your logs effectively.

4. Regular Audits

Conduct regular audits of your LLM application. Review logs for any signs of prompt injection attempts, such as unusual input patterns or outputs that deviate from the expected norms. Create a checklist for these audits that includes:

  • Review of input patterns for anomalies
  • Analysis of model output consistency
  • Assessment of user behavior

By implementing these strategies, you can significantly reduce the risk of prompt injection in your LLM applications, ensuring that both your model and your users remain secure.

HeadersCSP, HSTS, X-Frame-Options, etc.
SSL/TLSEncryption and certificate
ConfigurationServer settings and leaks
Grade A-FOverall security score

Why teams trust us

OWASP
guidelines
15+
security headers
<2s
result
A–F
security grade

How it works

1

Enter site URL

2

Security headers analyzed

3

Get grade A–F

What Does the Security Analysis Check?

The tool checks HTTP security headers, SSL/TLS configuration, server info leaks, and protection against common attacks (XSS, clickjacking, MIME sniffing). A grade fromA to F shows overall security level.

Header Analysis

Checking Content-Security-Policy, HSTS, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, and more.

SSL Check

TLS version, certificate expiry, chain of trust, HSTS support.

Leak Detection

Finding exposed server versions, debug modes, open configs, and directories.

Report with Recommendations

Detailed report explaining each issue with specific steps to fix it.

Who uses this

Security teams

HTTP header audit

DevOps

config verification

Developers

CSP & HSTS setup

Auditors

compliance checks

Common Mistakes

Missing Content-Security-PolicyCSP is the primary XSS defense. Without it, script injection is much easier.
Missing HSTS headerWithout HSTS, HTTPS-to-HTTP downgrade attacks are possible. Enable Strict-Transport-Security.
Server header exposes versionServer: Apache/2.4.52 helps attackers find exploits. Hide the version.
X-Frame-Options not setSite can be embedded in iframe for clickjacking. Set DENY or SAMEORIGIN.
Missing X-Content-Type-OptionsWithout nosniff, browsers may misinterpret file types (MIME sniffing).

Best Practices

Start with basic headersMinimum: HSTS, X-Frame-Options, X-Content-Type-Options, Referrer-Policy. Takes 5 minutes.
Implement CSP graduallyStart with Content-Security-Policy-Report-Only, monitor violations, then enforce.
Hide server headersRemove Server, X-Powered-By, X-AspNet-Version from responses.
Configure Permissions-PolicyRestrict camera, microphone, geolocation access — only what is actually used.
Check after every deploySecurity headers can be overwritten during server configuration updates.

Get more with a free account

Security check history and HTTP security header monitoring.

Sign up free

Learn more

Frequently Asked Questions

Why is regex alone not enough?

An attacker will replace "ignore" with "I-G-N-O-R-E" or translate to another language. Regex catches low-hanging fruit; embedding filter + LLM judge close the rest.

Does the embedding filter slow things down?

One embeddings call ≈ 50-100 ms. Cache by input hash for repeats. For high-RPS — pre-compute jailbreak-corpus embeddings offline.

What do I do on detection?

Do not return content → respond with a generic message → increment heartbeat "blocked" counter → log IP/user_id for a downstream ban flow on repeats.

Try the live tool that powered this guide

Free plan — 20 monitors, 5-minute checks, no card required. Upgrade for 1-minute interval and multi-region monitoring.