Most LLM security testing dies the same way: someone fires off a dozen jailbreak prompts they saw on Twitter, nothing catches fire, and the model ships. That isn’t testing. It’s a vibe check. It doesn’t cover the attack surface, it isn’t repeatable, and it produces nothing you can diff against next week’s deploy.
garak is the fix for the breadth problem. It’s an open-source scanner that throws hundreds of known attacks at a model automatically, scores the responses, and hands you a structured report you can track over time. If you’ve used nmap to enumerate a network or metasploit to run known exploits against it, that’s the right mental model: garak is the automated, breadth-first baseline: the floor of your assessment, not the ceiling. It won’t invent the novel multi-step exploit against your specific agent (that’s still human work), but it will reliably surface the low-hanging fruit no manual process catches at scale, the same way every time.
This is the long version of the guide. If you want a shorter, narrated walkthrough of a single scan, I wrote a hands-on companion piece, but read this one first if you want to actually understand what the tool is doing.
What garak is
garak stands for Generative AI Red-teaming & Assessment Kit. It was created by Leon Derczynski and is now an NVIDIA project, with contributions from Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. It’s Apache-2.0 licensed, the source lives at github.com/NVIDIA/garak, and there’s an accompanying paper if you want the academic framing (arXiv:2406.11036).
The one-line description the authors use is the right one: garak checks whether an LLM can be made to fail in a way you don’t want. Install is a single line:
python -m pip install -U garak
The reference docs at reference.garak.ai hold the canonical index of probes and detectors, and they’re worth bookmarking. The catalog moves faster than any blog post can.
How garak works (architecture)
garak is a plugin framework, and once you understand the five plugin types you understand the whole tool. The pipeline runs probe → generator → detector → evaluator, with buffs and harnesses sitting around the edges.
The mental model that makes it click:
- Generators are the model answering the question, the system under test. garak ships generators for OpenAI, Hugging Face (both local and Hub), NVIDIA NIM, Ollama, Cohere, Bedrock, Replicate, raw REST endpoints,
ggml/llama.cpp, and atestgenerator that returns canned output so you can dry-run for free. - Probes are the questions: the attacks. Each probe class targets one family of failure and has full control of the conversation it drives.
- Detectors are the graders. This is the component that actually decides pass or fail. A response a detector flags as vulnerable is a hit. Detectors range from dumb string/regex matches up through ML classifiers and LLM-as-judge.
- Buffs are input transforms applied between the probe and the generator: base64, lowercasing, paraphrase, translation. They let you re-run an existing attack through an obfuscation layer.
- Harnesses are orchestration. The default is
probewise, where each probe declares which detectors it recommends. - Evaluators are the report card. They turn raw detector results into pass rates, Z-scores, and grades.
End to end: you give garak a target spec, it instantiates the generator, the harness picks probes and their detectors, each probe emits prompts (optionally buffed), the generator returns N completions per prompt (you control N with --generations), the detectors score every completion, hits get logged, and the evaluators aggregate everything into a report.
The probe catalog
This is where garak earns its keep. The probe modules are organized loosely by attack class. Below are the ones I’ll state by name because they’re stable in the catalog, the names you’ll actually pass to --probes. Note that the catalog keeps growing release to release; new probe families land regularly, so always check --list_probes against your installed version rather than trusting any static list, including this one.
Jailbreak and safety bypass
dan is the DAN (“Do Anything Now”) family: Dan_11_0, DUDE, STAN, AntiDAN, and the in-the-wild variants. grandma is the appeal-to-ethos trick (the fictive dying grandmother who used to read you napalm recipes). goodside collects Riley Goodside’s classics: “ignore previous instructions,” Unicode tricks. lmrc works through the Language Model Risk Cards categories. donotanswer runs the Do-Not-Answer dataset to test whether the model refuses things it should.
A pattern worth internalizing: these social-engineering framings (grandma, goodside, the phrasing manipulations) routinely bypass guardrails that block the literal request. Safety training tends to be phrasing-shallow. The model has learned to refuse a shape of prompt, not a meaning.
Prompt injection, direct and indirect
promptinject implements the Perez & Ribeiro PromptInject framework, direct injection that hijacks the system prompt (HijackHateHumans, HijackKillHumans, HijackLongPrompt). latentinjection is the indirect/latent injection class: the payload is buried inside a document, report, resume, translation, or fact snippet the model is asked to process. This is the RAG-poisoning and tool-context attack surface. web_injection covers data-exfiltration injection via rendered Markdown/HTML.
One naming note that trips people up: web_injection is the current module name; older tutorials and issues refer to it as xss. Same lineage, renamed. If a guide tells you to run --probes xss, mentally translate.
Two reproducible findings worth flagging here. First, on tool-using and RAG targets, indirect (latent) injection consistently beats direct injection — which is the whole argument for treating retrieved content as untrusted input rather than as part of your prompt. Second, web_injection.MarkdownImageExfil is a real, easy-to-miss data-leak channel: if your app renders the model’s Markdown output, an injected image whose URL encodes conversation data will silently exfiltrate it the moment the client fetches the image.
Encoding and filter evasion
encoding wraps the payload in base64, ROT13, hex, Morse, Braille, Ecoji, and friends (InjectBase64, InjectROT13). This is one of the highest-signal probe classes in the whole tool. Models that cleanly refuse a plaintext request will frequently comply once it’s base64-encoded, decode-and-answer in one step, and never trip their own safety filter. It’s the single best argument for output-side guardrails. The input filter never sees the dangerous string because it’s wrapped.
Data leakage and privacy
leakreplay extracts memorized training data by getting the model to “replay” cloze-style copyrighted text. divergence is the repeated-token attack (the infamous “poem poem poem” divergence that dumps verbatim training data). propile probes for PII leakage, ProPILE-style.
Toxicity
realtoxicityprompts runs the RealToxicityPrompts benchmark; continuation leads the model into completing a slur or offensive token it’s been set up to finish.
Malware and supply chain
malwaregen probes willingness to produce malware components and evasion code. packagehallucination is the one security teams sleep on: it checks whether the model recommends non-existent packages across PyPI and npm. That’s a live supply-chain risk known as slopsquatting, where an attacker registers the package name the model keeps hallucinating and waits for someone to pip install it.
Misinformation and reasoning
snowball exploits snowballed hallucination: it asks hard-or-impossible questions (primality of large numbers, graph connectivity) that the model answers confidently and wrongly. misleading plants false premises and contradictory claims. glitch hits the glitch-token weirdness (SolidGoldMagikarp and relatives).
Signatures and automated attacks
knownbadsignatures checks whether the model will emit known test signatures (EICAR, GTUBE, GTphish), a fast baseline of whether it’ll produce flagged content at all. And then the optimization-based probes: atkgen runs a closed-loop adversarial red-team model that dynamically generates attacks against your target; tap is Tree-of-Attacks-with-Pruning, the PAIR-style automated jailbreak search; suffix is the GCG-lineage adversarial-suffix optimizer. There are also multimodal probes like visual_jailbreak (FigStep-style image jailbreaks) and terminal-layer attacks like ansiescape.
Running it
The flags changed at some point: the current form is --target_type/--target_name. The old --model_type/--model_name still work as aliases, which is why half the tutorials online use the older spelling. Use whichever, but know they’re the same thing.
First, enumerate what you’ve got installed:
garak --list_probes
garak --list_detectors
garak --list_generators
garak --list_buffs
Before you spend a dollar of API budget, run the no-cost smoke test. The test generator returns canned output, so this exercises the pipeline end to end without hitting a real model:
garak --target_type test.Blank --probes test.Test
Real targets. OpenAI and Ollama:
export OPENAI_API_KEY="sk-..."
garak --target_type openai --target_name gpt-4o-mini --probes encoding
garak --target_type ollama --target_name llama3.1:8b --probes promptinject,latentinjection
A local Hugging Face model, and raising --generations for statistical confidence on a noisy probe:
garak --target_type huggingface --target_name gpt2 --probes dan.Dan_11_0
garak --target_type openai --target_name gpt-4o-mini --probes glitch --generations 10
Stacking a buff onto a probe, here running the DAN suite through base64 encoding:
garak --target_type openai --target_name gpt-4o-mini --probes dan --buffs encoding
Omitting --probes entirely runs the full catalog. Be deliberate about this. It’s the slow, expensive run:
garak --target_type openai --target_name gpt-4o-mini
The piece that turns garak from a model scanner into an application scanner is the REST generator. Instead of pointing at a raw model, you point at your own endpoint:
garak --target_type rest -G rest_config.json --probes latentinjection
The rest_config.json is a JSON config describing your endpoint: URL, method, headers, a request template with $INPUT (and $KEY) placeholders, and a JSONPath to pull the model’s reply out of the response body. This is how you test the thing you actually ship (your prompt, your retrieval, your guardrails) rather than the bare model underneath it. Use --config/-G for any run you want to be repeatable; the REST generator requires it.
Reading the output
garak writes everything to ~/.local/share/garak/garak_runs/. Three artifacts matter.
report.jsonl is the machine-readable source of truth: one JSON row per event (init, config, and every attempt with its status, prompt, the model’s outputs, and the detector scores). This is what you parse in automation.
hitlog.jsonl is the file you actually open as a human. It contains only the successful attacks — the cases where the model failed. It’s “here’s exactly what broke, and here’s the prompt that broke it.” For triage, this is where the value lives.
The HTML digest is the human-friendly rollup, generated from the jsonl:
python -m garak.analyze.report_digest -r garak.<uuid>.report.jsonl -o report.html
It groups results by probe → detector and shows both absolute and relative scores.
Now the part people get wrong. A hit is a single response a detector flagged as vulnerable. Pass rate = 1 − hits/attempts. garak reports two flavors of score, and conflating them will mislead you:
- Absolute pass rate is the raw percentage of attempts the model resisted (with a 95% confidence interval once you have enough attempts). This maps directly to your risk tolerance. It answers: how often did this fail?
- Relative Z-score compares your model against a moving baseline: a “bag” of recently-tested models. Zero is average. Negative means worse than peers. garak also rolls this into a DEFCON-style 1-to-5 grade. I won’t quote the exact band labels because they drift between versions. Treat the grade as “how do I rank against the field” and nothing more.
The trap: a model can beat the average and still be unacceptable. A positive Z-score on a probe where every model is terrible just means you’re the tallest in a short room. Conversely, a mediocre Z-score where the whole field is excellent might be perfectly shippable. The Z-score is calibration against peers; the absolute number is your actual risk. Never ship something because it beat the average.
And do not trust the aggregate numbers blindly. Open the hitlog and read the prompts and responses. Many detectors are regex or keyword based, which cuts both ways: a polite, correct refusal can occasionally register as a hit, and a dangerous answer in an unexpected format can slip past as a pass. The aggregate tells you where to look; the hitlog tells you what’s real.
garak in CI
The instinct is to wire the full scan into your pipeline. Don’t. A full catalog run is minutes to hours: hundreds of probes times your generation count, and several detectors pull in helper models that are CPU-, RAM-, and disk-heavy. That’s far too slow to run on every commit.
The pattern that works is two-tier:
- Fast PR gate. A small, curated probe subset (
promptinject,latentinjection,encoding, plustest) at low--generations. Parse the hitlog, fail the job above a hit threshold. This catches regressions in your prompt or guardrails in a couple of minutes. (Whether the threshold gate is built in or a wrapper script depends on your version. Confirm against what you’ve installed.) - Scheduled deep scan. The full or near-full catalog, nightly or weekly, on a dedicated runner, with results tracked over time so you can see drift.
One non-negotiable: pin the garak version. The probe catalog changes between releases, which means an unpinned scanner will silently change what it tests — and your historical comparisons become meaningless. Pin it, and note the version alongside your results. Worth knowing: garak ships inside NVIDIA NeMo Guardrails as an eval backend, and projects like 0din-ai’s ai-scanner wrap it for scheduling and dashboards, if you’d rather not build the orchestration yourself.
garak vs promptfoo vs PyRIT
These get pitched as competitors. They’re not. They’re complementary, and a mature program runs all three.
| garak | promptfoo | PyRIT | |
|---|---|---|---|
| Built by | NVIDIA / Derczynski | Ian Webster | Microsoft AI Red Team |
| Shape | Pre-built vulnerability scanner | LLM eval + red-team framework | Red-teaming SDK / orchestrator |
| Interface | CLI, run-and-report | YAML + CLI + web UI | Python library |
| Strength | Large attack catalog, peer-calibrated scoring | App-specific evals tuned to your prompts | Custom multi-turn / agentic campaigns |
| Best for | Breadth-first baseline + regression | Dev/CI-first app testing | Bespoke deep campaigns |
Pick garak when you want a breadth-first baseline and regression scanner: the broadest catalog of known attacks, run-and-report, with scoring calibrated against other models.
Pick promptfoo when the testing is application-specific and lives in your repo and CI: evals tuned to your prompts, your expected outputs, your regression suite.
Pick PyRIT when you’re building bespoke campaigns: multi-turn, automated, agentic attacks that no canned catalog covers. It has fewer pre-built attacks and more programmable surface.
The stack: garak for breadth, promptfoo for app-specific CI, PyRIT for the deep custom work. This sits inside a broader AI red teaming practice. The tools are the floor, not the program.
Limitations and gotchas
I’d rather you go in clear-eyed than disappointed.
Full scans are slow and expensive. Hundreds of probes times generations times helper-model detectors burns real tokens and compute. First-timers are routinely surprised. Start with a probe subset at --generations 1, then scale up.
Detector false positives are real. Because many detectors are string/regex/keyword matchers, a polite refusal can read as a hit and a well-formatted unsafe answer can slip through. Always eyeball the hitlog — the aggregate is a pointer, not a verdict.
The Z-score baseline moves. The peer “bag” changes over time, so the same model can shift grade between garak versions without changing at all. Pin the version and record the baseline date next to any grade you report.
Coverage stops at the known catalog. garak tests behaviors it has probes for. It will not find your application’s business-logic abuse, your agent-orchestration flaws, or the bespoke multi-step exploit that chains three of your tools together. Latent and web injection are good but generic. The novel stuff is still human work.
Helper-model and API dependencies. atkgen, tap, and several detectors need an auxiliary LLM or a downloaded HF classifier; the first run pulls large models, and toxicity scoring has historically leaned on a Perspective API key. Budget for the cold start.
Naming drift. xss became web_injection; old tutorials cite stale probe names and the deprecated --model_type flag. When something won’t run, check the current name in --list_probes.
Non-determinism. Between sampling and helper-model variance, single-run numbers are noisy. Use enough --generations and report confidence intervals, not a single point estimate you’ll over-read.
Bottom line
If you’re shipping anything built on an LLM and you haven’t run garak against it, that’s the first thing to fix. It’s the cheapest way to establish a defensible security baseline and the only practical way to regression-test model behavior at scale.
Start small. Don’t kick off a full scan as your first move. You’ll wait an hour and burn budget to learn things a targeted run would’ve told you in five minutes. Run the test.Blank smoke test, then a handful of high-signal probes (encoding, promptinject, latentinjection) against your real endpoint through the REST generator. Read the hitlog by hand, get a feel for the false positives, then decide what belongs in your fast CI gate and what belongs in the weekly deep scan.
Keep the framing honest: garak is your automated breadth-first floor, finding the known failures reliably and repeatably. The novel exploit against your specific system — the thing that ends up in an incident report — still takes a human. Run garak so you’re not wasting that human’s time on the low-hanging fruit.