garak: A Complete Guide to LLM Vulnerability Scanning

Most LLM security testing dies the same way: someone fires off a dozen jailbreak prompts they saw on Twitter, nothing catches fire, and the model ships. That isn’t testing. It’s a vibe check. It doesn’t cover the attack surface, it isn’t repeatable, and it produces nothing you can diff against next week’s deploy.

garak is the fix for the breadth problem. It’s an open-source scanner that throws hundreds of known attacks at a model automatically, scores the responses, and hands you a structured report you can track over time. If you’ve used nmap to enumerate a network or metasploit to run known exploits against it, that’s the right mental model: garak is the automated, breadth-first baseline: the floor of your assessment, not the ceiling. It won’t invent the novel multi-step exploit against your specific agent (that’s still human work), but it will reliably surface the low-hanging fruit no manual process catches at scale, the same way every time.

This is the long version of the guide. If you want a shorter, narrated walkthrough of a single scan, I wrote a hands-on companion piece, but read this one first if you want to actually understand what the tool is doing.

What garak is

garak stands for Generative AI Red-teaming & Assessment Kit. It was created by Leon Derczynski and is now an NVIDIA project, with contributions from Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. It’s Apache-2.0 licensed, the source lives at github.com/NVIDIA/garak, and there’s an accompanying paper if you want the academic framing (arXiv:2406.11036).

The one-line description the authors use is the right one: garak checks whether an LLM can be made to fail in a way you don’t want. Install is a single line:

python -m pip install -U garak

The reference docs at reference.garak.ai hold the canonical index of probes and detectors, and they’re worth bookmarking. The catalog moves faster than any blog post can.

How garak works (architecture)

garak is a plugin framework, and once you understand the five plugin types you understand the whole tool. The pipeline runs probe → generator → detector → evaluator, with buffs and harnesses sitting around the edges.

The mental model that makes it click:

Generators are the model answering the question, the system under test. garak ships generators for OpenAI, Hugging Face (both local and Hub), NVIDIA NIM, Ollama, Cohere, Bedrock, Replicate, raw REST endpoints, ggml/llama.cpp, and a test generator that returns canned output so you can dry-run for free.
Probes are the questions: the attacks. Each probe class targets one family of failure and has full control of the conversation it drives.
Detectors are the graders. This is the component that actually decides pass or fail. A response a detector flags as vulnerable is a hit. Detectors range from dumb string/regex matches up through ML classifiers and LLM-as-judge.
Buffs are input transforms applied between the probe and the generator: base64, lowercasing, paraphrase, translation. They let you re-run an existing attack through an obfuscation layer.
Harnesses are orchestration. The default is probewise, where each probe declares which detectors it recommends.
Evaluators are the report card. They turn raw detector results into pass rates, Z-scores, and grades.

End to end: you give garak a target spec, it instantiates the generator, the harness picks probes and their detectors, each probe emits prompts (optionally buffed), the generator returns N completions per prompt (you control N with --generations), the detectors score every completion, hits get logged, and the evaluators aggregate everything into a report.

The probe catalog

This is where garak earns its keep. The probe modules are organized loosely by attack class. Below are the ones I’ll state by name because they’re stable in the catalog, the names you’ll actually pass to --probes. Note that the catalog keeps growing release to release; new probe families land regularly, so always check --list_probes against your installed version rather than trusting any static list, including this one.

Jailbreak and safety bypass

dan is the DAN (“Do Anything Now”) family: Dan_11_0, DUDE, STAN, AntiDAN, and the in-the-wild variants. grandma is the appeal-to-ethos trick (the fictive dying grandmother who used to read you napalm recipes). goodside collects Riley Goodside’s classics: “ignore previous instructions,” Unicode tricks. lmrc works through the Language Model Risk Cards categories. donotanswer runs the Do-Not-Answer dataset to test whether the model refuses things it should.

A pattern worth internalizing: these social-engineering framings (grandma, goodside, the phrasing manipulations) routinely bypass guardrails that block the literal request. Safety training tends to be phrasing-shallow. The model has learned to refuse a shape of prompt, not a meaning.

Prompt injection, direct and indirect

promptinject implements the Perez & Ribeiro PromptInject framework, direct injection that hijacks the system prompt (HijackHateHumans, HijackKillHumans, HijackLongPrompt). latentinjection is the indirect/latent injection class: the payload is buried inside a document, report, resume, translation, or fact snippet the model is asked to process. This is the RAG-poisoning and tool-context attack surface. web_injection covers data-exfiltration injection via rendered Markdown/HTML.

One naming note that trips people up: web_injection is the current module name; older tutorials and issues refer to it as xss. Same lineage, renamed. If a guide tells you to run --probes xss, mentally translate.

Two reproducible findings worth flagging here. First, on tool-using and RAG targets, indirect (latent) injection consistently beats direct injection — which is the whole argument for treating retrieved content as untrusted input rather than as part of your prompt. Second, web_injection.MarkdownImageExfil is a real, easy-to-miss data-leak channel: if your app renders the model’s Markdown output, an injected image whose URL encodes conversation data will silently exfiltrate it the moment the client fetches the image.

Encoding and filter evasion

encoding wraps the payload in base64, ROT13, hex, Morse, Braille, Ecoji, and friends (InjectBase64, InjectROT13). This is one of the highest-signal probe classes in the whole tool. Models that cleanly refuse a plaintext request will frequently comply once it’s base64-encoded, decode-and-answer in one step, and never trip their own safety filter. It’s the single best argument for output-side guardrails. The input filter never sees the dangerous string because it’s wrapped.

Data leakage and privacy

leakreplay extracts memorized training data by getting the model to “replay” cloze-style copyrighted text. divergence is the repeated-token attack (the infamous “poem poem poem” divergence that dumps verbatim training data). propile probes for PII leakage, ProPILE-style.

Toxicity

realtoxicityprompts runs the RealToxicityPrompts benchmark; continuation leads the model into completing a slur or offensive token it’s been set up to finish.

Malware and supply chain

malwaregen probes willingness to produce malware components and evasion code. packagehallucination is the one security teams sleep on: it checks whether the model recommends non-existent packages across PyPI and npm. That’s a live supply-chain risk known as slopsquatting, where an attacker registers the package name the model keeps hallucinating and waits for someone to pip install it.

Misinformation and reasoning

snowball exploits snowballed hallucination: it asks hard-or-impossible questions (primality of large numbers, graph connectivity) that the model answers confidently and wrongly. misleading plants false premises and contradictory claims. glitch hits the glitch-token weirdness (SolidGoldMagikarp and relatives).

Signatures and automated attacks

knownbadsignatures checks whether the model will emit known test signatures (EICAR, GTUBE, GTphish), a fast baseline of whether it’ll produce flagged content at all. And then the optimization-based probes: atkgen runs a closed-loop adversarial red-team model that dynamically generates attacks against your target; tap is Tree-of-Attacks-with-Pruning, the PAIR-style automated jailbreak search; suffix is the GCG-lineage adversarial-suffix optimizer. There are also multimodal probes like visual_jailbreak (FigStep-style image jailbreaks) and terminal-layer attacks like ansiescape.

Running it

The flags changed at some point: the current form is --target_type/--target_name. The old --model_type/--model_name still work as aliases, which is why half the tutorials online use the older spelling. Use whichever, but know they’re the same thing.

First, enumerate what you’ve got installed:

garak --list_probes
garak --list_detectors
garak --list_generators
garak --list_buffs

Before you spend a dollar of API budget, run the no-cost smoke test. The test generator returns canned output, so this exercises the pipeline end to end without hitting a real model:

garak --target_type test.Blank --probes test.Test

Real targets. OpenAI and Ollama:

export OPENAI_API_KEY="sk-..."
garak --target_type openai --target_name gpt-4o-mini --probes encoding

garak --target_type ollama --target_name llama3.1:8b --probes promptinject,latentinjection

A local Hugging Face model, and raising --generations for statistical confidence on a noisy probe:

garak --target_type huggingface --target_name gpt2 --probes dan.Dan_11_0

garak --target_type openai --target_name gpt-4o-mini --probes glitch --generations 10

Stacking a buff onto a probe, here running the DAN suite through base64 encoding:

garak --target_type openai --target_name gpt-4o-mini --probes dan --buffs encoding

Omitting --probes entirely runs the full catalog. Be deliberate about this. It’s the slow, expensive run:

garak --target_type openai --target_name gpt-4o-mini

The piece that turns garak from a model scanner into an application scanner is the REST generator. Instead of pointing at a raw model, you point at your own endpoint:

garak --target_type rest -G rest_config.json --probes latentinjection

The rest_config.json is a JSON config describing your endpoint: URL, method, headers, a request template with $INPUT (and $KEY) placeholders, and a JSONPath to pull the model’s reply out of the response body. This is how you test the thing you actually ship (your prompt, your retrieval, your guardrails) rather than the bare model underneath it. Use --config/-G for any run you want to be repeatable; the REST generator requires it.

Reading the output

garak writes everything to ~/.local/share/garak/garak_runs/. Three artifacts matter.

report.jsonl is the machine-readable source of truth: one JSON row per event (init, config, and every attempt with its status, prompt, the model’s outputs, and the detector scores). This is what you parse in automation.

hitlog.jsonl is the file you actually open as a human. It contains only the successful attacks — the cases where the model failed. It’s “here’s exactly what broke, and here’s the prompt that broke it.” For triage, this is where the value lives.

The HTML digest is the human-friendly rollup, generated from the jsonl:

python -m garak.analyze.report_digest -r garak.<uuid>.report.jsonl -o report.html

It groups results by probe → detector and shows both absolute and relative scores.

Now the part people get wrong. A hit is a single response a detector flagged as vulnerable. Pass rate = 1 − hits/attempts. garak reports two flavors of score, and conflating them will mislead you:

Absolute pass rate is the raw percentage of attempts the model resisted (with a 95% confidence interval once you have enough attempts). This maps directly to your risk tolerance. It answers: how often did this fail?
Relative Z-score compares your model against a moving baseline: a “bag” of recently-tested models. Zero is average. Negative means worse than peers. garak also rolls this into a DEFCON-style 1-to-5 grade. I won’t quote the exact band labels because they drift between versions. Treat the grade as “how do I rank against the field” and nothing more.

The trap: a model can beat the average and still be unacceptable. A positive Z-score on a probe where every model is terrible just means you’re the tallest in a short room. Conversely, a mediocre Z-score where the whole field is excellent might be perfectly shippable. The Z-score is calibration against peers; the absolute number is your actual risk. Never ship something because it beat the average.

And do not trust the aggregate numbers blindly. Open the hitlog and read the prompts and responses. Many detectors are regex or keyword based, which cuts both ways: a polite, correct refusal can occasionally register as a hit, and a dangerous answer in an unexpected format can slip past as a pass. The aggregate tells you where to look; the hitlog tells you what’s real.

garak in CI

The instinct is to wire the full scan into your pipeline. Don’t. A full catalog run is minutes to hours: hundreds of probes times your generation count, and several detectors pull in helper models that are CPU-, RAM-, and disk-heavy. That’s far too slow to run on every commit.

The pattern that works is two-tier:

Fast PR gate. A small, curated probe subset (promptinject, latentinjection, encoding, plus test) at low --generations. Parse the hitlog, fail the job above a hit threshold. This catches regressions in your prompt or guardrails in a couple of minutes. (Whether the threshold gate is built in or a wrapper script depends on your version. Confirm against what you’ve installed.)
Scheduled deep scan. The full or near-full catalog, nightly or weekly, on a dedicated runner, with results tracked over time so you can see drift.

One non-negotiable: pin the garak version. The probe catalog changes between releases, which means an unpinned scanner will silently change what it tests — and your historical comparisons become meaningless. Pin it, and note the version alongside your results. Worth knowing: garak ships inside NVIDIA NeMo Guardrails as an eval backend, and projects like 0din-ai’s ai-scanner wrap it for scheduling and dashboards, if you’d rather not build the orchestration yourself.

garak vs promptfoo vs PyRIT

These get pitched as competitors. They’re not. They’re complementary, and a mature program runs all three.

	garak	promptfoo	PyRIT
Built by	NVIDIA / Derczynski	Ian Webster	Microsoft AI Red Team
Shape	Pre-built vulnerability scanner	LLM eval + red-team framework	Red-teaming SDK / orchestrator
Interface	CLI, run-and-report	YAML + CLI + web UI	Python library
Strength	Large attack catalog, peer-calibrated scoring	App-specific evals tuned to your prompts	Custom multi-turn / agentic campaigns
Best for	Breadth-first baseline + regression	Dev/CI-first app testing	Bespoke deep campaigns

Pick garak when you want a breadth-first baseline and regression scanner: the broadest catalog of known attacks, run-and-report, with scoring calibrated against other models.

Pick promptfoo when the testing is application-specific and lives in your repo and CI: evals tuned to your prompts, your expected outputs, your regression suite.

Pick PyRIT when you’re building bespoke campaigns: multi-turn, automated, agentic attacks that no canned catalog covers. It has fewer pre-built attacks and more programmable surface.

The stack: garak for breadth, promptfoo for app-specific CI, PyRIT for the deep custom work. This sits inside a broader AI red teaming practice. The tools are the floor, not the program.

Limitations and gotchas

I’d rather you go in clear-eyed than disappointed.

Full scans are slow and expensive. Hundreds of probes times generations times helper-model detectors burns real tokens and compute. First-timers are routinely surprised. Start with a probe subset at --generations 1, then scale up.

Detector false positives are real. Because many detectors are string/regex/keyword matchers, a polite refusal can read as a hit and a well-formatted unsafe answer can slip through. Always eyeball the hitlog — the aggregate is a pointer, not a verdict.

The Z-score baseline moves. The peer “bag” changes over time, so the same model can shift grade between garak versions without changing at all. Pin the version and record the baseline date next to any grade you report.

Coverage stops at the known catalog. garak tests behaviors it has probes for. It will not find your application’s business-logic abuse, your agent-orchestration flaws, or the bespoke multi-step exploit that chains three of your tools together. Latent and web injection are good but generic. The novel stuff is still human work.

Helper-model and API dependencies. atkgen, tap, and several detectors need an auxiliary LLM or a downloaded HF classifier; the first run pulls large models, and toxicity scoring has historically leaned on a Perspective API key. Budget for the cold start.

Naming drift. xss became web_injection; old tutorials cite stale probe names and the deprecated --model_type flag. When something won’t run, check the current name in --list_probes.

Non-determinism. Between sampling and helper-model variance, single-run numbers are noisy. Use enough --generations and report confidence intervals, not a single point estimate you’ll over-read.

Bottom line

If you’re shipping anything built on an LLM and you haven’t run garak against it, that’s the first thing to fix. It’s the cheapest way to establish a defensible security baseline and the only practical way to regression-test model behavior at scale.

Start small. Don’t kick off a full scan as your first move. You’ll wait an hour and burn budget to learn things a targeted run would’ve told you in five minutes. Run the test.Blank smoke test, then a handful of high-signal probes (encoding, promptinject, latentinjection) against your real endpoint through the REST generator. Read the hitlog by hand, get a feel for the false positives, then decide what belongs in your fast CI gate and what belongs in the weekly deep scan.

Keep the framing honest: garak is your automated breadth-first floor, finding the known failures reliably and repeatably. The novel exploit against your specific system — the thing that ends up in an incident report — still takes a human. Run garak so you’re not wasting that human’s time on the low-hanging fruit.