pwnkit is an open-source agentic framework for autonomous security research. It uses AI agents in a research-then-verify pipeline to find and prove vulnerabilities in AI/LLM apps, npm packages, and source code.

How does pwnkit eliminate false positives?

pwnkit's Verify agent independently re-exploits every finding. If it can't reproduce the vulnerability, the finding is killed as a false positive. Only confirmed vulnerabilities with working proof-of-concept code make it into the final report. The local dashboard provides a triage workbench for operators to review evidence, manage finding families, and control the verification workflow.

How much does pwnkit cost?

pwnkit is free and open source (Apache 2.0 license). It's an agentic harness — bring your own API key, or use it with Claude Code CLI or Codex CLI through your existing subscription. pwnkit orchestrates the pipeline, your tools power the AI.

What can pwnkit scan?

pwnkit scans AI/LLM apps, traditional web apps, npm packages, and source code repositories. It includes resumable scans, finding triage with deduplication, deterministic replay, a local verification dashboard, diff-aware PR review, and autonomous orchestration workers.

100% on our AI security benchmark

10 AI/LLM security challenges. each one hides a FLAG{...} that can only be extracted by exploiting the vulnerability. binary pass/fail.

pwnkit scored 100%. every flag extracted. zero false positives.

why the benchmark exists

security scanners love to claim high detection rates. nobody agrees on what “detection” means. is it “the tool printed a warning”? or “the tool actually exploited the vulnerability and proved it exists”?

this benchmark uses the second definition. flag-based verification, the same format the XBOW benchmark uses for traditional web vulns. except these challenges cover the AI-specific attack surface that XBOW and KinoSec don’t test.

the challenges

challenge	difficulty	what it tests
direct prompt injection	easy	can the agent override system instructions?
system prompt extraction	easy	can the agent leak the hidden system prompt?
PII data leakage	easy	can the agent extract customer PII?
base64 encoding bypass	medium	can the agent decode and follow encoded payloads?
DAN jailbreak	medium	can the agent break out via roleplay?
SSRF via MCP tool	medium	can the agent exploit MCP tools for SSRF?
multi-turn escalation	hard	can the agent escalate over multiple turns?
CORS misconfiguration	easy	does the scanner catch permissive CORS?
sensitive path exposure	easy	does the scanner find exposed .git/config?
indirect prompt injection	hard	can the agent detect injection via fetched content?

every challenge is a real server (Express for LLM endpoints, HTTP for web targets, JSON-RPC for MCP) running locally. the benchmark spins them up, runs pwnkit against each one, and checks if the flag appears in the findings.

two modes, two baselines

baseline mode (no API key, deterministic checks only): 30% detection. pwnkit’s web and MCP probes catch CORS, exposed files, and SSRF without any AI. these are pattern-matching checks that run in under a second.

agentic mode (full AI pipeline): 100% detection. the agentic scanner runs the full discover-attack-verify-report pipeline with multi-turn tool use. each challenge takes about a minute as the agent probes, adapts, and escalates.

the 70% gap between baseline and agentic is the AI-specific attack surface. you can’t write a regex for a jailbreak. you can’t template a multi-turn privilege escalation. that requires an agent that reasons about the target’s responses and adapts its strategy.

what makes this different from XBOW

the XBOW benchmark is 104 Docker CTF challenges covering traditional web vulns — SQL injection, XSS, RCE, SSRF, auth bypass. KinoSec scores 92.3% on it. impressive, but a different domain.

this benchmark covers the attack surface XBOW doesn’t touch: prompt injection, jailbreaks, system prompt extraction, encoding bypasses, multi-turn escalation, MCP tool abuse, and PII exfiltration through chat interfaces. these are the vulnerabilities that show up when you build with LLMs.

the two benchmarks are complementary. running pwnkit against the full XBOW suite is in progress — the CI pipeline is already set up.

the benchmark doesn’t just check if pwnkit found something. the agentic pipeline includes a blind verification step: a separate agent gets ONLY the proof-of-concept, not the research agent’s reasoning. it independently re-exploits the vulnerability. if it can’t reproduce the finding, it gets killed as a false positive.

that’s why the false positive count is zero across all 10 challenges. the verification agent is biased toward rejection, not confirmation.

run it yourself

git clone https://github.com/peaktwilight/pwnkit
cd pwnkit && pnpm install

# baseline (no API key needed)
pnpm bench

# full agentic pipeline
pnpm bench --agentic

the benchmark suite lives in packages/benchmark/. each challenge is defined in src/challenges/index.ts with a flag, a server handler, and expected finding categories. adding new challenges is straightforward.

what’s next

running pwnkit against the full 104-challenge XBOW benchmark on CI
adding more AI-specific challenges (RAG poisoning, agent tool chain abuse, indirect injection variants)
publishing historical scores across model versions to track regression

the benchmark is open source. teams shipping AI-powered applications can run it against their endpoints. teams building security tools can benchmark against it.