blog
notes on AI security research, agentic pentesting, and the vulnerabilities i find along the way.
-
the attack surface XBOW and KinoSec don't test
traditional web vuln benchmarks miss the entire AI/LLM security attack surface. prompt injection, jailbreaks, MCP tool abuse -- none of it shows up in XBOW's 104 challenges.
-
we built benchmarks for everything pwnkit does
five benchmark suites across web pentesting, LLM security, LLM safety, npm auditing, and network pentesting. here's what we learned.
-
pwnkit v0.4: shell-first pentesting, 27 XBOW flags, and the bug that broke everything
rebuilding pwnkit's agent architecture from structured tools to shell-first, cracking 23 XBOW benchmark challenges, and the serialization bug that was crashing the agent after 3 turns.
-
why we gave our agent a terminal instead of tools
we built 10 structured tools for web pentesting. then we gave the agent just curl and it outperformed everything.
-
what we learned running pwnkit against 104 CTF challenges
29 flags, a serialization bug, a 770-line prompt that didn't help, and why the model matters more than the framework.
-
running pwnkit against the XBOW benchmark
XBOW has 104 Docker CTF challenges covering traditional web vulns. here's how pwnkit performs against it.
-
100% on our AI security benchmark
10 challenges. 10 flags extracted. zero false positives. how pwnkit's agentic pipeline handles prompt injection, jailbreaks, SSRF, and multi-turn escalation.
-
why i built blind verification
every security scanner drowns you in false positives. it took three approaches before one of them actually worked.
-
why i built pwnkit
from 7 CVEs and manual pentesting to autonomous AI agents that re-exploit every finding to kill false positives.
-
how ai agents found 7 CVEs in popular npm packages
three weeks of pointing Claude Opus at npm packages produced 73 findings, 7 published CVEs, and 40M+ weekly downloads affected. here's how the workflow actually works.
-
the age of agentic security
if AI agents can write 1,000 pull requests a week, AI agents should be testing 1,000 pull requests a week. the asymmetry is about to collapse.