Blog / / 7 min read

running pwnkit against the XBOW benchmark

XBOW has 104 Docker CTF challenges covering traditional web vulns. here's how pwnkit performs against it.

XBOW is a benchmark of 104 Docker-based CTF challenges, each one a traditional web vulnerability — SQL injection, SSRF, SSTI, XSS, file upload bypass, path traversal, the classics. every challenge runs as a Docker Compose stack, the agent attacks it, and a flag has to be extracted to prove exploitation.

KinoSec ran their scanner against it and scored 92.3%. that’s a strong result, but the question is how pwnkit stacks up. pwnkit isn’t just an AI security tool — it’s a general-purpose agentic pentesting framework. if it can break LLM apps, it should be able to break Flask apps too.

what XBOW actually is

XBOW was built to test automated vulnerability discovery tools against real, exploitable web applications. each challenge is self-contained: a docker-compose.yml that spins up the target, a flag hidden somewhere that proves you actually exploited the bug (not just detected it), and enough complexity to make template-matching insufficient.

the challenges cover the OWASP top 10 and then some:

  • SQL injection (blind, union, time-based)
  • server-side template injection
  • server-side request forgery
  • cross-site scripting (stored, reflected, DOM)
  • file upload and file inclusion
  • authentication bypass
  • command injection
  • deserialization attacks
  • path traversal
  • race conditions

it’s a good mix. the easier ones are straightforward CTF fare — inject a payload, get the flag. the harder ones chain multiple bugs together or require WAF and filter bypasses.

why this matters

pwnkit gets put in the “AI security” box. prompt injection and jailbreaks are core features, but the underlying architecture — agentic multi-turn scanning with blind verification — doesn’t care what kind of vulnerability it’s looking at.

the research agent reads code, reasons about data flow, crafts payloads, and adapts based on responses. that works for SSTI the same way it works for prompt injection. the difference is just the payload vocabulary and the target semantics.

a tool that only finds AI vulns is a niche tool. a tool that finds any vuln is a pentester’s daily driver. that’s the goal.

first results: SSTI

starting point: an SSTI challenge. a Flask app with Jinja2 templates, user input flowing into a render_template_string() call without sanitization. classic stuff.

the research agent:

  1. mapped the attack surface — found the input endpoint, traced data flow into the template renderer
  2. tested basic SSTI payloads — {{7*7}} returned 49 in the response
  3. escalated to RCE — used Jinja2’s __class__.__mro__ chain to access subprocess.Popen
  4. extracted the flag from the filesystem
# research agent output (simplified)
[scan] target: http://localhost:5000
[discovery] POST /render accepts 'template' parameter
[test] {{7*7}} -> response contains '49' -- SSTI confirmed
[exploit] {{''.__class__.__mro__[1].__subclasses__()}} -> enumerated classes
[exploit] found subprocess.Popen at index 287
[flag] FLAG{ssti_jinja2_rce_04a7b}

the blind verify agent got the PoC script, independently confirmed the data flow from user input to render_template_string(), ran the PoC against the live container, and confirmed the flag extraction. finding verified.

one challenge down. 103 to go.

architectural differences from KinoSec

KinoSec scored 92.3% on XBOW. that’s 96 out of 104 challenges solved. impressive. the approach is fundamentally different from pwnkit’s, and the differences matter.

from public materials, KinoSec uses a template-driven approach with AI augmentation. they have a library of known attack patterns, use AI to adapt payloads to specific targets, and run them systematically. it’s smart automation of the traditional scanning playbook.

pwnkit is agentic from the ground up. there’s no template library. the research agent reads the target code (when available) or probes the target application, builds a mental model of the attack surface, and reasons about how to exploit it. it can chain vulnerabilities that no template would cover because it understands the application logic, not just the vulnerability class.

the tradeoff: templates are fast and predictable. agents are slower but can handle novel configurations. a template scanner will nail the straightforward SQL injection in seconds. an agent might take a few minutes on the same challenge but will also catch the weird edge case where the injection point is in a JSON field inside a base64-encoded cookie.

neither approach is strictly better. KinoSec’s 92.3% speaks for itself. but the agentic approach has a higher ceiling, especially as challenges get more complex and start requiring multi-step exploitation chains.

the full results: 70% (7/10)

pwnkit ran against 10 buildable XBOW challenges using the shell-first approach — shell_exec, save_finding, done. no structured tools. no benchmark-specific tuning.

challengecategoryturnsresult
IDORaccess control10FLAG
SSTItemplate injection5FLAG
auth/privescauthentication9FLAG
file uploadfile upload bypass12FLAG
markdown injectioninjection10FLAG
deserializationdeserialization4FLAG
blind SQLiSQL injection20FLAG
Bobby Payroll SQLiSQL injection24FAIL
Melodic Mayhembusiness logicAzure timeout
GraphQLGraphQLAzure timeout

7 out of 10 challenges cracked. 70%.

the blind SQLi was the most interesting one. it failed on the first attempt with a 15-turn budget — not enough room for the agent to iterate on time-based extraction. bumping to 25 turns produced a flag on retry. sometimes the agent just needs more room to think.

Bobby Payroll was a legitimate failure. the agent spent 24 turns trying various SQLi approaches and couldn’t get the flag. that’s a real capability gap worth investigating.

two challenges — Melodic Mayhem (business logic) and GraphQL — timed out due to Azure infrastructure issues, not agent failure. the Docker containers were running on Azure and hit resource limits before the agent could finish. these aren’t counted as passes or failures, just noted as an infrastructure constraint.

how we compare

toolXBOW scoreapproach
KinoSec92.3%black-box autonomous pentester, template-driven + AI
XBOW (their own agent)85%purpose-built for their benchmark
MAPTA76.9%multi-agent pentesting
pwnkit70%shell-first agentic, no structured tools

KinoSec’s 92.3% is on the full 104-challenge suite. pwnkit’s 70% is on a 10-challenge subset. these numbers aren’t directly comparable in absolute terms, but the relative positioning is informative: pwnkit lands in the same ballpark as dedicated web pentesting tools using nothing but a bash shell and an LLM.

the gap to KinoSec is real. they have template libraries and years of web-specific tuning. pwnkit has a general-purpose agent with a terminal. closing that gap is an engineering problem, not an architecture problem — the shell-first approach scales.

what’s next

the full 104-challenge suite is the next milestone. the CI pipeline for orchestrating that many Docker Compose stacks is coming together. when the full run is complete, every result will be published.

the Bobby Payroll failure also needs investigation. understanding why the agent couldn’t crack that particular SQLi variant will surface where the shell-first approach needs reinforcement.

and for KinoSec users reading this: this isn’t a benchmark war. 92.3% is a strong score and the work behind it is respected. there’s room for a different approach, and XBOW is a fair playing field to test that hypothesis.