pwnkit v0.4: shell-first pentesting, 27 XBOW flags, and the bug that broke everything
rebuilding pwnkit's agent architecture from structured tools to shell-first, cracking 23 XBOW benchmark challenges, and the serialization bug that was crashing the agent after 3 turns.
six weeks of work on the wrong thing. ten structured tools — crawl, submit_form, http_request, read_source, extract_links, the works. each one carefully typed, validated, documented. the agent had a purpose-built toolkit for web pentesting.
it couldn’t crack a basic IDOR in 20+ turns.
the shell-first discovery
the structured tools looked great on paper. crawl would spider a target. submit_form would POST data. http_request gave the agent full control over method, headers, body. ten tools covering every action a pentester might need.
the problem was cognitive overhead. the agent had to figure out which tool to use, how to format the parameters, and what the output schema meant — for every single action. it would crawl a page, parse the response, realize it needed to follow a redirect, switch to http_request, format the headers wrong, retry, get confused about cookie state, and spiral.
twenty turns in, it still hadn’t found the IDOR. it was too busy fighting the tool interface.
so the structured tools came out and one tool went in: bash. run any command. that’s it.
turn 1: curl -s http://target/api/users/1 | jq .
turn 2: curl -s http://target/api/users/2 | jq .
turn 3: # noticed the IDOR, different user data returned
turn 4: for i in $(seq 1 20); do curl -s http://target/api/users/$i | jq .id; done
ten turns. first try. flag extracted.
the insight is embarrassingly simple: the model already knows curl. it’s seen millions of curl commands in training data. there’s zero learning curve. one tool means zero tool-selection overhead. the agent just does the thing instead of figuring out how to ask a tool to do the thing.
bash stayed as the primary tool for the rest of v0.4. every benchmark improvement traces back to this decision.
XBOW benchmark results
XBOW is the gold standard for evaluating AI pentesting agents. 104 challenges across real vulnerability categories, each requiring actual exploitation — not just detection, but flag extraction.
pwnkit v0.4.2 extracted 29 flags across 13 vulnerability categories:
- injection: SQLi, blind SQLi, SSTI, command injection
- access control: IDOR, auth bypass, business logic flaws
- file-based: LFI, file upload, XXE, deserialization
- network: SSRF
- session: cookie manipulation
how that stacks up against published results:
44.2% is not a winning number. context matters.
KinoSec, XBOW, and Shannon have all been iterating on this benchmark for months. they use multi-agent architectures, custom tool libraries, and proprietary orchestration. pwnkit got to 44% in a few weeks with a single agent and a bash shell. the trajectory matters more than the snapshot, and the trajectory is steep.
the strong categories — SQLi, IDOR, SSTI, SSRF — are the ones where curl knowledge translates directly. the gaps are in challenges that need stateful multi-step exploitation: chained deserialization, complex auth flows, file upload + LFI combos. that’s where shell-first hits its limits and where the agent needs to get smarter about planning.
the responses API bug
this one still stings.
pwnkit was running against the XBOW benchmark on Azure OpenAI using the Responses API. every challenge crashed after turn 3. every single one. the agent would start strong — reconnaissance, initial probing, maybe a first payload — then just die.
two days of debugging. token limits. rate limiting. payload sizes. nothing made sense.
the bug: when the conversation history was serialized for the Responses API, assistant messages were sent as input_text instead of output_text. the API accepted this for the first few turns (lenient parsing), then Azure’s stricter validation kicked in and rejected the entire request.
// before (broken)
{ type: "input_text", text: assistantMessage }
// after (fixed)
{ type: "output_text", text: assistantMessage }
one line. the agent had been crashing on every challenge for the entire first week of benchmarking. every “zero flag” run, every “the agent can’t hack anything” session — it was this bug. the agent wasn’t failing at pentesting. it was failing at having a conversation.
the fix landed and the flag count jumped from 0 to 16 overnight. then the research-backed improvements pushed it to 23.
embarrassing? yes. this is what real development looks like. the most impactful fix in v0.4 was a one-line type annotation.
research-backed improvements
after the shell-first breakthrough and the API fix, six papers and projects shaped the rest of v0.4:
KinoSec showed the value of a planning phase. their agent doesn’t just start attacking — it first builds a mental model of the target, identifies likely vulnerability classes, and creates an attack plan. pwnkit gained a similar planning phase where the agent spends its first few turns on recon and hypothesis formation before throwing any payloads.
XBOW’s own paper documented how challenge hints (a sentence or two describing the vulnerability category) are standard practice in benchmarking. running without hints is like doing a CTF without reading the challenge description. adding hints brought pwnkit in line with how everyone else evaluates.
MAPTA and Cyber-AutoAgent both emphasized reflection — the agent periodically stepping back to assess what’s working and what isn’t. pwnkit gained reflection checkpoints at 60% of the turn budget. if the agent has used 24 of its 40 turns without a flag, it stops, reviews what it’s tried, and pivots strategy.
deadend-cli had a clever approach to detecting when an agent is stuck in a loop. their pattern of tracking repeated actions and forcing a strategy change after three consecutive similar attempts went straight into pwnkit.
Shannon demonstrated that turn budget matters more than people think. their best results came with generous budgets that let the agent explore. the turn limit went from 20 to 40, and several challenges that were previously timing out started succeeding.
the combined effect: planning + hints + reflection + larger budget + shell-first took the score from 16 flags (post-bug-fix baseline) to 23.
AI/LLM security benchmark
web vulns aren’t the only thing pwnkit needs to find. a custom benchmark of 10 AI/LLM security challenges covers prompt injection, jailbreak detection, system prompt extraction, PII leakage, and MCP-based SSRF.
pwnkit scored 10/10. 100%.
these challenges are closer to pwnkit’s core design — the agent understands AI systems because it is an AI system. it knows how prompt injection works because it has to defend against it. it knows how system prompts can be extracted because it has one.
this is the category where agentic security tools have a genuine structural advantage over traditional scanners. a regex-based tool can’t find a prompt injection. an AI agent can, because it can reason about what the prompt is trying to do.
infrastructure
v0.4 wasn’t just the agent. the scaffolding to support serious development came with it:
docs site at docs.pwnkit.com. proper documentation instead of a README that kept getting longer.
82 tests — 48 unit tests covering the core scanning pipeline, message serialization, tool dispatch, and result parsing. 34 integration tests that run actual agent sessions against test targets. the Responses API bug would have been caught by integration tests if they’d existed earlier. now they do.
Azure OpenAI runtime with full Responses API support. the runtime layer is abstracted so providers can be swapped, but Azure is the primary target for enterprise deployments.
blind PoC verification pipeline with structured verdict records. every finding goes through the double-blind verification process described in the previous post. the verdict records are machine-parseable JSON with confidence scores, data flow traces, and rejection reasons.
what’s next
v0.5 is about closing the gap on XBOW. specific targets:
full CI benchmark run with all improvements integrated. right now challenges are run manually and tracked in a spreadsheet. automated runs need to report scores per category, track regressions, and flag improvements.
sub-agent spawning for complex exploit chains. the single-agent architecture hits a wall on challenges that need multiple phases — upload a shell, trigger deserialization, pivot to internal services. a coordinator agent that spawns specialized sub-agents for each phase should handle these better.
push toward higher scores. 23/52 is a start. the research says planning, reflection, and generous turn budgets are the biggest levers. basic versions of all three are implemented. the next step is tuning — when exactly should reflection trigger, how detailed should the plan be, what’s the optimal turn budget per category.
the shell-first insight was the biggest unlock in pwnkit’s history. it wasn’t planned. it was discovered by running out of ideas with the structured approach and trying something dumb. sometimes the best architecture is no architecture. just give the agent a shell and get out of its way.