2026-06-04 · 8 min read · agents, c, safety, language-design

What goes wrong when LLMs write C: the recurring bugs

The bug shapes we see over and over when an agent generates C, and which of them fastC structurally prevents, converts into a build-time trap, or simply makes louder.

Spend a week reading the C an LLM writes when you ask it nicely and the bug shapes stop feeling random. The same handful of failure modes show up across models, across prompt strategies, across model sizes. Some of them are bugs every C programmer writes occasionally; the agent just writes them at higher density because it does not feel the pain of UB the way a human does after their fifth segfault. Some of them are bugs no careful human writes but every model produces.

This is a short field guide to the shapes we keep seeing in our cross-language benchmark runs against GLM, Kimi, DeepSeek, and Qwen — and a column-by-column note on which of those shapes fastC structurally prevents, which it converts into a compile-time or runtime trap, and which it just makes louder so a reviewer can find them.

1. Silent integer overflow

The single most common shape in our T5 large_sum benchmark — sum the integers 1 through 100000 — is the agent writing int total = 0; and a loop, then handing you back a confidently wrong number. Go silently wrapped 3/3 of its runs. Rust did it 2/3 — Rust’s release-mode integer arithmetic wraps unless you explicitly opt into checked arithmetic, and the agent did not. The C programs got it right because the agent reflexively reached for long long, but it is a coin flip; on T5 with a tighter time budget the same coin lands tails.

fastC’s stance on this is the contrarian one: signed overflow is a runtime trap by default. The cost is real — our fib(40) benchmark is ~26 % slower than C with -O2 precisely because of the overflow checks. The payoff is that fastC ran the same T5 benchmark and produced zero silently-wrong binaries across all four models. It either compiled and computed correctly, or it trapped. There was no third outcome where the bytes looked plausible and the math was off by a billion.

This is also the case for “the contract is the documentation.” A function returning a sum has a natural @ensures(result >= 0) postcondition when the inputs are non-negative. The agent that writes total += i inside the body can attach the postcondition; if the body cannot satisfy it, the compiler complains. The reviewer is now reading the postcondition, not chasing the body.

2. Off-by-one on `strncpy` / `snprintf` / null terminators

The C standard library’s string functions are a museum of edge cases and the agent has memorized none of them. We see strncpy(dst, src, sizeof(src)) (wrong: should be sizeof(dst) - 1, then explicit null). We see snprintf(buf, sizeof(buf), "%s", input) with input not bounded. We see the dance of “did I leave room for the null terminator?” answered wrong, in both directions.

fastC’s stance: there is no strncpy. The Str type and the vec::push family on Vec[u8] handle length-prefix accounting. mod cli’s echo_cstr example in the source tree shows the canonical pattern — a usize walking pointer with a proven upper bound from the loop’s max, and an early return on the null byte. The bound is visible in the source; if you want to argue it is correct you argue with the bound, not with whether the agent remembered the trailing zero.

The agent benchmark wedge here is concrete: when we ship fastC code samples with the prompt — what we call the “cheatsheet” — the agent does not have to recall whether strncpy null-terminates (it does not, unless n > strlen(src)). It pattern-matches against echo_cstr or str::clone and reproduces the shape. We measured this. With an inaccurate cheatsheet our first-compile success was 0/9 on T1. With a cheatsheet built around a verified worked example plus a “common mistakes” inverse guide, the same models scored 12/12.

3. Ambient I/O reached for at the wrong layer

Ask an agent to write a “parse a config and return the parsed struct” function in C. There is a non-trivial chance it reaches for fopen() inside the parsing function — because the function “needs a config to parse,” and the function is named load_config, and the simplest path is to do the I/O right there. You have just lost separation of concerns; testing the parser now needs a temp file.

In fastC this is a type error. The parser’s signature does not declare CapFsRead, so fs::read is not callable from its body. The fix is structural: hoist the I/O to main, parse the bytes. The compiler made the architectural choice for you. This is the deepest part of the wedge — capabilities are not just a security mechanism, they are a layering mechanism. A function that does not declare a capability cannot accidentally take an I/O dependency.

4. The `system()` injection that nobody catches

The fastest way to make an agent produce a remote-code-execution vulnerability is to ask it to “do X by shelling out to Y.” It will write system() with interpolated arguments and not escape. It will write popen() the same way. Code review catches this when the reviewer has the time and the eye for it; code review does not catch it when the agent generated 800 lines yesterday and you are reading the diff during standup.

fastC’s stance: proc.spawn is a capability. Calling it requires a ref(CapProcSpawn) in the function signature. That signal is loud — a function that needs to spawn a process is carrying a capability for it, in its signature, where the reviewer cannot miss it. The function still has to escape its arguments correctly inside the body. The compiler has not solved escaping. What the compiler has done is moved the attack surface from “any function in the codebase” to “the small set of functions that explicitly carry CapProcSpawn.”

5. The use-after-free on the path you did not test

Agent-written C produces a lot of use-after-free, especially in error paths. The success path looks fine. The failure path frees ctx, then jumps to cleanup: which also frees ctx. Or it caches a pointer into a Vec and the next push reallocs. Or it returns a borrowed string from a function that owns the underlying buffer, and the caller outlives the buffer.

fastC’s runtime is, frankly, less mature here than Rust’s borrow checker. We do not pretend otherwise. What fastC has is contracts. A function that returns a pointer can carry an @ensures that documents the borrow lifetime — and the SMT discharger can prove the easy cases. A function that frees can carry an @ensures(self.freed == true) and a stricter linter that flags the second free. None of this is as airtight as Rust’s borrow checker. All of it is shipping today, and the runtime layer makes the violation a trap, not silent corruption.

The honest framing: if you need memory safety as a hard guarantee, Rust is still the answer. If you need memory safety as a strong default with structural escapes you can audit, fastC’s contracts plus the runtime trap layer get you most of the way, with a compile time that does not punish you for the choice.

6. The build-script payload

This one is not a bug an LLM writes by mistake. This is a bug an LLM writes by helpful imitation. Ask an agent to “add a dependency for X” and it will reach for the most-popular crate, which last week’s typosquat was banking on. The published Rust typosquat history — faster_log, async_println, evm-units, timeapis.io — is exactly the failure surface where the agent’s helpfulness is most dangerous. The agent picked the package the user named. The package executed code at build time. The .env got exfiltrated to a C2 server in Riga.

fastC removes the bug class. There is no build.rs. The package manifest is declarative. Dependencies are git URLs with commit + sha256 + cosign keyless signing. The agent can still pick the wrong dependency — pinning helps, signing helps, but if the human added a malicious URL the human added a malicious URL. The compiler will not silently fetch the dependency, run a payload, and then start compiling. The compiler will fail loudly when the hash does not match, and that is all.

What fastC does not fix

A non-exhaustive list, in honest order:

Logic bugs in the body. If the agent writes the wrong formula, the wrong formula will compile. Contracts only help if you wrote one. The agent can write contracts; humans review them. This is still labor.
Misuse of unsafe. fastC has unsafe blocks. They are smaller in surface area than Rust’s because the surrounding type system is smaller, but a determined or careless agent can defeat the system by reaching for unsafe. The lint flags it. The reviewer has to actually read it.
Concurrency bugs. Stage 2.3 brings async. Until then, fastC is synchronous and we do not try to be better than C on threading semantics. (We are better on capabilities — proc.spawn and net.listen need their caps — but pure data-race bugs in shared memory are a different threat model.)
Bad dependencies you chose to pin. Vendoring with a sha256 does not validate that the code at that hash is good. It only validates that you reviewed it once and it has not changed.

The shape of the bet

If you read the cross-language benchmark numbers in the README, fastC matches or beats every other language on first-compile success in T1, sits in the C / Zig class on binary size, and produces zero silently-wrong outputs on the safety wedge. That is what the language-design choices are for. The bet is that those three properties together change the economics of code review when an agent is the modal author. Reviewing a fastC diff is reviewing signatures, not bodies. Signatures fit on a screen. Bodies do not.

If you want the language-design rationale at length, docs/MANIFESTO.md in the source tree is the longest version of the argument. The version above is the one we tell when someone hands us a stack of bug reports from their AI-pair-programming pilot and asks what is going wrong.

Most of it is the same six shapes. Most of the fixes are structural.

Comments? Issues? Disagreements? Open an issue at github.com/Skelf-Research/fastc/issues.

← all posts