What Makes a Good LLM Spec

Vibe coding stops working as soon as you want to keep what you generated. After a few hundred lines, the agent contradicts decisions you thought were settled, the chat history is no longer the source of truth, and any non-trivial change risks rewriting things you already agreed on.

The responses to this tend to split in two directions. One piles more rules into AGENTS.md, more skill files, longer instruction preambles, and hopes the model will read and follow them on every turn, which is essentially asking the model to be both the worker and its own inspector. The other treats the spec as the real artifact, the agent as a generator that runs against it, and the surrounding system as deterministic software rather than another prompt. I am not going to get into it here, but a fair amount of friction in this area comes from continuing to model the agent as a colleague to be instructed and reasoned with, rather than as a generator with fairly predictable failure modes that needs a system around it.

I started sketching Ossature at the beginning of this year on that second view, and a lot of what has been published since on agentic coding has been moving toward similar conclusions. Stripe's Minions ship through pre-defined blueprints and pre-push hooks rather than open prompts, the OpenAI harness engineering team describes their work as "designing environments, specifying intent, and building feedback loops" rather than writing code, and Sean Grove has been arguing that the spec, not the code, is the artifact worth keeping.

All of this puts more weight on the spec itself. The gap between what you describe in natural language and what a program actually does has always existed in software, and AI generation widens it. As Shuvendu Lahiri at Microsoft Research recently put it, AI-generated code "amplifies it to an unprecedented scale." Test coverage does not save you from a spec that was vague about what the right thing was in the first place. Writing a spec is mostly the work of being precise about intent in places where you would otherwise rely on the agent (or your future self, or a colleague) to guess.

An example

yep is a yes(1) clone in C99. It loops a string to stdout until the pipe closes. The whole project fits in one spec file:

---
id: YEP
status: draft
priority: low
depends: []
---

# Yep

## Overview

A command-line utility that repeatedly outputs a line of text until
killed or the output pipe closes. Compatible with GNU coreutils `yes`
behavior.

That part is easy. The rest is what the agent actually generates from.

Be concrete

The most common failure mode in early specs is leaving room for the agent to interpret. "Handles invalid input gracefully" is a wish, not a requirement. The model will resolve it however the easiest tokens come out, and a different model on a different day will resolve it differently. Simon Willison describes the right instinct for production work: "I treat it like a digital intern, hired to type code for me based on my detailed instructions."

How yep describes its main loop:

### Output Loop

Continuously print a line to stdout until the process is terminated
or stdout closes (e.g., the pipe consumer exits).

**Accepts:** Zero or more string arguments.

**Returns:**
- If no arguments: prints `y` followed by a newline, repeatedly.
- If one or more arguments: prints all arguments joined by a single
  space, followed by a newline, repeatedly.

**Errors:**
- Write to stdout fails because the pipe consumer has closed
  (broken pipe) -> exit silently with code 0. The implementation
  MUST suppress the broken-pipe signal so the failure is observed
  as a write error rather than terminating the process via signal.
- Write to stdout fails for any other reason -> print an error
  message to stderr and exit with a non-zero code (1), consistent
  with GNU coreutils `yes`.

Nothing here is implicit. Arguments join with a single space. There is a newline. The broken-pipe case is named, and the implementation is told to suppress SIGPIPE rather than letting the kernel terminate the process. None of this is exotic. It is the kind of detail a careful programmer would think through, written down before code generation rather than discovered while debugging.

For each requirement, force yourself to answer three things. What comes in. What goes out. What happens when something goes wrong. If you cannot answer the third one, you have not finished thinking about the requirement.

Errors are part of the surface

Errors are where most generated code drifts. The agent picks something, a return code, an exception, a printed message, and the choice does not necessarily match anything else in the system. Naming each error case in the spec is the cheapest way to keep that consistent.

Below the per-requirement errors, yep reinforces global behaviour in Constraints:

## Constraints

- No third-party runtime dependencies.
- `--help` and `--version` are only recognized when each is the sole
  argument; otherwise all arguments are treated as literal strings to
  output.
- Exit code is 0 under normal operation and on broken-pipe termination;
  non-zero on other write errors.

If a rule applies to one requirement, write it inside that requirement. If it spans many, move it into constraints. This is also where you can put the things you would otherwise place in an AGENTS.md and hope the agent remembered.

Non-goals

The instinct when writing a spec is to list everything you want. The more useful instinct is to also list what you have decided not to do. Without that, the agent will add things on its own, a flag-parsing library, a --quiet mode, locale handling, and you will spend the rest of the project deleting them.

## Non-Goals

- Throughput optimization (e.g., syscall batching)
- Locale or internationalization support

Real yes(1) is heavily optimised for throughput by buffering output in large chunks before writing. If you do not want that complexity, say so. Otherwise the model sometimes adds it and sometimes does not, and you never know which version you ended up with.

Examples

Prose is ambiguous in ways that example input and output rarely are.

### Multiple Arguments

**Input:**

```plaintext
yep hello world | head -3
```

**Output:**

```plaintext
hello world
hello world
hello world
```

These are not test fixtures. They exist to disambiguate the prose. The example shows that "joined by a single space" means no leading or trailing space, and the newline placement is what you would expect. If your prose and your example disagree, the audit will catch it. It is much cheaper for you to catch it while writing.

When you also need an AMD

yep only has an SMD. The build is a Makefile, the program is a single .c file, and the audit can infer the file layout from the requirements without losing anything important.

For larger projects, you also want an AMD (Architecture Markdown). The SMD describes what the system should do. The AMD describes how it is laid out. Which files exist, what their interfaces look like, how the components depend on each other. You want one when:

The project has more than one module and you have an opinion about how they should be split.
You want to commit to specific public function signatures, so multiple specs can build against the same interface.
You are working on a multi-spec project where one spec needs the public surface of another, header-file style, without seeing the implementation.

If you skip the AMD, the audit decides the file structure on its own. That is fine for small things. For anything where module boundaries matter, an AMD removes a class of "why did the agent put it there" questions. The format is in the AMD reference.

A working heuristic. If you would draw a box-and-arrow diagram on a whiteboard before starting, write an AMD. If you would not, you probably do not need one yet.

Run the audit before generating

Once a spec is written, run it through the audit before generating any code. The audit reads the spec the way the build will read it later, and flags ambiguities you missed.

When I audited yep, the report came back without errors and surfaced two warnings, both about the same gap:

The global constraint says broken-pipe termination exits with code 0, but the Help Option error rule says any stdout write failure exits with code 1. A closed pipe while printing help (for example, yep --help | head -c0) is both a stdout write failure and a broken pipe, so two implementations could produce different exit statuses.

The global rule for broken pipes and the per-requirement rule for write failures overlap on the boundary case yep --help | head -c0. I had not noticed. The audit found it in seconds.

You do not have to fix everything the audit warns about. Some are real ambiguities, others are choices the agent can resolve at build time. The point is that they are surfaced before generation, while the cost of fixing them is still an edit to one file.

Level of detail

Specs that are too thin produce code full of guesses. Specs that try to predict every line of generated code stop being specs and become a slower way of writing the program by hand. The middle ground is mostly judgment. As a working heuristic, write down anything you would catch in code review.

For yep, that includes the broken-pipe semantics (because most ad-hoc implementations get it wrong), the exact help/version invocation rules, and the Makefile targets. It does not include "use a for loop" or "the buffer should be 4KB." Those are implementation choices the agent is generally fine to make.

If you find yourself writing pseudo-code in the SMD, that is a sign you should either move it into the AMD as an interface signature, or admit you would rather just write the code by hand for that part. Both are fine answers. Mixing them inside an SMD is not.

The spec is what you maintain

Once a project is generated, the temptation is to start editing the code directly. Try not to. The spec is the file you change when requirements change. Ossature hashes every input to every task, so a small spec edit only re-runs the tasks that actually depend on it. Iteration stays cheap in both time and tokens. That, more than anything, is the practical reason for putting a harness around the model in the first place rather than dumping more rules into AGENTS.md or skill files. The harness keeps the deterministic parts deterministic and asks the model to do only the thing it is good at, generating against a tightly scoped prompt. The model is not asked to remember conventions across turns or to check its own output.

After enough years of typing into source files, it feels strange that the source file is no longer the primary artifact. The alternative is the path vibe coding leads to, a generated codebase nobody can confidently modify because the decisions that produced it have evaporated into a chat log somewhere.

The format is the easy part. Deciding what to put in it is the engineering. For SMD frontmatter fields, required sections, and validator rules, see the SMD reference.