TDD by Construction: How the Vimbus Loop Validates Its Own Code at Scale

I run an autonomous loop that generates production code one task at a time — a fresh agent per task, a module's worth of work in a night. At that throughput the constraint stops being whether code gets written and becomes whether it's trustworthy. A passing test suite is necessary and far from sufficient: every pnpm check can come back green while the frontend was never opened in a browser and the backend has drifted off its architecture.

What makes the output trustworthy is not clever code in any one place. It's a validation layer wrapped around the loop, with gates at three altitudes — the task before an agent runs, the code before a checkbox flips, and the codebase before anything is called shippable. This is how that layer is built, and which failure mode each gate is designed to stop.

A task is a typed contract, not a sentence

Every unit of work lives in a loop-state/tasks.md file as a checkbox item with typed sub-fields. The driver parses it line by line, picks the first unchecked item, and builds a five-phase prompt from the fields. Here's a real backend task, lightly trimmed:

- [ ] A1 — Verify the demand-events schema, dedup recipe, and dayBucket
  - Id: class-demand-tracking-a-1-schema-and-dedup-verify
  - TaskType: backend
  - Complexity: high
  - Epic: A
  - Design: design/02-event-model.md
  - State: state/locked-decisions.md
  - Skill: database-design-mongodb
  - Files:
    - apps/api/src/demand/infrastructure/persistence/mongoose/demand-event.schema.ts
    - apps/api/src/demand/application/demand-dedup.ts
    - apps/api/src/demand/application/demand-dedup.spec.ts
  - TestCmd: pnpm test -- --testPathPattern=demand-dedup
  - Acceptance: pnpm test -- --testPathPattern=demand exits 0 with at least one
    test per branch of the dedup recipe (customerId-present, customerPhone-only,
    both-null skip, same-day collapse, next-day produces second row).

The key fields are load-bearing. Id: is a stable identifier matching ^[a-z0-9][a-z0-9-]*$. Skill: names exactly one capability the agent loads before coding. Files: lists concrete repo-root paths. And TestCmd: is the field that makes the task TDD: a real command that must exit zero.

These task files are not written by hand. They are produced by a planner pipeline that runs before the loop starts: a Research agent gathers the relevant library patterns, an Interview step fixes the requirements, a Design Writer commits the architecture decisions, a Task Writer converts those into typed tasks, and a Review Agent audits the whole graph — looping back to fix what it flags, up to three rounds, before returning PASS. Model cost is allocated where errors are most expensive: Opus on design and review, Sonnet on lookup and rule-following. By the time the loop executes a single task, the work has been researched, designed, and reviewed.

The verification-hook rule (E3)

The input gate enforces a non-negotiable rule the conventions doc calls E3: any task whose TaskType is backend, frontend, or fullstack must declare at least one verification hook. Six are accepted:

TestCmd: — a shell command, typically pnpm test -- --testPathPattern=<file>.
Route: — a URL to exercise through the Playwright MCP.
Playwright: — a browser interaction narrative.
VisualCheck: — a screenshot or visual-diff anchor.
A11y: — an accessibility check declaration.
Evidence: — non-browser proof: a log excerpt, a network request body, a DB row.

Acceptance: alone does not satisfy E3. An Acceptance: sentence states what should be true; a verification hook states how a machine confirms it. The two are kept distinct by construction.

Static validation, before any agent runs

tools/loop/lib/Validate-Module.ps1 runs with no model call — a static pass over the task file, so a malformed task never consumes an agent invocation. The E3 check is deliberately blunt:

# E3:missing-verification — implementation TaskType without any
# verification hook.
if ($f.TaskType -and ($f.TaskType.ToLower() -in @('backend', 'frontend', 'fullstack'))) {
    $hasHook = [bool]$f.TestCmd -or ($f.Route.Count -gt 0) -or `
               $f.Playwright -or $f.VisualCheck -or $f.A11y -or $f.Evidence
    if (-not $hasHook) {
        $errors.Add((_NewFinding -Rule 'E3:missing-verification' `
            -Message ("TaskType `{0}` must declare at least one of " +
                      "TestCmd/Route/Playwright/VisualCheck/A11y/Evidence" -f $f.TaskType)))
    }
}

There is no severity dial. E3 is an error, and an error stops the loop before it starts. The same file rejects vague acceptance criteria (works correctly, looks good, handles edge cases), duplicate ids, and remote git operations an unattended loop could never authenticate.

This is the first gate, and it operates entirely on the plan. What it cannot check is the code, which does not yet exist. For one task at a time, a constraint on the plan is enough. It stops being enough as soon as throughput goes up.

At scale, a green suite stops meaning good

Run one task and a passing pnpm check is a fair proxy for done. Run loops in parallel across a 131-task module spanning a dozen epics, and the proxy leaks — because the two things most likely to degrade at volume are exactly the two a unit-test run does not see.

The frontend was never opened. A frontend task can write a component, write a Jest test that mounts it in isolation, pass, and flip its checkbox — without a real browser ever loading the real route. The component exists; whether it is mounted, whether the page renders, whether the button does anything, whether the console is clean — all unverified. At three tasks the gap is obvious. At a hundred you have a green module and an app that white-screens.

The backend drifts off its architecture. Each task runs in a fresh context — deliberately, so task B47's prompt is not polluted by B1–B46. But a fresh context is also an amnesiac one. An agent that cannot see the existing ports will inject a Mongoose model into a use case, or instantiate an adapter inside the domain. The test still passes — the behavior is correct — but the boundary is breached. Across a hundred tasks, the architecture the project depends on accumulates holes, one reasonable-looking shortcut at a time.

Both failures share a shape: the code is correct and the code is bad, and only a gate that inspects the output distinguishes the two. Each arm of the problem gets its own.

Separation of powers on the frontend

The frontend gate is structural rather than advisory. When a task declares browser acceptance — any of Route:, Playwright:, VisualCheck:, A11y: — the loop stops running it as one agent and splits it into two invocations with different tools.

The implementation pass runs with Context7-only MCP. The browser tools are withheld; the agent writes the code, runs the unit checks, and is instructed not to touch the checkbox:

3. Browser verification is deferred to a second Claude invocation with strict Playwright-only MCP.
   - Do not call browser MCP tools in this implementation pass.
   - Complete the implementation and non-browser checks only.
   - Do not flip the checkbox in this pass; the browser verification pass will flip it if the UI checks pass.

A second agent, with Playwright-only MCP, runs the verification pass. It navigates the real Route:, takes a browser_snapshot to confirm the expected controls exist, performs the Playwright: interaction, screenshots the VisualCheck:, runs axe for A11y:, and reads browser_console_messages. Only this second agent may change - [ ] to - [x].

The agent that writes the code is not the agent that decides it works. That separation is the answer to a frontend that was not production-grade: a passing unit test proves a component computes; driving the mounted route in a real browser proves the feature exists.

The verification pass is built to survive a real environment. It preflights the dev server, and when a route redirects to /login it performs an actual credentials login before classifying the failure. When it cannot make the UI pass, it writes BLOCKED.md with a classification (not-mounted, auth-gate, route-missing, acceptance-failed) and halts the loop, surfacing the failure immediately rather than burying it in a log.

The input gate enforces a matching rule so the two halves cannot disagree:

TaskType	Verification the loop requires
`backend`	`TestCmd:` — a Jest run, usually against `mongodb-memory-server`
`frontend`	`Route:` + `Playwright:` + `Evidence:` — driven through the Playwright MCP
`fullstack`	both of the above
`migration`	`TestCmd:` or `Evidence:` with expected pre/post row counts

browser:missing-integration rejects a frontend task whose Route: acceptance points at a UI not actually mounted under apps/web/app/.... The Playwright pass exercises the wired page, or the task does not ship.

Keeping the backend on its architecture

The backend defends its architecture at two moments — before the loop runs, and on every task while it runs. Neither is a compiler-level import-boundary lint.

Before a single line is written, the plan passes through the same Review Agent (tools/planner/agents/review-agent.prompt.md) — "the last gate before the loop starts executing." It loads the codebase convention doc, reads the entire task graph, and fails it on any blocking architectural issue:

every port defined in domain/ has a corresponding adapter task in infrastructure/
no application/ task imports directly from infrastructure/ — it must go through a port
tasks are ordered domain → application → infrastructure → interface/http
injection tokens are Symbol()s in SCREAMING_SNAKE_CASE, defined before they are used
businessId is never read from a request body, query param, or URL — the same header-only tenant rule the API enforces at runtime

The verdict is binary: any blocking issue and it is FAIL, and the loop does not start. A bypassed port is caught as a planning error, before an agent spends a token writing it.

While the loop runs, the rules are re-stated on every task — because a fresh context per task is an amnesiac one, and an agent that cannot see the existing ports will instantiate an adapter inside the domain. Build-Prompt.ps1 re-injects them into every backend / fullstack prompt, keyed off ModuleType::

Hexagonal rules (non-negotiable):
- `domain/`          — pure TS (ports, value objects, enums; no NestJS imports)
- `application/`     — `@Injectable()` services depending on ports via `@Inject(TOKEN)`
- `infrastructure/`  — adapters (Mongo, Redis, Meta, etc.)
- `interface/http/`  — controllers + DTOs
Never remove or bypass a port.

pnpm check gates the end. No single one of these is a hard wall; together — a plan-time review, rules re-grounded on every task, and a green-or-nothing verify — they make bypassing the architecture the expensive path, which at scale is most of the result.

A second failure mode follows from the fresh context. Clean context stops contamination, but it also means task B47 has no knowledge of the conventions tasks B1–B46 settled on. Port naming, event shapes, error-handling style — an amnesiac agent reinvents them, producing a module that is internally inconsistent even though every file passes. The mitigation is not to re-inject the full history; it is a thin, committed ledger the loop keeps beside tasks.md: a PROGRESS.md with one line per PASS, and a history.md recording every PASS, FAIL, and SKIP. That record lets a hundred-task run survive a crash — or a stray git command — without redoing or clobbering finished work, and it is the seam through which a later task can be handed the conventions earlier ones already settled.

The verification chain

Both gates plug into the same five-phase prompt the driver builds for every task:

Load patterns — invoke the named Skill: and pull library docs via Context7.
Read design — open the Design: and State: docs for context.
Implement — write the code under the layer rules for the module type.
Verify — run the TestCmd:, then the whole-repo gate, then (for browser tasks) hand off to the Playwright pass.
Flip the checkbox — change - [ ] to - [x], only when everything is green.

VerifyCmd in Phase 4 defaults to pnpm check, which the root package.json defines as a single chain:

"check": "pnpm lint && pnpm typecheck && pnpm test"

Per-task hooks prove the new behavior; the whole-repo gate proves nothing else regressed. If lint fails, the checkbox stays unset. If a type is wrong, it stays unset. If one test is red — or, for a UI task, if the browser pass cannot drive the route — it stays unset, and the agent is instructed to surface the blocker rather than flip the checkbox. The loop does not record work as done until both proofs hold.

Codebase audit: verified end-to-end

Every gate so far operates on one task at a time. Above them runs a slower, whole-codebase check: an MVP audit — the product-owner agent dispatching a backend-auditor and a frontend-auditor in parallel — whose single output is an honest answer to what stands between the system and shipping. Its bar for marking anything done is the explicit form of the same standard:

A checkbox is [x] only if the audit traced HTTP entry → controller → service → DB schema → response, AND a test exists exercising the full path.

Code that looks right but has no test, a happy path only, mocks in the critical layers — none of it earns the mark; it stays open or is flagged as claimed-but-unverified. The frontend auditor applies the same rule through a Playwright or MSW test tracing page → query → API → route → render. The instruction to both is one line: no checkboxes based on vibes. It is the per-task standard applied at the scope of the whole product.

Cost and payoff

The cost is real and worth stating. Up front, a task cannot be written without deciding how it will be proven. At run time, the output gate is not free: the split browser pass is a second agent invocation — more latency, more tokens — and re-injecting the hexagonal rules into every prompt is prompt weight paid on every task. For exploratory work that overhead is a poor trade.

The payoff is that quality becomes a property of the system rather than an act of attention. Verification is wired into the schema, the validator, the prompt, and the separation between the agent that writes and the agent that checks — so it holds at any throughput, including unattended overnight runs with no human reviewing a diff. It is the same approach reached from other directions: in evaluation-driven development for the booking agent, where the eval suite is the spec, and in the hexagonal architecture behind Holocomm, where a dependency cannot point the wrong way. Make the correct outcome the only one the system accepts, and correctness stops depending on discipline that is not always present.

Kent Beck's original framing of TDD asks you to write the test first as an act of will. This system removes the will from the equation: declare how the work will be proven, hand the proving to an agent that did not write the code, or the gate stays shut.