Internal-verified. Built by our AI delivery practice for our parent group's recruitment team. November 2025 – March 2026. Numbers reflect the project's full delivery span across a fractional six-person team.
Executive summary
We built an internal AI recruitment platform: twenty-two fully-specified features, five integrated application modules (API, frontend, end-to-end tests, ETL workflows, database), 745+ commits, four calendar months.
Six fractional team members carried the build: two developers, a QA engineer, a business analyst, a project manager, and a solution architect. Allocation was uneven — the developer pair held the largest share of code-effort, while the BA, PM, QA, and solution architect ramped in around feature handoff points rather than carrying equal load through the engagement. Direct product-delivery effort came to about 1,300 hours across the four calendar months — materially below what a Big-4 engagement estimate would carry for the same scope.
The build matters less than how the build went. We used AI agents (Claude Code was the primary coding tool), but we didn't point them at vague prompts and ask for working software. Every feature went through a four-phase specification process before any production code was written. Requirements first, in a constrained pattern language. Then a technical design. Then a task breakdown. Then test cases. Only then implementation. The output was AI-generated code that matched the system architecture, instead of plausible-looking code that drifted from it.
This is the methodology we now bring to external client engagements. The case below is what it looked like inside one project.
Why AI-assisted development needs structure
The default pattern with AI coding assistants looks productive and often isn't. A developer describes what they want, the assistant generates code, and the developer spends the next several hours debugging the parts the assistant misunderstood and reworking the parts that conflict with code somewhere else in the system. What you get is plausible output. Code that compiles. Code that looks like it does the right thing. Code that doesn't quite. Across a 22-feature project, this compounds into technical debt that has to be paid down before any new feature can be added.
The fix isn't to use less AI. It's to give the AI more structure to work against. We inverted the pattern: instead of using AI primarily to write code, we used it to think through the problem first, producing a structured specification that made the subsequent coding phase fast and architecturally consistent. When the assistant has a 415-line design specification with TypeScript interfaces, data models, and API contracts, the generated code matches the system architecture. Without that, you get plausibility.
We call the methodology Spec-Driven Development. It is a four-phase, mandatory process where every feature is fully specified before a single line of production code is written.
The four-phase process
`` Requirements → Design → Tasks → Test Cases → Implementation (Phase 1) (Phase 2) (Phase 3) (Phase 4) (Code) ``
Phase 1: Requirements. Each feature begins with a requirements document written in EARS format (Easy Approach to Requirements Syntax), a constrained pattern language that produces unambiguous, testable statements:
`` REQ-003: Language Filter AND Logic WHEN the user selects multiple languages, THEN the system SHALL display only candidates who speak ALL selected languages, AND the result count SHALL update to reflect the filtered set. ``
This format eliminates the most expensive class of software defect: building the wrong thing. A requirement written as "WHEN X, THEN Y" maps directly to a test assertion. There is no interpretation gap between what was requested and what gets verified. Each requirements document also carries user stories, edge cases, explicit out-of-scope items, and a domain glossary to prevent terminology drift across features.
Phase 2: Technical design. Component structure, data models, API contracts, error handling, and the design decisions with their rationale. Every design is written against the system's established architecture documentation, which means feature #22 follows the same patterns as feature #1.
Phase 3: Task breakdown. The design is decomposed into sequenced implementation tasks, each scoped to 2–4 hours of work. Tasks link back to specific requirements, so traceability runs from user need to implementation step. They are tracked with checkboxes and are the single source of truth for implementation progress.
Phase 4: Test cases. Test documentation is split across three purpose-specific files: end-to-end smoke tests (automated critical-path scenarios capped at about 10 per feature), manual test cases (comprehensive user flows covering edge cases), and API test cases (backend endpoint and contract validation). Every test case references the requirement it validates. This is left-shift testing in the truest sense: test cases are authored from the EARS requirements before implementation begins, so coverage is designed into the feature rather than retrofitted after the code ships.
Only then does implementation start.
Specialized AI agents with ownership boundaries
Rather than one general-purpose AI assistant doing everything, the project defined twelve specialized agents — ten pipeline agents and two on-demand specialists. Each had explicit file ownership and explicit restrictions.
- Business Analyst — Responsibility: Requirements (EARS format, ACs, Goals), PRD validation — Owns / produces:
requirements.md, cross-feature_requirements/, PRD validation report - Architect — Responsibility: Technical blueprints, feature decomposition, architecture decisions — Owns / produces:
design.md,api-contract.md,architecture.md,feature-roadmap.md - UI Designer — Responsibility: UI prototypes, design system, UI-feature alignment — Owns / produces:
prototype/,_ui-kit/, ui-feature-alignment doc - Developer — Responsibility: Task breakdown + TDD implementation — Owns / produces:
tasks.md,src/** - Tester — Responsibility: Test case design, post-implementation QA, E2E validation — Owns / produces:
tests.md,testcases-*.md,e2e/** - Reviewer — Responsibility: Code review against specs + coverage tracing — Owns / produces:
coverage-matrix.md - Code Analyzer — Responsibility: Static analysis: complexity, vulnerabilities, tech debt — Owns / produces:
code-analysis.md, tech-debt log - DevOps — Responsibility: Environment management, migrations, deployment — Owns / produces: runbooks,
.env.example, deployment scripts - Triage Analyst — Responsibility: Bug investigation and classification — Owns / produces: (read-only — investigation reports)
- Lessons Analyst — Responsibility: Self-improvement: captures lessons, proposes rules — Owns / produces:
lessons.md,learned-rules.md, layer guidance - Researcher (on-demand) — Responsibility: Tech research with parallel web search sub-agents — Owns / produces:
research-summary.md - Codex Reviewer (on-demand) — Responsibility: Independent second-opinion review of specific artifacts — Owns / produces: (read-only — review reports)
Ownership boundaries are not honor-system. A pre-tool hook rejects any agent that tries to write a file outside its ownership matrix. The Business Analyst agent cannot quietly expand scope by modifying the design. The Tester cannot water down requirements to make tests pass. The Reviewer cannot hide a flagged issue by editing the test it would fail. Hooks enforce mechanically what process documents only request.
The Codex Reviewer is worth singling out. At any approval gate, the orchestrator can dispatch an independent review of the specific artifact in question — requirements.md, design.md, or the diff for an in-flight feature — to Codex, a different model from the one that produced the artifact. Findings come back to the human approver. It is a cheap way to catch what same-model self-review misses.
The agents were configured against Claude Code (Anthropic's coding assistant). The constraint isn't the tool. The constraint is the process. The same approach works against any sufficiently capable coding agent, and the spec library is portable across them.
Reusable skills as institutional knowledge
Each phase runs through a dedicated skill: a structured template (typically 150–415 lines of guidance) that defines the expected output format, a quality checklist, and patterns specific to that phase.
- Requirements Engineering: EARS syntax patterns, user-story templates, completeness checklist
- Design Documentation: Architecture templates, interface specifications, decision-log format
- Task Breakdown: Sequencing strategies (foundation-first, vertical-slice, risk-first), scope-calibration rules
- QA Test Case Creator: Feature-type assessment logic, test-distribution strategy, naming conventions
Skills make the AI's output consistent regardless of which developer initiates the interaction. They encode the team's standards into a reusable form: institutional knowledge as executable templates. When a new contributor joins, they don't have to learn the team's spec conventions by trial and error; the skill applies them automatically. The skills themselves are version-controlled artifacts, reviewed and improved by the team like any other piece of code.
Where humans stay in control
The AI accelerates production of artifacts. Humans control decisions.
- Phase gates need a human signature. Each phase's output is reviewed and committed to version control independently. The AI cannot advance to design without approved requirements. It cannot advance to tasks without an approved design.
- Scope changes go through specs first. When a new requirement surfaces during implementation, the process enforces a hard stop: stop coding, update requirements, update design, update tasks, update test cases, then resume implementation. No undocumented functionality.
- Architecture stays with humans. Three system-level documents (product context, architecture decisions, engineering standards) are architect-controlled. AI agents read these as immutable context; they cannot modify them.
- Code review judgment is human. A code-review skill provides structured checklists (security, data flow, performance, test coverage), but the accept-or-reject decision belongs to people.
Hooks: mechanical enforcement
Phase gates and ownership boundaries don't rely on agents agreeing to follow them. Twenty-plus shell hooks fire automatically before and after every tool call. The blocking ones (PreToolUse) refuse the action when a violation is detected:
- Phase enforcement — spec edits are blocked during
devorreviewphases; feature-spec creation is blocked beforearchitecture.mdexists - Ownership enforcement — agents that try to write files outside their ownership matrix are rejected
- Workflow prerequisites —
design.mdwrites are blocked untilrequirements.mdhas Goals;tasks.mdwrites are blocked until the prototype's AC checklist is complete - Coverage and build gates — the
donetransition is blocked without a complete coverage-matrix; phase transitions are blocked whennpm run buildfails or E2E screenshots are missing - Destructive operations —
rm -rf, force-push,DROP TABLE, and similar are blocked without explicit user confirmation
Advisory hooks (PostToolUse) handle automation rather than blocking: auto-advancing pipeline state when an artifact is written, regenerating the morning brief, archiving stale lessons, syncing agent metadata. They cannot stop work, but they keep the system tidy.
The effect is that the methodology is enforced mechanically rather than relied upon culturally. An agent — or a developer — that tries to skip a step gets a hard failure, not a polite reminder.
Quality gates in the CI pipeline
Left-shift testing meant the test cases existed before the code did. Quality gates meant they ran on every change. Four layers of automated coverage feed the CI pipeline that gates every merge:
- Unit tests — Scope: Function-level branching and edge cases — Source of truth: Acceptance criteria in
requirements.md - Integration tests — Scope: Interactions between components and services within a feature — Source of truth: Data flows in
design.md - API tests — Scope: Every endpoint — happy path and documented error responses — Source of truth:
testcases-api.md - UI end-to-end (Playwright) — Scope: Critical user flows only — job import, position publish, long-list creation, AI research launch, export — Source of truth:
testcases-e2e.md
The pipeline runs all four layers on every pull request. Failures block the merge. We deliberately kept the E2E layer focused on critical functionality rather than exhaustive UI coverage — manual test cases catch the rest, and exhaustive end-to-end suites tend to become flaky and get ignored. The result: every shipped change had verified-passing tests at all four layers, and the regression suite that protected feature #22 was the same one that had protected feature #1.
How the system learns
Every test failure, code-review issue, or spec inconsistency surfaces as a [PENDING] entry in tasks/lessons.md. After every feature's QA stage, the Lessons Analyst agent reviews pending entries, identifies patterns, and proposes them as rules — tagged either [PERMANENT] (foundational, never retire, capped at five) or [ACTIVE] (recent, capped at fifteen, retire after three features without use).
After human approval, rules promote to .claude/rules/learned-rules.md (general) or to layer-specific guidance files for api/, services/, persistence/, ui/, and e2e/. Both auto-load into every subsequent session. The next feature inherits what the previous features learned. By feature #22, the agents were working against a body of version-controlled judgement that didn't exist at feature #1.
This is the compounding asset of a spec-driven AI engagement. The spec library, the rule library, the skill templates, and the agent definitions all carry forward — within a project across features, and across projects when the patterns apply.
What we measured
The headline numbers across the project:
- Twenty-two features fully specified through all applicable phases
- 745+ commits across the engagement
- Five application modules integrated (API, frontend, end-to-end tests, ETL workflows, database)
- Six fractional team members at uneven allocation: two developers (the largest share of code-effort), one QA engineer, one business analyst, one project manager, one solution architect
- About 1,300 hours of direct product-delivery effort over roughly four calendar months
- Left-shift testing throughout: test cases were authored from EARS requirements before implementation began, not after, so the regression suite grew as a designed artifact rather than a retrofit
- Comprehensive specification, design, task, and test documentation produced as a byproduct of the development process, not as a separate documentation effort layered on top
We tracked phase-level effort against a traditional baseline. The figures below are expert estimates rather than measured velocity. (We explain why in the next section.)
- Requirements — Traditional development: Often skipped or informal; when done properly, 1–2 days — SDD + AI: 4–8 hours per feature across multiple review iterations
- Technical design — Traditional development: Rarely documented; when done, 1–2 days — SDD + AI: 2–4 hours
- Task breakdown — Traditional development: Ad-hoc or done in sprint planning — SDD + AI: 30–60 minutes
- Test cases — Traditional development: Written after implementation, often incomplete; when thorough, 2–3 days — SDD + AI: 2–4 hours for comprehensive coverage
- Implementation — Traditional development: Variable; 20–40% of effort typically spent on rework and misunderstandings — SDD + AI: Focused; about 30–40% less rework
- Documentation — Traditional development: Separate effort, often deferred indefinitely — SDD + AI: Zero additional effort, because the specs are the documentation
The compound effect showed up across the engagement. EARS-shaped requirements with explicit acceptance criteria meant developers and AI built exactly what was specified. The language-filter feature's AND-not-OR logic decision is a concrete example: caught in requirements review, not discovered in production. Later features built on patterns established by earlier ones, so the spec library functioned as an institutional knowledge base both humans and AI could reference. Onboarding became self-service: a new contributor could understand any feature by reading its spec folder. Test coverage was designed into the feature, not retrofitted, and the regression suite that aggregated individual feature test cases gave us confidence in cross-feature interactions during releases.
Business-analysis effectiveness on the project came in roughly 50% above conventional BA activity, mostly through AI-accelerated domain research, project exploration, stakeholder-meeting transcript analysis, requirements drafting, and prototype generation. The pattern reversed the traditional write-from-scratch approach: the analyst refined AI-drafted specifications rather than authoring them from a blank page.
What we wouldn't claim
The efficiency numbers above are expert opinion, not measured velocity. The next project introduces stable process, sprint planning, weekly goals, and retrospective cycle-time measurement. Until then, treat the figures as directional evidence rather than benchmarks.
And we wouldn't claim Spec-Driven Development turns AI into a replacement for engineering judgment. The methodology amplifies the team it runs through; it doesn't replace one. A team that doesn't know what good looks like will produce specs that don't know either.
What this pattern unlocks for client engagements
This is the methodology we now bring to external client engagements. The same delivery shape (specs first, AI-accelerated, bounded agents, human-gated phase reviews) is what compresses a client AI build from a Big-4-style multi-month consulting engagement into an 8–16-week production delivery at mid-market pricing. The economics flip when the spec library, the skill templates, and the agent definitions are reusable assets across engagements. The first project pays the methodology overhead. Every project after carries the previous projects' work forward.
A Pulse Check is the entry point. Free, thirty minutes, no slide deck. We listen to where the workflow breaks today, sketch what specs would need to exist before any code starts, and tell you honestly whether this pattern fits your team or whether something else does.