AI as multiplier, discipline as durability
AI productivity gains require structural discipline to be sustainable. Without systematic rules, documentation, and automation, AI produces inconsistent results — some sessions yield clean code, others introduce technical debt. The solution is six practices: an agents constitution file, Architecture Decision Records, persistent memory, automation infrastructure, adversarial verification, and context hygiene. This system turns AI from an unreliable tool into a consistent multiplier — and lets one person ship at near-team velocity, with a project that doesn't rot underneath them.
The core problem
I observed a pattern when I started using AI seriously for production code in late 2024: massive speed variations. One session produced a clean, well-tested module in 20 minutes. The next session generated code that compiled, ran, and quietly violated the no-throw rule in three places. The root cause: AI is excellent at producing code that matches the prompt. AI is mediocre at remembering the rules of this specific codebase across sessions. AI is terrible at relitigating decisions you've already made when those decisions aren't visible in the local context of the file it's editing.
The fix wasn't "better prompts." The fix was making the rules of the codebase structurally visible so the AI reads them rather than remembers them. That single shift is what turns AI from a high-variance lottery into a consistent multiplier. The magnitude of the multiplier is task-dependent — large on boilerplate, modest on judgment, sometimes net-negative on long-running refactors — but the shape becomes predictable. The discipline doesn't homogenize the multiplier across tasks; it eliminates the negative tail. No more random sessions that quietly introduce technical debt nobody catches.
Six practices that make it work
The system I converged on has six distinct practices. Each closes a specific failure mode I'd observed.
1. AGENTS.md — the living onboarding doc
A single file at the root of the repository that AI is instructed to read before writing anything. Mine opens with: "This file is law. Every code change, every file created, every pattern used MUST comply. Violations are not 'I'll fix it later' — they are rejected. Read before writing."
The name is deliberate. Not CLAUDE.md, not CURSOR.md — AGENTS.md. The discipline is model-agnostic. If you switch tools next year, the file still works. Any AI assistant that reads the repository root picks it up.
The content is short — about 300 lines — and densely opinionated. It encodes the rules that aren't visible from looking at a random source file:
- No throwing in domain or application layers — use Result types instead
- No DTOs outside the API layer
- No
anycasts — useunknownwith type guards - No cross-module imports except via events
- Specific carve-outs for atomic writes via
forwardRef
Each rule includes concrete examples of the wrong way and the right way, side by side. After AGENTS.md was strict enough, no session ever wrote a throw in a use case again. The compliance rate went from "most of the time" to "every time."
But here's what actually matters: AGENTS.md isn't a rulebook you write once. It's the onboarding doc for an employee who forgets everything every morning.
Every real team has that doc — "here's how we do things, here's what not to do, here's why." When a new hire violates a convention, you don't blame them — you update the onboarding doc so the next person doesn't make the same mistake. AGENTS.md works the same way.
AI skips a validation step — you add a rule. AI over-engineers a simple CRUD endpoint with unnecessary abstractions — you add "do not add features beyond what the task requires." AI generates Czech error messages in a domain service — you add "error messages in domain layer are English-only."
The file grows from friction, not from planning. My AGENTS.md started at ~50 lines. Nine months later it's ~300. Every line is a scar from a session where the AI did something wrong and I decided it should never happen again.
Honest caveat: not every entry reflects deep principle. Some rules accommodate early codebase states encoded into discipline because consistency outweighed the cost of refactoring. The clearest example: the whole codebase — DB columns, DTOs, event payloads, API responses, and the TypeScript properties that mirror them — uses snake_case throughout, including BE-internal code. The origin is mundane: a non-developer's mock files used snake_case, the FE was built against those mocks, and the API contract followed. By the time the BE was real, snake_case was already wire-deep in the FE. I tried the conventional NestJS shape first — camelCase internally with a mapper at the entity↔API boundary — and actually built it. The math didn't hold up: two mapping layers, not one (DB row ↔ entity, entity ↔ API DTO), on every endpoint, every test fixture, every event payload, both directions. And the bigger blocker wasn't the BE cost: flipping the API to camelCase would have forced the FE to refactor every consumer, and I wasn't going to make that ask. So I removed the mappers and the BE absorbed the cost — snake_case properties in TypeScript, slightly off-convention but consistent end to end. The FE didn't need to change; every BE file reads user.user_id instead of user.userId. Annoying, not broken. Not an anti-pattern — a bounded accommodation with clear trade-offs, and a place where AI didn't push back because a globally-consistent codebase gives it nothing to push back against. The deeper point is that "discipline" in AGENTS.md isn't a synonym for "principle." Both kinds — principled and accommodation — get the same consistent AI compliance, and the operational win is the same. The accommodation kind's downside: AI won't surface it; you only see it when you ask yourself which line in AGENTS.md you wrote because you believed it, versus which line you wrote because the AI didn't push back.
2. ADRs — the decision graph
Architecture Decision Records, written as I go. Each one captures one architectural decision: context, decision, consequences, alternatives considered with rejection reasons, implementation notes. ECM has 38 of them in nine months.
The thing ADRs solve that AGENTS.md doesn't is the "why didn't you just…" failure mode. Future me — or a future contributor, or the AI in a session three months from now — looks at a piece of code that seems weird and thinks "we should just refactor this back to the obvious thing." The ADR has the obvious thing in the "Alternatives Considered" section with the reason it was rejected. The refactor doesn't happen. The hard-won design stays.
I wrote a separate post on ADR-driven solo development. The short version: writing ADRs while making the decision (not in batches afterwards) costs 20 minutes per decision and saves hours per decision when future-you would otherwise relitigate it.
3. Memory files — the persistent context
Conversation-spanning notes that the AI can read at the start of any new session. Mine are organized by category: user profile (who I am, what I'm working on), project facts (key technical details that recur), feedback (corrections I've given the AI that should generalize), references (where external resources live).
This solves the "new session, no context, has to learn everything from scratch" failure mode. A fresh conversation doesn't restart at zero — it inherits the curated, distilled version of what previous sessions discovered. The AI walks in already knowing my codebase has 38 ADRs and one extracted service and a deferred trigger-based audit design.
What matters: the memory isn't a transcript or a journal. It's a curated set of durable facts — things still true a month later. When something changes (pilot date shifts, an ADR supersedes another, my opinion on a pattern reverses), the memory gets updated. Stale memory is worse than no memory; the AI will confidently assert things that were true once and aren't anymore. Curating means being honest about what's still load-bearing.
4. Skills, hooks, slash commands, MCPs — the automation layer
Where AGENTS.md and ADRs encode rules, this layer encodes workflows. A "deep-audit" skill that does a full codebase pass with a defined set of phases. A "pull-and-review" skill that fetches the latest commits, runs verification, and produces a structured review. A hook that runs gitleaks before every commit.
This solves the "the right way to do X involves seven steps and I'll skip one" failure mode. When a workflow is automation, you can't skip step 4 because step 4 is a line in the script. When a workflow is "remember to run gitleaks before pushing," step 4 gets skipped sometimes — and that's how secrets end up in commits and force-pushes happen.
Seven concrete practices once you've passed the basics:
- Push domain logic into skills, not the constitution. AGENTS.md is read on every session and burns tokens; skills load on demand. Anything domain-specific — workflow-engine guidance, billing math, RLS plumbing, integration patterns — belongs in a skill the agent reads only when the work calls for it. The token win is real the moment your constitution starts approaching 500+ lines. The discipline is the same; the cost is much lower.
- Use MCPs for daily-touched tools. Linear / Jira, GitHub, Slack, Confluence, your CI — if you touch it daily, an MCP server gives the agent typed, scoped access through real schemas. Better than scraping CLI output, better than the agent guessing flags from memory. The agent calls the actual API; you get to define what's exposed and what isn't.
- Write your own tools when nothing fits. Custom scripts triggered by skills, or a full MCP server when the surface justifies it. Marginal cost of a one-off script is low; marginal benefit of "this exact workflow, one command, reproducible by future-you or future-AI" is high. Don't wait for someone to publish the perfect tool — write the imperfect one that matches your workflow.
- Add guardrails before they're needed. Settings-level permission rules that prevent destructive operations without explicit approval. Hooks that fail loud on rule violations (gitleaks pre-commit is the canonical one — caught my
.envleak before it ever reached origin in a different session). The cost of writing a guardrail is small; the cost of an unguarded mistake is sometimes catastrophic. Install them like seatbelts — before the crash, not after. - Wire CI ratchets — quantitative quality gates that only move one direction. Beyond guardrails on individual mistakes, CI-enforced baselines convert "we accept X violations today" into "we never accept more than X tomorrow." Lint warning count, test coverage percentage, cross-module import count, bundle size budget,
any-cast count, TypeScript strict-file count — whichever metric matters, capture today's value as a committed baseline and have CI fail if the next PR exceeds it. ECM has a dep-cruiser ratchet that locks in the current count of cross-module direct imports; any new violation fails the build, forcing it to be either fixed or explicitly justified in the same PR. This matters disproportionately in AI-augmented work: code generation is fast, so without quantitative friction, debt accumulates faster than human review catches it — and AI is genuinely good at satisfying explicit quantitative constraints if they're enforced in CI. Cost: a few hours to capture baselines, ~30 minutes to wire into the pipeline. Benefit: structural. Technical debt becomes mathematically monotone-non-increasing — it can be paid down, it cannot accidentally grow. - Skills auto-trigger — the description is the load-bearing field. Most people treat skills like slash commands ("invoke when I type /name"). They aren't. The agent scans every skill's
descriptionwhen analyzing a request and decides which to load natively, without anyone asking. The implication: the description is the matching function. A description like "Helps with billing" fires on every billing-adjacent message and clogs the agent's context; a description like "Use when adding a new tariff, fee type, or modifying billing math. Do NOT use for invoice generation or payment matching (see invoice-workflow skill)" fires precisely when it should. Write descriptions like you're writing the dispatch rule, because that's what they are. - Compose skills, don't duplicate them. A skill's workflow can reference other skills — "for the migration step, follow the
migrationsskill; for rollout, thecanary-deployskill" — and the agent chains them when it follows the trail. Most teams treat skills as isolated and end up duplicating content (the migration procedure copy-pasted into three skills, each diverging over time). They're not isolated. Cross-reference, keep each skill focused, let the agent compose. The win compounds as the skill set grows — refactor any procedure once, every dependent skill picks it up automatically.
5. The override reflex — adversarial verification
The most underrated discipline. AI is fluent and confident — and sometimes confidently wrong.
Rule one: verify before claiming. Before asserting a fact about the codebase, verify it against the current source via grep, find, or an actual file read — not from memory or "what's typical in NestJS projects."
This rule lives in my memory file as an explicit feedback entry because the AI once confidently told me there were 550 console.log calls polluting my production code. Verification revealed exactly 1 actual call; the other 549 were in OLD_modules/ reference code and one-off scripts. Without verify-before-claiming, a fake cleanup sprint addressing a non-problem would have been triggered. The lesson: fluency is not accuracy. AI is fluent enough to make confidently wrong claims indistinguishable from confidently right ones at first glance.
Rule two: treat AI output like a junior engineer's PR — review it, don't merge it. AGENTS.md prevents pattern violations at the syntax level. It cannot prevent the AI from being confidently shallow. When the AI says "your auth system scores 7/10," your job is to ask "did you actually check the boot-time route validation, or did you just see JWT and score it?" The discipline isn't in the rules file — it's in refusing to accept summaries when you know the detail matters.
Corollary: if you can't tell whether the AI's output is shallow or deep, you don't know the domain well enough to use AI for it yet.
Rule three: redirect to source. When AI summarizes, it loses nuance. Build the reflex of saying "go read the actual file" instead of accepting the summary. The AI will form opinions from filenames and patterns before reading the code. Make it prove its claims against the actual source.
6. Context window discipline — know when to start fresh
The most underrated operational discipline. Sessions degrade. Past roughly the 100k-token mark, the agent starts forgetting things established earlier in the conversation, gets confused about which version of the codebase it's looking at, makes suggestions that contradict things you settled an hour ago. Tell-signs: it asks you a question it already asked, it asserts a fact you corrected ten messages ago, its proposed diffs lose precision.
The fix is mechanical: /clear and start fresh. You keep the durable artifacts — AGENTS.md, ADRs, memory files, skills — and lose the session history that was actively hurting more than helping. The agent walks back in with the constitution and the cheat sheet, no accumulated confusion.
Operational rule: if a task is going to span more than a few hours of agent work, plan at least one /clear checkpoint. Summarize the state to memory if useful, then start clean. Better five minutes on a handoff than thirty minutes debugging why the agent suddenly forgot the rules. The agents that ship for half a day aren't running one session; they're running a chain of fresh ones, each with the persistent context but none with the noise.
The empty chair problem
Solo dev means no code review. No architecture review. No "is this good?" from a peer.
Conventional AI-augmented development focuses on AI as a code generator — write faster, scaffold quicker, generate tests. That's real, but it's the smaller win. The larger win is AI as a structural mirror.
The most valuable AI interaction isn't "write this function." It's "here's my architecture — what breaks at 100x scale?" or "evaluate this codebase as if you were deciding whether to join the company." You're using AI to reflect your decisions back through a different lens, catching blind spots you can't see because you've been staring at the code for months.
The catch: the mirror only works if you set up the frame. "Review my code" produces generic feedback. "You're a senior engineer evaluating whether to join this company, and your career depends on this assessment being accurate" produces findings you can act on. The persona isn't a prompt trick — it changes what the AI optimizes for.
Pattern: after any significant architectural phase, run a structured external review. Frame it adversarially. Treat the findings as a punch list, not a report. An external review of my project flagged five concrete issues. Within three days, a structured hardening sprint addressed every one plus twenty-five additional items discovered during the process — boot-time env validation, JSON structured logging, per-adapter latency histograms, coverage thresholds in CI. The review cost an hour. The improvements it triggered took the codebase from "works in production" to "observable in production."
The chair is still empty. But AI fills it better than nothing, and the discipline of treating AI findings as real work items — not just interesting commentary — is what makes it useful.
Zoom, don't dump
Don't ask AI to evaluate everything at once. Start wide, narrow based on findings, go deep where it matters.
Pattern: broad scan → identify interesting areas → targeted deep dive → cross-reference findings. Each step is informed by the previous. This isn't about prompting — it's about treating AI interaction as a conversation with compounding context, not a series of independent requests.
Anti-pattern: pasting 50 files and saying "review this." The AI will give you 50 shallow observations instead of 3 deep ones.
The progressive-delegation model works because AI attention is a finite resource within a session — the same as human attention. Spending it on a surface scan of everything means spending it on a deep scan of nothing. Direct the attention. Let the broad scan tell you where to point the microscope.
What this looks like in practice — a side project in 3 nights
After an AI workshop in 2026, I decided to test how far the system could go in compressed time. The result: a sprint-orchestration and code-review dashboard, built in roughly three evenings of AI-augmented work.
What ended up in it:
- Bun + Hono, JSX server components, SQLite via Bun's native driver
- DDD-lite structure:
domain/,application/,infrastructure/,interface/— the same layering I use in ECM - Result pattern across all use cases (same rule, copied via AGENTS.md)
- A 3-layer JSON repair pipeline for LLM output that doesn't quite parse: skill-side validation, parser-layer repair strategies, and a re-prompt fallback. This part is production-grade and worth lifting into its own library someday.
- A self-learning triage loop: human corrects a classification → "upskill" use case writes patterns to a knowledge table → future classifications read learned patterns first.
- A "memory palace" — verified review findings accumulate as entries with confidence scores; subsequent reviews cite them via
<!-- palace:1,2 -->markers so trust can be tracked. - Dual-backend AI abstraction: same workflow runs against Claude CLI or Cursor; only the subprocess invocation differs. Both read the same skills folder.
- 13 unit + integration tests; in-memory SQLite for fakes.
What is not in it, which is the more interesting part:
- No narration comments. No "// return result" lines.
- No TODO/FIXME debt anywhere in source.
- No commented-out code blocks.
- Zero
anycasts in domain or application layers — all type-narrowing via guards.
This is what the multiplier looks like. The discipline encoded in AGENTS.md (copied from ECM) and the patterns codified in the ADRs (also copied) produced a ~22K-LOC TypeScript codebase with the same quality profile as something written carefully over months. The work was three nights, not three months, because the rules came pre-loaded.
A side observation: most 3-night side projects have at least one "good enough" shortcut that becomes the technical debt nobody wants to clean up. This one doesn't. Not because I'm a saint — because the rules made the shortcuts impossible. AGENTS.md says no any; the AI knew not to write it. AGENTS.md says no narration comments; the AI knew not to add them. The discipline that took years to develop manually was, by then, a 300-line file the AI honored.
The full anatomy — the JSON-repair pipeline, the self-learning triage upskill loop, the dual-backend AI abstraction (same skills folder consumed by Claude CLI and Cursor agents), the memory palace, the Jira / GitLab / Figma integration surfaces — is in a separate writeup. For this post, the takeaway is the velocity profile, not the architecture.
What AI is actually good at — and what it isn't
The post so far has talked about AI as if it's uniformly a multiplier. It isn't. The multiplier is real for some kinds of work and absent (or negative) for others. Worth being granular before someone reads this and assumes AI is the answer to every problem.
AI is genuinely good at:
- Coding from clear specification. "Implement this use case following ADR-008's pattern, given this interface and these tests" — huge multiplier, with discipline. The clearer the spec, the bigger the win.
- Generating boilerplate and scaffolds from patterns. New repository, new use case, new value object — the templates are well-defined, and AI matches them faster than you can type.
- Writing tests for code you've already written. Especially edge cases you missed. The AI sees the function shape and can enumerate the corners.
- Documentation generation from code + intent. JSDoc, READMEs, ADRs from a conversation transcript — strong.
- First-draft writing. Blog posts, design docs, change-log entries, commit messages. Your edit pass is fast; the cold-start cost is what AI removes.
- Searching and summarizing codebases too large to read. "Show me every site that calls into the EDC adapter" or "summarize how the workflow engine handles claim-state recovery" — strong, especially with verify-before-claiming.
AI is mixed at:
- Architectural design. Collaborator, not driver. AI defaults to generic patterns ("you could use a workflow engine here…") that often miss the domain's actual constraints. Your domain knowledge has to drive; AI fills in the patterns once you've decided the shape.
- Debugging. Good at hypothesis generation ("here are five things that could cause this symptom"); mediocre at staying focused on the actual symptom once it picks a hypothesis. The "rubber duck that talks back" failure mode is real — AI will confidently chase a wrong lead unless you re-anchor it to the actual evidence.
- Code review. Catches different things than humans — typo bugs, missing null checks, inconsistent naming. Misses architectural drift, judgment calls, and "this is technically right but wrong for this codebase." Complement, not replacement.
- Peer review. Fills the empty chair — but only if you frame the session adversarially and redirect to source when the AI gets shallow.
AI is genuinely bad at:
- Maintaining architectural consistency across sessions without artifacts. The system above (AGENTS.md, ADRs, memory) fixes this; AI alone doesn't. Without the system, every session re-derives the patterns and they drift.
- Catching its own confabulation. AI cannot reliably self-detect when it's confidently wrong. Verify-before-claiming exists because of this. A reader who trusts AI's confidence ships AI's mistakes at scale.
- Long-running multi-file refactors without supervision. Past a certain blast radius, the agent loses the thread — applies the rule to file A, forgets it by file J, contradicts itself at file Q. Human oversight required for any cross-cutting refactor; or break it into many small AI-shaped tasks.
- Anything requiring silence. AI always answers. The "sit with this and think for an afternoon" mode that experienced engineers use to find the actually-elegant solution — it can't. If the right move is "wait, don't decide today," AI will produce a confident decision anyway.
- Business context that isn't documented anywhere. "Why did we choose X" lives in your head until it's in an ADR. AI can't infer it. The first time AI suggests reverting a decision you made for a reason you never wrote down is the moment you wish you'd written it down.
The honest summary: AI is a multiplier where your output is well-specified; it's a coin flip where your thinking is well-specified; and it's a liability where the work requires judgment you haven't externalized. Calibrate accordingly.
The bus factor paradox
Conventional wisdom: solo dev = bus factor 1 (worst possible). AI augmentation makes it worse because more ships per unit of person-time, so more is at risk.
Counterargument: a solo dev with the system above has a higher effective bus factor than a solo dev without it.
If I disappeared tomorrow and someone else inherited ECM, they'd inherit:
- AGENTS.md tells them the rules of the codebase in one read. They don't have to derive the conventions empirically.
- 38 ADRs with rejection reasons tell them why every weird-looking thing is the way it is. They don't relitigate decisions I already made.
- Memory files tell them what the past nine months looked like — pilot launch, major refactors, competitive context.
- Skills + slash commands tell them the standard workflows in executable form. They don't have to guess how I review a PR or pull and verify.
- Cheat sheet over the ADRs gives them the map; full ADRs give them the territory.
None of these existed in any prior solo project of mine. The discipline that AI augmentation forced — because without it the multiplier doesn't work — turns out to also be the discipline that turns codified knowledge into a transfer-friendly artifact. The bus factor went from "1 with no documentation" to "1 with about 200 pages of structured, current, internally-consistent documentation." Still 1 in absolute terms. The recovery story is dramatically different.
The interesting corollary: companies hiring senior engineers should be looking for ADR sets in their portfolios, not LinkedIn endorsements. An engineer with 38 well-maintained ADRs in a personal project demonstrates a discipline that's vanishingly rare and trivially transferable to any codebase.
What the system doesn't catch — until it does
AGENTS.md enforces rules the AI can read. ADRs prevent re-litigation. CI ratchets prevent regression. But none of these protect against assumptions you haven't thought to question — until a specific incident exposes one. The system's strength is that each exposure permanently closes the gap. Three concrete examples from my own codebase.
Column types. AGENTS.md's domain is structure; column-type pragmatics aren't part of it. My project had half its timestamp columns as timestamp instead of timestamptz — a difference invisible in development (same timezone) that silently shifts billing period boundaries by an hour across DST transitions in production. An external review surfaced it. The fix was a migration across affected tables, a CI guard preventing future timestamp columns, and an ADR documenting the "Czech midnight" convention. Structural rules now protect that surface; the next time someone writes timestamp without time zone, CI fails the build.
Integration tests weren't part of the system from day one. Unit tests existed; domain entities were well-covered; use cases had spec files. But the boundary tests — does the workflow engine actually advance entity status the way the config claims, against a real database — those weren't enforced until a workflow regression slipped through three passing unit tests. The rule landed: every workflow step gets an integration test against a real Postgres via Testcontainers. The cost was real (slower test suite, more setup overhead), but the previous gap was also real, and it was closing the gap that mattered, not preserving the cheap test run.
Cross-module atomic writes. The original rule was "events only, never direct cross-module DI." That rule was right 95% of the time. The 5% — operations that needed atomic rollback across module boundaries — got worked around with eventually-consistent patterns that were actually worse than the in-process alternative. ADR-033 added the carve-out: forwardRef is allowed specifically for atomic cross-module writes, with a sentinel pattern enforcing rollback through Result types. The gap closed by narrowing the original rule, not abandoning it.
The pattern is consistent: a structural rule doesn't anticipate every situation. A new situation exposes a gap. The next time it would have happened, a rule, a hook, or an ADR catches it. The gap closes.
This is the real maintenance cost: not writing the file, but noticing when it needs updating. Every time the AI produces something you have to manually correct, ask: "Is this a one-off mistake, or a pattern I'll see again?" If it's a pattern, it goes in the file. If you stop noticing, the file goes stale, and the AI drifts.
The honest costs
None of this is free. The trade-offs I've actually paid:
- Onboarding is heavy. A new contributor (human or AI) has to read AGENTS.md, skim the cheat sheet, dip into the ADRs that pertain to their first task. That's an hour or two before they can write a line. The alternative — discovering the rules empirically through code review — is faster on day one and much slower on day twenty.
- Writing ADRs takes 20 minutes per decision. ~40 decisions × 20 min = ~13 hours of ADR-writing time over nine months. Trivial against the time those ADRs saved by preventing relitigation of past decisions, but real.
- Memory curation requires honesty. Wrong memory is worse than no memory. I've had to delete confidently-asserted memories that turned out to be false (or became outdated). The discipline isn't writing memories; it's auditing them and being willing to remove the wrong ones.
- The verify-before-claiming rule slows the AI down. AI feels less magical when it has to spend five tool calls checking the codebase before answering. The answers are more right. But the surface experience is "I asked a question and it took thirty seconds." That's the cost of accuracy.
- The system is opinionated and won't fit every team. If your team prefers light documentation, this is too heavy. If your team prefers organic conventions, this is too rigid. The point isn't that everyone should copy my AGENTS.md — it's that you need some equivalent if you want AI to be a consistent multiplier instead of a sporadic one.
- The discipline becomes a one-way ratchet. Once AGENTS.md, ADRs, memory, and skills exist, you can't safely turn them off. The codebase is the rules now. The first session that ignores the rule list ships code that drifts; the second session matches the drift; by session five, the system is producing the average of "with discipline" and "without," and your enforcement is structurally weaker. Real lock-in. The good news: maintenance cost is small — roughly half an hour a week to keep memory accurate, ADRs current, rules current. The bad news: it's not zero, and the consequence of skipping it is steady, asymmetric degradation. You can build the system in a month; you can't unbuild the dependency on it without paying weeks of refactor.
- The vigilance tax. The system doesn't maintain itself. Every AI mistake is an opportunity to add a rule — but only if you notice the mistake. Accepting AI output without review means the onboarding doc stops growing, and the AI slowly drifts from your standards. The cost isn't time. It's attention.
The selection effect — AI accelerates, doesn't improve
The hardest truth, before closing.
AI doesn't make you a better engineer. It makes a good engineer faster.
If you can't review what AI writes, AI just produces faster bad code. If you can't articulate your architecture clearly enough to write an AGENTS.md, AI propagates whatever conventions it pattern-matches from its training data. If you can't tell when AI is confabulating, AI confabulates at scale.
A weak engineer with AI ships weak code faster. A strong engineer with AI ships strong code faster. The delta between "weak" and "strong" doesn't compress under AI augmentation; it amplifies. The discipline encoded in the system — AGENTS.md, ADRs, memory, automation, the override reflex, context-window hygiene — is what makes the second case possible. Without it, the first case is the default outcome, and the engineer who was already strong gets much further than the one who was just OK.
This is why the post is about discipline rather than AI. AI is the multiplier; discipline is the precondition. You don't get the multiplier without the precondition. People who lead with "AI made me 20× faster" usually skipped articulating the precondition because they treat it as ambient. It isn't ambient. It's the work that makes the multiplier exist.
A useful diagnostic before trusting your AI workflow at scale: turn off the AI for a week and write code yourself. If the quality drops noticeably, you were leaning on the AI to think for you, and the system is masking a skill gap that will show up the moment the AI is wrong. If quality stays consistent, you were using AI to write faster — which is the right shape. The first case is fragile; the second is durable.
Why bother
The system works because it encodes a preference, not just a process. If you don't care whether your code is clean, no amount of AGENTS.md will save you — you'll find workarounds, you'll accept the AI's first draft, you'll skip the verification step because it's slower.
The discipline artifacts formalize a standard that existed before them. The question isn't "should I adopt these practices?" It's "do I already care about this, and want a way to enforce it at AI speed?"
The takeaway
The thing I most want to land:
The output you can get from AI is bounded by the discipline of the system around it, not the cleverness of the prompt. Prompt engineering is real but small. System engineering — what your codebase, your rules, your memory, your automation look like — is what determines whether AI is a multiplier or a lottery.
If you're a solo dev trying to ship something serious with AI:
- Invest in the system, not the prompts
- Write an AGENTS.md before session two — and update it after every session that surprises you
- Start ADRs from the first non-trivial decision
- Curate memory after every conversation that taught you something durable
- Automate workflows that repeat
- Review AI output like you'd review a junior engineer's PR
- Use AI to fill the peer-review chair, not just the coding chair
- Name your gaps honestly — they're the next line in AGENTS.md
The system paid for itself in weeks for me. Whether it does for you depends on the scale and lifetime of what you're building. The work it produces is what survives — if you've built the system around it, the work tends to outlast the session that produced it.
And — the part I find most worth saying out loud — it's how you ship like a team of one without the project rotting underneath you. The boring patterns are the ones that compound. AI doesn't change that; it amplifies whichever direction you're already going.
Next: the same primitive, the other side of the boundary
Caveat upfront: I'm not a security specialist. I'm a backend architect who works with AI daily and notices the threat surface from inside the systems I build. What follows is a mental model — patterns I see from this adjacency, not specialist doctrine. The next post unpacks them more carefully.
One thing this post deliberately doesn't address: discipline only protects you from yourself. The system above keeps your codebase coherent across hundreds of AI-augmented sessions. It does not keep your codebase safe from someone else's AI-augmented sessions — the attacker who runs the same multiplier against your APIs, your error messages, your login flows, your public surface.
If AI is a multiplier for builders, it's also a multiplier for attackers. Autonomous log analysis. Library-vulnerability pattern matching at scale. Phishing email indistinguishable from genuine. And the rule that matters most: everything client-side can be constructed. TLS fingerprint, canvas hash, mouse curves, user-agent, WebGL renderer, audio context, keystroke timing — every signal an attacker controls is a signal an attacker can forge, with off-the-shelf tooling that costs less per year than a developer laptop. The defenses that survive aren't the ones that detect; they're the ones that charge. The next piece is about what changes when both sides have AI, and what discipline looks like at the boundary rather than inside the codebase. Short version: engineering can be amortized; payment can't. An attacker writes a stealth toolkit once and runs it forever; they pay per CAPTCHA, per fresh account, per request — every time. Per-account rate limits, proof-of-work on high-cost endpoints, opaque errors for unauthenticated requests, cryptographic identity over passwords, paid access for agentic consumers. Make every attack cost more than its return. Wallet pain for everyone trying.