AI as multiplier, discipline as durability

TL;DR

AI productivity gains require structural discipline to be sustainable. Without systematic rules, documentation, and automation, AI produces inconsistent results — some sessions yield clean code, others introduce technical debt. The solution is six practices: an agents constitution file, Architecture Decision Records, persistent memory, automation infrastructure, adversarial verification, and context hygiene. This system turns AI from an unreliable tool into a consistent multiplier — and lets one person ship at near-team velocity, with a project that doesn't rot underneath them.

The core problem

I observed a pattern when I started using AI seriously for production code in late 2024: massive speed variations. One session produced a clean, well-tested module in 20 minutes. The next session generated code that compiled, ran, and quietly violated the no-throw rule in three places. The root cause: AI is excellent at producing code that matches the prompt. AI is mediocre at remembering the rules of this specific codebase across sessions. AI is terrible at relitigating decisions you've already made when those decisions aren't visible in the local context of the file it's editing.

The fix wasn't "better prompts." The fix was making the rules of the codebase structurally visible so the AI reads them rather than remembers them. That single shift is what turns AI from a high-variance lottery into a consistent multiplier. The magnitude of the multiplier is task-dependent — large on boilerplate, modest on judgment, sometimes net-negative on long-running refactors — but the shape becomes predictable. The discipline doesn't homogenize the multiplier across tasks; it eliminates the negative tail. No more random sessions that quietly introduce technical debt nobody catches.

Six practices that make it work

The system I converged on has six distinct practices. Each closes a specific failure mode I'd observed.

1. AGENTS.md — the living onboarding doc

A single file at the root of the repository that AI is instructed to read before writing anything. Mine opens with: "This file is law. Every code change, every file created, every pattern used MUST comply. Violations are not 'I'll fix it later' — they are rejected. Read before writing."

The name is deliberate. Not CLAUDE.md, not CURSOR.mdAGENTS.md. The discipline is model-agnostic. If you switch tools next year, the file still works. Any AI assistant that reads the repository root picks it up.

The content is short — about 300 lines — and densely opinionated. It encodes the rules that aren't visible from looking at a random source file:

Each rule includes concrete examples of the wrong way and the right way, side by side. After AGENTS.md was strict enough, no session ever wrote a throw in a use case again. The compliance rate went from "most of the time" to "every time."

But here's what actually matters: AGENTS.md isn't a rulebook you write once. It's the onboarding doc for an employee who forgets everything every morning.

Every real team has that doc — "here's how we do things, here's what not to do, here's why." When a new hire violates a convention, you don't blame them — you update the onboarding doc so the next person doesn't make the same mistake. AGENTS.md works the same way.

AI skips a validation step — you add a rule. AI over-engineers a simple CRUD endpoint with unnecessary abstractions — you add "do not add features beyond what the task requires." AI generates Czech error messages in a domain service — you add "error messages in domain layer are English-only."

The file grows from friction, not from planning. My AGENTS.md started at ~50 lines. Nine months later it's ~300. Every line is a scar from a session where the AI did something wrong and I decided it should never happen again.

Honest caveat: not every entry reflects deep principle. Some rules accommodate early codebase states encoded into discipline because consistency outweighed the cost of refactoring. The clearest example: the whole codebase — DB columns, DTOs, event payloads, API responses, and the TypeScript properties that mirror them — uses snake_case throughout, including BE-internal code. The origin is mundane: a non-developer's mock files used snake_case, the FE was built against those mocks, and the API contract followed. By the time the BE was real, snake_case was already wire-deep in the FE. I tried the conventional NestJS shape first — camelCase internally with a mapper at the entity↔API boundary — and actually built it. The math didn't hold up: two mapping layers, not one (DB row ↔ entity, entity ↔ API DTO), on every endpoint, every test fixture, every event payload, both directions. And the bigger blocker wasn't the BE cost: flipping the API to camelCase would have forced the FE to refactor every consumer, and I wasn't going to make that ask. So I removed the mappers and the BE absorbed the cost — snake_case properties in TypeScript, slightly off-convention but consistent end to end. The FE didn't need to change; every BE file reads user.user_id instead of user.userId. Annoying, not broken. Not an anti-pattern — a bounded accommodation with clear trade-offs, and a place where AI didn't push back because a globally-consistent codebase gives it nothing to push back against. The deeper point is that "discipline" in AGENTS.md isn't a synonym for "principle." Both kinds — principled and accommodation — get the same consistent AI compliance, and the operational win is the same. The accommodation kind's downside: AI won't surface it; you only see it when you ask yourself which line in AGENTS.md you wrote because you believed it, versus which line you wrote because the AI didn't push back.

2. ADRs — the decision graph

Architecture Decision Records, written as I go. Each one captures one architectural decision: context, decision, consequences, alternatives considered with rejection reasons, implementation notes. ECM has 38 of them in nine months.

The thing ADRs solve that AGENTS.md doesn't is the "why didn't you just…" failure mode. Future me — or a future contributor, or the AI in a session three months from now — looks at a piece of code that seems weird and thinks "we should just refactor this back to the obvious thing." The ADR has the obvious thing in the "Alternatives Considered" section with the reason it was rejected. The refactor doesn't happen. The hard-won design stays.

I wrote a separate post on ADR-driven solo development. The short version: writing ADRs while making the decision (not in batches afterwards) costs 20 minutes per decision and saves hours per decision when future-you would otherwise relitigate it.

3. Memory files — the persistent context

Conversation-spanning notes that the AI can read at the start of any new session. Mine are organized by category: user profile (who I am, what I'm working on), project facts (key technical details that recur), feedback (corrections I've given the AI that should generalize), references (where external resources live).

This solves the "new session, no context, has to learn everything from scratch" failure mode. A fresh conversation doesn't restart at zero — it inherits the curated, distilled version of what previous sessions discovered. The AI walks in already knowing my codebase has 38 ADRs and one extracted service and a deferred trigger-based audit design.

What matters: the memory isn't a transcript or a journal. It's a curated set of durable facts — things still true a month later. When something changes (pilot date shifts, an ADR supersedes another, my opinion on a pattern reverses), the memory gets updated. Stale memory is worse than no memory; the AI will confidently assert things that were true once and aren't anymore. Curating means being honest about what's still load-bearing.

4. Skills, hooks, slash commands, MCPs — the automation layer

Where AGENTS.md and ADRs encode rules, this layer encodes workflows. A "deep-audit" skill that does a full codebase pass with a defined set of phases. A "pull-and-review" skill that fetches the latest commits, runs verification, and produces a structured review. A hook that runs gitleaks before every commit.

This solves the "the right way to do X involves seven steps and I'll skip one" failure mode. When a workflow is automation, you can't skip step 4 because step 4 is a line in the script. When a workflow is "remember to run gitleaks before pushing," step 4 gets skipped sometimes — and that's how secrets end up in commits and force-pushes happen.

Seven concrete practices once you've passed the basics:

5. The override reflex — adversarial verification

The most underrated discipline. AI is fluent and confident — and sometimes confidently wrong.

Rule one: verify before claiming. Before asserting a fact about the codebase, verify it against the current source via grep, find, or an actual file read — not from memory or "what's typical in NestJS projects."

This rule lives in my memory file as an explicit feedback entry because the AI once confidently told me there were 550 console.log calls polluting my production code. Verification revealed exactly 1 actual call; the other 549 were in OLD_modules/ reference code and one-off scripts. Without verify-before-claiming, a fake cleanup sprint addressing a non-problem would have been triggered. The lesson: fluency is not accuracy. AI is fluent enough to make confidently wrong claims indistinguishable from confidently right ones at first glance.

Rule two: treat AI output like a junior engineer's PR — review it, don't merge it. AGENTS.md prevents pattern violations at the syntax level. It cannot prevent the AI from being confidently shallow. When the AI says "your auth system scores 7/10," your job is to ask "did you actually check the boot-time route validation, or did you just see JWT and score it?" The discipline isn't in the rules file — it's in refusing to accept summaries when you know the detail matters.

Corollary: if you can't tell whether the AI's output is shallow or deep, you don't know the domain well enough to use AI for it yet.

Rule three: redirect to source. When AI summarizes, it loses nuance. Build the reflex of saying "go read the actual file" instead of accepting the summary. The AI will form opinions from filenames and patterns before reading the code. Make it prove its claims against the actual source.

6. Context window discipline — know when to start fresh

The most underrated operational discipline. Sessions degrade. Past roughly the 100k-token mark, the agent starts forgetting things established earlier in the conversation, gets confused about which version of the codebase it's looking at, makes suggestions that contradict things you settled an hour ago. Tell-signs: it asks you a question it already asked, it asserts a fact you corrected ten messages ago, its proposed diffs lose precision.

The fix is mechanical: /clear and start fresh. You keep the durable artifacts — AGENTS.md, ADRs, memory files, skills — and lose the session history that was actively hurting more than helping. The agent walks back in with the constitution and the cheat sheet, no accumulated confusion.

Operational rule: if a task is going to span more than a few hours of agent work, plan at least one /clear checkpoint. Summarize the state to memory if useful, then start clean. Better five minutes on a handoff than thirty minutes debugging why the agent suddenly forgot the rules. The agents that ship for half a day aren't running one session; they're running a chain of fresh ones, each with the persistent context but none with the noise.

The empty chair problem

Solo dev means no code review. No architecture review. No "is this good?" from a peer.

Conventional AI-augmented development focuses on AI as a code generator — write faster, scaffold quicker, generate tests. That's real, but it's the smaller win. The larger win is AI as a structural mirror.

The most valuable AI interaction isn't "write this function." It's "here's my architecture — what breaks at 100x scale?" or "evaluate this codebase as if you were deciding whether to join the company." You're using AI to reflect your decisions back through a different lens, catching blind spots you can't see because you've been staring at the code for months.

The catch: the mirror only works if you set up the frame. "Review my code" produces generic feedback. "You're a senior engineer evaluating whether to join this company, and your career depends on this assessment being accurate" produces findings you can act on. The persona isn't a prompt trick — it changes what the AI optimizes for.

Pattern: after any significant architectural phase, run a structured external review. Frame it adversarially. Treat the findings as a punch list, not a report. An external review of my project flagged five concrete issues. Within three days, a structured hardening sprint addressed every one plus twenty-five additional items discovered during the process — boot-time env validation, JSON structured logging, per-adapter latency histograms, coverage thresholds in CI. The review cost an hour. The improvements it triggered took the codebase from "works in production" to "observable in production."

The chair is still empty. But AI fills it better than nothing, and the discipline of treating AI findings as real work items — not just interesting commentary — is what makes it useful.

Zoom, don't dump

Don't ask AI to evaluate everything at once. Start wide, narrow based on findings, go deep where it matters.

Pattern: broad scan → identify interesting areas → targeted deep dive → cross-reference findings. Each step is informed by the previous. This isn't about prompting — it's about treating AI interaction as a conversation with compounding context, not a series of independent requests.

Anti-pattern: pasting 50 files and saying "review this." The AI will give you 50 shallow observations instead of 3 deep ones.

The progressive-delegation model works because AI attention is a finite resource within a session — the same as human attention. Spending it on a surface scan of everything means spending it on a deep scan of nothing. Direct the attention. Let the broad scan tell you where to point the microscope.

What this looks like in practice — a side project in 3 nights

After an AI workshop in 2026, I decided to test how far the system could go in compressed time. The result: a sprint-orchestration and code-review dashboard, built in roughly three evenings of AI-augmented work.

What ended up in it:

What is not in it, which is the more interesting part:

This is what the multiplier looks like. The discipline encoded in AGENTS.md (copied from ECM) and the patterns codified in the ADRs (also copied) produced a ~22K-LOC TypeScript codebase with the same quality profile as something written carefully over months. The work was three nights, not three months, because the rules came pre-loaded.

A side observation: most 3-night side projects have at least one "good enough" shortcut that becomes the technical debt nobody wants to clean up. This one doesn't. Not because I'm a saint — because the rules made the shortcuts impossible. AGENTS.md says no any; the AI knew not to write it. AGENTS.md says no narration comments; the AI knew not to add them. The discipline that took years to develop manually was, by then, a 300-line file the AI honored.

The full anatomy — the JSON-repair pipeline, the self-learning triage upskill loop, the dual-backend AI abstraction (same skills folder consumed by Claude CLI and Cursor agents), the memory palace, the Jira / GitLab / Figma integration surfaces — is in a separate writeup. For this post, the takeaway is the velocity profile, not the architecture.

What AI is actually good at — and what it isn't

The post so far has talked about AI as if it's uniformly a multiplier. It isn't. The multiplier is real for some kinds of work and absent (or negative) for others. Worth being granular before someone reads this and assumes AI is the answer to every problem.

AI is genuinely good at:

AI is mixed at:

AI is genuinely bad at:

The honest summary: AI is a multiplier where your output is well-specified; it's a coin flip where your thinking is well-specified; and it's a liability where the work requires judgment you haven't externalized. Calibrate accordingly.

The bus factor paradox

Conventional wisdom: solo dev = bus factor 1 (worst possible). AI augmentation makes it worse because more ships per unit of person-time, so more is at risk.

Counterargument: a solo dev with the system above has a higher effective bus factor than a solo dev without it.

If I disappeared tomorrow and someone else inherited ECM, they'd inherit:

None of these existed in any prior solo project of mine. The discipline that AI augmentation forced — because without it the multiplier doesn't work — turns out to also be the discipline that turns codified knowledge into a transfer-friendly artifact. The bus factor went from "1 with no documentation" to "1 with about 200 pages of structured, current, internally-consistent documentation." Still 1 in absolute terms. The recovery story is dramatically different.

The interesting corollary: companies hiring senior engineers should be looking for ADR sets in their portfolios, not LinkedIn endorsements. An engineer with 38 well-maintained ADRs in a personal project demonstrates a discipline that's vanishingly rare and trivially transferable to any codebase.

What the system doesn't catch — until it does

AGENTS.md enforces rules the AI can read. ADRs prevent re-litigation. CI ratchets prevent regression. But none of these protect against assumptions you haven't thought to question — until a specific incident exposes one. The system's strength is that each exposure permanently closes the gap. Three concrete examples from my own codebase.

Column types. AGENTS.md's domain is structure; column-type pragmatics aren't part of it. My project had half its timestamp columns as timestamp instead of timestamptz — a difference invisible in development (same timezone) that silently shifts billing period boundaries by an hour across DST transitions in production. An external review surfaced it. The fix was a migration across affected tables, a CI guard preventing future timestamp columns, and an ADR documenting the "Czech midnight" convention. Structural rules now protect that surface; the next time someone writes timestamp without time zone, CI fails the build.

Integration tests weren't part of the system from day one. Unit tests existed; domain entities were well-covered; use cases had spec files. But the boundary tests — does the workflow engine actually advance entity status the way the config claims, against a real database — those weren't enforced until a workflow regression slipped through three passing unit tests. The rule landed: every workflow step gets an integration test against a real Postgres via Testcontainers. The cost was real (slower test suite, more setup overhead), but the previous gap was also real, and it was closing the gap that mattered, not preserving the cheap test run.

Cross-module atomic writes. The original rule was "events only, never direct cross-module DI." That rule was right 95% of the time. The 5% — operations that needed atomic rollback across module boundaries — got worked around with eventually-consistent patterns that were actually worse than the in-process alternative. ADR-033 added the carve-out: forwardRef is allowed specifically for atomic cross-module writes, with a sentinel pattern enforcing rollback through Result types. The gap closed by narrowing the original rule, not abandoning it.

The pattern is consistent: a structural rule doesn't anticipate every situation. A new situation exposes a gap. The next time it would have happened, a rule, a hook, or an ADR catches it. The gap closes.

This is the real maintenance cost: not writing the file, but noticing when it needs updating. Every time the AI produces something you have to manually correct, ask: "Is this a one-off mistake, or a pattern I'll see again?" If it's a pattern, it goes in the file. If you stop noticing, the file goes stale, and the AI drifts.

The honest costs

None of this is free. The trade-offs I've actually paid:

The selection effect — AI accelerates, doesn't improve

The hardest truth, before closing.

AI doesn't make you a better engineer. It makes a good engineer faster.

If you can't review what AI writes, AI just produces faster bad code. If you can't articulate your architecture clearly enough to write an AGENTS.md, AI propagates whatever conventions it pattern-matches from its training data. If you can't tell when AI is confabulating, AI confabulates at scale.

A weak engineer with AI ships weak code faster. A strong engineer with AI ships strong code faster. The delta between "weak" and "strong" doesn't compress under AI augmentation; it amplifies. The discipline encoded in the system — AGENTS.md, ADRs, memory, automation, the override reflex, context-window hygiene — is what makes the second case possible. Without it, the first case is the default outcome, and the engineer who was already strong gets much further than the one who was just OK.

This is why the post is about discipline rather than AI. AI is the multiplier; discipline is the precondition. You don't get the multiplier without the precondition. People who lead with "AI made me 20× faster" usually skipped articulating the precondition because they treat it as ambient. It isn't ambient. It's the work that makes the multiplier exist.

A useful diagnostic before trusting your AI workflow at scale: turn off the AI for a week and write code yourself. If the quality drops noticeably, you were leaning on the AI to think for you, and the system is masking a skill gap that will show up the moment the AI is wrong. If quality stays consistent, you were using AI to write faster — which is the right shape. The first case is fragile; the second is durable.

Why bother

The system works because it encodes a preference, not just a process. If you don't care whether your code is clean, no amount of AGENTS.md will save you — you'll find workarounds, you'll accept the AI's first draft, you'll skip the verification step because it's slower.

The discipline artifacts formalize a standard that existed before them. The question isn't "should I adopt these practices?" It's "do I already care about this, and want a way to enforce it at AI speed?"

The takeaway

The thing I most want to land:

The output you can get from AI is bounded by the discipline of the system around it, not the cleverness of the prompt. Prompt engineering is real but small. System engineering — what your codebase, your rules, your memory, your automation look like — is what determines whether AI is a multiplier or a lottery.

If you're a solo dev trying to ship something serious with AI:

The system paid for itself in weeks for me. Whether it does for you depends on the scale and lifetime of what you're building. The work it produces is what survives — if you've built the system around it, the work tends to outlast the session that produced it.

And — the part I find most worth saying out loud — it's how you ship like a team of one without the project rotting underneath you. The boring patterns are the ones that compound. AI doesn't change that; it amplifies whichever direction you're already going.

Next: the same primitive, the other side of the boundary

Caveat upfront: I'm not a security specialist. I'm a backend architect who works with AI daily and notices the threat surface from inside the systems I build. What follows is a mental model — patterns I see from this adjacency, not specialist doctrine. The next post unpacks them more carefully.

One thing this post deliberately doesn't address: discipline only protects you from yourself. The system above keeps your codebase coherent across hundreds of AI-augmented sessions. It does not keep your codebase safe from someone else's AI-augmented sessions — the attacker who runs the same multiplier against your APIs, your error messages, your login flows, your public surface.

If AI is a multiplier for builders, it's also a multiplier for attackers. Autonomous log analysis. Library-vulnerability pattern matching at scale. Phishing email indistinguishable from genuine. And the rule that matters most: everything client-side can be constructed. TLS fingerprint, canvas hash, mouse curves, user-agent, WebGL renderer, audio context, keystroke timing — every signal an attacker controls is a signal an attacker can forge, with off-the-shelf tooling that costs less per year than a developer laptop. The defenses that survive aren't the ones that detect; they're the ones that charge. The next piece is about what changes when both sides have AI, and what discipline looks like at the boundary rather than inside the codebase. Short version: engineering can be amortized; payment can't. An attacker writes a stealth toolkit once and runs it forever; they pay per CAPTCHA, per fresh account, per request — every time. Per-account rate limits, proof-of-work on high-cost endpoints, opaque errors for unauthenticated requests, cryptographic identity over passwords, paid access for agentic consumers. Make every attack cost more than its return. Wallet pain for everyone trying.