Six months of production: what I've learned

May 2026 · Retrospective · ~11 min read

TL;DR

An energy community management platform I built for the Czech regulated market went live in November 2025. Six months later, the architecture has survived contact with real users, real billing cycles, and real EDC integration traffic. This is what survived, what didn't, and what 150+ metering points told me that no design review could have.

Setting

The Czech energy community framework (Lex OZE II/III) opened the regulated market for peer-to-peer electricity sharing in late 2023. Communities register with the regulator, members assign their metering points, energy is shared within the community according to a registered allocation key, and a national data hub provides 15-minute interval consumption and production data. The administrative overhead of running such a community is significant. The opportunity is that the same software primitives apply to every community, which makes a platform layer worth building.

ECM is that platform. It handles community lifecycle, member onboarding, sharing-group registration with the national data hub, energy aggregation, billing, invoice generation, payment matching, and the cron infrastructure that polls measurement data, runs billing cycles, and sends notifications. The architecture is a modular NestJS + TimescaleDB platform with one extracted service (PDF rendering). It's multi-tenant and supports a country/instance/deployment-mode config cascade so different customer profiles share the same codebase.

I built it in evenings and weekends, alongside a full-time job. The first community went live November 2025. Six full monthly billing cycles later, I have opinions.

The thing I keep going back to

The single architectural choice that paid for itself first and biggest was making the workflow engine config-driven. Every entity lifecycle in the system — user registration, sharing group setup, billing cycle, invoice processing — runs through the same JSON-driven engine. When the first community needed to handle a non-standard onboarding sequence in week 2, it was a JSON change. Not a release. Not a code review. Not a deployment.

I underestimated this when designing it. I knew config-driven was nicer than code-driven; I didn't fully appreciate that it would let me respond to "we need a new workflow variant" in 20 minutes instead of half a day. Six months of "actually, can we add a step here?" has made me a believer.

Three things that worked from day one

The Result<T, string> discipline in domain and application layers paid off in the first 30 days. The first month surfaced a handful of edge cases — an unusual address parse, a payment with an unexpected variable symbol format, an integration response missing a field I'd assumed always present — and in every case the failure surfaced as a structured error message in the logs and a readable message in the UI. Not a 500 with a stack trace. The architectural cost was a discipline to enforce; the operational benefit was that "unknown error, check logs" stopped happening.

Correlation-aware structured logging paid for itself the first time something went wrong with the national data hub integration. The whole problem was diagnosed in 15 minutes by grepping for one correlation ID across BE logs, the BullMQ job queue audit, and the integration adapter trace. Without correlation IDs, the same problem would have been hours of guessing what happened.

And the modular platform with extraction-ready primitives turned out to be the right scope for a two-engineer effort. The PDF rendering service was extracted last quarter without changing a single consumer-side line of code; the boundary was BullMQ jobs, and the boundary was clean. Knowing that other modules could be extracted the same way — but choosing not to extract them, because the operational simplicity of one deployment is worth real money to a small team — is what makes the architecture defensible.

Three things that surprised me

The integration layer is the moat. I underestimated how much of the platform's actual value would come from making the national data hub integration a non-event for the customer. The official integration surface is unfriendly. I invested in a robust adapter that handles polling, retry on transient failures, schema variability, and deferred reconciliation when the hub is unreachable. Six months in, the people running the community don't think about the hub. The integration absorbs the friction. That layer is the platform's strongest point.

Billing math was easy; payment matching was hard. The energy-to-money side of billing is well-defined — aggregate measurements, apply tariffs, compute amounts, the regulator publishes the formula. The hard part turned out to be the other direction: matching incoming bank transfers to outstanding invoices when the payer puts the wrong variable symbol, sends the wrong amount, or pays for multiple invoices at once. The platform now has a fuzzy-match layer that produces a confidence score and a manual-review queue for low-confidence matches. About 80% of payments clear automatically. The remaining 20% needs human review, which the admin UI now supports gracefully.

The frontend was the friction point, not the backend. The frontend was inherited from an earlier iteration and only lightly refactored before go-live. The backend held up beautifully under real use; the frontend accumulated paper cuts that needed individual triage. Lesson learned: invest in frontend type contracts as seriously as you invest in your domain model. We're now generating frontend response types from backend DTOs to remove a category of type assertions that were the source of several reported bugs.

The decisions I most don't regret

In rough order of how much they earned their keep:

Modular platform with clean extraction primitives. Deploying one service is operationally simpler than deploying twelve. Being able to extract one when you need to is what makes that defensible long-term.
JSON-driven workflow engine over an external orchestrator. One JSON change to add an admin override step. Zero new infrastructure.
Result<T, E> everywhere except the controller boundary. Forces failure paths to be explicit. Makes a 500 a sign of a real bug, not a routine occurrence.
Correlation IDs through every layer. Cheapest observability investment available. Pays back the first time you have to diagnose anything.
Multi-tenant config cascade. Currently overkill for one customer. Will be the cheapest way to onboard customer five.

The decisions that needed correction

Two architectural extensions shipped during the pilot, both in response to specific observed behavior:

The claim-state CAS primitive (post here) shipped after I observed a "billing close" click producing duplicate emails when an impatient admin clicked twice. Optimistic locking caught the double-write at persist time but not the double-action — the action bodies had already fired before either reached the version CAS. The primitive added a transient "claiming" status and a per-module janitor for recovery; duplicates haven't recurred.

After-transition effects with at-least-once semantics shipped to handle side effects that needed to fire after a successful transition, not during it. Originally the action body did everything, including external publishes; this meant the action couldn't safely retry because a partial run had already touched the outside world. Moving the "publish after the transition is committed" logic into a dedicated after-effect executor — fire once, log failures, durability is the use case's responsibility — made the workflow steps idempotent in a way they hadn't fully been before.

Both extensions were observed-need, not design-time. I think this is the right pattern: design the engine for the cases you can name, ship, then add primitives for the failure modes you observe in production. Trying to design every primitive up-front leads to engines with features nobody uses and gaps in the features people do.

What I'd do differently

Generate frontend types from day one. Backend DTOs are validated with class-validator. The frontend reads JSON and casts. We're now generating types end-to-end; doing this in week 1 instead of month 4 would have eliminated a category of bugs that took cumulative days to triage.

Write the ADRs in real time, not in batches. I have 38 ADRs. About 25 were written contemporaneously with the decision; the rest I backfilled. The contemporaneous ones are demonstrably better — they record alternatives considered and rejected, which the backfilled ones often miss. The cost of writing an ADR while making the decision is 20 minutes. The cost of backfilling six months later is recovering context you've already partly forgotten.

Build a shared "needs review" UI primitive earlier. Payment matching, document signature verification, integration reconciliation — all of them eventually needed "human-in-the-loop for the cases automation can't handle." I built each one ad-hoc. A shared primitive, designed once, would have been faster and more consistent than three separate incarnations.

The operational things I didn't have to build

One of the more useful side effects of choosing entity-is-state over an external orchestrator: a lot of operations work didn't have to happen.

No external workflow orchestrator to operate. The engine runs inside the application. There's nothing else to monitor, patch, or upgrade.
No state replay semantics to debug. The entity status field is the workflow state. Restart the process, read the status, proceed.
No process-instance table to clean up. Stuck workflows are stuck entities, queryable by status. The janitor cron handles the narrow stuck-in-transient case.
No saga compensation logic. Cross-module atomic writes happen inside a single transaction; failure paths roll back automatically. Failed operations leave the entity untouched.

The architecture is deployed on Docker Swarm with Ansible — a setup that's unfashionable in some circles. It takes ~12 minutes of attention per month and has never woken anyone up. Which is exactly the operational profile a two-engineer team needs.

The numbers

Six months in, in rough terms:

One community live, 150+ metering points, producer and consumer roles
Six full monthly billing cycles processed, none requiring rollback
Several hundred invoices generated and delivered
15-minute polling of the national data hub running continuously since go-live
~80% of incoming payments matched automatically to invoices; the rest go through a manual-review UI
Two architecture extensions during the pilot (claim-state primitive, after-transition effects), both shipped in response to specific observed behavior
One service extracted from the platform (PDF rendering) without changing the consumer-side contract
Zero downtime incidents in the pilot window

What I take from this

The thing I most want to say after six months of production is the unfashionable one: the boring patterns won. DDD, CQRS-lite, Result types, structured logging, ADRs, idempotent migrations, validation-first / fire-second. There was no silver bullet. There was just doing each of those things consistently, and the compound interest of consistency made the platform fit-for-purpose by go-live and better by month six.

The temptation in side projects is to chase novelty — try a new ORM, a new framework, the workflow engine of the year. I am glad I didn't. I picked patterns I trusted, applied them carefully, wrote down why each one was there. Six months later, none of those choices is one I'd reverse, and the few I'd revise (frontend types, contemporaneous ADRs) are about doing the same boring things sooner, not differently.

If you're starting something similar — a small team, a regulated market, a pilot in mind — the takeaway I'd press hardest is: pick patterns you understand deeply and apply them with discipline. The boring stack is durable. The flashy stack rots when you don't have time to maintain it. With two people and one product, you cannot afford rot.