Why we built our own workflow engine instead of using Temporal
For a modular monolith running entity lifecycle workflows (registration, billing, invoicing) in a regulated market, an entity-is-state engine with JSON-driven config beat Temporal on every axis that mattered: deployment footprint, debugging story, extraction-readiness, and time to first pilot. Temporal solves real problems — they're just not the problems we have. This is the reasoning, not the code.
The constraint that shaped everything
We're building an energy community management platform for the Czech regulated market. Two engineers. Pilot deadline tight. Workflows are everywhere — user registration goes through ~15 statuses before someone can share electricity, sharing groups have their own lifecycle through the national data hub, invoices flow through draft / approved / sent / paid / overdue, billing periods open and close on a calendar.
We needed orchestration. The question was: where does it live?
The default answer in 2025 is Temporal. It's well-engineered, the SDK is mature, the community is large, the documentation is excellent. We seriously evaluated it, and then we didn't pick it. This post is about why.
The keystone decision: entity is state
The single architectural call that changed the rest of the design was this: the domain entity's status field IS the workflow state. No separate "process instance" table. No external state store. No replay history. The engine reads JSON config, looks at invoice.status or user.status, and decides what to do next.
This sounds modest until you list everything it implies:
- No sync problem. There is no second source of truth to keep in agreement with the entity. The entity is the truth.
- Crash recovery is structural. The pod dies mid-step. You restart it. The entity is at whatever status was last committed. Call the endpoint again. There is no "did the checkpoint commit before the crash" question.
- Git is the workflow history. Old config in production → old behavior. New config deployed → new behavior. Entities under the old config keep their semantics; new entities pick up new ones. No version markers, no SDK upgrade gymnastics.
- Schema is the API contract for state. When another module wants to know "is this invoice paid?" it reads
invoice.status. There is no service to query, no workflow ID to track. The state space is a Postgres enum. - Extraction is free. If we extract the financial-management module into its own service tomorrow, the invoice status column travels with the invoices. No shared workflow database to split. (We already extracted the PDF rendering service exactly this way. The proof of concept was the proof.)
Temporal makes the opposite choice. Workflow state lives in Temporal's database, with replay history. That's the right call if you have long-running workflows where every action must be deterministic on replay, or if you operate at a scale where centralizing orchestration in a dedicated cluster pays for itself. Neither applies to us.
What entity-is-state actually looks like in production
"Entity is state" sounds like a slogan until you watch it work. A concrete walkthrough makes it less abstract.
Take a user registration workflow with these statuses: pending_email_verification → email_verified → pending_documents → documents_uploaded → pending_admin_approval → active.
In a Temporal-style design, the schema and process look like:
- A
userstable withid,email, etc. — but no status column. Status lives in Temporal. - A Temporal server with its own database, holding a "workflow instance" record per user plus an event-sourced history of every action.
- A worker process that consumes events, replays history on restart to reconstruct state, and signals status changes back to your app.
- Your app subscribes to those signals to update read models, send notifications, etc.
In entity-is-state, the same domain looks like this:
- A
userstable withid,email, and astatus TEXTcolumn containing one of the six values above. - No external state store. No event-sourced replay history. No worker dance.
At runtime, when "user uploads documents" arrives:
- HTTP controller calls
RegistrationWorkflow.executeStep({ user_id, requester_type: 'user', data: { documents } }) - Engine reads the current state:
SELECT status FROM users WHERE id = $1→'pending_documents' - Engine looks at the workflow config: at status
pending_documents, the step is "upload documents" - Engine runs the validation use case (are the documents well-formed? is the user allowed?)
- Engine runs the action use case (insert documents, link them to the user)
- Engine persists the transition:
UPDATE users SET status = 'documents_uploaded' WHERE id = $1 AND version = $2— optimistic CAS via TypeORM's version column - Engine publishes a domain event so observability and downstream subscribers can react asynchronously
Seven steps. One write at the end. No external service involved. If the pod dies between steps 4 and 6, the entity is still at pending_documents. Retry calls into step 1, repeats validation, repeats action (idempotent by rule), commits successfully. The same code path is the retry mechanism.
Compare to a Temporal worker dying mid-step: now the question becomes "did the activity complete? Will Temporal replay it correctly? What if the activity wrote to a non-deterministic side effect mid-replay?" Different mental model entirely. Not wrong — just much heavier for the same problem class.
What recovery looks like
A user gets stuck in pending_documents and contacts support. Three steps to triage:
SELECT * FROM users WHERE id = X— see exactly where they are. One row, one column. No replay history to interpret.SELECT * FROM workflow_history WHERE entity_id = X ORDER BY changed_at— see what already happened along the way.- If you need to move them manually:
UPDATE users SET status = '<the right state>' WHERE id = X. The next time they hit the endpoint, the engine picks up from the new state.
In Temporal: query the Temporal CLI or UI to see workflow state, send a signal to advance the workflow, hope the replay semantics work correctly when the worker re-processes. Solvable, but a fundamentally different kind of solve — and one that requires your support engineer to know how Temporal works.
The entity-is-state recovery is the kind of operation any backend engineer can perform with a SQL client. The Temporal recovery requires Temporal-specific knowledge. For a small team, that simplicity compounds.
Git history as workflow versioning
One implication that took me a while to fully internalize: old entities at old statuses keep their old behaviour automatically.
Suppose I ship a new step today that changes how documents are validated. The change is a config diff plus a new validation use case. Existing entities at pending_documents — say, 50 users mid-flight — will be picked up by the new step on their next request. But existing entities at later statuses (already past documents) are unaffected by the change, because the new step's status doesn't apply to them.
This means git log application-configuration/workflows/ IS the workflow version history. There's no separate version-marker concept, no "this entity is on workflow version 3" tracking. The config at main applies to whoever hits the engine right now; entities past affected states aren't reprocessed because the changed step doesn't match their current status.
In Temporal, you have explicit version markers in workflow code (workflow.GetVersion() or similar) so that long-running workflows survive code deploys without breaking replay determinism. It's a real solution to a real problem — but the problem only exists because Temporal replays history. In entity-is-state, the problem doesn't exist; the entity's current status is the source of truth, not a reconstructed-from-history view.
Why this is the right shape for our problem
We run regulated workflows: energy community registration, sharing-group setup with the national data hub, invoice lifecycle, billing cycles. None of these have:
- Multi-hour active CPU work that needs replay determinism
- Cross-language coordination (we're all TypeScript)
- Sub-second timer scheduling at high volume
- Strict event-sourcing audit requirements (a structured audit log is sufficient for our regulatory context)
What they do have:
- Long human-driven or system-driven waits (admin approval, data hub response) — handled by BullMQ delayed jobs plus an out-of-band state-change entry point
- Audit requirements (who changed what when) — handled by the
workflow_historytable plus structured logs - Multi-tenancy — handled by RLS bound through AsyncLocalStorage
- Cross-module atomic writes (e.g. setup community) — handled by
forwardRefplus the sentinel pattern (the design that makesResult.failuretrigger a TypeORM rollback without throwing through the use-case layer)
Entity-is-state is the right shape for this problem class. For a system that runs long-running quant trades, cross-language ML pipelines, or strict event-sourced financial ledgers, the answer might be Temporal. Our system isn't either; we picked accordingly. The decision is bounded by the problem, not by ideology about workflow engines.
Validate-first, fire-second
The second decision: every workflow step is atomic in a specific way. The engine enforces this order, every time, no exceptions:
- Step selection — figure out which step applies given current status and request context
- Input validation — the validation use case checks the data
- Action execution — the action use case performs business logic
- State transition — entity status updates only after everything succeeds
- Best-effort side effects — Redis flags, etc., never fail the step
If anything fails at steps 1–3, the entity hasn't changed. If step 4 fails, we accept that step 3 was either side-effect-free or idempotent (a rule we enforce). Step 5 is fire-and-forget.
This makes retry trivial. The user's browser shows an error. They click again. The same atomic sequence runs against the same entity at the same status. If it succeeds this time, great. If it fails the same way, they see the same error. There is no half-committed state to compensate for, no saga to coordinate. The engine doesn't need timer activities or signal/query semantics because the failure model is just "call again."
Config-driven, not code-driven
Workflow behavior lives in JSON. Each step declares:
- Which status it applies to
- Which validation use case to run
- Which action use case to run
- What status to transition to on success
- What to do on rejection (different state, different action)
- Whether to auto-advance to the next step
Adding a new step on an existing status is zero code if the use cases already exist. Every field supports per-requester-type granularity — a step can run different validation for users vs. admins vs. system cron, transition to different states, perform different actions. That's expressed in JSON, not in if branches scattered across handler files.
The trade-off: configs grow. Our largest workflow has ~25 steps and ~600 lines of JSON. The compensating discipline is that the configs are readable. A new contributor can read the registration workflow JSON and trace the entire user lifecycle in 10 minutes, including admin overrides and rejection paths. They cannot do this from code distributed across handler classes.
Startup validation: fail loud
The engine refuses to start the application if any workflow config is invalid. Not at deploy time. At boot time. A startup service walks every workflow config, checks:
- Required fields are present
- All status references are in the declared status_types array
- Every referenced use case file exists on disk
- No orphaned statuses (every status has a step or is reachable)
- Auto-advance chains terminate
- Per-requester values are well-formed
If any check fails, the process exits with a specific error. This converts an entire class of "discovered in production when a user clicked a button" bugs into "discovered before the load balancer ever sent traffic." It's a small amount of code for a large reduction in incident surface.
How long-running workflows actually work
"No timer activities inside the engine" sometimes gets read as "the engine can't do long-running flows." It can. Long-running flows are first-class — they just don't live inside the step pipeline itself. Two mechanisms cover the cases:
BullMQ delayed jobs handle "do something in N days." When a step needs to schedule a payment reminder for 7 days from now, or a regulatory recheck at the end of the quarter, it enqueues a BullMQ job with a delay parameter. BullMQ persists the schedule in Redis and reliably delivers the job when the time comes. The job, when consumed, calls the next workflow step. No engine-internal sleep, no replay semantics — a durable timer that lives outside the request lifecycle and is visible in operational tooling.
Out-of-band state-change entry points handle "wait until something external happens." A polling cron or webhook handler detects that an external system (the national data hub, a bank, an admin via a UI elsewhere) changed state, and notifies the engine that the transition has already occurred. The engine treats observed transitions differently from initiated ones: validation and action use cases don't fire (we didn't make this happen, we noticed it happened), but the state transition is recorded, after-effects fire, and the workflow proceeds.
Together: a workflow that "register the community with the data hub, wait up to 30 days for approval, then proceed" is a step whose action enqueues the registration request, plus a polling cron that notifies the engine when the hub's status flips. The 30-day window is BullMQ's responsibility (if there's a fallback timer) or the external system's (if approval is open-ended). The transition itself is the engine's. Neither requires a workflow-internal timer.
The trade against Temporal: you don't get the elegant Workflow.sleep() abstraction inside workflow code. The "wait 7 days" lives in BullMQ configuration. For us this is a feature — the wait is visible in BullMQ's operational dashboard rather than buried inside a running workflow instance. For some teams it would be a step backwards. Pick based on which mental model your team prefers.
What history actually looks like
One concern that comes up about entity-is-state engines: "if state lives in the entity column, isn't history lost?" The status column shows where the entity is now, not where it's been.
The answer is that history isn't free, but it isn't absent either. The engine writes to a dedicated workflow_history table (per entity, per user, with rich JSONB metadata, step_key, execution time, workflow type, GIN-indexed for metadata queries). A separate engine-owned workflow_permanent_failure_log table captures every entity that hit a terminal failure state — used for admin "show me failed entities and their latest error" lookups, cross-workflow metrics ("how many credential failures this week"), and compliance/retention. Both are queryable directly via SQL, not just searchable in log streams.
On top of that, the structured workflow logger writes operational logs to stdout (Loki, CloudWatch — your choice), and domain events fan out through the bus so external subscribers (analytics, audit, anything) can react to transitions. Together: four layers, each fit for a different purpose. The entity column for "where is this now"; workflow_history for "what's the per-entity timeline"; workflow_permanent_failure_log for "what failed across the system"; structured logs for "what happened operationally during this request."
What this trades against Temporal: Temporal's history UI is one place that answers most history questions visually. Ours is several DB tables and a stdout stream that you query or aggregate yourself. The data is there — there's just no single UI rendering it. For us this has been fine; the admin queries we wanted to run all map cleanly to SQL against workflow_history, and the operational ones go to Loki. For a team that wants out-of-the-box visual workflow history, Temporal's UI is genuinely better.
Implicit context inside, explicit handoff at the boundary
One trade-off that didn't make the front of the design but earns its keep every week of production: AsyncLocalStorage inside the request, types at the boundary.
The engine itself doesn't read ALS directly. It can't — the request context (correlation_id, tenant_id, user_id, requester_type) only matters at cross-cutting infrastructure: the logger, the global exception filter, the outbound HTTP client base, the queue and event-bus publishers. The engine consumes those services and they consume ALS. Code-side discipline says: "use cases, repositories, domain code — never read ALS directly. If you need correlation, you have an observability concern, and it belongs in one of the infrastructure layers that's allowed to read."
That works as long as you stay inside the request. The moment work crosses a boundary the ALS scope doesn't follow — a BullMQ job enqueued for 7 days from now, a domain event published to Redis Streams, a cron tick on a new pod — the ALS store is gone. Code that ran fine inside the request silently runs without context on the other side. The logger emits "no-corr". Tenant binding is lost. Audit rows show null user_id.
The pattern we settled on is: inside the request, ALS is implicit and cheap; at every boundary, context is explicit and type-required.
- BullMQ jobs carry a
__ctxenvelope onjob.data.QueueService.add()reads the current ALS context, stamps it onto the payload, and the worker entry point reads it back and opens a fresh ALS scope viaCorrelationContextService.run(...)before invoking the handler. The job-data type makes__ctxrequired; if you forget, TypeScript refuses to compile. - Redis Streams events ship the context on
event.metadata. The publisher stamps; the consumer re-establishes. The event interface enforces it at the type level. - Cron / system ticks open their own ALS scope via
CrossTenantCronExecutor.run(fn), which establishes a fresh correlation_id and a "system" requester_type before invokingfn. You cannot call into engine code from a cron without going through this executor — the workflow-engine modules don't accept cron callers any other way. - Outbound HTTP reads ALS at request time and stamps
X-Correlation-Idso the trace survives the network hop. The remote side echoes it back; if their logs end up in the same aggregator, you have one correlation_id spanning two services.
What we traded: the appeal of "context is always there, never think about it" for "context is explicit when it crosses a boundary, type-checked, impossible to forget." Inside the boundary, ALS keeps it implicit and free. At the boundary, the types keep it explicit and enforced. The visibility cost is real — a new contributor has to learn "where are the boundaries" — but it's a learnable, finite cost. The alternative ("context will be there, I think") is unbounded: a class of "missing correlation_id in production logs, can't trace the incident" bugs that grow with each new async-handoff site you add.
The interesting structural property: this composes with everything else. RLS reads app.tenant_id from the ALS-bound transaction session. The proposed status-change-audit trigger (ADR-041, deferred) reads app.user_id / app.correlation_id from the same session. None of these need new plumbing — they consume the spine that already exists. The decision to make the boundary explicit isn't workflow-engine-specific. It's the design rule that lets every cross-cutting concern in the codebase trust that context is present, without requiring every business-logic function to thread it through their signatures.
How config changes are handled
A reasonable question with any JSON-driven workflow engine: what happens when you change the config in production? The key insight first, because it makes everything else trivial: entities are at statuses. Steps are stateless config that describes how to transition between statuses. An entity's row carries its current status, never a step reference. So config changes only require entity migrations when the set of valid statuses an entity can be at changes — specifically, when a status is removed or renamed.
| Config change | Migration needed? | Why |
|---|---|---|
| Add a new status | No | New status_types entry. Existing entities don't know it exists. |
| Add a new step | No | Step applies at the next request against the matching status. Entity unchanged. |
| Remove or rename a step | No | Entities reference statuses, not steps. Removing a step changes what fires at a given status; it doesn't change where entities are. If you removed the only step at a non-terminal status and want entities to keep moving, you add a new step — that's a config follow-up, not an entity migration. |
| Change action / validation use case for a step | No | Use case names are strings. Next call resolves the new use case. Context shape must stay compatible — that's an interface concern, not a state concern. |
Change a step's successful_state / rejection_state |
No | Existing entities are at their current status; they aren't affected. The next transition from this step picks up the new target. |
| Remove or rename a status | Yes — degrade entities | Entities currently at that status need to move to a status the new config supports. One UPDATE. The startup validator rejects configs with orphaned statuses, so a forgotten migration is caught at boot. |
The unobvious benefit: workflow versioning is almost a non-event. The only config change that requires an entity migration is one that shrinks the set of valid statuses. Everything else — new statuses, new steps, new actions, new transitions — is forward-compatible with whatever entities are already in the database. Old entities at unchanged statuses keep their semantics. New entities pick up the new config. There's no replay, no version markers in code, no deterministic-execution gymnastics. Just one rule: if you shrink the status set, write the one-line UPDATE.
What we gave up
I owe an honest list. Choosing this engine cost us a few things Temporal would have given us for free:
- No multi-language workers. We're a TypeScript shop. If we ever needed a Go worker to participate in the workflow, we'd be writing the integration. Temporal would give us that for free.
- No external community / ecosystem. Temporal has thousands of users, established patterns, a vendor with support contracts and managed hosting. Our engine has one team. Bugs, edge cases, and "how do I solve X" questions land on us. The flip side is that the engine fits us exactly; the cost is that nobody else has worn the path before.
- Status type proliferation. Non-idempotent steps require a transient "claiming" status to prevent double-action. Each claim site adds 1–2 statuses to the enum. Configs get larger over time.
- No GUI. Temporal's UI is genuinely useful. We have a debug endpoint that shows step resolution and a path-analyzer that renders workflow graphs as HTML, but that's homemade.
If you need deterministic replay across workflow execution (the workflow code itself must produce identical observable behavior across restarts of the worker), cross-language worker coordination, sub-second high-volume timer scheduling, or your org already operates Temporal and has the community/vendor support around it — choose Temporal. The case for our engine is conditional, not universal. It's the right shape for entity lifecycle orchestration in a TypeScript modular monolith. Different problem class, different answer.
The unobvious win: extraction-readiness
The thing I undersold to myself when designing this was extraction. Modular monolith is a stopover; the real architecture is extraction-ready modular monolith. The engine respects this in two ways.
First, the common/ infrastructure that holds the engine itself imports nothing from any module. It deals in strings: claim states, owning module names, entity table names. The recovery contract is a registry pattern — each module registers its janitor and the claim states it owns. The engine doesn't know there's a billing module or an invoice module. It knows there are workflows whose configs declare an owning module.
Second, because entity-is-state means workflow state lives in the entity's table, and the entity's table travels with its module, extracting a module is a normal database extraction. You're not also extracting workflow history from a centralized Temporal cluster. You take the table, you take the module, you take the janitor, you keep moving.
When we extracted the PDF rendering service from the monolith last quarter — replacing Puppeteer with Playwright at the same time — there was no workflow refactor. The thing that triggers PDF generation is a BullMQ job; the PDF service consumes it; the workflow on the main side simply waits for an event to come back. The engine couldn't tell the difference.
The hard part wasn't the code
If you read the engine's source today, you might assume it was designed coherently from the start. The shape — entity-is-state, config-driven, validate-first / fire-second, claim-state CAS, per-module janitor recovery, multi-tenancy via config cascade — looks integrated enough that the obvious read is "someone planned this from day one."
That isn't what happened. The engine has the shape it does because it evolved into it. The ADRs record the evolution:
- ADR-015 (workflow-driven document processing) set the initial pattern — JSON config, status-based step resolution, the validation-then-action pipeline
- ADR-026 (workflow configuration validation) came after configs started getting complex enough that runtime-discovered errors became a real operational cost
- ADR-030 (batch-aware use cases) came after I realized 100 invoices going through the same step meant 100 separate use case invocations, each running its own queries — N+1 at the engine level, not just at the ORM level
- ADR-031 (claim-state primitive) came after observing duplicate emails on concurrent billing-close clicks in production. Optimistic locking caught the second write; it didn't catch the second action
- ADR-032 (claim recovery contract) came as the engine-side complement to ADR-031, after thinking carefully about what happens when a process crashes between claim and release — and explicitly choosing per-module janitor over engine-owned janitor to preserve MS-readiness
- ADR-033 (cross-module atomic writes) came when one use case needed to write across two modules atomically and the events-only rule had to acquire a documented carve-out — the sentinel pattern + forwardRef wiring as the bounded escape valve
- ADR-035 (tenancy isolation and RLS) came when the second customer profile entered the design — multi-tenancy stopped being hypothetical
- ADR-036 (ALS request context spine) was the operational answer to "how do we propagate tenant + correlation + user context through every async boundary cleanly"
None of these were designed up front. Each was a response to a specific limit the previous shape couldn't handle well. The pattern was: hit the limit, sit with it for a few days, write the ADR's alternatives section, pick the option, ship it, watch how it behaves. Then either keep going or revise. Several ADRs supersede earlier sketches that didn't ship.
The code itself — around 13,000 lines of TypeScript for the engine itself, plus roughly another 7,000 across tests, the simulation harness, the path analyzer, and the HTML report — is finite and tractable. Anyone with a year of NestJS experience could write the current engine if they had the design in hand. The hard part is not the code. The hard part is the iteration that produces the design, and that part is recorded in the ADRs, not in the source files. A reader looking at the engine and asking "could I build this?" should be reading the ADR series first, the source second. The source is the artifact. The ADRs are the path.
If you take one thing from this post: when you build something like this, write the ADRs as you go. Not in batches afterwards. The "alternatives considered" sections in particular are how you preserve why the engine isn't shaped like the obvious thing — and "the obvious thing" is exactly what a future reader, including future-you, will be tempted to revise back to if they don't have the rejection reason in front of them.
What I'd tell someone facing the same choice
Three questions to ask, in order:
- Do your workflows need deterministic replay? If the workflow code itself must replay across worker restarts and produce identical observable behavior — yes, choose Temporal. If your state lives in entity columns and "restart and continue from the committed status" is fine, you don't need replay semantics.
- Will you ever extract modules into separate services? If yes, entity-is-state is gold. Temporal becomes a coupling point you'll have to disentangle later, because workflow history lives in Temporal's database and travels separately from your entities.
- Do you have a platform team? If no, Temporal's ops surface is a real tax. If yes, it's free.
Most teams I've worked with answer "no, yes, no." Most teams I've worked with would benefit from considering a small engine. Most teams I've worked with don't, because the wisdom of the industry is to reach for the framework. Sometimes the default isn't the right fit — and the only way to know is to actually consider the problem in front of you rather than the problem the framework solves.