When forwardRef becomes a saga: planning module extraction in a modular monolith
Cross-module atomic writes via NestJS forwardRef work because the modules share a process, a connection pool, and a transaction. The moment you extract one module into its own service, the shared transaction can't span the boundary — DB transactions live on connections, connections live in processes. The replacement patterns are saga (with compensation), transactional outbox + inbox (for at-least-once eventual consistency), and 2PC (which you almost never want). This post walks through each, the cost ledger, and the migration path. Plan for the extraction question from day one, even if the answer is "not yet."
Setting the stage: the patterns named here
Before the migration logic, brief grounding on the distributed-system patterns this post moves between. Skip if you've shipped these patterns before. Stay if your team is still treating "we'll figure it out at extraction time" as a strategy — because the figuring-out cost is much higher mid-extraction than upfront.
The core problem in one paragraph
Two services own different tables. A single user action needs to write to both atomically — community-management creates the Community row; user-management creates the BankAccount row; both must commit or both must roll back. In one process this is one DB transaction. Across two processes, there's no shared transaction context — and the question becomes which pattern do you use to preserve the consistency property.
Saga
A long-lived transaction decomposed into local transactions, each with a compensating action. If step 4 fails, fire compensations 3-2-1 in reverse. The compensations make the system eventually consistent — there's a window where you've done part-of-the-thing and not the rest, but the saga is committed to either finishing or undoing.
Two flavors: orchestrated (one coordinator drives the sequence; clear control flow but the coordinator becomes a coupling point) and choreographed (services react to each other's events; looser coupling, harder to reason about past three or four steps).
Failure mode without it: "user clicks setup community" partially succeeds — Community created, BankAccount creation failed — the system is left in a half-state forever. Either you manually clean it up via support, or the user sees inconsistent data and contacts you anyway. Either way, you've discovered the lack of saga in production.
Transactional Outbox
The local write and the "we promise to publish this event" are made atomic by putting them in the same DB transaction. The transaction inserts the data row and a row in an outbox table. After commit, a relay process polls the outbox and publishes to the bus, then marks the row sent.
Why it works: if the COMMIT succeeds, the outbox row is durable. The event will be published, even if the relay crashes and restarts. If the COMMIT fails, neither the data nor the event exist. The atomicity is one-service-local; the cross-service consistency is eventually-via-events.
Failure mode without it: you write the row, then try to publish the event, and the publish fails — now you have data without a corresponding event downstream. Or you publish first and then the write fails — now you have an event for data that doesn't exist. Both lead to drift; both are painful to debug because the symptoms appear days later in a different module.
Inbox
The mirror image. The receiving service stores incoming event IDs in an inbox table, in the same transaction that processes the event. Re-delivery of the same event sees the ID already present and skips.
Why it works: message buses provide at-least-once delivery (you might get the event twice if the consumer crashed mid-processing). Without inbox, that means double-processing. With inbox, the second processing is a no-op.
Failure mode without it: the consumer crashes after processing but before acknowledging the message. Bus redelivers; consumer processes again; you have double payments, double user creations, double whatever-the-event-meant. The naive fix ("make every operation idempotent") works until it doesn't — some operations are naturally side-effecting in ways that aren't trivially idempotent.
Two-Phase Commit (2PC / XA)
The synchronous all-or-nothing answer. Each service prepares its transaction; a coordinator collects all the prepares; commits all if all succeeded, rolls back all otherwise.
Postgres supports it (PREPARE TRANSACTION). It's also the answer most modern systems reject. The reasons:
- Blocking semantics on coordinator failure. Coordinator crashes between prepare and commit → participants hold locks indefinitely until someone resolves them manually. Real ops nightmare.
- Performance. Multiple round trips per operation; pinned connections during the prepare window.
- Ecosystem mismatch. Most ORMs don't optimize for it. Most managed databases discourage it. Most message brokers don't participate.
- The deeper smell. If you're reaching for 2PC, your service decomposition probably has the boundary in the wrong place. The two services trying to commit-or-rollback together are usually actually one service wearing a process-boundary disguise.
Why it's worth knowing about anyway: someone on your team will eventually suggest it. Having the rejection arguments ready saves an hour of meeting.
When you'd combine them
For most real cross-service writes, the answer is outbox + inbox as the delivery substrate plus saga as the coordination shape. The outbox guarantees the event gets published. The inbox guarantees the event gets processed exactly once. The saga decomposes the multi-step operation and provides the compensation story. 2PC stays in your back pocket for the rare case where the rest genuinely doesn't fit (and that case is much rarer than people think).
If you nodded along to all of that — skip to "The pattern that doesn't survive process boundaries" below. The rest of this post is about which of these patterns you reach for to migrate a specific in-process forwardRef site, and the honest cost of each.
The pattern that doesn't survive process boundaries
In a modular monolith, the cleanest way to handle a cross-module atomic write is something like this: the use case lives in one module, depends on repository interfaces rather than concrete classes, and the module wiring uses forwardRef(() => OtherModule) so the interfaces resolve against the data-owning module's implementations. Everything happens inside one runInTransaction call. If any step returns failure, the sentinel pattern throws inside the transaction and TypeORM rolls everything back.
This works beautifully. We use it in one place in our codebase: SetupCommunityUseCase atomically creates a Community (community-management), a Company, an Address, and a BankAccount (user-management). If the bank account insert fails, the community, company, and address all roll back. No saga, no compensation, no eventual consistency. One use case, one transaction, all-or-nothing.
It also pins those two modules to the same process, forever. Until you redesign.
Why shared transactions can't cross process boundaries
The mechanics are blunt:
- A database transaction lives on a single connection
- A connection lives in a connection pool
- A connection pool lives in a single process
- Two services in two processes have two separate connection pools, two separate connections, two separate transactions
When the community-management service inserts a Community row, it does so on its connection, in its transaction. When the user-management service — now extracted — needs to insert the BankAccount row, it does so on its own connection, in its own transaction. There is no shared transaction. There is no "all-or-nothing" guarantee at the wire level.
Several intuitions that don't help:
- "But they can connect to the same database." They can. The transactions are still separate. Postgres doesn't know that "transaction 17 on connection A" is logically related to "transaction 42 on connection B" unless you explicitly tell it via two-phase commit (which we'll come back to).
- "Redis can coordinate them." Redis can carry messages between services, but it has no concept of atomicity across two database commits. It's a transport, not a transaction manager.
- "We can manually roll back if the other side fails." Yes, you can — that's the saga pattern, and it's not free. We'll get to it.
So the question is: what gives you cross-service consistency, and what does each option cost?
Option 1: Saga with compensation
Decompose the atomic operation into a sequence of local transactions, each in its own service. Each step has a compensating action that undoes its effect. If step 4 fails, you fire compensations 3-2-1 in reverse.
For our SetupCommunityUseCase if user-management were extracted:
- user-management service: create Company → on later failure, compensate by deleting it
- user-management service: create Address → on later failure, compensate by deleting it
- user-management service: create BankAccount → on later failure, compensate by deleting it
- community-management service: create Community → if successful, the saga commits; if it fails, fire compensations 3-2-1
Two implementation styles, both real:
Orchestrated saga: one service (or a dedicated coordinator service) drives the sequence and tracks state. Easier to reason about, but the coordinator becomes a coupling point.
Choreographed saga: services react to each other's events, each one knowing what to do next. Less coupling, but the event topology becomes hard to keep in your head after about five steps.
The cost ledger:
- Every action needs a compensation. Some compensations are trivial (delete a row). Some are hard (un-send an email — you can't, so design the workflow to send the email only after the saga succeeds, not during it).
- Intermediate states are observable. A user who clicks "create my community" might see, for ~200ms, a state where their bank account exists but no community is attached to it. Compensation eventually cleans it up, but the partial state was real during the window.
- Saga coordinators are non-trivial code. State machines, retry policies, ordered compensation, idempotency of compensations under retry. You will write this. It is the cost of extraction.
- Compensations must be idempotent. If a compensation runs twice (because the coordinator retried), the second run must be a no-op.
Option 2: Transactional Outbox
The atomic write becomes local to one service. In the same transaction that writes your data, you also INSERT INTO an outbox table the event(s) that need to fan out to other services.
BEGIN;
INSERT INTO communities (id, ...) VALUES (...);
INSERT INTO outbox (event_type, payload, target)
VALUES ('CommunityCreated', '{...}', 'user-management');
COMMIT;
A separate relay process polls the outbox and publishes events to the bus (Redis Streams, Kafka, whatever your transport is). Once published, the row is marked as sent (or moved to an archive table). Receiving services consume the events asynchronously.
Why this is good:
- The local write and the "promise to publish" are atomic. If the COMMIT succeeds, the event will eventually be published because the outbox row is durable.
- If the COMMIT fails, neither exists. No half-state.
- No 2PC needed. One database, one transaction, one commit. The whole thing fits in a single Postgres transaction.
- Replayable. If the relay crashes, it restarts and continues from the last unsent row. If consumers are slow, the outbox accumulates and drains naturally.
Why it's not magic:
- Eventually consistent. The downstream service sees the event sometime after the local commit. Usually milliseconds; could be minutes if the relay is backed up.
- The downstream must be idempotent. Or use an inbox (see next section). Otherwise replays from the relay cause double-processing.
- Operational surface. The outbox table needs maintenance: rows marked sent need pruning or archival, the relay needs monitoring, lag metrics are now a thing you care about.
For our SetupCommunity case: the outbox alone is not enough, because the operation spans writes to both services' tables, not just one service writing and the others reading. You'd combine it with a saga: each saga step uses outbox to durably publish its "I did my part" event, and the coordinator drives the sequence.
Option 3: Inbox (the complement to outbox)
The receiving service stores the IDs of incoming events in an inbox table, in the same transaction that processes the event:
BEGIN;
-- ON CONFLICT DO NOTHING: if we've seen this event before, the row already exists
INSERT INTO inbox (event_id) VALUES ('evt-12345') ON CONFLICT DO NOTHING;
IF (rows inserted == 0) THEN ROLLBACK; -- already processed; skip silently
ELSE
-- process the event
INSERT INTO bank_accounts (...);
END IF;
COMMIT;
This catches the "consumer crashes after processing but before acknowledging the message" failure mode. Without it, your message bus redelivers the event, you process it again, and your domain has duplicates.
Outbox + inbox is the canonical combination. The outbox guarantees "the event will be published." The inbox guarantees "the event will be processed exactly once." Together they convert at-least-once delivery (which is all most message buses provide) into effectively-exactly-once processing.
Option 4: Two-Phase Commit (2PC / XA)
Postgres supports prepared transactions. The pattern: each service does its part of the work, then issues PREPARE TRANSACTION 'name' (instead of COMMIT). A coordinator collects all the prepares; only when all return successfully does it tell each service to COMMIT PREPARED. If any prepare fails, everyone runs ROLLBACK PREPARED.
This is the synchronous all-or-nothing answer. It's also the answer most modern systems reject. The reasons are:
- Blocking semantics on coordinator failure. If the coordinator crashes between the prepare phase and the commit phase, the participants hold their prepared transactions open — including all the locks they took — until someone resolves them manually. This is a real ops nightmare under partial failure.
- Performance. Every cross-service operation involves multiple round trips and pinned connections during the prepare window.
- Ecosystem mismatch. Most ORMs don't optimize for it. Most message brokers don't participate. Most managed databases discourage it.
- The deeper signal: If you reach for 2PC, your service decomposition is often wrong. You're trying to enforce a transactional boundary across a service boundary, and the service boundary exists precisely because the transactional boundary shouldn't be there.
I include 2PC for completeness. If you're seriously considering it, the better answer is usually "don't extract" or "extract differently so the atomicity stays local."
What I'd actually do (specific to ECM)
If we extracted user-management from community-management tomorrow, the migration for SetupCommunityUseCase would be:
- Orchestrated saga, driven from community-management (because that's where the use case originates). The saga state lives in community-management.
- Each saga step uses outbox at the sending side. The community-management service's outbox holds "PleaseCreateCompany", "PleaseCreateAddress", "PleaseCreateBankAccount", and the eventual "SagaCompletedSuccessfully" or "SagaRolledBack".
- user-management uses inbox on the receiving side. Each command is processed exactly once. Each response is published back via user-management's own outbox.
- Compensations are local to user-management: "DeleteCompany", "DeleteAddress", "DeleteBankAccount". All idempotent. None involve external side effects (no money moved, no emails sent — we deliberately ordered the saga so external side effects happen after community creation, which is the last step).
- User-visible behavior during partial failure: the API returns a 202 Accepted with a saga ID. The frontend polls or subscribes to a status endpoint. "Creating your community... done" or "Something went wrong; we've rolled back; please try again."
The migration is non-trivial but bounded. ~2 weeks of work for a competent engineer, much of it in writing tests for the compensation paths. The hardest part is shifting the team's mental model from "this either fully succeeds or fully rolls back synchronously" to "this is eventually consistent within seconds, with defined failure modes and visible saga state."
The honest cost of extraction
Extraction is a trade. You exchange:
| Monolith gives you | Extraction gives you |
|---|---|
| One service to deploy | Independent deployment cadence per service |
| One process to monitor | Independent scaling per service |
| Synchronous transactions across modules | Failure isolation — one service down doesn't kill the other |
| One stack trace per request | Team autonomy at the service boundary |
| One CI/CD pipeline | Different SLAs per service |
This is worth it when:
- One module has wildly different scaling requirements (PDF rendering is CPU-bound; the API is I/O-bound — different shapes, different scaling)
- One module has different SLA requirements (the payments service can't be down; the analytics service can)
- You need to deploy modules at different cadences (different teams, different velocity)
- You've outgrown the modular monolith's single-process resource limits
It's not worth it when:
- You're doing it because microservices are fashionable
- The team is small (under 10 engineers)
- You lack the operational maturity to run multiple services (observability, deployment, monitoring, on-call)
- The two modules naturally co-vary — when one changes, the other usually changes too. Extracting them just spreads a coordinated change across two pull requests in two repos
The position we've actually taken
For ECM specifically: we extracted PDF rendering because PDFs are CPU-heavy, fail differently from main app code (Playwright crashes, font issues, timeout patterns), and benefit from independent scaling. The communication boundary is BullMQ jobs — community-management enqueues "please render this PDF", the PDF service consumes the job, the result comes back as an event. No shared transaction. No saga. The natural async boundary made extraction cheap.
We have not extracted user-management, community-management, energy-management, or financial-management. They co-vary with each other, scale with the same factors (per-community user count, per-community billing cycles), and benefit from the operational simplicity of one deployment. The single forwardRef site between community-management and user-management is a documented, deliberate coupling. The day we decide to extract, we have the migration path laid out above. We don't pretend it's free; we don't ban ourselves from using forwardRef because "someday we might extract."
This is what modular monolith with extraction-readiness actually means: every coupling is a deliberate decision, every extraction is a foreseeable migration, neither is imposed by ideology. The forwardRef pattern is a tool. The saga + outbox + inbox combination is also a tool. Different tools, different costs, both legitimate. Choose based on the actual deployment boundary you actually have, not the one you might have someday.
The bottom line
If your team uses forwardRef across module boundaries:
- Document the decision — somewhere a future engineer will read it (ADR is ideal)
- Cross-link from the use cases that depend on it
- Know your migration path: saga with compensation + outbox + inbox, mostly. Don't reach for 2PC.
- Be honest that you've decided those modules ship together for the foreseeable future
If you are extracting:
- Don't ship the extraction without a plan for cross-service consistency. The first time a user sees "your community was partially created, please contact support" is the moment you regret skipping the design.
- Saga + outbox + inbox covers most cases. 2PC is almost never the right answer.
- Compensations are part of the design, not an afterthought. Decide which compensations are easy (delete a row) and which are hard (un-send an email) early.
- Don't skip the inbox. At-least-once becomes exactly-once at the cost of one table. It's the cheapest insurance you'll buy.
The forwardRef pattern is durable in-process and structurally impossible cross-process. The replacement patterns are well-understood and well-documented across the industry. The migration is foreseeable. All of this can coexist in the same codebase, on the same team, in the same calendar year. That's what extraction-ready modular monolith actually means in practice — not "every module will eventually be a service," but "every module could be a service, with a clear plan and an honest cost, and we'll choose deliberately when the moment comes."