Microservices • Saga Pattern
Designing Saga Choreography for Workflow Engines
Netflix Conductor orchestrates 30+ critical workflows inside our Freshworks iPaaS. Here is how I pair saga choreography with Change Data Capture (CDC) to keep microservices consistent without slowing down developer velocity.
Why Saga + CDC?
Traditional two-phase commit approaches buckle under the latency profile of omnichannel integrations. Saga choreography allows every service to manage its own compensation logic, while CDC guarantees that state changes are visible to Conductor in near real time.
- Each service emits domain events to DynamoDB streams.
- CDC relays events into Kafka topics dedicated per domain.
- Conductor tasks subscribe to those topics to decide progression or rollback.
Reference Architecture
The diagram below illustrates how domain services, CDC pipelines, and Conductor workers stay coordinated without introducing a single point of failure.
Workflow Contract
Every workflow definition includes an explicit task domain contract so we can route execution to language-specific workers (Java WebFlux vs. Node.js). Domains also control isolation boundaries for throttling and alerting.
Choreography Guardrails
Choreography can devolve into spaghetti if not governed. I enforce a few non-negotiables:
- Idempotent tasks with deterministic compensation.
- Time-boxed states: every task publishes heartbeat events so Conductor can trigger retries or compensations.
- Tracing parity: propagate OpenTelemetry baggage so CDC and workflow logs can be correlated with Grafana dashboards.
Implementation Checklist
If you are rolling this pattern into your own stack, I recommend advancing through these gates:
- Align on the event taxonomy—what constitutes a success, failure, timeout?
- Codify retry semantics per domain and document the compensating action.
- Instrument end-to-end tracing before onboarding any production workloads.
Metrics That Matter
We treat sagas like product experiences and monitor them with KPIs:
- Median orchestration time per workflow.
- Percentage of compensations triggered per domain.
- CDC lag between source tables and Conductor consumption.
Operational Outcomes
By combining CDC + saga we reduced reconciliation incidents by 60%, and onboarding a new workflow now takes hours rather than days. Developers work inside their preferred stack, while platform teams maintain real-time visibility through Conductor’s summary index.