HS

Himanshu Sharma

Lead Platform Engineer

← Back to portfolio

Microservices • Saga Pattern

Designing Saga Choreography for Workflow Engines

Netflix Conductor orchestrates 30+ critical workflows inside our Freshworks iPaaS. Here is how I pair saga choreography with Change Data Capture (CDC) to keep microservices consistent without slowing down developer velocity.

Why Saga + CDC?

Traditional two-phase commit approaches buckle under the latency profile of omnichannel integrations. Saga choreography allows every service to manage its own compensation logic, while CDC guarantees that state changes are visible to Conductor in near real time.

  • Each service emits domain events to DynamoDB streams.
  • CDC relays events into Kafka topics dedicated per domain.
  • Conductor tasks subscribe to those topics to decide progression or rollback.

Reference Architecture

The diagram below illustrates how domain services, CDC pipelines, and Conductor workers stay coordinated without introducing a single point of failure.

Service A Emit Domain Events CDC Stream DynamoDB → Kafka Conductor Tasks Workflow Progress Saga Orchestration Context Metadata: task domain, retry policy, compensations
Domain events flow through CDC before Conductor advances the saga.

Workflow Contract

Every workflow definition includes an explicit task domain contract so we can route execution to language-specific workers (Java WebFlux vs. Node.js). Domains also control isolation boundaries for throttling and alerting.

Pro tip: expose the contract via metadata so UI and API consumers can search workflows by domain (now supported upstream via my PR #492).

Choreography Guardrails

Choreography can devolve into spaghetti if not governed. I enforce a few non-negotiables:

  1. Idempotent tasks with deterministic compensation.
  2. Time-boxed states: every task publishes heartbeat events so Conductor can trigger retries or compensations.
  3. Tracing parity: propagate OpenTelemetry baggage so CDC and workflow logs can be correlated with Grafana dashboards.

Implementation Checklist

If you are rolling this pattern into your own stack, I recommend advancing through these gates:

  • Align on the event taxonomy—what constitutes a success, failure, timeout?
  • Codify retry semantics per domain and document the compensating action.
  • Instrument end-to-end tracing before onboarding any production workloads.

Metrics That Matter

We treat sagas like product experiences and monitor them with KPIs:

  • Median orchestration time per workflow.
  • Percentage of compensations triggered per domain.
  • CDC lag between source tables and Conductor consumption.

Operational Outcomes

By combining CDC + saga we reduced reconciliation incidents by 60%, and onboarding a new workflow now takes hours rather than days. Developers work inside their preferred stack, while platform teams maintain real-time visibility through Conductor’s summary index.