HS

Himanshu Sharma

Lead Platform Engineer

← Back to portfolio

Scaling • Observability

Scaling Omnichannel Services to 100K Daily Interactions

Messaging is unforgiving—latency spikes translate directly to revenue hits. This is the reference architecture I used at Freshworks to keep WhatsApp, email, SMS, and voice channels reliable as traffic tripled.

Layered Capacity Planning

Instead of planning capacity per microservice, we scope by conversation lane (SMS, WhatsApp, voice). Each lane owns:

  • An autoscaled WebFlux ingress tier with bounded connection pool.
  • Dedicated Kafka partitions tuned for the lane’s throughput distribution.
  • Failover templates that reroute traffic to a warm standby region.

Channel Architecture Map

Each conversation lane has an ingress tier, a buffering tier, and a delivery tier. The following diagram captures the core flow.

Ingress WebFlux Gateway Buffering Kafka + Redis Delivery Channel Adapters Observability & Control Plane Grafana dashboards • Jenkins canary • PagerDuty
Each lane owns ingress, buffering, and delivery tiers backed by a shared control plane.

Back-pressure Everywhere

Spring WebFlux gives us a reactive backbone, but the magic happens when we propagate demand signals end-to-end:

  1. Apply rate-aware batching before calling third-party APIs.
  2. Emit custom Micrometer meters for queue depth and consumer lag, surfacing them in Grafana.
  3. Wire Jenkins canary jobs to watch these meters during deploy and auto-roll back when SLOs degrade.

Design Tenets

We describe every channel lane using four tenets:

  • Isolation: noisy neighbors can’t impact other lanes.
  • Observability: SLOs and budget alerts exist before GA.
  • Automatability: deployments + rollbacks require a single Jenkins job.
  • Compliance: data residency settings per lane (EU-only, US-only).

Runbook Snapshot

Our runbook includes pre-built flows for:

  1. Fast-draining a Kafka partition when consumer lag exceeds thresholds.
  2. Switching traffic to a warm region and rehydrating Redis caches.
  3. Automated post-mortem collection (logs, traces, dashboards).

Observability Guardrails

Engineering teams get a single Grafana dashboard with four tiles: ingestion latency, downstream success, queue depth, and error budget burn. Every tile maps to PagerDuty policies so on-call responders have consistent triggers.

Bonus: pair the dashboard with synthetic monitors to replay common journeys (OTP, ticket assignments) every minute.

Results

The platform now handles 100K+ daily interactions with sub-100 ms latency and a 20% lift in engagement. Teams deploy 2–3 times per day because the guardrails make regressions obvious.

  • p99 response time: 95 ms during peak events.
  • Error budget burn reduced by 35% quarter over quarter.
  • Time-to-detect (TTD) for channel regressions under 3 minutes.