Scaling • Observability

Scaling Omnichannel Services to 100K Daily Interactions

Messaging is unforgiving—latency spikes translate directly to revenue hits. This is the reference architecture I used at Freshworks to keep WhatsApp, email, SMS, and voice channels reliable as traffic tripled.

Author: Himanshu Sharma Last updated: Nov 2025 Reading time: 6 min

Layered Capacity Planning

Instead of planning capacity per microservice, we scope by conversation lane (SMS, WhatsApp, voice). Each lane owns:

An autoscaled WebFlux ingress tier with bounded connection pool.
Dedicated Kafka partitions tuned for the lane’s throughput distribution.
Failover templates that reroute traffic to a warm standby region.

Channel Architecture Map

Each conversation lane has an ingress tier, a buffering tier, and a delivery tier. The following diagram captures the core flow.

Each lane owns ingress, buffering, and delivery tiers backed by a shared control plane.

Back-pressure Everywhere

Spring WebFlux gives us a reactive backbone, but the magic happens when we propagate demand signals end-to-end:

Apply rate-aware batching before calling third-party APIs.
Emit custom Micrometer meters for queue depth and consumer lag, surfacing them in Grafana.
Wire Jenkins canary jobs to watch these meters during deploy and auto-roll back when SLOs degrade.

Design Tenets

We describe every channel lane using four tenets:

Isolation: noisy neighbors can’t impact other lanes.
Observability: SLOs and budget alerts exist before GA.
Automatability: deployments + rollbacks require a single Jenkins job.
Compliance: data residency settings per lane (EU-only, US-only).

Runbook Snapshot

Our runbook includes pre-built flows for:

Fast-draining a Kafka partition when consumer lag exceeds thresholds.
Switching traffic to a warm region and rehydrating Redis caches.
Automated post-mortem collection (logs, traces, dashboards).

Observability Guardrails

Engineering teams get a single Grafana dashboard with four tiles: ingestion latency, downstream success, queue depth, and error budget burn. Every tile maps to PagerDuty policies so on-call responders have consistent triggers.

Bonus: pair the dashboard with synthetic monitors to replay common journeys (OTP, ticket assignments) every minute.

Results

The platform now handles 100K+ daily interactions with sub-100 ms latency and a 20% lift in engagement. Teams deploy 2–3 times per day because the guardrails make regressions obvious.

p99 response time: 95 ms during peak events.
Error budget burn reduced by 35% quarter over quarter.
Time-to-detect (TTD) for channel regressions under 3 minutes.

Need help scaling your messaging surface?

Book a chat →