Scaling • Observability
Scaling Omnichannel Services to 100K Daily Interactions
Messaging is unforgiving—latency spikes translate directly to revenue hits. This is the reference architecture I used at Freshworks to keep WhatsApp, email, SMS, and voice channels reliable as traffic tripled.
Layered Capacity Planning
Instead of planning capacity per microservice, we scope by conversation lane (SMS, WhatsApp, voice). Each lane owns:
- An autoscaled WebFlux ingress tier with bounded connection pool.
- Dedicated Kafka partitions tuned for the lane’s throughput distribution.
- Failover templates that reroute traffic to a warm standby region.
Channel Architecture Map
Each conversation lane has an ingress tier, a buffering tier, and a delivery tier. The following diagram captures the core flow.
Back-pressure Everywhere
Spring WebFlux gives us a reactive backbone, but the magic happens when we propagate demand signals end-to-end:
- Apply rate-aware batching before calling third-party APIs.
- Emit custom Micrometer meters for queue depth and consumer lag, surfacing them in Grafana.
- Wire Jenkins canary jobs to watch these meters during deploy and auto-roll back when SLOs degrade.
Design Tenets
We describe every channel lane using four tenets:
- Isolation: noisy neighbors can’t impact other lanes.
- Observability: SLOs and budget alerts exist before GA.
- Automatability: deployments + rollbacks require a single Jenkins job.
- Compliance: data residency settings per lane (EU-only, US-only).
Runbook Snapshot
Our runbook includes pre-built flows for:
- Fast-draining a Kafka partition when consumer lag exceeds thresholds.
- Switching traffic to a warm region and rehydrating Redis caches.
- Automated post-mortem collection (logs, traces, dashboards).
Observability Guardrails
Engineering teams get a single Grafana dashboard with four tiles: ingestion latency, downstream success, queue depth, and error budget burn. Every tile maps to PagerDuty policies so on-call responders have consistent triggers.
Results
The platform now handles 100K+ daily interactions with sub-100 ms latency and a 20% lift in engagement. Teams deploy 2–3 times per day because the guardrails make regressions obvious.
- p99 response time: 95 ms during peak events.
- Error budget burn reduced by 35% quarter over quarter.
- Time-to-detect (TTD) for channel regressions under 3 minutes.