Real-time chat systems power everything from customer support platforms to social messaging apps used by billions of users. While sending a single message seems simple, delivering millions of messages per second with ordering guarantees, low latency, and high availability is a complex distributed systems problem.
At scale, even moderate usage grows quickly. For example, 20 million users sending just 30 messages per day already generate 600 million messages daily — nearly 7,000 per second on average, with peak traffic far higher.
Designing for this scale requires careful separation of concerns, asynchronous processing, and strong consistency strategies.
Core Requirements
Before jumping into components, define what “good” looks like:
- Low Latency: Messages should appear in real time (sub-second delivery).
- High Throughput: Support millions of concurrent connections.
- Ordering Guarantees: Messages in a conversation must arrive in sequence.
- Reliability: No message loss, even during failures.
- Offline Support: Messages must be delivered when users reconnect.
- Scalability: Horizontal scaling without central bottlenecks.
- Observability: Real-time monitoring of delivery health.
High-Level Architecture
A scalable chat system is built around persistent connections, distributed message routing, and durable storage.
Typical core components include:
- Connection Gateways (WebSocket servers)
- Message Router / Broker
- Presence Service
- Message Storage
- Notification Service (for offline users)
Each layer scales independently.
Connection Layer (WebSocket Gateways)
Clients establish long-lived WebSocket connections to gateway servers.
Responsibilities:
- Maintain active user connections
- Authenticate sessions
- Forward outgoing messages to routing layer
- Push incoming messages to users
Gateways are stateless and horizontally scalable behind load balancers.
Sticky sessions or consistent hashing are often used to keep users on the same node.
Message Routing & Fanout
When a user sends a message:
- Gateway forwards message to routing service
- Router determines recipients (1-to-1 or group)
- Message is published to a distributed broker (Kafka, Pulsar, Redis Streams)
- Subscribed gateways push messages to online recipients
Partitioning by conversation_id preserves ordering within chats.
This decoupling enables massive parallelism.
Presence & Online Status
A presence service tracks:
- user online/offline
- active devices
- last seen timestamp
Typically stored in an in-memory datastore like Redis with TTL heartbeats.
This allows:
- routing messages to live connections
- triggering push notifications for offline users
Message Storage Strategy
To support history and offline delivery:
Hot Storage
Recent messages stored in fast NoSQL (Cassandra, DynamoDB).
Cold Storage
Older messages archived to object storage or data warehouses.
Indexes are usually built by:
- conversation_id
- timestamp
This enables fast pagination and retrieval.
Offline Message Handling
If recipients are offline:
- Messages are persisted
- Push notification is sent (optional)
- Delivered when user reconnects
Gateways fetch undelivered messages on reconnect.
Scaling Strategies
Horizontal Scaling
- Stateless gateways
- Partitioned brokers
- Sharded databases
Load Isolation
- Separate clusters for heavy group chats
- Priority lanes for real-time delivery
Caching
- Recent conversations cached in Redis
- User metadata cached aggressively
Reliability & Failure Handling
At-Least-Once Delivery
Messages are persisted before fanout.
Idempotency
Clients ignore duplicate message IDs.
Backpressure
Gateways throttle slow consumers to avoid overload.
Replication
Brokers and databases replicate across availability zones.
Observability
Critical metrics:
- message delivery latency
- connection counts
- broker lag
- failed sends
- offline queue depth
Alerts trigger on abnormal spikes.
Tradeoffs & Bottlenecks
| Area | Tradeoff |
|---|---|
| Strong ordering | Reduces parallelism |
| Persistent connections | Higher memory cost |
| Durability | Slight latency increase |
| Global scale | Cross-region replication complexity |
Every chat system balances these differently based on product needs.
Conclusion
A scalable real-time chat system is fundamentally a distributed streaming platform built on persistent connections, message brokers, and durable storage.
By separating connection handling, routing, presence, and storage — and by designing for failure from day one — teams can support millions of users while maintaining real-time performance.
These architectural patterns underpin the messaging platforms used by modern consumer and SaaS applications at global scale.