How to Design a Scalable Real-Time Chat System

Real-time chat systems power everything from customer support platforms to social messaging apps used by billions of users. While sending a single message seems simple, delivering millions of messages per second with ordering guarantees, low latency, and high availability is a complex distributed systems problem.

At scale, even moderate usage grows quickly. For example, 20 million users sending just 30 messages per day already generate 600 million messages daily — nearly 7,000 per second on average, with peak traffic far higher.

Designing for this scale requires careful separation of concerns, asynchronous processing, and strong consistency strategies.

Core Requirements

Before jumping into components, define what “good” looks like:

Low Latency: Messages should appear in real time (sub-second delivery).
High Throughput: Support millions of concurrent connections.
Ordering Guarantees: Messages in a conversation must arrive in sequence.
Reliability: No message loss, even during failures.
Offline Support: Messages must be delivered when users reconnect.
Scalability: Horizontal scaling without central bottlenecks.
Observability: Real-time monitoring of delivery health.

High-Level Architecture

A scalable chat system is built around persistent connections, distributed message routing, and durable storage.

Typical core components include:

Connection Gateways (WebSocket servers)
Message Router / Broker
Presence Service
Message Storage
Notification Service (for offline users)

Each layer scales independently.

Connection Layer (WebSocket Gateways)

Clients establish long-lived WebSocket connections to gateway servers.

Responsibilities:

Maintain active user connections
Authenticate sessions
Forward outgoing messages to routing layer
Push incoming messages to users

Gateways are stateless and horizontally scalable behind load balancers.

Sticky sessions or consistent hashing are often used to keep users on the same node.

Message Routing & Fanout

When a user sends a message:

Gateway forwards message to routing service
Router determines recipients (1-to-1 or group)
Message is published to a distributed broker (Kafka, Pulsar, Redis Streams)
Subscribed gateways push messages to online recipients

Partitioning by conversation_id preserves ordering within chats.

This decoupling enables massive parallelism.

Presence & Online Status

A presence service tracks:

user online/offline
active devices
last seen timestamp

Typically stored in an in-memory datastore like Redis with TTL heartbeats.

This allows:

routing messages to live connections
triggering push notifications for offline users

Message Storage Strategy

To support history and offline delivery:

Hot Storage

Recent messages stored in fast NoSQL (Cassandra, DynamoDB).

Cold Storage

Older messages archived to object storage or data warehouses.

Indexes are usually built by:

conversation_id
timestamp

This enables fast pagination and retrieval.

Offline Message Handling

If recipients are offline:

Messages are persisted
Push notification is sent (optional)
Delivered when user reconnects

Gateways fetch undelivered messages on reconnect.

Scaling Strategies

Horizontal Scaling

Stateless gateways
Partitioned brokers
Sharded databases

Load Isolation

Separate clusters for heavy group chats
Priority lanes for real-time delivery

Caching

Recent conversations cached in Redis
User metadata cached aggressively

Reliability & Failure Handling

At-Least-Once Delivery

Messages are persisted before fanout.

Idempotency

Clients ignore duplicate message IDs.

Backpressure

Gateways throttle slow consumers to avoid overload.

Replication

Brokers and databases replicate across availability zones.

Observability

Critical metrics:

message delivery latency
connection counts
broker lag
failed sends
offline queue depth

Alerts trigger on abnormal spikes.

Tradeoffs & Bottlenecks

Area	Tradeoff
Strong ordering	Reduces parallelism
Persistent connections	Higher memory cost
Durability	Slight latency increase
Global scale	Cross-region replication complexity

Every chat system balances these differently based on product needs.

Conclusion

A scalable real-time chat system is fundamentally a distributed streaming platform built on persistent connections, message brokers, and durable storage.

By separating connection handling, routing, presence, and storage — and by designing for failure from day one — teams can support millions of users while maintaining real-time performance.

These architectural patterns underpin the messaging platforms used by modern consumer and SaaS applications at global scale.