Back to Blog

How to Design a Scalable Real-Time Chat System

February 13, 202610 min read

Real-time chat systems power everything from customer support platforms to social messaging apps used by billions of users. While sending a single message seems simple, delivering millions of messages per second with ordering guarantees, low latency, and high availability is a complex distributed systems problem.

At scale, even moderate usage grows quickly. For example, 20 million users sending just 30 messages per day already generate 600 million messages daily — nearly 7,000 per second on average, with peak traffic far higher.

Designing for this scale requires careful separation of concerns, asynchronous processing, and strong consistency strategies.

Core Requirements

Before jumping into components, define what “good” looks like:

  • Low Latency: Messages should appear in real time (sub-second delivery).
  • High Throughput: Support millions of concurrent connections.
  • Ordering Guarantees: Messages in a conversation must arrive in sequence.
  • Reliability: No message loss, even during failures.
  • Offline Support: Messages must be delivered when users reconnect.
  • Scalability: Horizontal scaling without central bottlenecks.
  • Observability: Real-time monitoring of delivery health.

High-Level Architecture

A scalable chat system is built around persistent connections, distributed message routing, and durable storage.

Typical core components include:

  • Connection Gateways (WebSocket servers)
  • Message Router / Broker
  • Presence Service
  • Message Storage
  • Notification Service (for offline users)

Each layer scales independently.

Connection Layer (WebSocket Gateways)

Clients establish long-lived WebSocket connections to gateway servers.

Responsibilities:

  • Maintain active user connections
  • Authenticate sessions
  • Forward outgoing messages to routing layer
  • Push incoming messages to users

Gateways are stateless and horizontally scalable behind load balancers.

Sticky sessions or consistent hashing are often used to keep users on the same node.

Message Routing & Fanout

When a user sends a message:

  1. Gateway forwards message to routing service
  2. Router determines recipients (1-to-1 or group)
  3. Message is published to a distributed broker (Kafka, Pulsar, Redis Streams)
  4. Subscribed gateways push messages to online recipients

Partitioning by conversation_id preserves ordering within chats.

This decoupling enables massive parallelism.

Presence & Online Status

A presence service tracks:

  • user online/offline
  • active devices
  • last seen timestamp

Typically stored in an in-memory datastore like Redis with TTL heartbeats.

This allows:

  • routing messages to live connections
  • triggering push notifications for offline users

Message Storage Strategy

To support history and offline delivery:

Hot Storage

Recent messages stored in fast NoSQL (Cassandra, DynamoDB).

Cold Storage

Older messages archived to object storage or data warehouses.

Indexes are usually built by:

  • conversation_id
  • timestamp

This enables fast pagination and retrieval.

Offline Message Handling

If recipients are offline:

  • Messages are persisted
  • Push notification is sent (optional)
  • Delivered when user reconnects

Gateways fetch undelivered messages on reconnect.

Scaling Strategies

Horizontal Scaling

  • Stateless gateways
  • Partitioned brokers
  • Sharded databases

Load Isolation

  • Separate clusters for heavy group chats
  • Priority lanes for real-time delivery

Caching

  • Recent conversations cached in Redis
  • User metadata cached aggressively

Reliability & Failure Handling

At-Least-Once Delivery

Messages are persisted before fanout.

Idempotency

Clients ignore duplicate message IDs.

Backpressure

Gateways throttle slow consumers to avoid overload.

Replication

Brokers and databases replicate across availability zones.

Observability

Critical metrics:

  • message delivery latency
  • connection counts
  • broker lag
  • failed sends
  • offline queue depth

Alerts trigger on abnormal spikes.

Tradeoffs & Bottlenecks

AreaTradeoff
Strong orderingReduces parallelism
Persistent connectionsHigher memory cost
DurabilitySlight latency increase
Global scaleCross-region replication complexity

Every chat system balances these differently based on product needs.

Conclusion

A scalable real-time chat system is fundamentally a distributed streaming platform built on persistent connections, message brokers, and durable storage.

By separating connection handling, routing, presence, and storage — and by designing for failure from day one — teams can support millions of users while maintaining real-time performance.

These architectural patterns underpin the messaging platforms used by modern consumer and SaaS applications at global scale.

Share this article