Designing a Scalable Online Chess Platform: A System Design Case Study

Designing a Scalable Online Chess Platform: A System Design Case Study

Online chess looks deceptively simple.

At first glance, a chess app feels like a small turn-based game. Two players connect, one player makes a move, the other player responds, and the game continues until checkmate, resignation, timeout, stalemate, or draw.

But once we design it at scale, it becomes a very interesting system design problem.

The hard part is not the size of a chess move. A move like e2e4 is tiny. The hard part is the combination of real-time communication, server-authoritative game state, matchmaking, long-lived WebSocket connections, fault tolerance, reconnection handling, spectator fanout, cheating detection, rating updates, game analysis, and safe deployments without causing reconnect storms.

This case study walks through the design of a scalable online chess website or app similar to Chess.com or Lichess. The goal is to make this both a blog-style explanation and a revision document for system design interviews.


1. Problem Statement

We want to design an online chess platform where users can come online, find opponents, play real-time chess games, chat, review games later, and receive accurate game results.

The system should support:

  • Two online players playing a chess game.
  • A matching system that pairs players based on rating or challenge preferences.
  • A game engine that validates moves and manages game state.
  • A chat system between players.
  • A move log for every game.
  • Game termination through checkmate, timeout, resignation, stalemate, forfeit, or draw.
  • Post-game analysis.
  • Rating updates.
  • Cheating detection.
  • Optional spectator support.

The system must be low-latency, low-bandwidth, fault-tolerant, and consistent.


2. Functional Requirements

The core functional requirements are:

  • Users should be able to create or accept chess challenges.
  • The system should match players against suitable opponents.
  • Once a match is found, the system should create a game and assign one player as white and the other as black.
  • White should make the first move.
  • Both players should make moves one after the other.
  • The system should validate every move.
  • Players should not be able to cancel or roll back moves.
  • The system should maintain a log of all moves.
  • The system should detect game-ending states such as checkmate, stalemate, resignation, timeout, or forfeit.
  • The system should support in-game chat.
  • The system should allow users to retrieve game state.
  • The system should optionally support spectators.
  • The system should support post-game analysis.
  • The system should support cheating detection.
  • The system should update player ratings after the game ends.

!IMPORTANT The client cannot be trusted to validate chess moves. The browser or mobile app can be modified, hacked, or made to send illegal moves. Therefore, all important game validation must happen on the server.


3. Non-Functional Requirements

The main non-functional requirements are:

  • Low latency: Moves must feel real-time.
  • Low bandwidth: Keep message payloads small.
  • Fault tolerance: Server crashes should not result in lost games.
  • Consistency: One authoritative sequence of moves per game.
  • Scalability: Handle millions of connected users and active games.
  • High availability: The matchmaking and gameplay paths must remain operational.
  • Safe deployments: Deployments must not disconnect millions of active players at once.
  • Reconnect support: Seamless session restoration after transient network drops.
  • Observability: Real-time metrics on move latency, reconnect rates, and engine health.

For chess, "low latency" does not mean the same thing as a first-person shooter game. We do not need 20 ms server reaction time. But the move should reach the opponent quickly enough that the game feels real-time. For online chess, a reasonable target would be:

  • Move acknowledgement $p_{95}$: under 100–200 ms.
  • Opponent move delivery $p_{95}$: under 200–500 ms.
  • Reconnect recovery: ideally within a few seconds.

The system must also be consistent. For one chess game, there must be exactly one valid sequence of moves. We cannot allow two different realities where one server thinks the game is at move 24 and another server thinks it is at move 25. The game engine must be authoritative.


4. Capacity Estimation

Let us start with a simple estimate.

Assumptions:

  • Daily active users (DAU): 1 million.
  • Average active users per minute: $1,000,000 / 1440 \approx 700\text{--}900$ users per minute.
  • Average game duration: 5 minutes.
  • Normal connected users: around 5,000.
  • Peak connected users: around 20,000.

Pending match requests may be much smaller than total online users because challenges expire quickly and many users are already in games.

If we assume each pending match request takes 1 KB, then: $$20,000 \text{ pending requests} \times 1\text{ KB} = 20\text{ MB}$$

This is small enough to keep in memory.

If the matching engine uses a Balanced BST, TreeSet, Redis Sorted Set, or similar range-query structure, lookup complexity is $O(\log N)$. For $N = 20,000$ pending requests, $\log_2(20,000) \approx 14$ operations. So the matcher is not likely to be the hardest bottleneck.

The bigger challenge is maintaining long-lived WebSocket connections and safely routing messages between users. Chess is usually connection-heavy but not bandwidth-heavy. A chess move is tiny. Even if each move payload is 1 KB after metadata and overhead, the move rate is still moderate compared to video streaming or telemetry-heavy games.

Example:

  • Suppose 200,000 active games are running.
  • If each game averages one move every 10 seconds, that is: $$200,000 / 10 = 20,000 \text{ moves per second}$$

This is significant, but still manageable with sharded game servers, efficient move logs, and lightweight payloads.

The harder part is:

  1. Maintaining millions of live connections.
  2. Handling reconnect storms.
  3. Routing messages to the correct gateway.
  4. Recovering game state after crashes.
  5. Avoiding spectator fanout from hurting players.
  6. Ensuring every move is processed exactly once logically.

5. High-Level Architecture

At a high level, the architecture separates the critical low-latency gameplay path from asynchronous downstream services.

High-level architecture for a scalable online chess platform showing WebSocket gateways, connection routing, matchmaking, game engine shards, and isolated spectator fanout path.

The gameplay path (move validation, clock updates, move persistence, and opponent notification) is synchronous, thin, and low-latency.

Everything else—game analysis, rating updates, cheating checks, emails, and long-term analytics—should be processed asynchronously off an event stream.


6. Why WebSockets?

A normal HTTP request-response model is not enough for active chess gameplay.

When Player A makes a move, Player B should receive it immediately. The server needs to push data to the opponent without waiting for the opponent to poll.

Polling would look like this:

  1. Player B asks the server every second, "Did my opponent move?"
  2. This creates unnecessary traffic and poor latency. If the polling interval is 1 second, the opponent may see moves up to 1 second late. If the polling interval is very small, the server gets hammered by repeated useless requests.

WebSockets solve this by keeping a long-lived bidirectional connection between the client and server.

With WebSockets:

  • Client can send a move to the server.
  • Server can push the move to the opponent.
  • Server can notify about resignation, timeout, draw offer, disconnect, reconnect, chat messages, and game end.

The gameplay experience becomes smoother. However, WebSockets introduce new complexity. HTTP servers are mostly stateless. WebSocket servers are stateful because they hold long-lived connections. This affects load balancing, deployment, failure recovery, connection routing, and capacity planning.


7. Keep the Gateway Dumb

One of the most important design decisions is to keep the WebSocket gateway as dumb and stable as possible.

The gateway should mainly do:

  • Accept WebSocket connections.
  • Verify authentication tokens.
  • Maintain socket lifecycle.
  • Handle ping/pong heartbeats.
  • Track connection metadata.
  • Forward inbound messages to the connection service.
  • Send outbound messages to connected clients.

It should not contain frequently changing business logic. The gateway should not deeply understand chess move validation, matchmaking rules, profile logic, rating updates, anti-cheat logic, or game analysis.

Why?

Because WebSocket gateways hold long-lived connections. If we frequently deploy gateway changes, we risk disconnecting many users. If thousands of clients disconnect at once and reconnect immediately, we create a thundering herd.

A better design separates concerns:

  • Dumb WebSocket Gateway: Stable and rarely changes.
  • Connection Service: Orchestrates the mappings.
  • Downstream Services (Game / Match / Profile / Chat): Frequently changing business logic.

8. Connection Service

The connection service is the smart layer behind the gateway.

It handles:

  • User-to-connection mapping.
  • Connection-to-gateway mapping.
  • Protocol versioning.
  • Message routing & fanout coordination.
  • Presence & session restoration.
  • Backpressure & reconnect handling.

For example, the gateway may receive this message:

{
  "type": "MOVE_SUBMIT",
  "gameId": "game-123",
  "move": "e2e4",
  "clientMoveId": "move-789",
  "expectedMoveNumber": 12
}

The gateway does not validate the chess move. It wraps the message with connection metadata:

{
  "connectionId": "conn-456",
  "userId": "user-1",
  "gatewayId": "gw-7",
  "receivedAt": 1710000000,
  "rawMessage": {
    "type": "MOVE_SUBMIT",
    "gameId": "game-123",
    "move": "e2e4",
    "clientMoveId": "move-789",
    "expectedMoveNumber": 12
  }
}

Then the connection service routes it to the correct game engine shard.

The connection service maintains mappings like:

  • userId -> (gatewayId, connectionId)
  • gameId -> gameEngineShardId
  • connectionId -> userId
  • userId -> activeGameId

Redis is a common choice for this type of ephemeral mapping because it supports fast reads/writes and TTLs. Mappings should expire automatically if heartbeats stop.


9. Matching Engine

The matching engine is responsible for pairing players.

A user may create a challenge with constraints like:

  • Time control: 3+0 (Blitz), 5+0, 10+0 (Rapid), etc.
  • Opponent rating: Minimum/maximum bounds.
  • Mode: Rated or unrated.

Example challenge:

{
  "challengeId": "challenge-123",
  "userId": "user-1",
  "rating": 1500,
  "minOpponentRating": 1400,
  "maxOpponentRating": 1600,
  "timeControl": "5+0",
  "rated": true,
  "createdAt": 1710000000,
  "expiresAt": 1710000030
}

A simple approach:

  1. Store pending challenges in memory.
  2. Index them by rating using a TreeSet, Balanced BST, or Redis Sorted Set.
  3. Expire challenges after a short TTL, such as 30 seconds.
  4. When a new challenge arrives, search for compatible existing challenges.
  5. If a match is found, atomically remove both challenges and create a game.
  6. If no match is found, insert the challenge into the pending pool.

The important operation is atomic match creation. We must prevent a bug where User A's challenge is accepted by User B and User C at the same time. To prevent this, we need compare-and-set operations, distributed locks, database transactions, or single-threaded partition ownership inside the matching shard.

A good sharding strategy partitions match requests by time control and rating bucket:

  • matchmaking:blitz:1400-1600
  • matchmaking:rapid:1600-1800
  • matchmaking:bullet:1200-1400

As wait time increases, we can gradually widen the rating range (e.g., first 5 seconds: $\pm 100$ rating, next 10 seconds: $\pm 200$ rating, etc.). This improves both fairness and match availability.


10. Game Creation

Once the match engine finds two compatible users, it creates a new game.

Game creation involves:

  • Assigning a unique gameId.
  • Assigning white and black colors.
  • Initializing the board state (usually FEN format) and clocks.
  • Storing game metadata and setting game status to ACTIVE.
  • Notifying both players.

Example game object:

{
  "gameId": "game-123",
  "whiteUserId": "user-1",
  "blackUserId": "user-2",
  "timeControl": "5+0",
  "rated": true,
  "status": "ACTIVE",
  "currentFen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
  "turn": "WHITE",
  "whiteTimeMs": 300000,
  "blackTimeMs": 300000,
  "moveNumber": 0,
  "createdAt": 1710000000
}

The system sends both players a notification:

{
  "type": "GAME_STARTED",
  "gameId": "game-123",
  "color": "WHITE",
  "opponent": {
    "userId": "user-2",
    "rating": 1520
  },
  "timeControl": "5+0",
  "initialFen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"
}

11. Game Engine

The game engine is the core component in the system.

It is responsible for:

  • Maintaining active game state.
  • Validating moves.
  • Enforcing turn order.
  • Managing clocks.
  • Detecting checkmate, stalemate, draw, resignation, and timeout.
  • Appending moves to durable storage.
  • Sending updates to players.
  • Publishing events for asynchronous consumers.

The game engine must be authoritative. The client may show the board and calculate legal moves for UX, but the server decides whether a move is valid. This prevents cheating where a modified client sends illegal moves or claims a different clock state.


12. Move Flow

A typical move flow:

  1. Player sends a move over WebSocket.
  2. Gateway receives the message.
  3. Connection service routes it to the correct game engine shard.
  4. Game engine validates the move.
  5. Game engine appends the move to the move log.
  6. Game engine updates in-memory state.
  7. Game engine sends MOVE_ACCEPTED to the active player.
  8. Game engine sends OPPONENT_MOVED to the opponent.
  9. Game engine publishes MOVE_COMMITTED to the event stream.

A move request:

{
  "type": "MOVE_SUBMIT",
  "gameId": "game-123",
  "clientMoveId": "client-move-999",
  "expectedMoveNumber": 23,
  "move": "e2e4"
}

A successful response:

{
  "type": "MOVE_ACCEPTED",
  "gameId": "game-123",
  "moveNumber": 24,
  "move": "e2e4",
  "fenAfterMove": "rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq e6 0 2",
  "whiteTimeMs": 182000,
  "blackTimeMs": 210000,
  "serverTimestamp": 1710000012
}

A rejection response:

{
  "type": "MOVE_REJECTED",
  "gameId": "game-123",
  "clientMoveId": "client-move-999",
  "reason": "NOT_YOUR_TURN"
}

Possible rejection reasons:

  • Invalid game or user not part of the game.
  • Game already ended or clock expired.
  • Not the user's turn or illegal move.
  • Stale move number or duplicate/malformed payload.

13. Server-Authoritative Clock

Chess clocks are highly sensitive. The client cannot be trusted to report remaining time.

The server stores:

  • Last move committed timestamp.
  • Remaining time for white and black.
  • Whose turn it is.

When a player submits a move, the server calculates elapsed time: $$\Delta t = T_{\text{received}} - T_{\text{last_committed}}$$

The server subtracts $\Delta t$ from the moving player's clock, applies any increment (if the time control uses increment), and commits the updated clock state.

While the client displays a local countdown for smooth UX, official timeout decisions are always made by the server. This prevents cheating where a client claims it moved earlier than it actually did.


14. Move Log and Fault Tolerance

The game engine keeps active game state in memory for low latency. However, memory is volatile. If a game server crashes, we should not lose active games.

Therefore, every accepted move is appended to durable storage.

A move log schema:

  • gameId
  • moveNumber
  • playerId
  • move
  • fenAfterMove
  • serverReceivedAt
  • whiteTimeMs
  • blackTimeMs
  • clientMoveId

The move log is append-only because players cannot roll back moves. The history of a game is naturally an ordered event log. Append-only logs are easier to recover from, audit, debug, replay, and analyze.

If the game engine crashes:

  1. Connection service detects that the shard is unavailable.
  2. The game is reassigned to another game engine shard.
  3. The new game engine loads the game metadata.
  4. The new game engine replays the move log.
  5. The new game engine reconstructs board state and clocks.
  6. Players reconnect and the game continues.

This gives us the best of both worlds: in-memory state for fast move validation, and a durable move log for fault tolerance.


15. Consistency Model

For each individual game, we require strong consistency:

  • Only one move should be accepted for a given move number.
  • Moves must be processed in order.
  • The same player cannot move twice in a row.
  • Illegal moves must never be committed.

The easiest way to achieve this is to route all moves for a given game to one game-engine owner at a time. We do this by sharding by gameId: $$\text{gameId} \xrightarrow{\text{hash}} \text{gameEngineShard}$$

Within that shard, moves for a game are processed sequentially. If we allowed multiple servers to write moves for the same game concurrently, we would need distributed locking or complex transactions, which increases latency. Since chess games have low write throughput (at most a few writes per second), single-owner-per-game is the most reasonable model.


16. Idempotency and Retries

Networks are unreliable. A client may send a move, the server may process it, but the response may get lost. The client will then retry the same move. Without idempotency, the server may process the same move twice.

To avoid this, each move includes a unique clientMoveId.

The server stores recently processed clientMoveIds per game/user. If the same move is retried, the server returns the previous cached result instead of applying it again.

{
  "gameId": "game-123",
  "userId": "user-1",
  "clientMoveId": "move-abc-1",
  "expectedMoveNumber": 18,
  "move": "g1f3"
}

The game engine checks:

  • Have I already processed clientMoveId = move-abc-1 for this game/user?
  • If yes, return the previous acknowledgement.
  • If no, validate normally.

Making retryable operations idempotent is a standard system design technique.


17. Chat Service

The platform supports in-game chat. Chat is less consistency-critical than moves and can be eventually consistent. It should be stored separately from the game move log.

Basic APIs:

  • sendMessage(gameId, userId, message)
  • getMessages(gameId, userId)

A chat message object:

{
  "messageId": "msg-123",
  "gameId": "game-123",
  "senderId": "user-1",
  "message": "Good luck!",
  "createdAt": 1710000000
}

For live chat, the chat service can use the same WebSocket connection but a different message type (e.g., CHAT_MESSAGE).

!TIP Do not let chat processing block move processing. Chat messages should be rate-limited, filtered for abusive content, and stored asynchronously.


18. Spectator Service

With only two players, each move needs to be delivered to two users. But with spectators, a popular game may have thousands or millions of viewers. This creates a fanout problem.

We must not overload the game engine by making it push updates directly to every spectator.

Instead:

  1. Game engine publishes move events to a message broker.
  2. Spectator service consumes those events.
  3. Spectator service fans out updates to viewers.
Game Engine
   |
   +---> Player move delivery (Priority Path)
   |
   +---> Event Stream
            |
            v
        Spectator Service ---> Spectator Clients (Isolated Path)

This prevents spectator traffic from harming core gameplay. For spectators, we can also optimize bandwidth:

  • Send move deltas, not full board state.
  • Batch spectator count updates (e.g., update every 5 seconds).
  • Use CDN-like fanout or edge servers for popular games.

19. Analysis Engine

Post-game analysis is CPU-heavy. A chess engine like Stockfish analyzes positions to identify mistakes, blunders, inaccuracies, best moves, and accuracy. This should not happen in the move path.

Instead:

  1. Game ends.
  2. Game engine publishes GAME_ENDED.
  3. Analysis engine consumes the event.
  4. Analysis engine loads the game move history and runs Stockfish.
  5. Analysis result is stored.
  6. User views analysis later.

Example event:

{
  "eventType": "GAME_ENDED",
  "gameId": "game-123",
  "whiteUserId": "user-1",
  "blackUserId": "user-2",
  "result": "WHITE_WON",
  "terminationReason": "CHECKMATE",
  "endedAt": 1710000100
}

This is an offline, asynchronous pipeline. It may be delayed by seconds or minutes under peak load, which is acceptable since it is not needed for live gameplay.


20. Rating Engine

Player ratings should be updated after the game ends.

  • The rating engine consumes GAME_ENDED events.
  • It calculates new ratings based on current ratings, game result, and the rating system (e.g., Elo or Glicko-2).
  • The rating update must be reliable and idempotent. If the same GAME_ENDED event is processed twice, ratings should not be updated twice.
  • We use an idempotency key: rating_update:{gameId}.

Rating updates do not block the move path and happen asynchronously.


21. Cheating Checker

Cheating detection uses multiple signals:

  • Move similarity to engine recommendations.
  • Consistently high accuracy in complex positions.
  • Suspicious timing patterns (e.g., constant 5-second delays on simple moves).
  • Reports from opponents.
  • Browser/device focus behavior.

The cheating checker consumes move and game events asynchronously. It does not block a move from being accepted in the normal path unless there is a clear real-time abuse rule. Cheating analysis is probabilistic, expensive, and requires looking at patterns across many games.


22. Notifications and Campaigns

Email, SMS, and campaign systems must be fully outside the gameplay path.

  • Examples: Tournament announcements, friend challenge notifications, game reminders.
  • These systems are slow and failure-prone compared to gameplay. If the email service is down, chess moves should still work.
  • This is classic system design isolation: non-critical features should fail independently.

23. Data Model

Here is the proposed relational or document layout for the core entities:

User

  • userId (Primary Key)
  • username
  • email
  • ratingBlitz
  • ratingRapid
  • ratingBullet
  • createdAt

Game

  • gameId (Primary Key)
  • whiteUserId
  • blackUserId
  • timeControl
  • rated (Boolean)
  • status (ACTIVE, ENDED)
  • currentFen
  • turn (WHITE, BLACK)
  • whiteTimeMs
  • blackTimeMs
  • moveNumber
  • createdAt
  • endedAt
  • result (WHITE_WON, BLACK_WON, DRAW)
  • terminationReason (CHECKMATE, TIMEOUT, RESIGNATION, STALEMATE)

Move

  • gameId (Composite Key Part 1)
  • moveNumber (Composite Key Part 2)
  • playerId
  • move (e.g., e2e4)
  • fenAfterMove
  • serverReceivedAt
  • whiteTimeMs
  • blackTimeMs
  • clientMoveId

24. APIs

We separate HTTP APIs (for metadata and setup) from WebSocket APIs (for active gameplay).

HTTP APIs

  • POST /challenges — Create a challenge
  • DELETE /challenges/{challengeId} — Cancel a challenge
  • GET /games/{gameId} — Retrieve game metadata
  • GET /games/{gameId}/moves — Fetch move history
  • GET /games/{gameId}/analysis — Get post-game analysis
  • GET /users/{userId}/profile — Fetch user profile

WebSocket Message Types

  • SESSION_INIT — Client connects and receives startup payload
  • CHALLENGE_MATCHED — Match found, client redirected to game
  • MOVE_SUBMIT — Client submits a move
  • MOVE_ACCEPTED / MOVE_REJECTED — Server response to move submission
  • OPPONENT_MOVED — Server pushes opponent's move to client
  • CHAT_MESSAGE — Live in-game chat message
  • DRAW_OFFERED / DRAW_ACCEPTED — Draw negotiation
  • RESIGN — Resign game
  • GAME_ENDED — Game termination notification
  • PING / PONG — Heartbeat

25. Request Batching

Request batching is an important optimization. When a user connects, a naive system may make many separate calls to load profile, rating, active games, pending challenges, unread notifications, and friend status.

If every small item requires a separate round trip, connection startup becomes slow and expensive. A better design batches startup data.

Example SESSION_INIT response:

{
  "type": "SESSION_INIT",
  "user": {
    "userId": "user-1",
    "username": "ayush",
    "rating": 1500
  },
  "activeGame": {
    "gameId": "game-123",
    "fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
    "moveNumber": 18,
    "whiteTimeMs": 120000,
    "blackTimeMs": 115000
  },
  "pendingChallenges": [],
  "notifications": []
}

Batching reduces latency, network overhead, backend request count, and CPU serialization overhead. However, do not delay critical moves just to batch them. Critical gameplay updates should be sent immediately.


26. Low Bandwidth Design

Chess is naturally low bandwidth, but bad design can still waste bytes.

  • Gameplay: Use move deltas (e.g., e2e5, clock times) for normal updates.
  • Reconnect/Resync: Send full snapshots (full board FEN, entire move list) only when a client reconnects or experiences state desynchronization.
  • Recovery: Replay from the move log for deep debugging.

This reduces bandwidth and allows each gateway server to support more concurrent connections.


27. Scaling Strategy

Different parts of the system scale on different axes:

ComponentScaling BottleneckStrategy
GatewayConcurrent connections, memory, file descriptorsHorizontal scaling, lightweight connections
Connection ServiceMessage routing QPS, registry lookupsRedis sharding, lookup caching
Match EnginePending challenge volumePartition by time control + rating bucket
Game EngineActive games, move validation QPSShard by gameId, in-memory state
Spectator ServiceFanout volume to viewersPub/Sub event distribution, edge cache
Analysis EngineCPU-heavy Stockfish processingAsync workers, priority message queues

28. Sharding

For sharding:

  • Game Engine: Shard by gameId using consistent hashing: $$\text{shard} = \text{hash}(\text{gameId}) \pmod{\text{number_of_shards}}$$
  • Connection Registry: Shard by userId.
  • Match Engine: Shard by time control and rating bucket.
  • Move Store: Partition by gameId or createdAt time.

Consistent hashing reduces remapping when shards are added or removed. However, because games are active and stateful, shard movement must be done carefully. A game should not bounce between servers during active play unless there is a failure recovery or planned migration.


29. Caching

Caching can be used for:

  • User profiles and ratings.
  • Active game metadata.
  • Connection registry and pending challenges.
  • Recent chat messages.

However, the cache should not become the source of truth for critical move validation. The authoritative state is always the in-memory game engine backed by the durable move log.


30. Deployment and Thundering Herds

If a gateway process is restarted, all clients connected to it disconnect. If thousands of clients reconnect at the same time, they can overload the load balancer, authentication service, session service, connection registry, Redis, and game engine. This is a thundering herd.

To defend against this:

  • Jittered Reconnect: Implement exponential backoff with random jitter: $$t_{\text{wait}} = \text{min}(t_{\text{max}}, 2^{\text{attempt}} \times \text{base}) \pm \text{random_jitter}$$
  • Connection Draining: Stop sending new connections to old gateway instances. Allow existing games to finish before closing remaining connections gracefully.
  • Gateway Stability: Because the gateway is dumb and stable, most feature deployments happen behind it. This reduces the frequency of risky gateway deployments.

31. Failure Modes and Defenses

  • Gateway Crash: Reconnect with jitter, session resume, connection registry TTL.
  • Game Engine Crash: Durable move log, game recovery through replay, shard reassignment.
  • Duplicate Move Retry: Use clientMoveId and expected move number.
  • Out-of-order Messages: Validate expected move number, process moves sequentially.
  • Spectator Fanout Spike: Separate spectator service, CDN fanout, prioritize players.
  • Rating Double Update: Idempotency key per game, transactional database update.

32. Monitoring and Observability

Key Metrics to track:

  • Gateway: Active connections, connect/disconnect rates, gateway CPU/memory/file descriptors.
  • Gameplay: Move submit latency ($p_{50}, p_{95}, p_{99}$), move validation latency, opponent delivery latency.
  • Matchmaking: Matchmaking wait times, challenge expiration rate, game creation rate.
  • Downstream: Analysis queue lag, rating update failures, cheating checker backlog.

33. Summary

A scalable chess platform is not difficult because chess moves are large. It is difficult because real-time correctness, connection management, fault tolerance, and safe deployments all interact.

Key Takeaways

  1. Use WebSockets for live gameplay, but keep gateways dumb and stable.
  2. Move business logic into downstream services (Connection, Game, Match).
  3. Use the server as the absolute source of truth for moves and clocks.
  4. Shard by gameId to process moves sequentially per game.
  5. Log every accepted move to an append-only durable store for recovery.
  6. Isolate spectators from players via a separate fanout service.
  7. Process non-critical operations (analysis, ratings, notifications) asynchronously.
  8. Prevent thundering herds using connection draining and jittered reconnects.
Ayush Jaipuriar

Ayush Jaipuriar

Full Stack Software Engineer at TransUnion, specializing in modern web technologies and cloud solutions.

Share this post

Ayush Jaipuriar

AI Agent Engineer & Senior Full-Stack Developer

jaipuriar.ayush@gmail.com

Currently exploring Senior SWE & AI Engineering roles

Connect

© 2026 Ayush Jaipuriar. All rights reserved.

Built with Vue.js & Nuxt 3. Deployed on GitHub Pages.