Designing a Scalable Online Chess Platform: A System Design Case Study
Online chess looks deceptively simple.
At first glance, a chess app feels like a small turn-based game. Two players connect, one player makes a move, the other player responds, and the game continues until checkmate, resignation, timeout, stalemate, or draw.
But once we design it at scale, it becomes a very interesting system design problem.
The hard part is not the size of a chess move. A move like e2e4 is tiny. The hard part is the combination of real-time communication, server-authoritative game state, matchmaking, long-lived WebSocket connections, fault tolerance, reconnection handling, spectator fanout, cheating detection, rating updates, game analysis, and safe deployments without causing reconnect storms.
This case study walks through the design of a scalable online chess website or app similar to Chess.com or Lichess. The goal is to make this both a blog-style explanation and a revision document for system design interviews.
1. Problem Statement
We want to design an online chess platform where users can come online, find opponents, play real-time chess games, chat, review games later, and receive accurate game results.
The system should support:
- Two online players playing a chess game.
- A matching system that pairs players based on rating or challenge preferences.
- A game engine that validates moves and manages game state.
- A chat system between players.
- A move log for every game.
- Game termination through checkmate, timeout, resignation, stalemate, forfeit, or draw.
- Post-game analysis.
- Rating updates.
- Cheating detection.
- Optional spectator support.
The system must be low-latency, low-bandwidth, fault-tolerant, and consistent.
2. Functional Requirements
The core functional requirements are:
- Users should be able to create or accept chess challenges.
- The system should match players against suitable opponents.
- Once a match is found, the system should create a game and assign one player as white and the other as black.
- White should make the first move.
- Both players should make moves one after the other.
- The system should validate every move.
- Players should not be able to cancel or roll back moves.
- The system should maintain a log of all moves.
- The system should detect game-ending states such as checkmate, stalemate, resignation, timeout, or forfeit.
- The system should support in-game chat.
- The system should allow users to retrieve game state.
- The system should optionally support spectators.
- The system should support post-game analysis.
- The system should support cheating detection.
- The system should update player ratings after the game ends.
!IMPORTANT The client cannot be trusted to validate chess moves. The browser or mobile app can be modified, hacked, or made to send illegal moves. Therefore, all important game validation must happen on the server.
3. Non-Functional Requirements
The main non-functional requirements are:
- Low latency: Moves must feel real-time.
- Low bandwidth: Keep message payloads small.
- Fault tolerance: Server crashes should not result in lost games.
- Consistency: One authoritative sequence of moves per game.
- Scalability: Handle millions of connected users and active games.
- High availability: The matchmaking and gameplay paths must remain operational.
- Safe deployments: Deployments must not disconnect millions of active players at once.
- Reconnect support: Seamless session restoration after transient network drops.
- Observability: Real-time metrics on move latency, reconnect rates, and engine health.
For chess, "low latency" does not mean the same thing as a first-person shooter game. We do not need 20 ms server reaction time. But the move should reach the opponent quickly enough that the game feels real-time. For online chess, a reasonable target would be:
- Move acknowledgement $p_{95}$: under 100–200 ms.
- Opponent move delivery $p_{95}$: under 200–500 ms.
- Reconnect recovery: ideally within a few seconds.
The system must also be consistent. For one chess game, there must be exactly one valid sequence of moves. We cannot allow two different realities where one server thinks the game is at move 24 and another server thinks it is at move 25. The game engine must be authoritative.
4. Capacity Estimation
Let us start with a simple estimate.
Assumptions:
- Daily active users (DAU): 1 million.
- Average active users per minute: $1,000,000 / 1440 \approx 700\text{--}900$ users per minute.
- Average game duration: 5 minutes.
- Normal connected users: around 5,000.
- Peak connected users: around 20,000.
Pending match requests may be much smaller than total online users because challenges expire quickly and many users are already in games.
If we assume each pending match request takes 1 KB, then: $$20,000 \text{ pending requests} \times 1\text{ KB} = 20\text{ MB}$$
This is small enough to keep in memory.
If the matching engine uses a Balanced BST, TreeSet, Redis Sorted Set, or similar range-query structure, lookup complexity is $O(\log N)$. For $N = 20,000$ pending requests, $\log_2(20,000) \approx 14$ operations. So the matcher is not likely to be the hardest bottleneck.
The bigger challenge is maintaining long-lived WebSocket connections and safely routing messages between users. Chess is usually connection-heavy but not bandwidth-heavy. A chess move is tiny. Even if each move payload is 1 KB after metadata and overhead, the move rate is still moderate compared to video streaming or telemetry-heavy games.
Example:
- Suppose 200,000 active games are running.
- If each game averages one move every 10 seconds, that is: $$200,000 / 10 = 20,000 \text{ moves per second}$$
This is significant, but still manageable with sharded game servers, efficient move logs, and lightweight payloads.
The harder part is:
- Maintaining millions of live connections.
- Handling reconnect storms.
- Routing messages to the correct gateway.
- Recovering game state after crashes.
- Avoiding spectator fanout from hurting players.
- Ensuring every move is processed exactly once logically.
5. High-Level Architecture
At a high level, the architecture separates the critical low-latency gameplay path from asynchronous downstream services.
The gameplay path (move validation, clock updates, move persistence, and opponent notification) is synchronous, thin, and low-latency.
Everything else—game analysis, rating updates, cheating checks, emails, and long-term analytics—should be processed asynchronously off an event stream.
6. Why WebSockets?
A normal HTTP request-response model is not enough for active chess gameplay.
When Player A makes a move, Player B should receive it immediately. The server needs to push data to the opponent without waiting for the opponent to poll.
Polling would look like this:
- Player B asks the server every second, "Did my opponent move?"
- This creates unnecessary traffic and poor latency. If the polling interval is 1 second, the opponent may see moves up to 1 second late. If the polling interval is very small, the server gets hammered by repeated useless requests.
WebSockets solve this by keeping a long-lived bidirectional connection between the client and server.
With WebSockets:
- Client can send a move to the server.
- Server can push the move to the opponent.
- Server can notify about resignation, timeout, draw offer, disconnect, reconnect, chat messages, and game end.
The gameplay experience becomes smoother. However, WebSockets introduce new complexity. HTTP servers are mostly stateless. WebSocket servers are stateful because they hold long-lived connections. This affects load balancing, deployment, failure recovery, connection routing, and capacity planning.
7. Keep the Gateway Dumb
One of the most important design decisions is to keep the WebSocket gateway as dumb and stable as possible.
The gateway should mainly do:
- Accept WebSocket connections.
- Verify authentication tokens.
- Maintain socket lifecycle.
- Handle ping/pong heartbeats.
- Track connection metadata.
- Forward inbound messages to the connection service.
- Send outbound messages to connected clients.
It should not contain frequently changing business logic. The gateway should not deeply understand chess move validation, matchmaking rules, profile logic, rating updates, anti-cheat logic, or game analysis.
Why?
Because WebSocket gateways hold long-lived connections. If we frequently deploy gateway changes, we risk disconnecting many users. If thousands of clients disconnect at once and reconnect immediately, we create a thundering herd.
A better design separates concerns:
- Dumb WebSocket Gateway: Stable and rarely changes.
- Connection Service: Orchestrates the mappings.
- Downstream Services (Game / Match / Profile / Chat): Frequently changing business logic.
8. Connection Service
The connection service is the smart layer behind the gateway.
It handles:
- User-to-connection mapping.
- Connection-to-gateway mapping.
- Protocol versioning.
- Message routing & fanout coordination.
- Presence & session restoration.
- Backpressure & reconnect handling.
For example, the gateway may receive this message:
{
"type": "MOVE_SUBMIT",
"gameId": "game-123",
"move": "e2e4",
"clientMoveId": "move-789",
"expectedMoveNumber": 12
}
The gateway does not validate the chess move. It wraps the message with connection metadata:
{
"connectionId": "conn-456",
"userId": "user-1",
"gatewayId": "gw-7",
"receivedAt": 1710000000,
"rawMessage": {
"type": "MOVE_SUBMIT",
"gameId": "game-123",
"move": "e2e4",
"clientMoveId": "move-789",
"expectedMoveNumber": 12
}
}
Then the connection service routes it to the correct game engine shard.
The connection service maintains mappings like:
userId -> (gatewayId, connectionId)gameId -> gameEngineShardIdconnectionId -> userIduserId -> activeGameId
Redis is a common choice for this type of ephemeral mapping because it supports fast reads/writes and TTLs. Mappings should expire automatically if heartbeats stop.
9. Matching Engine
The matching engine is responsible for pairing players.
A user may create a challenge with constraints like:
- Time control:
3+0(Blitz),5+0,10+0(Rapid), etc. - Opponent rating: Minimum/maximum bounds.
- Mode: Rated or unrated.
Example challenge:
{
"challengeId": "challenge-123",
"userId": "user-1",
"rating": 1500,
"minOpponentRating": 1400,
"maxOpponentRating": 1600,
"timeControl": "5+0",
"rated": true,
"createdAt": 1710000000,
"expiresAt": 1710000030
}
A simple approach:
- Store pending challenges in memory.
- Index them by rating using a TreeSet, Balanced BST, or Redis Sorted Set.
- Expire challenges after a short TTL, such as 30 seconds.
- When a new challenge arrives, search for compatible existing challenges.
- If a match is found, atomically remove both challenges and create a game.
- If no match is found, insert the challenge into the pending pool.
The important operation is atomic match creation. We must prevent a bug where User A's challenge is accepted by User B and User C at the same time. To prevent this, we need compare-and-set operations, distributed locks, database transactions, or single-threaded partition ownership inside the matching shard.
A good sharding strategy partitions match requests by time control and rating bucket:
matchmaking:blitz:1400-1600matchmaking:rapid:1600-1800matchmaking:bullet:1200-1400
As wait time increases, we can gradually widen the rating range (e.g., first 5 seconds: $\pm 100$ rating, next 10 seconds: $\pm 200$ rating, etc.). This improves both fairness and match availability.
10. Game Creation
Once the match engine finds two compatible users, it creates a new game.
Game creation involves:
- Assigning a unique
gameId. - Assigning white and black colors.
- Initializing the board state (usually FEN format) and clocks.
- Storing game metadata and setting game status to
ACTIVE. - Notifying both players.
Example game object:
{
"gameId": "game-123",
"whiteUserId": "user-1",
"blackUserId": "user-2",
"timeControl": "5+0",
"rated": true,
"status": "ACTIVE",
"currentFen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
"turn": "WHITE",
"whiteTimeMs": 300000,
"blackTimeMs": 300000,
"moveNumber": 0,
"createdAt": 1710000000
}
The system sends both players a notification:
{
"type": "GAME_STARTED",
"gameId": "game-123",
"color": "WHITE",
"opponent": {
"userId": "user-2",
"rating": 1520
},
"timeControl": "5+0",
"initialFen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"
}
11. Game Engine
The game engine is the core component in the system.
It is responsible for:
- Maintaining active game state.
- Validating moves.
- Enforcing turn order.
- Managing clocks.
- Detecting checkmate, stalemate, draw, resignation, and timeout.
- Appending moves to durable storage.
- Sending updates to players.
- Publishing events for asynchronous consumers.
The game engine must be authoritative. The client may show the board and calculate legal moves for UX, but the server decides whether a move is valid. This prevents cheating where a modified client sends illegal moves or claims a different clock state.
12. Move Flow
A typical move flow:
- Player sends a move over WebSocket.
- Gateway receives the message.
- Connection service routes it to the correct game engine shard.
- Game engine validates the move.
- Game engine appends the move to the move log.
- Game engine updates in-memory state.
- Game engine sends
MOVE_ACCEPTEDto the active player. - Game engine sends
OPPONENT_MOVEDto the opponent. - Game engine publishes
MOVE_COMMITTEDto the event stream.
A move request:
{
"type": "MOVE_SUBMIT",
"gameId": "game-123",
"clientMoveId": "client-move-999",
"expectedMoveNumber": 23,
"move": "e2e4"
}
A successful response:
{
"type": "MOVE_ACCEPTED",
"gameId": "game-123",
"moveNumber": 24,
"move": "e2e4",
"fenAfterMove": "rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq e6 0 2",
"whiteTimeMs": 182000,
"blackTimeMs": 210000,
"serverTimestamp": 1710000012
}
A rejection response:
{
"type": "MOVE_REJECTED",
"gameId": "game-123",
"clientMoveId": "client-move-999",
"reason": "NOT_YOUR_TURN"
}
Possible rejection reasons:
- Invalid game or user not part of the game.
- Game already ended or clock expired.
- Not the user's turn or illegal move.
- Stale move number or duplicate/malformed payload.
13. Server-Authoritative Clock
Chess clocks are highly sensitive. The client cannot be trusted to report remaining time.
The server stores:
- Last move committed timestamp.
- Remaining time for white and black.
- Whose turn it is.
When a player submits a move, the server calculates elapsed time: $$\Delta t = T_{\text{received}} - T_{\text{last_committed}}$$
The server subtracts $\Delta t$ from the moving player's clock, applies any increment (if the time control uses increment), and commits the updated clock state.
While the client displays a local countdown for smooth UX, official timeout decisions are always made by the server. This prevents cheating where a client claims it moved earlier than it actually did.
14. Move Log and Fault Tolerance
The game engine keeps active game state in memory for low latency. However, memory is volatile. If a game server crashes, we should not lose active games.
Therefore, every accepted move is appended to durable storage.
A move log schema:
gameIdmoveNumberplayerIdmovefenAfterMoveserverReceivedAtwhiteTimeMsblackTimeMsclientMoveId
The move log is append-only because players cannot roll back moves. The history of a game is naturally an ordered event log. Append-only logs are easier to recover from, audit, debug, replay, and analyze.
If the game engine crashes:
- Connection service detects that the shard is unavailable.
- The game is reassigned to another game engine shard.
- The new game engine loads the game metadata.
- The new game engine replays the move log.
- The new game engine reconstructs board state and clocks.
- Players reconnect and the game continues.
This gives us the best of both worlds: in-memory state for fast move validation, and a durable move log for fault tolerance.
15. Consistency Model
For each individual game, we require strong consistency:
- Only one move should be accepted for a given move number.
- Moves must be processed in order.
- The same player cannot move twice in a row.
- Illegal moves must never be committed.
The easiest way to achieve this is to route all moves for a given game to one game-engine owner at a time. We do this by sharding by gameId:
$$\text{gameId} \xrightarrow{\text{hash}} \text{gameEngineShard}$$
Within that shard, moves for a game are processed sequentially. If we allowed multiple servers to write moves for the same game concurrently, we would need distributed locking or complex transactions, which increases latency. Since chess games have low write throughput (at most a few writes per second), single-owner-per-game is the most reasonable model.
16. Idempotency and Retries
Networks are unreliable. A client may send a move, the server may process it, but the response may get lost. The client will then retry the same move. Without idempotency, the server may process the same move twice.
To avoid this, each move includes a unique clientMoveId.
The server stores recently processed clientMoveIds per game/user. If the same move is retried, the server returns the previous cached result instead of applying it again.
{
"gameId": "game-123",
"userId": "user-1",
"clientMoveId": "move-abc-1",
"expectedMoveNumber": 18,
"move": "g1f3"
}
The game engine checks:
- Have I already processed
clientMoveId = move-abc-1for this game/user? - If yes, return the previous acknowledgement.
- If no, validate normally.
Making retryable operations idempotent is a standard system design technique.
17. Chat Service
The platform supports in-game chat. Chat is less consistency-critical than moves and can be eventually consistent. It should be stored separately from the game move log.
Basic APIs:
sendMessage(gameId, userId, message)getMessages(gameId, userId)
A chat message object:
{
"messageId": "msg-123",
"gameId": "game-123",
"senderId": "user-1",
"message": "Good luck!",
"createdAt": 1710000000
}
For live chat, the chat service can use the same WebSocket connection but a different message type (e.g., CHAT_MESSAGE).
!TIP Do not let chat processing block move processing. Chat messages should be rate-limited, filtered for abusive content, and stored asynchronously.
18. Spectator Service
With only two players, each move needs to be delivered to two users. But with spectators, a popular game may have thousands or millions of viewers. This creates a fanout problem.
We must not overload the game engine by making it push updates directly to every spectator.
Instead:
- Game engine publishes move events to a message broker.
- Spectator service consumes those events.
- Spectator service fans out updates to viewers.
Game Engine
|
+---> Player move delivery (Priority Path)
|
+---> Event Stream
|
v
Spectator Service ---> Spectator Clients (Isolated Path)
This prevents spectator traffic from harming core gameplay. For spectators, we can also optimize bandwidth:
- Send move deltas, not full board state.
- Batch spectator count updates (e.g., update every 5 seconds).
- Use CDN-like fanout or edge servers for popular games.
19. Analysis Engine
Post-game analysis is CPU-heavy. A chess engine like Stockfish analyzes positions to identify mistakes, blunders, inaccuracies, best moves, and accuracy. This should not happen in the move path.
Instead:
- Game ends.
- Game engine publishes
GAME_ENDED. - Analysis engine consumes the event.
- Analysis engine loads the game move history and runs Stockfish.
- Analysis result is stored.
- User views analysis later.
Example event:
{
"eventType": "GAME_ENDED",
"gameId": "game-123",
"whiteUserId": "user-1",
"blackUserId": "user-2",
"result": "WHITE_WON",
"terminationReason": "CHECKMATE",
"endedAt": 1710000100
}
This is an offline, asynchronous pipeline. It may be delayed by seconds or minutes under peak load, which is acceptable since it is not needed for live gameplay.
20. Rating Engine
Player ratings should be updated after the game ends.
- The rating engine consumes
GAME_ENDEDevents. - It calculates new ratings based on current ratings, game result, and the rating system (e.g., Elo or Glicko-2).
- The rating update must be reliable and idempotent. If the same
GAME_ENDEDevent is processed twice, ratings should not be updated twice. - We use an idempotency key:
rating_update:{gameId}.
Rating updates do not block the move path and happen asynchronously.
21. Cheating Checker
Cheating detection uses multiple signals:
- Move similarity to engine recommendations.
- Consistently high accuracy in complex positions.
- Suspicious timing patterns (e.g., constant 5-second delays on simple moves).
- Reports from opponents.
- Browser/device focus behavior.
The cheating checker consumes move and game events asynchronously. It does not block a move from being accepted in the normal path unless there is a clear real-time abuse rule. Cheating analysis is probabilistic, expensive, and requires looking at patterns across many games.
22. Notifications and Campaigns
Email, SMS, and campaign systems must be fully outside the gameplay path.
- Examples: Tournament announcements, friend challenge notifications, game reminders.
- These systems are slow and failure-prone compared to gameplay. If the email service is down, chess moves should still work.
- This is classic system design isolation: non-critical features should fail independently.
23. Data Model
Here is the proposed relational or document layout for the core entities:
User
userId(Primary Key)usernameemailratingBlitzratingRapidratingBulletcreatedAt
Game
gameId(Primary Key)whiteUserIdblackUserIdtimeControlrated(Boolean)status(ACTIVE,ENDED)currentFenturn(WHITE,BLACK)whiteTimeMsblackTimeMsmoveNumbercreatedAtendedAtresult(WHITE_WON,BLACK_WON,DRAW)terminationReason(CHECKMATE,TIMEOUT,RESIGNATION,STALEMATE)
Move
gameId(Composite Key Part 1)moveNumber(Composite Key Part 2)playerIdmove(e.g.,e2e4)fenAfterMoveserverReceivedAtwhiteTimeMsblackTimeMsclientMoveId
24. APIs
We separate HTTP APIs (for metadata and setup) from WebSocket APIs (for active gameplay).
HTTP APIs
POST /challenges— Create a challengeDELETE /challenges/{challengeId}— Cancel a challengeGET /games/{gameId}— Retrieve game metadataGET /games/{gameId}/moves— Fetch move historyGET /games/{gameId}/analysis— Get post-game analysisGET /users/{userId}/profile— Fetch user profile
WebSocket Message Types
SESSION_INIT— Client connects and receives startup payloadCHALLENGE_MATCHED— Match found, client redirected to gameMOVE_SUBMIT— Client submits a moveMOVE_ACCEPTED/MOVE_REJECTED— Server response to move submissionOPPONENT_MOVED— Server pushes opponent's move to clientCHAT_MESSAGE— Live in-game chat messageDRAW_OFFERED/DRAW_ACCEPTED— Draw negotiationRESIGN— Resign gameGAME_ENDED— Game termination notificationPING/PONG— Heartbeat
25. Request Batching
Request batching is an important optimization. When a user connects, a naive system may make many separate calls to load profile, rating, active games, pending challenges, unread notifications, and friend status.
If every small item requires a separate round trip, connection startup becomes slow and expensive. A better design batches startup data.
Example SESSION_INIT response:
{
"type": "SESSION_INIT",
"user": {
"userId": "user-1",
"username": "ayush",
"rating": 1500
},
"activeGame": {
"gameId": "game-123",
"fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
"moveNumber": 18,
"whiteTimeMs": 120000,
"blackTimeMs": 115000
},
"pendingChallenges": [],
"notifications": []
}
Batching reduces latency, network overhead, backend request count, and CPU serialization overhead. However, do not delay critical moves just to batch them. Critical gameplay updates should be sent immediately.
26. Low Bandwidth Design
Chess is naturally low bandwidth, but bad design can still waste bytes.
- Gameplay: Use move deltas (e.g.,
e2e5, clock times) for normal updates. - Reconnect/Resync: Send full snapshots (full board FEN, entire move list) only when a client reconnects or experiences state desynchronization.
- Recovery: Replay from the move log for deep debugging.
This reduces bandwidth and allows each gateway server to support more concurrent connections.
27. Scaling Strategy
Different parts of the system scale on different axes:
| Component | Scaling Bottleneck | Strategy |
|---|---|---|
| Gateway | Concurrent connections, memory, file descriptors | Horizontal scaling, lightweight connections |
| Connection Service | Message routing QPS, registry lookups | Redis sharding, lookup caching |
| Match Engine | Pending challenge volume | Partition by time control + rating bucket |
| Game Engine | Active games, move validation QPS | Shard by gameId, in-memory state |
| Spectator Service | Fanout volume to viewers | Pub/Sub event distribution, edge cache |
| Analysis Engine | CPU-heavy Stockfish processing | Async workers, priority message queues |
28. Sharding
For sharding:
- Game Engine: Shard by
gameIdusing consistent hashing: $$\text{shard} = \text{hash}(\text{gameId}) \pmod{\text{number_of_shards}}$$ - Connection Registry: Shard by
userId. - Match Engine: Shard by time control and rating bucket.
- Move Store: Partition by
gameIdorcreatedAttime.
Consistent hashing reduces remapping when shards are added or removed. However, because games are active and stateful, shard movement must be done carefully. A game should not bounce between servers during active play unless there is a failure recovery or planned migration.
29. Caching
Caching can be used for:
- User profiles and ratings.
- Active game metadata.
- Connection registry and pending challenges.
- Recent chat messages.
However, the cache should not become the source of truth for critical move validation. The authoritative state is always the in-memory game engine backed by the durable move log.
30. Deployment and Thundering Herds
If a gateway process is restarted, all clients connected to it disconnect. If thousands of clients reconnect at the same time, they can overload the load balancer, authentication service, session service, connection registry, Redis, and game engine. This is a thundering herd.
To defend against this:
- Jittered Reconnect: Implement exponential backoff with random jitter: $$t_{\text{wait}} = \text{min}(t_{\text{max}}, 2^{\text{attempt}} \times \text{base}) \pm \text{random_jitter}$$
- Connection Draining: Stop sending new connections to old gateway instances. Allow existing games to finish before closing remaining connections gracefully.
- Gateway Stability: Because the gateway is dumb and stable, most feature deployments happen behind it. This reduces the frequency of risky gateway deployments.
31. Failure Modes and Defenses
- Gateway Crash: Reconnect with jitter, session resume, connection registry TTL.
- Game Engine Crash: Durable move log, game recovery through replay, shard reassignment.
- Duplicate Move Retry: Use
clientMoveIdand expected move number. - Out-of-order Messages: Validate expected move number, process moves sequentially.
- Spectator Fanout Spike: Separate spectator service, CDN fanout, prioritize players.
- Rating Double Update: Idempotency key per game, transactional database update.
32. Monitoring and Observability
Key Metrics to track:
- Gateway: Active connections, connect/disconnect rates, gateway CPU/memory/file descriptors.
- Gameplay: Move submit latency ($p_{50}, p_{95}, p_{99}$), move validation latency, opponent delivery latency.
- Matchmaking: Matchmaking wait times, challenge expiration rate, game creation rate.
- Downstream: Analysis queue lag, rating update failures, cheating checker backlog.
33. Summary
A scalable chess platform is not difficult because chess moves are large. It is difficult because real-time correctness, connection management, fault tolerance, and safe deployments all interact.
Key Takeaways
- Use WebSockets for live gameplay, but keep gateways dumb and stable.
- Move business logic into downstream services (Connection, Game, Match).
- Use the server as the absolute source of truth for moves and clocks.
- Shard by
gameIdto process moves sequentially per game. - Log every accepted move to an append-only durable store for recovery.
- Isolate spectators from players via a separate fanout service.
- Process non-critical operations (analysis, ratings, notifications) asynchronously.
- Prevent thundering herds using connection draining and jittered reconnects.
Ayush Jaipuriar
Full Stack Software Engineer at TransUnion, specializing in modern web technologies and cloud solutions.