We want to design an online platform that allows users to purchase tickets for concerts, sports events, theater, and other live entertainment. It has around 100 million DAU (Daily Active Users). The system should be able to handle high traffic and provide a smooth user experience. We will follow these steps to design the system:
- Requirements: We will first gather the requirements of the system, including functional and non-functional requirements.
- Core Entities: We will identify the core entities of the system, such as users, events, tickets, and orders.
- APIs: We will design the APIs for the system, including the endpoints, request and response formats.
- High-Level Design: We will design the high level architecture of the system, including the components, services, and data flow.
- Deep Dive: We will dive deeper into the components and services, including the data models, algorithms, and data flow.
Example:
At 0am, we want to sell 100 phones at $1000 and only 1 phone per user. We expect huge traffic at 0am, so we need to design the system to handle the traffic spike and ensure that users can purchase the phones smoothly.
Scenario
QPS Calculation
Requests per second (QPS) is a measure of how many requests the system needs to handle per second. To calculate the QPS, we need to consider the number of users, the number of requests per user, and the duration of the event.
- 1000 QPS in normal days
- 10000 QPS during sales event
So it is 100 times more traffic during a sales event. We need to design the system to handle this traffic spike.
Workflow
flowchart LR
A[Start] --> B{Logged in?}
B -->|Yes| C[Show Home]
B -->|No| D[Show Login]
C --> E[End]
D --> F[User Login]
F --> G{Login success?}
G -->|Yes| C
G -->|No| H[Show Error]
H --> D
The flow below outlines the flash sale journey from entering the product page to payment success or failure.
flowchart TD
U[User] --> C[Click]
C --> P[Product page]
P --> S{Start time reached?}
S -->|No| W[Show countdown and disable purchase button]
S -->|Yes| O{Sold out?}
O -->|Yes| E[Flash sale ends]
O -->|No| B[Show purchase button]
B -->|Buy| CO[Create order]
CO --> L{Order created and stock locked?}
L -->|No| F1[Order failed]
L -->|Yes| PC[Payment countdown]
PC --> T{Paid on time?}
T -->|Yes| DS[Deduct stock]
DS --> SU[Purchase success]
T -->|No| RS[Release stock]
RS --> F2[Purchase failed]
Problems to solve
- Handle high traffic during sales event A normal server can handle only 1000 QPS (4-core CPU, 8GB RAM), but during a sales event, we need to handle 100000 QPS. The server will be overwhelmed and crash if we do not design the system properly.
- Prevent overselling We have only 100 phones to sell, but if we do not design the system properly, we may sell more than 100 phones or sell less than 100 phones.
- Malicious users Some users may try to exploit the system to purchase more than 1 phone or make millions of requests with a script.
- Start time precision All users will try to purchase the phones at the same time, so we need to ensure that the system can handle the requests at the exact start time. Not before or after.
- One phone per user We need to ensure that each user can only purchase 1 or N phones. We need to identify the users properly and prevent them from creating multiple accounts to purchase more than 1 phone.
Requirements
Seller
- Add flash sale event
- Set up flash sale event
Client Side
- Flash sale page (frontend or app)
- Buy
- Order
- Pay
Services
A single-server structure or a microservices structure. A single server couples many services together; microservices split each service into different servers. A single server is harder to scale, and when we want to update one service, we need to update the whole server. Microservices are easier to scale and update, but they are more complex to design, develop and maintain. A cascading failure is also a problem; one service failure may lead to failures in other services. It also limits the technology stack to use only one language and framework. It will also put more load on the database.
A microservices structure is more suitable for this system, as we need to handle high traffic and ensure high availability. We can split the services into different servers and scale them independently based on the traffic. From the gateway, we can route the requests to different services based on the endpoint.
- decoupling: each service is independent.
- simplicity: each service is simple.
- scalability: for specific services.
- collaboration: different teams can work on different services.
- fault isolation: one service failure does not affect other services.
- technology diversity: each service can use a different technology stack.
- database isolation: each service can have its own database.
Storage
- Design storage for each service
- Schema design
Tables
commodity_info
| Field | Description |
|---|---|
| id | Unique commodity ID |
| name | Commodity name |
| description | Commodity description |
| price | Original price |
| id | name | description | price |
|---|---|---|---|
| 1 | iphone 17 128g | smart phone | 1000 |
seckill_info
| Field | Description |
|---|---|
| id | Flash sale event ID |
| name | Flash sale event name |
| commodity_id | Commodity ID |
| price | Flash sale price |
| number | Flash sale stock |
| id | name | commodity_id | price | number |
|---|---|---|---|---|
| 28 | iphone 17 128g seckill | 1 | 900 | 100 |
stock_info
| Field | Description |
|---|---|
| id | Stock record ID |
| name | Stock record name |
| commodity_id | Commodity ID |
| price | Stock price |
| number | Stock quantity |
| id | commodity_id | seckill_id | stock | lock |
|---|---|---|---|---|
| 1 | 1 | 0 | 1000000 | 0 |
| 2 | 1 | 28 | 100 | 5 |
order_info
| Field | Description |
|---|---|
| id | Order ID |
| commodity_id | Commodity ID |
| seckill_id | Flash sale event ID |
| user_id | User ID |
| paid | Paid status (0 = unpaid, 1 = paid) |
| id | commodity_id | seckill_id | user_id | paid |
|---|---|---|---|---|
| 1 | 1 | 28 | Jack | 1 |
How to add an index?
Normally, the fields that often appear in the WHERE clause should be indexed. For example, in the order_info table, the user_id and seckill_id fields are often used to query the orders of a user for a specific seckill event. So we can add a composite index on (user_id, seckill_id) to speed up the query.
Data Stream
The two diagrams below split the data flow by actor: merchants manage catalog and inventory, while users read sale data and place orders.
flowchart TB
M[Merchant] -->|select| CI[commodity_info]
M -->|insert| SI[seckill_info]
M -->|insert| ST[stock_info]
flowchart TB
U[User] -->|select| SI[seckill_info]
U -->|insert| OI[order_info]
SI --> CI[commodity_info]
CI --> ST[stock_info]
U -->|update| ST
Operations
Get stock
SELECT stock FROM stock_info WHERE commodity_id = 1 AND seckill_id = 28;
Reduce stock
UPDATE stock_info SET stock = stock - 1 WHERE commodity_id = 1 AND seckill_id = 28;
Problem 1:
If two requests come at the same time, both read the stock as 1, then both reduce the stock to 0, leading to overselling.
Use lock to prevent overselling (pessimistic locking)
START TRANSACTION;
-- lock the row for update, we can only update these row after getting the lock in this transaction
SELECT stock FROM stock_info WHERE commodity_id = 1 AND seckill_id = 28 FOR UPDATE;
UPDATE stock_info SET stock = stock - 1 WHERE commodity_id = 1 AND seckill_id = 28;
COMMIT;
db.startTransaction()
# we have SQL to lock the row for update, and the second request will wait here until the first transaction is committed; if there is no stock, it will not buy.
stock = select_stock_for_update(commodity_id, seckill_id)
if stock > 0:
buy()
db.commit()
Use UPDATE with condition (optimistic locking)
SELECT stock FROM stock_info WHERE commodity_id = 1 AND seckill_id = 28;
UPDATE stock_info SET stock = stock - 1 WHERE commodity_id = 1 AND seckill_id = 28 AND stock > 0;
stock = select_stock_for_update(commodity_id, seckill_id)
if stock > 0:
if try_buy():
# if buy success
else:
# quit
Problem 2:
If we have 100k requests at the same time, the database will be overwhelmed and crash.
The seckill is a battle to get resources in storage. We need to reduce the load on the database by using caching and a message queue. Redis is a good choice for caching as it is in-memory and can handle high QPS (100k QPS with a single Redis instance).
We can load data into Redis and use Redis to handle the requests. The database is only used to persist the data.
Redis saves data into memory as key-value pairs. We can use the key as “seckill_stock_{commodity_id}_{seckill_id}” and the value as the stock number. It is a kind of NoSQL database. It can also have data persistence by saving the data to disk periodically.
- Supported data structures: string, hash, list, set, sorted set.
- Redis is single-threaded, and it uses I/O multiplexing to handle multiple connections.
- It uses event-driven architecture to handle requests.
- Redis also supports disaster tolerance. Redis is atomic, either all operations in a transaction are executed or none of them are.
- Redis supports pub/sub messaging and channels; we can use it to implement a message queue. It is often used as a cache to reduce the load on the database.
- Is Redis single-threaded as of 2026?
- What is I/O multiplexing?
Warm up Redis
Need to define a primary key in Redis to store the stock info before the seckill event starts.
SET seckill:28commodity:1:stock 100
Data persistence
Redis supports two methods for data persistence: RDB (Redis Database Backup) and AOF (Append Only File). RDB creates point-in-time snapshots of the dataset at specified intervals, while AOF logs every write operation received by the server. AOF provides better durability as it can be configured to fsync data to disk after every write operation, but it may result in larger file sizes and slower recovery times compared to RDB.
A cluster of Redis nodes can be set up to provide high availability and fault tolerance. In a Redis cluster, data is automatically sharded across multiple nodes, and each node can have one or more replicas for redundancy. If a master node fails, one of its replicas can be promoted to master to ensure continuous availability.
SQL vs NoSQL
- SQL is a relational database, NoSQL is a non-relational database.
- SQL is for structured data, NoSQL is for unstructured data.
- SQL is for big data, NoSQL is for high throughput.
- SQL is ACID compliant, NoSQL is BASE compliant.
- SQL is vertically scalable, NoSQL is horizontally scalable.
- SQL is suitable for complex queries, NoSQL is suitable for simple queries.
Operation with Redis
GET seckill:28commodity:1:stock
DECR seckill:28commodity:1:stock
Redis should cover all requests before the database is accessed. If the stock is less than 0, we can return “sold out” directly without accessing the database.
Redis will do CAS (check-and-set), so it is not atomic. We need to check the operation at database level even if Redis gives the go-ahead!
We can use a Lua script to make the operation atomic in Redis.
if (redis.call('exists', KEYS[1]) == 1) then
local stock = tonumber(redis.call('get', KEYS[1]));
if (stock <= 0) then
return -1
end;
redis.call('decr', KEYS[1]);
return stock - 1;
end;
return -1;
Problem 3:
If we have 100k requests at the same time, the Redis server may also be overwhelmed and crash.
flowchart TD
U[User] --> L[Lua script reads Redis and decrements]
L --> D{Decrement success?}
D -->|Yes| O[Lock DB stock and create order]
O --> P[Payment]
P --> S[Stock deducted]
D -->|No| E[Flash sale ends]
We need to slow down the requests between Redis and the SQL database by using a message queue. The message queue can buffer the requests and process them one by one. The database is only used to persist the data.
Design with Message Queue
- MQ is used to decouple the services and buffer the requests.
- A producer can send messages to the MQ with high throughput.
- A consumer can process the messages from the MQ in its own rhythm.
- MQs often have built-in retry mechanisms to ensure messages are processed successfully.
- MQ can fail to deliver messages, so we need to design the system to handle failures or provide more storage in Redis to store more for the MQ to process later.
flowchart TD
U[User] --> L[Lua script reads Redis and decrements]
L --> D{Decrement success?}
D -->|No| E[Flash sale ends]
D -->|Yes| Q[Publish message to order system]
Q --> C[Message consumed]
C --> O[Lock DB stock and create order]
O --> P[Payment]
P --> S[Stock deducted]
S --> U
When to reduce stock in database?
- Reduce stock in the database when ordering. Better user experience, but others cannot buy if the user does not pay.
- Order first, only reduce stock when paying. Bad user experience; users may not be able to pay in time.
- Order first, lock stock in the database, only reduce stock when paying, release stock when canceled or timed out.
How to limit one phone per user?
flowchart LR
A[Orders table] --> B{Existing order for user?}
B -- No --> C[Create order]
B -- Yes --> D[Purchase failed]
Check the database to see if the user has already bought the phone. But it will put more load on the database. For each request, we need to check the database first, then reduce the stock and create the order.
We can use Redis to cache the user purchase info. When a user buys a phone, we set a key in Redis with the user ID and seckill ID. For each request, we check the key in Redis first. If the key exists, we return “already bought”. If the key does not exist, we proceed to reduce the stock and create the order.
flowchart LR
U[User] --> C{Is the user id in the Redis set?}
C -- Yes --> F[Purchase failed]
C -- No --> R[Decrement Redis stock]
R --> A[Add user id to Redis set]
A --> O[Place order and payment]
O --> S{Success?}
S -- No --> D[Remove user id from Redis set]
Data Consistency
flowchart LR
A[Payment] --> B[Payment service<br>Payment record]
B --> C[Order service<br>Order payment success]
C --> D[Product service<br>Stock deduction]
Distributed Transaction (2PC)
Ensure payment, order, and stock updates across services either all succeed or all fail to keep strong consistency.
sequenceDiagram
participant TC as Transaction Coordinator
participant Pay as Payment Service
participant Ord as Order Service
participant Inv as Inventory Service
TC->>Pay: Prepare (can commit?)
Pay-->>TC: Yes/No
TC->>Ord: Prepare (can commit?)
Ord-->>TC: Yes/No
TC->>Inv: Prepare (can commit?)
Inv-->>TC: Yes/No
alt All Yes
TC->>Pay: Commit
TC->>Ord: Commit
TC->>Inv: Commit
else Any No
TC->>Pay: Rollback
TC->>Ord: Rollback
TC->>Inv: Rollback
end
Scale
We need to explain how to optimize the system.
Redis
We can refuse the connections when Redis finds the storage is full.
Statistics page
We can use a separate service to handle the statistics page. The statistics page can read the data.
CDN: Frontend Static Resources Use CDN edge nodes to serve static assets (HTML/CSS/JS/images) so most traffic never reaches the origin, reducing load and improving latency. Edge servers are deployed close to users; requests are routed to the nearest edge, which responds from its cache. On a miss, the edge fetches from the origin, caches the response, and serves subsequent users locally for lower RTT and faster TTFB.
flowchart LR U[User] --> E{CDN edge cache hit?} E -- Yes --> A[Serve cached static assets] E -- No --> O[Origin] O --> S[Static asset store] O --> E E --> AFrontend limits request rate: make the button unclickable after a click for a while. Reject requests from the frontend randomly.
Frontend countdown guard: disable the buy button before start time and show a countdown. The client gets server time (or a time offset) on page load and periodically polls to correct drift, then enables the button only when the event starts.
flowchart LR P[Open page] --> T[Fetch server time / offset] T --> C[Render countdown] C --> D{Countdown reached 0?} D -- No --> P2[Poll server time periodically] P2 --> C D -- Yes --> B[Enable buy button]
High availability
We need to ensure the event does not affect the other services. We need to avoid Avalanche Effect.
Avalanche effect: in a fan-out call chain, a slow/unavailable downstream service causes upstream retries and resource buildup, spreading failures to healthy services.
flowchart LR
A[Service A] --> B[Service B]
A --> C[Service C]
B --> D[Service D]
D --> F[Service F]
C --> E[Service E]
E --> G[Service G]
D -. slow/timeout .-> A
E -. slow/timeout .-> A
Fuse or Circuit-breaker: When the service is overloaded, we can refuse the requests to the service to avoid the service crash. We can set a threshold for the service, when the service load exceeds the threshold, we can refuse the requests to the service. After a while, we can allow the requests to the service again.
Ratelimit mechanism
- Blacklist Mechanism: We can blacklist the IPs or users that send too many requests in a short time. We can set a threshold for the number of requests, when the number of requests exceeds the threshold, we can blacklist the IP or user for a certain period of time.
Ticket Master vs Flash Sale
For iPhone there is no difference, but for train tickets, we need to consider the seat selection problem. The user needs to select the seat before buying the ticket. This will add more complexity to the system design. There will be a field for seat class in the ticket_info table. The stock_info table will also have a field for seat class. The user needs to select the seat class before buying the ticket. The system needs to check the stock for the selected seat class before reducing the stock and creating the order.
Requirements
Functional Requirements
These are the features of the system. We also say “The user should be able to …”. We list first the core features, then the additional features. The users should be able to:
- Book the tickets
- View the available events
- Search for events: A dropdown menu to select the type of event, such as concerts, sports, theater, etc. A search bar to search for events by name or keyword.
- Cancel a booking
- View the booking history
Non-Functional Requirements
CAP Theory
- Strong consistency for booking tickets: To ensure no double bookings.
- High availability for viewing events: To ensure users can always view the available events.
- Read » Write: The system should be able to handle high read traffic, as users will frequently view the available events. The write traffic will be lower, as users will only book tickets once they make the decision.
- Scalability to handle surges: When there is a popular event, such as the Super Bowl or NBA Finals, the system should be able to handle the surge in traffic. The system should be able to scale horizontally, by adding more servers to handle the increased traffic.
These topics are out of scope:
- GDPR compliance: The system should be able to handle user data in compliance with GDPR regulations. This includes data encryption, data anonymization, and data retention policies.
- Fault tolerance: The system should be able to handle failures gracefully, such as server crashes, network failures, and database failures. This includes data replication, data backup, and data recovery policies.
Core Entities
Here we need to specify what data is processed by the system and exchanged by the APIs.
- Event
- Venue (stadium, theater, etc.)
- Performer (artist, team, etc.)
- Ticket: All tickets given to an event.
APIs
These are the user-facing APIs, which the users will interact with to satisfy the functional requirements.