Summary
Requirements
Functional Requirements
- Given a URL, generate a shorter and unique alias (short link).
- When users access a short link, redirect to the original link.
- Users should optionally be able to pick a custom short link for their URL.
- Links will expire after a standard default timespan. Users should also be able to specify the expiration time.
Non-Functional Requirements
- The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
- URL redirection should happen in real-time with minimal latency.
- Shortened links should not be guessable (not predictable).
Extended Requirements
- Analytics; e.g., how many times a redirection happened?
- Be accessible through REST APIs by other services.
Capacity Estimation and Constraints
Assumption
- Read-heavy. More redirection requests compared to new URL shortenings.
- Assume 100:1 ratio between read and write.
Traffic estimates
- 500M new URL shortenings per month, 100 * 500M => 50B redirections per month.
- New URL shortenings per second
- 500 million / (30 days 24 hours 3600 seconds) = ~200 URLs/s
- URLs redirections per second
- 50 billion / (30 days 24 hours 3600 sec) = ~19K/s
Storage estimates
- Assume storing every URL shortening request for 5 years, each object takes 500 bytes
- Total objects: 500 million 5 years 12 months = 30 billion
- Total storage: 30 billion 500 bytes = *15 TB
Bandwidth estimates
- Write: 200 URL/s 500 bytes/URL = *100 KB/s
- Read: 19K URL/s 500 bytes/URL = *~9 MB/s
Cache memory estimates
- Follow the 80-20 rule, assuming 20% of URLs generate 80% of traffic, cache 20% hot URLs
- Requests per day: 19K 3600 seconds 24 hours = ~1.7 billion/day
- Cache 20%: 0.2 1.7 billion 500 bytes = ~170GB
Summary
- Assuming 500 million new URLs per month and 100:1 read:write ratio
Category | Calculation | Estimate ---- | ---- | ---- New URLs | 500 million / (30 days 24 hours 3600 seconds) | 200 /s URL redirections | 500 million 100 / (30 days 24 hours 3600 seconds) | 19 K/s Incoming data | 500 bytes/URL 200 URL/s | 100 KB/s Outgoing data | 500 bytes/URL 19K URL/s | 9 MB/s Storage for 5 years | 500 bytes/URL 500 million 60 months | 15 TB Memory for cache | 19K URL 3600 seconds 24 hours 500 bytes * 20% | 170 GB
System APIs
createUrl
- Parameters
Name | Type | Note
---- | ---- | ----
api_dev_key
|string
| The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota.original_url
|string
| Original URL to be shortened.custom_alias
|string
| Optional custom key for the URL.user_name
|string
| Optional user name to be used in encoding.expire_date
|string
| Optional expiration date for the shortened URL. - Return
string
- A successful insertion returns the shortened URL; otherwise, it returns an error code.
deleteUrl
- Parameters
Name | Type | Note
---- | ---- | ----
api_dev_key
|string
| The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota.url_key
|string
| Short URL. - Return
string
- A successful deletion returns ‘URL Removed’.
Database design
Observations
- Need to store billions of records.
- Each object is small (less than 1K).
- No relationships between records—other than storing which user created a URL.
- Read-heavy.
- A NoSQL choice would also be easier to scale.
- Comment: SQL with sharding should also work
Schema
- URL
Column | Type
---- | ----
hash
| varchar(16)original_url
| varchar(512)creation_date
| datetimeexpiration_date
| datetimeuser_id
| int - User
Column | Type
---- | ----
name
| varchar(20)email
| varchar(32)creation_date
| datetimelast_login
| datetime
- URL
Column | Type
---- | ----
Basic System Design and Algorithm
Encoding actual URL
- Compute unique hash
base64
: A-Z, a-z, 0-9,-
,.
- 6 letters: 64 ^ 6 = ~68.7 billion
- 8 letters: 64 ^ 8 = ~281 trillion
- Use 6 letters
MD5
generates 128 bit hash value- Each
base64
character encodes 6 bits base64
encoding generates 22 characters- Select 8 characters
- Issues with this approach
- Same URL from multiple users
- URL-encoded
- Workaround
- Append an increasing sequence number to each input URL, and generate a hash for it
- Append user id to input URL
Generating keys offline
Standalone Key Generation Service (KGS)
- Generate random 6 letter strings and store them in a database (key DB)
- When a short URL is needed, take one from the key DB
Key DB size
- 6 characters/key * 68.7B unique keys = 412 GB
Concurrency issue
- If there are multiple servers reading keys concurrently, two or more servers try to read the same key from the database.
Workaround
- Servers can use KGS to read/mark keys in the database.
- KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys.
- KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them.
- KGS needs to make sure not to give the same key to multiple servers.
- Comment: keys are sharded. Each KGS server only serves one application server.
Key lookup
- When a key is found, issue an "HTTP 302 Redirect" status and passing the stored URL.
- When a key is not found, issue an "HTTP 404 Not Found", or redirect to homepage.
UUID
Replace KGS with UUID.
Data Partitioning and Replication
- Range Based Partitioning
- Store URLs in separate partitions based on the first letter of the URL or the hash key.
- Combine certain less frequently occurring letters into one database partition.
Problem with this approach
- Unbalanced servers.
Hash-Based Partitioning
- Take a hash of the short URL we are storing, and calculate which partition to use based upon the hash.
- Use consistent hashing
Cache
- Eviction policy
- LRU: discard the least recently used URL first
- Cache update
- Cache miss: hit backend database and pass new entry to all cache replicas
Load Balancer (LB)
- LB locations
- Between Clients and Application servers
- Between Application Servers and database servers
- Between Application Servers and Cache servers
DB Sweeping
A separate Cleanup service can run periodically to remove expired links from our storage and cache.
Telemetry
Statistics about the system: how many times a short URL has been used
Security and Permissions
- Store permission level (public/private) with each URL in the database
- Send an error (HTTP 401) for unauthorized access