Design TinyURL
Design a TinyURL service that takes a long URL and convert to short URL.
Assumptions
To start designing, first we need to make assumptions so that we could make decisions based on the assumptions.
Number of users:
30 million / month
Data Capacity Model:
Long url: 2kb
Short url: 7b (xxxxxxx)
Create time: 4b
Expire time: 4b
Total: ~2000b
Storage needed:
30 million * 2074 = about 2000*30 million = 600000000000 = 60GB
60 GB/ month
720 GB / year = 0.7 TB / year
3.5 TB / 5 year
These stats lead to conclusion like we should add scalability to our arch at some point, or like we should be sharding the database since 1 machine can’t hold this much data, although 3.5TB is pretty reasonable for 1 machine lol. Then you can look further for things like traffic(QPS) load and get conclusion like we need multiple machine to handle the traffic(maybe 1GB/s etc)
Algorithms for generating short url
Base62 (integer -> str) is this 1–1 mapping?
Yep.. but why?
If we use base62 to encode the integer, we’re essentially converting a base10 integer to base62 integer
then we have 62⁷ = ~350 Billion
Each of the number from 0 to 350B-1 maps to a 7-char base62 number represented in string
Mdb (str -> str)
Problem is collision if we only take 7 digits
Choose Database
Relational Database
ACID
What does each property mean
Atomicity: two database transaction are done as if they are done in one setting, transaction either succeed completely or fail completely. Prevent partial writes during outage
Consistency: a transaction will bring the database from one valid state to another valid state, i.e database constraints will still hold
Isolation: Transactions are executed as if they are carried out sequentially.
Durability: Once transaction is committed it’ll remain committed even in case of system failure
(There’re much more details about ACID than these definitions, like different levels of isolations, how are ACID implemented, etc, does ACID still hold on shared MySql servers?)
Sharding is complex, Harder to scale
With relational db like mysql, shard-ing doesn’t come out of the box, so we’ll have to build our own services to maintain these shards and scaling them(like keep track of shards->machines config with ZK, etc)
Here’s a good blog how Pinterest shards their database
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
NoSQL
Eventual consistency
High availability
Easy to scale
Why? Explain
https://stackoverflow.com/questions/8729779/why-nosql-is-better-at-scaling-out-than-rdbms
https://stackoverflow.com/questions/34458082/scaling-in-nosql-vs-rdbms
An article on Sql vs NoSql https://www.thorntech.com/2019/03/sql-vs-nosql/
Techniques to generate URLs
Technique1
For a long URL input, generate a random number and short url = base62(random), insert into database
Problem with this approach:
Multiple app servers generate the same short url for two different short URLs, NoSQL has no transaction?
As the free pool of short urls saturates, generating URL will be slower as you’d need to constantly re-generate, plus the risk of race condition in NoSQL database
Technique2
For a long URL input, short url = first 7 chars of mdb5(random), insert into database
Similarly, it’s possible to collide short urls here, and NoSql has no transaction
Technique3
Use a counter service to keep track of a counter, short url = base62(counter), insert into database
The counter service would be a single point of failure, and we need to scale it if we have lots of app servers.
What if we have multiple counter nodes, each responsible for certain range of numbers?
For that we need ZooKeeper for coordinating assignation of these ranges
i.e each app server asks for a counter range on startup and allocate short urls using this range, if the app server is down, simply ask for another range
(More details on the ZooKeeper configuration)
For example we could have data nodes like below in ZK
/servers/app1/range (counter_range str, consumed bool)
/servers/app2/range
/servers/app3/range
On startup, each app would consume its range stored in ZK and use the range for generating counters
A dedicated service is responsible for watching changes in these range, once it’s consumed, the service is responsible for filling in new counter ranges
The dedicated service can just be a thread running on these app servers, we could use ZK to elect the service if the service is down.
Or we could use a new server for the server, with a backup to avoid Single Point Failure.
Technique 3 is highly available since we could have as many servers serving shortening request as possible.
Performance is good since we don’t require extra look up into the database to ensure no duplicates.
It is scalable, we could easily spin up more app servers to handle more load, just add 1 more app entry on ZK and make the dedicated service assign a counter range for it.
Extra thoughts:
How do we shard the database in case there’s not enough space on one database instance? Shard based on hash(short url) using consistent hashing in nosql? Other than hash based partitioning, there’s also key range based partitioning, which is good for range scan, however in our case range scan efficiency is not useful, compared to that we probably want more even load balance between different partitions where consistent hashing is good at.