Design TinyURL

Hello World

4 min readOct 19, 2020

Design a TinyURL service that takes a long URL and convert to short URL.

Assumptions

To start designing, first we need to make assumptions so that we could make decisions based on the assumptions.

Number of users:

30 million / month

Data Capacity Model:

Long url: 2kb

Short url: 7b (xxxxxxx)

Create time: 4b

Expire time: 4b

Total: ~2000b

Storage needed:

30 million * 2074 = about 2000*30 million = 600000000000 = 60GB

60 GB/ month

720 GB / year = 0.7 TB / year

3.5 TB / 5 year

These stats lead to conclusion like we should add scalability to our arch at some point, or like we should be sharding the database since 1 machine can’t hold this much data, although 3.5TB is pretty reasonable for 1 machine lol. Then you can look further for things like traffic(QPS) load and get conclusion like we need multiple machine to handle the traffic(maybe 1GB/s etc)

Algorithms for generating short url

Base62 (integer -> str) is this 1–1 mapping?

Yep.. but why?

If we use base62 to encode the integer, we’re essentially converting a base10 integer to base62 integer

then we have 62⁷ = ~350 Billion

Each of the number from 0 to 350B-1 maps to a 7-char base62 number represented in string

Mdb (str -> str)

Problem is collision if we only take 7 digits

Choose Database

Relational Database

ACID

What does each property mean

Atomicity: two database transaction are done as if they are done in one setting, transaction either succeed completely or fail completely. Prevent partial writes during outage

Consistency: a transaction will bring the database from one valid state to another valid state, i.e database constraints will still hold

Isolation: Transactions are executed as if they are carried out sequentially.

Durability: Once transaction is committed it’ll remain committed even in case of system failure

(There’re much more details about ACID than these definitions, like different levels of isolations, how are ACID implemented, etc, does ACID still hold on shared MySql servers?)

Sharding is complex, Harder to scale

With relational db like mysql, shard-ing doesn’t come out of the box, so we’ll have to build our own services to maintain these shards and scaling them(like keep track of shards->machines config with ZK, etc)

Here’s a good blog how Pinterest shards their database

https://medium.com/pinterest-engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f

https://www.digitalocean.com/community/tutorials/understanding-database-sharding

NoSQL

Eventual consistency

High availability

Easy to scale

Why? Explain

https://stackoverflow.com/questions/8729779/why-nosql-is-better-at-scaling-out-than-rdbms

https://stackoverflow.com/questions/34458082/scaling-in-nosql-vs-rdbms

An article on Sql vs NoSql https://www.thorntech.com/2019/03/sql-vs-nosql/

Techniques to generate URLs

Technique1

For a long URL input, generate a random number and short url = base62(random), insert into database

Problem with this approach:

Multiple app servers generate the same short url for two different short URLs, NoSQL has no transaction?

As the free pool of short urls saturates, generating URL will be slower as you’d need to constantly re-generate, plus the risk of race condition in NoSQL database

Technique2

For a long URL input, short url = first 7 chars of mdb5(random), insert into database

Similarly, it’s possible to collide short urls here, and NoSql has no transaction

Technique3

Use a counter service to keep track of a counter, short url = base62(counter), insert into database

The counter service would be a single point of failure, and we need to scale it if we have lots of app servers.

What if we have multiple counter nodes, each responsible for certain range of numbers?

For that we need ZooKeeper for coordinating assignation of these ranges

i.e each app server asks for a counter range on startup and allocate short urls using this range, if the app server is down, simply ask for another range

(More details on the ZooKeeper configuration)

For example we could have data nodes like below in ZK

/servers/app1/range (counter_range str, consumed bool)

/servers/app2/range

/servers/app3/range

On startup, each app would consume its range stored in ZK and use the range for generating counters

A dedicated service is responsible for watching changes in these range, once it’s consumed, the service is responsible for filling in new counter ranges

The dedicated service can just be a thread running on these app servers, we could use ZK to elect the service if the service is down.

Or we could use a new server for the server, with a backup to avoid Single Point Failure.

Technique 3 is highly available since we could have as many servers serving shortening request as possible.

Performance is good since we don’t require extra look up into the database to ensure no duplicates.

It is scalable, we could easily spin up more app servers to handle more load, just add 1 more app entry on ZK and make the dedicated service assign a counter range for it.

Extra thoughts:

How do we shard the database in case there’s not enough space on one database instance? Shard based on hash(short url) using consistent hashing in nosql? Other than hash based partitioning, there’s also key range based partitioning, which is good for range scan, however in our case range scan efficiency is not useful, compared to that we probably want more even load balance between different partitions where consistent hashing is good at.

Videos I watched for learning:

https://www.youtube.com/watch?v=JQDHz72OA3c&t=129s

https://www.youtube.com/watch?v=fMZMm_0ZhK4