Modern Web Serving in 2020: DNS and Edge Network (Part2)

11 min readSep 21, 2020

In Part 1 of this series, we’ve covered a simple web serving architecture which makes use of CDN, and Nginx load balancer in front of a cluster of app servers. That architecture is reasonably good for maybe a million users. How about aiming higher than that ? Now, let me go ahead and present to you an overview of traffic infrastructure at a medium/large size tech company, with an (over-simplified)diagram.

Note: there are many ways to do load balancing, in this article i m just describing one particular way, for more info/details, feel free to check out references at the bottom.

Here, as you see the diagram is way over-simplified(didn’t even draw any load balancers!) just so it could fit in this page, or one’s brain :) First we’ll briefly go through the steps that a http request will flow through, what each component does(DNS, PoP, data center, etc), and then we could dive into each component in details later.

Going through the diagram

Step 1, 2, 3:

Let’s say a user wants to access https://www.example.com, before being able to send the request, user’s computer needs to do a DNS lookup on the domain name to find the IP Address behind it. The DNS lookup process go through a chain of servers(User’s Computer -> DNS Resolver -> DNS Root Server -> Top Level Domain Server -> Authoritative Name Server) and eventually reach something called the authoritative name server which knows the IP address mapped to https://www.example.com. Note that the IP address user gets depends on user’s location, for example, if you’re located near San Francisco, you’ll likely be getting the IP Address of the San Francisco PoP(Points of Presence). These PoPs are strategically placed around many parts of the world near users and forward requests to data centers. I will describe DNS Lookup with more details in section #DNS Lookup.

Step 4,5:

Once user’s request hit one of our PoP, it go throughs a bunch of routers, L4/L7 load balancers inside the PoP and get sent to some data center. Wondering why should user’s request not just go to the data center directly through public Internet? Well, requests actually get to data center faster by going through our PoP! To avoid repetition, I’ll explain in section #Inside Edge PoP, stay tuned!

Now that we’ve a rough picture about life of a request from user to data center, let’s actually expand each component and learn the details. Next, we’ll cover the internals of how DNS work, what’s inside a PoP and data center.

DNS Lookup

DNS Lookup is the process of resolving a domain name into a IP Address.

There’re 2 parts to the DNS lookup

How user’s computer walks through a chain of DNS servers(eventually reach the authoritative name server) to get the IP Address mapped to the domain name
How does the authoritative name server decide which IP Address to return to user

The first part has remained the same for the most parts in past decades and there isn’t much need to change that. The magic happens in second part where at authoritative name server like NS1/Amazon Route53, you can employ different strategies for resolving a domain name into different IP addresses. Such strategies include GeoDNS, RumDNS, or just one-to-one Domain Name-> IP mapping.

First part of DNS Lookup is well explained in this article from Cloudflare: https://www.cloudflare.com/learning/dns/what-is-dns/, highly encourage to read it if you’re not already familiar.

Here’s a diagram for first part of DNS lookup, just for completeness of this article.

Domain name -> IP translation methods

This section describes methods authoritative name server use to translate domain name into IP Address

Static domain->IP mapping with plain dns records

This method is most intuitive, web server owner defines what IP address example.com can map to by adding an “A record” to your “zone” hosted on some authoritative name server provider(NS1/Amazon Route 53). An A Record is used to point a logical domain name, such as “example.com”, to the IP address of some hosting server, “1.2.3.4”.

You could also add multiple such A records for the same domain name but point to different IP address. In this case, the dns system will act as some sort of load balancer that distributes your user evenly to each of the IP addresses. Only bad thing is that it can’t be used as a fail-over solution since some users may be sent to a dead server by the dns system.

2. GeoDNS

The basic idea of GeoDNS(aka GeoIP, GeoLocation) is to resolve a domain name dynamically based on location of initiator of the DNS request. For example, the pink pin user on above map is physically closer to a server in US-WEST-1, and so it’d be reasonable to route this user to it instead of US-EAST-1

In order for authoritative name server to achieve this, 2 pieces of info are required

GPS Coordinates of the IP Addresses to web servers which we want the domain name to be resolved into(provided by web server owner)
GPS Coordinate of initiator of the DNS request(provided by Maxmind GeoIP database which is actively maintained and refined)

With these info, when a DNS request like “IP 1.2.3.X would like to know the IP Address of example.com” arrives at authoritative name server, we could easily calculates which web server(or PoP) is closest to the user.

With GeoDNS, we have more flexible control on user traffic routing in order to provide better latency, since distance is usually positively proportional to latency. However, in some cases due to network congestion or other weird reasons, it’s possible that latency is smaller between a user to a physically further web server. In this case GeoDNS performance would not be optimal, which is why we have Rum DNS.

3. RumDNS(the fancy stuff!)

Last but not least, the RumDNS! RUM Stands for Real User Metrics.

The idea is that, instead of routing users to web servers based on their distance to web servers, why not route based on their latency to web servers?

This is analogical to taking a detour on purpose to avoid traffic congestion.

Contrary to GeoDNS, in order to achieve RumDNS, it’ll require quite a bit of work from the web server owner. Just to give a bit more info but not a full dive into it, we need some sort of daemons running on desktop/web clients that constantly ping our web servers in order to calculate RTT from client to each web server. With these info, we could decide which IP subnets should go to what web servers. These info are then fed into authoritative name server which directly handles DNS request from users.

Many tech companies including Dropbox and Linkedin are using RumDNS to improve overall latency thus making user experience with their sites much better than others!

Edge PoP

Before anything else, why should we even have PoP?

Why not just send the http request to data center directly over the public Internet? The reason is public Internet is slow! There may be congestions and weird stuff happening in public Internet that causes packets to be dropped, timed out, etc. With PoP, once user’s request hit our PoP, that request will be taken into some Backbone Network(think of it as a network connecting our data centers and PoPs via dedicated physical circuits) which reduces packet loss, because less time is spent on public Internet.

From the picture below, we can see latency is significantly reduced for request going through a PoP near user:

stolen from Dropbox blog (In second pic, PoP-> data center latency is 150ms, user to PoP latency is 20ms, server time is 100ms)

With less congestion, TCP CWND also settles on a higher value, which leads to better throughput overall.

So, both latency and performance improved with use of PoP!

Yet another benefit of having PoP is, static assets can be served directly from a PoP near user instead of from data center, this improves latency when user fetching static assets (only)for the first time, since subsequent fetches would come from a CDN that’s even closer to user.

Load balancing across PoPs(aka GSLB, Global Server Load Balancing)

There are couple of ways to load balance user requests across PoPs

1 BGP Anycast for load balancing PoPs

What this means is in all our PoPs we advertise the same IP subnet 1.2.3.0/24 with BGP protocol. Then we set up a DNS record to point example.com to 1.2.3.1(some server in that IP subnet). This way, when user makes a request to example.com, it’ll be routed the closest PoP who advertises that IP subnet 1.2.3.0 and eventually end up in 1.2.3.1. All this happens by magic, I mean, BGP.

The pros of BGP Anycast is that it’s easy to setup and we get automatic failover for free. For example, if one PoP goes down, subsequent requests will be directed to another PoP by BGP.

The cons of BGP Anycast is that its performance is not optimal in terms of latency: BGP is optimal in terms of number of hops, not latency. Another thing, graceful drain of PoP(moving traffic from PoP A to other PoPs due to maintenance need) is impossible with BGP since BGP route packets, not connections. This means that packets of a TCP connection destined for a drained PoP will be routed to another PoP, so the TCP connection will be reset/terminated(RST).

GeoDNS/Rum DNS for load balancing PoPs

BGP Anycast performance is worse than GeoDNS and RumDNS. In GeoDNS/RumDNS, each PoP advertises its own IP Subnet. The authoritative dns server is responsible for pointing user to the best PoP choice.

In general, performance-wise, RumDNS > GeoDNS > BGP Anycast.

However, in practice we use a hybrid approach: Most DNS requests are solved by RumDNS. If RumDNS misses, GeoDNS will be used. In case neither works, we use BGP Anycast as a fallback: example.com(without www) points to the shared IP subnet 1.2.3.0. This way we gained both performance, graceful PoP drain, and some sort of failover.

Load balancing layers inside PoP

Up until this point, the kind of load balancing we talked about are all handled by BGP Anycast, DNS. Next thing i want to talk about is real load balancers that live in L4 and L7 layers in the OSI model.

Again, i’ll just present you one possible load balancing topology, here’s a diagram:

This load balancer design is referred to as fault tolerance and scaling via clustering and distributed consistent hashing. It works as follows:

Step1:

n edge routers each announce same set of Anycast VIPs at an identical BGP weight. ECMP(Equal-Cost-Multi-Path, a routing strategy) with Consistent Hashing is used to ensure that all packets coming from the same 4-tuple(src_ip, src_port, dst_ip, dst_port) end up in the same edge router.

Although it’s not necessary to force packets from the same connection to always end up in same edge router, it’s generally beneficial to avoid out of order packets which would degrade performance.

Step2:

Using the same ECMP+Consistent Hashing, the edge router select a Layer 4 load balancer. Again, we want packets in the same flow to end up in same L4 loadbalancer.

Step3:

L4 loadbalancer uses Consistent Hashing to select an L7 load balancer and forwards packets from the same TCP connection to the same L7 load balancer.

The internals of L4 load balancers is a topic big enough for another book. But basically, the way it works is you have a set of pre-defined config files in which a set of backend setups are defined(in our case it’s a set of L7LBs), including info like how to health-check the backend, how do we find the backend(like IP/Port), etc. Then, L4LB constantly health-check the L7LBs and make sure requests are forwarded to an alive host(again, packets from same flow should end up in same L7LB with Consistent Hashing)

Another thing to mention is that L4LB will use GRE(Generic Routing Encapsulation) to encapsulate the IP packets being sent from the load balancer to the L7LBs, and eventually to the backend server. With GRE, the backend server will be able to decapsulate the packet and see user’s IP and send response directly to user without going through the PoP again. This significantly reduces PoP’s workload so that PoP capacity can be better utilized to serve incoming user traffic, making things much more reliable, as we know, less is more. Such load balancer that uses GRE is called a DSR load balancer

Step4:

L7LB forwards layer 7 traffic like HTTP, gRPC, etc.

In L7LB, consistent hashing is also used to select the same server for the same 5-tuple(2 ips, 2 ports, 1 protocol), because servers may have local cache that stores data related to last received request. However backend server should not implement their backends relying on this fact.

The advantages of this design is obvious, it’s super scalable and reliable. With each layer having a fleet of routers and loadbalancers, we could always tolerate outage in a few of them, and the system will work just fine. We could add more loadbalancers if loads increase, with Consistent Hashing we don’t have to worry packets from the same flow going to different loadbalancers/backends causing degraded performance.

Questions

Why do we need load balancers at all? Why don’t edge routers at PoP talk to backends directly with ECMP?

With load balancers we could mitigate many types of DoS attack by doing things like limiting number of connections, etc. Or, having a load balancer in front of multiple servers already make it less likely to overload any particular server.

Why do we need L7 load balancer, why not just use L4 load balancer.

In short, L7 load balancer is aware of the contents of the packet such as URL, Cookie in case of layer 7 protocol like HTTP. This means that it can make smart decisions about how to route L7 layer packets to load balance more efficiently than L4 load balancer.. It can also do compressions and encryptions on the packet.

Why do we need L4 load balancer, why not just use L7 load balancer.

The benefits of having L4LBs before L7LBs is

1 L4LB is less smart, and it’s actually a benefit.. that means it’s also less CPU-intensive and can be more resilient to DoS attack

2 L7LBs tend to have more bugs and need to be updated more often, having a bunch of L4LBs in front of it makes rolling deployment easier. Also, since L7LBs tend to have more bugs, it’s good to have L4LBs in front of it to route around failures when they happen.

Phew, that’s a long read! Congratulations for making it. Thanks for reading!

Next article: maybe talk about stuff inside data center?