Mitigating CDN issues with a Multi-CDN setup

CDNs are a critical piece of IT infrastructure; when your CDN goes down, your site goes down with it. Such outages sometimes happen in a big fashion like this recent outage described here by Cedexis. Outages can also happen in a smaller and less noticeable way, as they’re usually local and caused by performance degradation rather than by the full unavailability of a CDN. These so-called micro-outages are harder to detect, but can still have a significant impact on your users’ experience. It is possible to minimise the impact of both types of outages (big and small), for which you’ll need 3 things:

  1. Proper monitoring
  2. CDN load balancing technology
  3. Multiple CDNs

Monitoring

There are 2 main types of monitoring: Real User Measurements (RUM) and Synthetic monitoring. Both types serve their purpose and are, at least partially, complementary. In short:

RUM

Short for Real User Measurements. These measurements include everything from the user’s browser to the test target.

Advantages
  • Can expose issues specifically for your user’s environment and behaviour, such as particular site browsing patterns
  • Includes all sorts of (network) issues associated with consumer grade internet connections
  • It allows you to gain insight into the correlation between performance and business metrics
Disadvantages
  • Can be influenced by activity on the end user’s computer
  • Hard to get a clean view on particular improvements to you application as there are many interfering elements in the measurements

Synthetic monitoring

Synthetic tests are most often performed from servers inside a datacenter with near perfect internet connectivity, and little or no interference from other devices on the same network. Sometimes a synthetic test platform provider ships testing units to people’s homes or offices to test locally. These nodes can then be used to perform tests, and to gain insight into the so-called last-mile.

Advantages
  • Great for getting a clean view of performance improvements that can be used in pre and post build testing procedures
  • Acts as a canary. If something doesn’t work from the server with a great connection it’s likely that it isn’t working on a consumer line.
  • Allow you to gain deep insight into the various page elements.
Disadvantages
  • Does not provide a real world view of your applications overall performance.

Real life example

The graph below illustrates the time for a CDN to respond to a request from Germany. You can clearly see the response times of one CDN increasing on the right half of the graph, and eventually spiking in the most right quarter.

CDN_in Multi-CDN_Ger_last30_rtt

RoundTripTime (RTT) timings for 5 CDNs in Germany. Lower is better.

Not only response times are measured, throughput is measured as well. In the case of throughput, higher is better and in the graph below you can see something similar happening to response times happening to throughput, albeit inverted: a relatively slow decrease in the right half of the graph and a huge drop in the most right quarter, approaching 0 throughput.

CDN_in Multi-CDN_Ger_last30_trp

Throughput measurement for 5 CDNs in Germany. Higher is better.

The CDN provider showing the drop in the graphs above has a global footprint and the same effect was visible at their other European PoPs (Point of Presence). The affected PoP in in Germany is the provider’s furthest east in Europe – so we can expect a country like Poland to be affected as well. Let’s review the effect of these issues on the performance of this CDN, as measured from Poland:

The response times.

CDN_in Multi-CDN_Pol_last30_rtt

RoundTripTime (RTT) timing for 5 CDNs in Poland. Lower is better.

And the throughput.

CDN_in Multi-CDN_Pol_last30_trp

Throughput measurement for 5 CDNs in Poland. Higher is better.

If there is an issue with a particular PoP, for instance misconfiguration, these issues tend to affect the users in neighbouring countries without a local PoP. If, however, the issue is caused by limited capacity of a certain PoP, part of that traffic will usually be routed to other PoPs in the proximity (in this case Amsterdam or Paris). If the needed capacity exceeds the available capacity enough, these other PoPs will be affected as well.

CDN load balancing

In order to use multiple CDNs (Multi-CDN) for one property, you need a load balancing technology that enables you to switch between the CDNs. Ideally, you want to have a connection between the RUM data (giving us insight into the performance as seen from end users) and the load balancing technology.

The two graphs below show the amount of decisions for each CDN in the load balancing mix. This particular property has not been configured with the five CDNs depicted above, but with three. You can see that one of the CDNs was added around 10 days after the graph starts, and that it immediately takes a share of the total traffic.

CDN_in Multi-CDN_impact_on_ww_decisions

Amount of decisions on a global level taken in favour of each CDN

Decreased performance of one of the providers has an impact on the amount of decision made in their favour, as is quite visible in the graph above. Let’s zoom in to the decisions taken in Germany:

CDN_in Multi-CDN_impact_on_ger_decisions

Amount of decisions on a national level (Germany) taken in favour of each CDN

As the performance of the CDN gets worse, the selection criteria for selecting the fastest CDN, or the one with the highest throughput, rule out that particular CDN more and more until the point that it hardly receives any traffic. Users won’t notice anything of this, as another CDN seamlessly takes its place.

Multiple CDNs

The last thing you need are multiple CDN providers to work with. If you think setting up one CDN is hard, try setting up multiple!

Some of the issues you’re likely to encounter when setting up a Multi-CDN solution are:

  • Configuration sync

Keeping the configuration of the various CDNs in sync can be laborious and error prone. They use different terminology for the same functionality, have different APIs and completely different web interfaces.

  • Overview

It can be hard to keep an overview of the various CDNs. Statistics are in multiple web interfaces and you might have multiple commitments in place you want need to meet.

  • Loss of bargaining power

There’s an inversely proportional relationship between the amount of traffic you do and the price per unit. Said differently, the more traffic you do, the lower the price. If you have 300TB of traffic to negotiate with, you might get a good price. If you split the traffic over multiple CDNs, lets say 3 and each takes the same portion of traffic, you only have 100TB to negotiate per CDN provider.

Why Multi-CDN

Having an available and well-performing site is critically important to acquiring and retaining your users and customers. A Multi-CDN solution gives you peace of mind and a fully automated ‘plan B’ in case of outages and performance degradation of CDNs.

 

About the Author

Thijs de Zoete

Technical guy at Warpcache