The Deeper Root Cause of the Fastly and Akamai Outages

June 22, 2021November 9, 2021

By Alban Kwan, regional director, CSC East Asia
Share this post

As we finished this article, the world was hit by another global outage by content delivery network (CDN) provider, Akamai, on June 17, 2021. The cause seems to be related to the lack of capacity to a certain “routing table” of their distributed denial of service (DDoS) mitigation. Although the technical analysis is not yet available, the central premise of this article also applies to this incident, and it serves as a timely testimony.

Around 6 a.m. ET on June 6, 2021, websites hosted by CDN Fastly such as Amazon^®, Reddit, Spotify^®, eBay^®, Twitch, Pinterest^®, and CNET were taken offline in an incident that CNET described as “the day the internet broke^[1].” Merely 11 days after the CNET outage, Akamai, a top three CDN provider, also went down, taking some global banks and airlines with it, including Southwest Airlines, United Airlines, Commonwealth Bank of Australia, Westpac, and the Australia and New Zealand Banking Group, as well as the Hong Kong Stock Exchange website.

According to some research on the Fastly incident^[2], none of these major global corporations seemed to have any automated response system in place to mitigate such an incident, and had to manually adjust their domain name system (DNS) records to remove Fastly.

The cause of the Fastly outage has now been widely reported.

Fastly updated the software that was containing an undiscovered bug.
One customer triggered said bug during a perfectly normal operation.

An undiscovered software bug is innocent enough and is an expected issue that’s within all software. Aside from this incident having cost these brands multi-millions in losses, what’s the big deal? While it seems paradoxical, it’s a valid question.

From zero trust to deep-seated trust in cloud

Those not directly impacted by this outage may not ask the question “what’s the big deal?,” however, in our practical experience, this question is most definitely asked by most IT and security managers. This is evidenced by the lack of action from the practitioner community after similar incidents, such as the DYN outage in 2016, Cloudflare^®, Azure, and Amazon Web Services outage in 2020. Akamai and Fastly will not be the last.

There are certainly practical limitations, such as budget and conflicting priorities that cause inaction. However, what’s more alarming is that cloud and CDN are often viewed as THE solution to mitigate network outages. If these giants were to also go down, decision makers would generally feel like they had done their best.

In the recent years, zero trust has been a very popular concept in cyber security. Although zero trust mainly focuses on zero trust at the end point, it has also evolved into a security philosophy covering the entire network architecture. This creates a stark contrast with the deep-seated, sometimes disproportionate trust of cloud services by IT and security professionals. Google^® published an article called “The cloud trust paradox”^[3] which describes the role of trust in using cloud services, and that “the very concept of using public cloud” is inseparable from “trusting your cloud provider.” Instead of building a truly zero-trust network, what we can see in practice is probably better described as outsourcing the risk to a trusted source.

As the title of the Google article suggested, “to trust cloud computing more, you need the ability to trust it less.” The IT security community may have to start trusting the CDN and cloud sources less to maintain a better security posture.

As the technical root cause of the Fastly and Akamai incidents are something uncontrollable by the enterprises, how can we trust it less and mitigate the risk? For this, we dive deep into its non-technical root cause.

The root cause: internet consolidation

In 2019, the Internet Society (ISOC) released an important report titled “Consolidation in the Internet Economy.” Part of the report describes how the concentration of only a few providers of the internet infrastructure presents risks to the internet and to the wider community. I understand that internet consolidation is not a common concept discussed amongst the IT and security community, however, I would argue that such consolidation and concentration is one of the root causes of the Fastly and Akamai incidents, and any meaningful mitigation strategy must be addressed from this angle.

One of the consequences of internet consolidation is that it creates deep dependencies on only a few service providers, and as that failed, the internet failed as it did in this two cases.

The ISOC report described the internet consolidation happening at three levels:

Internet application – Today, a small number of companies operate some of the internet’s most popular services. Google alone holds 90% of the global search market, over 60% of web browsers, the number one (by far) mobile operating system (Android^TM), the top user-generated video platform (YouTube^TM), and has more than 1.5 billion active users of its email service (Gmail). There is a similar concentration in China with Alibaba^® and Tencent dominating eCommerce and social media platforms, respectively. Internet application level concentration is easy to be seen and problematic in itself.
Access provision – The concentration of internet service providers due to lowering revenue base and high cost of entry.
Service infrastructure – Consolidation is happening among CDNs and cloud service providers, and both are now a fundamental infrastructure component. Among the top 1,000 websites globally, CDN use grew to an estimated 87.5% in August 2018 from 50% in June 2014. Of the websites in the sample that use CDNs, 27% use Amazon CloudFront, 27% use Akamai, while 8% uses Fastly. While 8% market share is significant, it has yet to reach a critical mass, so why was the impact from Fastly’s outage felt so widely?

An in-depth investigation of the incident by The Internet Report^[4] provided an interesting answer. By looking at the IP addresses used by some of the largest online services, they have uncovered that some of these CDN and cloud service providers are also using Fastly for redundancy. It’s because the internet landscape has become so concentrated that even for redundancy amongst the CDNs and cloud providers themselves, only a few players are being used. The result is that someone who is using CDN A (as an example) may actually be resolving some of the traffic via Fastly, and when Fastly went down, it also impacted clients who are using CDN A. Such deep-dependencies seems to go deeper than anyone may think.

Another interesting fact uncovered in The Internet Report is that the impacted enterprises most likely used the DNS to recover from the outage, and some actually recovered quicker than Fastly did. This brings us to another critical service infrastructure highlighted in the ISOC report: the DNS.

ISOC reported that both recursive DNS and authoritative DNS have experienced significant consolidation. An academic paper looks into whether the market learned from the massive DYN DNS incident. Four years after the DYN incident, the market seems to have learned very little from the impact of concentrated DNS providers^[5]. Amongst the top DNS hosting providers, only a few seem to embrace diversity and encourage clients to take up such best practices^[6]. One of the key problems not highlighted in the paper is that the most popular DNS service providers are also the largest CDN and cloud providers, which creates another layer of vertical concentration to worsen the situation.

DNS deserves to be highlighted because while CDN is classified as a critical infrastructure, it only impacts the online properties that are using the CDN. In the Fastly incident, anything not directly connected to Fastly, such as email servers, was basically unaffected. However, DNS is linked to almost EVERYTHING internally and externally, including your connection to a CDN and any cloud service. Thus, the impact DNS concentration has on the overall security posture of the businesses cannot be underestimated. Since DNS was the method used by some of the corporations to recover from the Fastly outage, this further highlights its critical position in the modern network and inter-network design.

The solution and recommendations

The Fastly incident once again exposes the problem of concentration and consolidation of critical internet services, such as CDN, cloud services, and DNS. This issue is exacerbated because of the hyper dependency between the major CDN and cloud providers. The following are recommended steps to minimize the risks.

Step 1: Harness the benefits of internet concentration

While we discussed a lot of the issues with internet concentration, it’s important to stress that internet concentration also brings a lot of positives. For example, the concentration of CDN providers creates economies of scale for content delivery and also significantly reduces the data transit costs. The concentration of DNS providers allows the largest one to have enough scale to sustain Terabit-level DDoS attacks. The prominence of Google also helps the testing and development of critical new protocols such as QUIC.

Therefore, for those who are still maintaining self-owned infrastructure but are highly dependent on the internet for your main operation, we recommend first harnessing the benefits of internet concentration by moving the appropriate services to a CDN, the cloud, and enterprise DNS hosting to drive down cost and improve resilience. However, we don’t stop there.

Step 2: Prioritize diversification by going back to the root of the internet

Once cost is driven down and the baseline security posture improved, consider how to diversify these critical infrastructures to reduce risks. Prioritization implies that there are some things more critical than another, so how do we determine if CDN, cloud, or DNS is more critical?

Internet invariants are the fundamental building blocks of the internet and they will not change no matter what. In ISOC’s words, these are “what really matters^[7].”

A practical example of an internet variant is the internet protocol (IP) as it underscores the interoperability foundation of the internet. DNS is also an internet invariant, as the internet requires a “global, managed addressing and naming service^[8]” to function with high integrity. Whether you host a server locally or on the cloud, and the method of how you speed up website resolution, will evolve over time, so they are not invariants.

As such, we have always maintained that DNS is one of the most critical internet infrastructures and should receive higher priority. It should be divorced from other internet service infrastructures, and use a DNS provider dedicated to this space to avoid the deep dependencies.

Step 3: Avoid vertical consolidation

Internet consolidation and concentration can happen horizontally (most CDN services consolidated to Akamai, Cloudflare, and a few others) and vertically (a corporation consolidates their CDN, cloud hosting and computing, and DNS to a single vendor). Horizontal consolidation is beyond what a single company can control and is determined mostly by market forces. As The Internet Report investigation shows, vertical consolidation and concentration can also happen without you knowing (i.e., the deep dependencies of service providers). As such, to reduce risk, it’s important for corporations to start decoupling the complex web of interdependencies of these large providers. The best way to start is to reduce your vertical consolidation by decoupling and diversifying the CDN, cloud hosting, and enterprise DNS providers.

Going back to our first recommendation, internet consolidation will bring cost benefits, so if you are not using an enterprise-class cloud-based provider already, this should be where you start. However, these internet service infrastructures are so independently critical that the failure of any one of these will result in a significant downtime. Therefore, vertical integration—using the same provider for Cloud, CDN, DNS, DDOS protection—should be avoided as much as possible.

Step 4: Consolidate management; diversify infrastructure

The problems brought about by internet consolidation is a very difficult issue to tackle, mainly because it also brings a lot of benefits. Commercially, there are two main benefits: financial and managerial. Financially, there is always a tradeoff between higher security and cost, but when it comes to managerial or operational benefits, one may reap the benefits of consolidated management while diversifying the infrastructure. Using DNS management as an example, you can use one provider as primary while all updates are copied automatically to the secondary DNS infrastructure. It’s logical that similar settings can be both on CDN and cloud. When choosing vendors that provide your internet service infrastructure, make diversification one of the factors, bearing that the chosen provider has the necessary enterprise security level. A diversified internet is ultimately more stable and secure, and only through our collective commercial choice can we achieve it.

[1] cnet.com/news/fastly-internet-outage-explained-how-one-customer-broke-amazon-reddit-and-half-the-web/

[2] Ep. 40 Fastly’s Outage and Why CDN Redundancy Matters, The internet Report, youtube.com/watch?v=VNOFxULD3Lo

[3] cloud.google.com/blog/products/identity-security/trust-a-cloud-provider-that-enables-you-to-trust-them-less

[4] youtube.com/channel/UCewXUwLMfn7Y69C6vGRVwyw

[5] zdnet.com/article/four-years-after-the-dyn-ddos-attack-critical-dns-dependencies-have-only-gone-up/

[6]dl.acm.org/doi/pdf/10.1145/3419394.3423664

[7] internetsociety.org/internet-invariants-what-really-matters

[8] internetsociety.org/internet-invariants-what-really-matters

The Deeper Root Cause of the Fastly and Akamai Outages

From zero trust to deep-seated trust in cloud

The root cause: internet consolidation

The solution and recommendations

AUTHOR

RELATED POSTS