Why do Internet service outages happen? The importance of network monitoring (and benchmarking) to understand service outages and network reliability

Why do Internet service outages happen?

author: MedUX

This post is also available in: ES (ES)

Why do Internet service outages happen?

The importance of network monitoring (and benchmarking) to understand service outages and network reliability

The Internet is a mission-critical asset to business productivity and continuity no matter where you are at, very especially considering that employees have shifted to working remotely because of COVID-19 circumstances and policies.

Recent Fastly outage was a reminder on the complexity of the Internet and the importance of redundancy.

Besides this, global Internet disruptions saw an unprecedented rise and remained elevated through the first half of 2020 according to recent ThousandEyes reports, exhibiting 44% more disruptions in June 2020 over January 2020. MedUX has been reporting some of these incidents, above all from March 2020 when everything changed.

Service outages may occur during normal network operations across ISP, public cloud and edge service networks and should not be taken as an indication that Internet infrastructures do not hold up under strain circumstances. A variety of factors can interrupt service, from temporary disruptions to longer-term incidents or degradations.

This kind of issues can happen (and do) more often than we would desire, and whenever we suffer them, we tend to blame the ISPs or telecommunication operators. This is unfair many of the times.

At MedUX, our goal with network monitoring capabilities and Internet outages assessments is straightforward:

  • providing an indication of how actual internet users are experiencing the Internet service quality in real time
  • exploring the state of Internet services health and enable network issues, degradations, and outages identification
  • helping improve network performance and experience to ensure that service networks can meet end-users’ connectivity requirements.

Having said this, can you imagine not being able to connect to an important video meeting while you are working from home? Or did you find impossible to access some of your most used websites or online services?

Just because your network is “UP,” does not mean the service is working well. Service outages may affect end-users experience to a great extent, but we often do not realize how interconnected and complex our Internet really is, until things break.

MedUX approach goes beyond typical latency and packet loss, and focuses on measuring, monitoring and benchmarking a wide variety of performance indicators, services and applications directly from customer premises, i.e., the TRUE customer experience or Quality of Experience (QoE). Benchmarking is equally important as own service monitoring because it provides a wider understanding when an outage occurs based on your competitors’ performance and dependencies.

We analyse both network and service issues at different levels, including the user experience with OTT services, in order to get to know the causes but also the impact of these service disruptions on Customer Experience.

A look back at some recent outages and its QoE impact assessment

MedUX network testing ecosystem helps understand a wide range of incidents affecting the customer experience, from Quality of Service (QoS) problems and network outages to Quality of Experience (QoE) issues and services degradation.

How a service outage or degradation has impacted customer experience depends on the incident duration, the scope of services affected and, equally important, when the incident occurs. Actually, many outages happen at night-time or during off-peak hours and many go unnoticed.

Over the last couple of years, at MedUX we have been performing some analysis and impact assessments of major outages that significantly affected end-users’ Quality of Experience worldwide:

  • Fastly CDNs went down worldwide (08/06/2021) – As mentioned before, Fastly brought down Amazon, Twitter, Twitch and many more. Our analysis shows that the Web Browsing Experience was affected the most between 12 PM and 1 PM CET, as end-users could either not access websites hosted by Fastly or the loading time was impacted significantly.

  • YouTube went down worldwide (12/11/2020)MedUX detected a service degradation between 12AM and 2AM (UTC). This service incident affected YouTube’s availability and in general the Streaming Experience, since it prevented videos from loading for at least one hour.

  • COVID-19 impact on residential networks in Europe (2020) – Users in countries such as the United Kingdom, Italy, Germany, and Spain saw their overall Internet Experience somehow affected, partially and/or temporarily. This mostly happened during the first months of lockdowns and in areas with strong preventive measures.

  • CenturyLink/Level3 outage worldwide (30/08/2020) – CenturyLink/Level 3 suffered an incident lasting several hours, that affected big enterprises using the peering services such as Cloudflare or Google. CenturyLink/Level 3 terminates a large portion of Internet traffic around the world.

    The CenturyLink outage was identified as the cause of service disruption that occurred at Cloudflare, a web infrastructure and website security provider that helps optimize and keep websites up and running. MedUX analysed the impact of the service disruption in Spain, which partially affected overall Customer Experience at national level. Most used services (Web Browsing, Cloud Storage and Streaming) were affected because a pane at Century Link. Web Browsing Experience success rates were below 60% at 12 PM for some of the operators in Spain.

CenturyLink/Level3 outage

  • TalkTalk’s DNS service disruption affecting Internet service across the UK (29/05/2020)MedUX detected a service outage during morning hours (between 10 AM and 12 PM BST) in TalkTalk VDSL services, which prevented its users from getting online and browsing the Internet.

  • Telekom’s, Vodafone’s, O2’s and 1&1’s internet outage affecting xDSL operations and users’ QoE in Germany (12/02/2020)MedUX observed a service outage during the morning hours, mostly between 2 and 9 AM (CET), on February 12th, 2020. Service availability was at the lowest level between 3 and 4 AM and gradually recovered afterwards. According to MedUX measurements, service availability was even below 60% in certain regions during peak outage time.

  • Vodafone’s DNS service disruption partially affecting network and customer experience in Germany (17/01/2020) – MedUX executive Insights Report covered the analysis and impact assessment of the service disruption and degradation between 3AM and 9AM (CET) on January 17th, 2020. This service degradation affected the Internet connectivity and customer experience on the most demanded services.

    The service availability was at the lowest level between 5AM and 6AM, when up to 75% of Vodafone and even some 1&1 client locations presented at least an error related to DNS resolution and Internet connectivity. However, overall customer experience was not so heavily affected as the issue happened during early morning hours and seemed to be fixed around 8 AM (delayed maintenance window).

7) Vodafone’s DNS service worldwide disruption partially affecting network and customer experience in Germany

  • Gaming disruption: Riot Games service interruption during game release (01/07/2019)MedUX detected how the interruption that Riot Games servers (League of Legends and Teamfight Tactics) affected Customer Experience in some European countries. The service suffered an overload, preventing users to enter online and play the videogame.

  • Vodafone service outage in Europe (13/06/2019)MedUX analyses Vodafone fixed network outage, which had limited impact in some countries such as the United Kingdom, Italy, Portugal and Ireland, where affected users recovered their Internet service in less than an hour.

Why do Internet service outages happen? – Analysing its Root Cause

Shared platforms, hosting services, DNS servers and even physical infrastructure all contribute to the interconnected, collective fragility of the Internet.

As the internet ecosystem of applications, services and physical infrastructure becomes more and more interconnected, outages can affect increasingly large portions of end-users’ daily lives. A variety of factors can interrupt service, from temporary disruptions to longer-term incidents or degradations. Common points of failure could be major ISPs, DNS providers, CDN providers, hosting or infrastructure vendors, or even APIs for information exchange.

When these service disruptions happen, the service uses to be unavailable or degraded from the end-user point of view, but what is hidden behind the causes of an outage?

Operators work hard worldwide to maintain the service quality and Internet usage experience, and in general terms, Internet service is stable. However, there are many problems that can affect network performance, and some of them are very complex to identify and understand. In the following, some of the most recurring ones:

  • Internet application (content providers): Internet application-level concentration is easy to be seen and problematic for outages occurrence. Today, companies such as Google hold the Internet’s most popular services. Including web searching, email hosting (Gmail) and video platform (YouTube). Application updates and application server/misconfiguration issues are frequent root causes among service infrastructure providers.
  • Service infrastructure (cloud service providers): CDNs, DNS and cloud service providers are now a fundamental infrastructure part. For example, over 80% of top websites globally are using CDNs, such as Akamai or Fastly. Software updates during regular maintenance, misconfigurations and failure of HW/SW parts are frequent root causes among service infrastructure providers.
  • International connectivity (ISPs): Global communications depend heavily on subsea cables and interconnection/peering providers that connect regions as well as telecommunication networks and content providers.
  • Others: Service might be down because of severe weather such as earthquakes or hurricanes, as well as electrical power outages.
  • Telecommunications operators:
    • Access provision: Internet connectivity depends on access and last-mile providers to get the content to the end-user and any failure in the access network part is critical. A teardown of access links or any access-related failure disconnects the customer from its provider affecting end-user connectivity.
    • Network Congestion: It is the reduced quality of service that mostly occurs when too many users are trying to access a network at the same time in a certain geographical area. Typical effects include queueing delay, packet loss or the blocking of new connections. The circuit quality may gridlock or deteriorate causing the network collapse and impede users from making an efficient use of the network.  Lack of prevision or support of user demand could be considered either a business-related matter or a technical one depending on the circumstances.
    • Transport Link failures: This failure happens when the physical or logical links between network systems or equipment assets suffers an interruption. Probably link failures occur due to low converging time, previously allocated delay and bandwidth, and iterative loops which degrade network performance.
    • Equipment or node failure: Lockups and overloads can also cause equipment failure. Furthermore, not grounding or protecting the equipment from surges can leave it vulnerable to circuit damage. Furthermore, these technical issues can be mitigated with the appropriate hardware setup and maintenance. This category may also include crashes or non-planned reboots, line card failures or resets, CPU overload or even human misconfiguration.
    • Routing problems: In case of link and node failures, the routing might be able to automatically find a new stable configuration, guaranteeing good connections between any pair of nodes in the network. However, sometimes routing protocols do not repair connectivity issues the way they should, or worse, they can create outages of their own that wouldn’t have occurred in a properly configured network.

MedUX’s network monitoring and outages detection

MedUX helps to understand a wide range of incidents affecting Customer Experience. The aim of MedUX technology is to help ISPs analyse and improve the broadband Quality of Experience with performance insights from their end-users’ perspective (and those end-users of the competitors).

MedUX monitors the public’s most-used services, including Web Browsing, YouTube, Dropbox and many other OTT applications, such as WhatsApp or Facebook. In addition, MedUX Ecosystem analytical capabilities enable locating and resolving incidents affecting service quality and Customer Experience.

At MedUX, we are always working hard to improve network performance, monitor Customer Experience and deliver innovative solutions to support the telecommunications industry.

Stay tuned to our next reports and insights and get in touch with us at hello@medux.com if you need further information. Our team will be glad to discuss our new features to prevent Customer Experience issues in this innovative and hyper-connected era.

Don’t forget to follow us on social networks and subscribe to MedUX newsletter!

SUBSCRIBE

 

Share it

Tagged with:

related POSTS:

Outage Analysis

Fastly brings down Amazon, Twitter, Twitch and many more


Author: MedUX

This post is also available in: ES (ES)Fastly brings down Amazon, Twitter, Twitch and many more: MedUX impact assessment Fastly

Outage Analysis

YouTube goes down: MedUX impact assessment


Author: MedUX

This post is also available in: ES (ES) YouTube goes down: MedUX impact assessment YouTube went down between midnight and

Outage Analysis

COVID-19 impact: Monitoring on European residential networks


Author: MedUX

This post is also available in: ES (ES)COVID-19 impact: Performance and experience monitoring on European residential networks This article is

Any need related to Quality and Customer Experience? We are here to help you!