On Tuesday, 12 August 2014, the world's carriers, service providers and internet users experienced widespread and prolonged disruption from packet loss.
The whole thing, that could cost businesses billions in lost trade and fees from broken service agreements, may well have been caused by the global routing table exceeding the legacy size limit of 512K.
The global routing what?
The global routing table (the ever changing table that contains the routes between Autonomous Systems (AS)) has been growing ever since its creation at the inception of the decentralised, Border Gateway Protocol (BGP) controlled internet.
The routing table is updated constantly as new routes are learnt and old routes removed. The 'routes' are what tell every router on the internet how to reach any other router via announced IP address ranges. It’s similar to how a post or zip code allows the postal service to determine where to send mail for local sorting.
The table's growth however has accelerated in recent years as carriers and ISPs tackle the impending issue of IPv4 depletion. To cope with the inevitable exhaustion of IPv4 addresses, AS' have been forced to get savvy and change the ways in which they use these valuable and limited resources. One by-product of this has been increased use of prefixes, and these add directly to the size of the global routing table.
According to the CIDR report, a web service that tracks changes in the routing table, it exceeded 512K in size for the first time ever this week. The table crossed the threshold, briefly, for the first time on Friday, 08 August 2014. It quickly dropped back down only to rise over it again, and stay there, on Tuesday, 12 August 2014.
This major internet 'event' coincided with global internet service disruption and is, we are convinced, the underlying reason for it.
How can something so small break the internet?
Like any large, distributed system, the internet is vulnerable to technical limitations and assumptions that hark back to its creation.
At the time of the BGP internet's creation 512K was a large number in terms of memory and probably considered ample, if not overkill, for the global routing table - which, while performing a very complex function, needed only to contain very little, simple data. This assumption became embodied as hardware limitations in devices which were only allocated 512K memory, or thereabouts, for the table.
When the table exceeded that size this week, some of the Default-Free Zone (DFZ) routers will have no longer been able to store all of the routes in the table, meaning they were no longer able to advise different AS' where to send some of their data. Due to the central role of the DFZ routers and the interconnected nature of the internet, the effects of the resulting packet loss were felt the world over.
How was it fixed?
While most carriers are yet to come forward and announce that it was the 512K threshold that caused the issues of 12 August, we are confident that it was at the heart of them. We expect that the affected networks responded by swapping out old hardware (that had the limitation hard-wired), upgrading firmware and changing configurations to make space for the larger routing table.
What has ServerSpace done?
Although it was impossible for anyone, including us, to fully protect themselves from the effects on other networks when the table crossed the threshold, there are steps that we can and have taken to make sure that our network continues to operate properly, even if others' are on the fritz.
Phase one of our response has been the reconfiguring of core routing hardware in our network to allocate more RAM to the IPv4 routing table. This has been done in case the routing table continues to increase at an even faster and unexpected rate.
In phase two we will further upgrade our network core. This will involve a complete swap of the hardware to the latest generation of routers which have support for a considerably larger routing table. Phase one has bought us the time to research the issue, identify hardware that will need changing and also to embrace what the wider community has to say on this matter.
These measures resolve the immediate problem, pushing out the boundaries once more. But eventually these boundaries too will be reached. So our third commitment is to be vigilant to the ever changing internet landscape, keep pace with that change and remain transparent, as we were on the 12th, with our customers about events taking place on our network and beyond.