Monday, February 27, 2017

Five nines

In the world of cloud computing (which has recently become a big part of my professional life), the holy grail is to achieve "five nines".

Five nines has a variety of definitions, for example "no more than 1 failure per 10,000 requests", or "99.999% availability" (do you see the "five nines" in that second formulation?).

Achieving this level of availability in your service is UNBELIEVABLY hard; only a tiny handful of organizations in existence today can even aspire to that level of service.

But how do you know if you've achieved it?

Well, here's one way to know: Spanner, TrueTime and the CAP Theorem

For locking and consistent read/write operations, modern geographically distributed Chubby cells provide an average availability of 99.99958% (for 30s+ outages) due to various network, architectural and operational improvements. Starting in 2009, due to “excess” availability, Chubby’s Site Reliability Engineers (SREs) started forcing periodic outages to ensure we continue to understand dependencies and the impact of Chubby failures.

The above paragraph may have seemed like gobbledy-gook, so let me try to rephrase it:

Some of Google's internal services are now SO reliable that Google actually intentionally crashes them once in a while, just to let their engineers practice handling those failures.

"Excess availability." That's a problem that you don't ever hear about. Amazing.

No comments:

Post a Comment