Percona Live 2016: Operational Buddhism — Building Reliable Services From Unreliable Components

percona live 2016

It’s another packed day here at Percona Live 2016, with many, many database topics under discussion. Some technical, some strategical, and some operational. One such talk I sat in on was given by Ernie Souhrada, Database Engineer and Bit Wrangler at Pinterest. His talk was called Operational Buddhism: Building Reliable Services From Unreliable Components.

In it he discussed how the rise of utility computing has revolutionized much about the way organizations think about infrastructure and back-end serving systems, compared to the “olden days” of physical data centers. But success is still driven by meeting your SLAs. If services are up and sufficiently performant, you win. If not, you lose. In the traditional data center environment, fighting the uptime battle was typically driven by a philosophy Ernie calls “Operational Materialism.” The primary goal of OM is preventing failures at the infrastructure layer, and mechanisms for making this happen are plentiful and well-understood, many of which boil down to simply spending enough money to have at least N+1 of anything that might fail.

Ernie contends that in the cloud, Operational Materialism cannot succeed. Although the typical cloud provider tends to be holistically reliable, there are no guarantees that any individual virtual instance will not randomly or intermittently drop off the network or be terminated outright. Since we still need to keep our services up and running and meet our SLAs, we need a different mindset that accounts for the fundamentally opaque and ephemeral nature of the public cloud.

Ernie presented an alternative to OM, a worldview that he referred to as “Operational Buddhism.” Like traditional Buddhism, OB has Four Noble Truths:

  1. Cloud-based servers can fail at any time for any reason
  2. Trying to prevent this server failure is an endless source of suffering for DBAs and SREs alike
  3. Accepting the impermanence of individual servers, we can focus on designing systems that are failure-resilient, rather than failure-resistant
  4. We can escape the cycle of suffering and create a better experience for our customers, users, and colleagues.

To illustrate these concepts with concrete examples, he discussed how configuration management, automation, and service discovery help Pinterest to practice Operational Buddhism for both stateful (MySQL, HBase) and stateless (web) services. He also talked about some of the roads not taken, including the debate over Infrastructure-as-a-Service (IaaS) vs. Platform-as-a-Service (PaaS).

I was able to have a quick chat with Ernie after the talk:


See the rest of the Percona Live 2016 schedule here.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com