Dec
15
2020
--

AWS introduces new Chaos Engineering as a Service offering

When large companies like Netflix or Amazon want to test the resilience of their systems, they use chaos engineering tools designed to help them simulate worst-case scenarios and find potential issues before they even happen. Today at AWS re:Invent, Amazon CTO Werner Vogels introduced the company’s Chaos Engineering as a Service offering called AWS Fault Injection Simulator.

The name may lack a certain marketing panache, but Vogels said that the service is designed to help bring this capability to all companies. “We believe that chaos engineering is for everyone, not just shops running at Amazon or Netflix scale. And that’s why today I’m excited to pre-announce a new service built to simplify the process of running chaos experiments in the cloud ,” Vogels said.

As he explained, the goal of chaos engineering is to understand how your application responds to issues by injecting failures into your application, usually running these experiments against production systems. AWS Fault Injection Simulator offers a fully managed service to run these experiments on applications running on AWS hardware.

AWS Fault Injection Simulator workflow.

Image Credits: Amazon / Getty Images

“FIS makes it easy to run safe experiments. We built it to follow the typical chaos experimental workflow where you understand your steady state, set a hypothesis and inject faults into your application. When the experiment is over, FIS will tell you if your hypothesis was confirmed, and you can use the data collected by CloudWatch to decide where you need to make improvements,” he explained.

While the company was announcing the service today, Vogels indicated it won’t actually be available until some time next year.

It’s worth noting that there are other similar services out there by companies like Gremlin, who are already providing a broad Chaos Engineering Service as a Service offering.

May
19
2020
--

Gremlin brings chaos engineering to Windows platform

Chaos engineering is about helping companies set up worst-case scenarios and testing them to see what causes the operating system to fall over, but up until now, it has mostly been for teams running Linux servers. Gremlin, the startup that offers Chaos Engineering as a Service, released a new tool to give engineers working on Microsoft Windows systems access to a similar set of experiments.

Gremlin co-founder and CEO Kolton Andrus says that the four-year-old company started with Linux support, then moved to Docker containers and Kubernetes, but there has been significant demand for Windows support, and the company decided it was time to build this into the platform too.

“The same types of failure can occur, but it happens in different ways on different operating systems. And people need to be able to respond to that. So it’s been the blind spot, and we [decided to] prioritize the types of experiments that people [running Windows] need the most,” he said.

He added, “What we’re launching here is that core set of capabilities for customers so they can go out and get started right away.”

To that end, the Gremlin Windows agent lets engineers run experiments on shutdown, CPU, disk, I/O, memory and latency attacks. It’s worth noting that a third of the world’s servers still run on Windows, and having this ability to test these systems in this way has been mostly confined to companies that could afford to build their own systems in-house.

What Gremlin is doing for Windows is what it has done for the other supported systems. It’s enabling any company to take advantage of chaos engineering tools to help prevent system failure. During the pandemic, as some systems have become flooded with traffic, having this ability to experiment with different worst-case scenarios and figuring out what brings your system to its knees is more important than ever.

The Gremlin Windows agent not only gives the company a wider range of operating system support, it also broadens its revenue base, which is also increasingly important at a time of economic uncertainty.

The company, which is based in the San Francisco area, was founded in 2016 and has raised more than $26 million, according to Crunchbase data. The company raised the bulk of that, $18 million, in 2018.

Nov
18
2019
--

Gremlin brings Chaos Engineering as a Service to Kubernetes

The practice of Chaos Engineering developed at Amazon and Netflix a decade ago to help those web scale companies test their complex systems for worst-case scenarios before they happened. Gremlin was started by a former employee of both these companies to make it easier to perform this type of testing without a team of Site Reliability Engineers (SREs). Today, the company announced that it now supports Chaos Engineering-style testing on Kubernetes clusters.

The company made the announcement at the beginning of KubeCon, the Kubernetes conference taking place in San Diego this week.

Gremlin co-founder and CEO Kolton Andrus says that the idea is to be able to test and configure Kubernetes clusters so they will not fail, or at least reduce the likelihood. He says to do this it’s critical to run chaos testing (tests of mission-critical systems under extreme duress) in live environments, whether you’re testing Kubernetes clusters or anything else, but it’s also a bit dangerous to do be doing this. He says to mitigate the risk, best practices suggest that you limit the experiment to the smallest test possible that gives you the most information.

“We can come in and say I’m going to deal with just these clusters. I want to cause failure here to understand what happens in Kubernetes when these pieces fail. For instance, being able to see what happens when you pause the scheduler. The goal is being able to help people understand this concept of the blast radius, and safely guide them to running an experiment,” Andrus explained.

In addition, Gremlin is helping customers harden their Kubernetes clusters to help prevent failures with a set of best practices. “We clearly have the tooling that people need [to conduct this type of testing], but we’ve also learned through many, many customer interactions and experiments to help them really tune and configure their clusters to be fault tolerant and resilient,” he said.

The Gremlin interface is designed to facilitate this kind of targeted experimentation. You can check the areas you want to apply a test, and you can see graphically which parts of the system are being tested. If things get out of control, there is a kill switch to stop the tests.

Gremlin Kubernetes testing screen (Screenshot: Gremlin)

Gremlin launched in 2016. Its headquarters are in San Jose. It offers both a freemium and pay product. The company has raised almost $27 million, according to Crunchbase data.

Oct
31
2019
--

How you react when your systems fail may define your business

Just around 9:45 a.m. Pacific Time on February 28, 2017, websites like Slack, Business Insider, Quora and other well-known destinations became inaccessible. For millions of people, the internet itself seemed broken.

It turned out that Amazon Web Services was having a massive outage involving S3 storage in its Northern Virginia datacenter, a problem that created a cascading impact and culminated in an outage that lasted four agonizing hours.

Amazon eventually figured it out, but you can only imagine how stressful it might have been for the technical teams who spent hours tracking down the cause of the outage so they could restore service. A few days later, the company issued a public post-mortem explaining what went wrong and which steps they had taken to make sure that particular problem didn’t happen again. Most companies try to anticipate these types of situations and take steps to keep them from ever happening. In fact, Netflix came up with the notion of chaos engineering, where systems are tested for weaknesses before they turn into outages.

Unfortunately, no tool can anticipate every outcome.

It’s highly likely that your company will encounter a problem of immense proportions like the one that Amazon faced in 2017. It’s what every startup founder and Fortune 500 CEO worries about — or at least they should. What will define you as an organization, and how your customers will perceive you moving forward, will be how you handle it and what you learn.

We spoke to a group of highly-trained disaster experts to learn more about preventing these types of moments from having a profoundly negative impact on your business.

It’s always about your customers

Reliability and uptime are so essential to today’s digital businesses that enterprise companies developed a new role, the Site Reliability Engineer (SRE), to keep their IT assets up and running.

Tammy Butow, principal SRE at Gremlin, a startup that makes chaos engineering tools, says the primary role of the SRE is keeping customers happy. If the site is up and running, that’s generally the key to happiness. “SRE is generally more focused on the customer impact, especially in terms of availability, uptime and data loss,” she says.

Companies measure uptime according to the so-called “five nines,” or 99.999 percent availability, but software engineer Nora Jones, who most recently led Chaos Engineering and Human Factors at Slack, says there is often too much of an emphasis on this number. According to Jones, the focus should be on the customer and the impact that availability has on their perception of you as a company and your business’s bottom line.

Someone needs to be calm and just keep asking the right questions.

“It’s money at the end of the day, but also over time, user sentiment can change [if your site is having issues],” she says. “How are they thinking about you, the way they talk about your product when they’re talking to their friends, when they’re talking to their family members. The nines don’t capture any of that.”

Robert Ross, founder and CEO at FireHydrant, an SRE as a Service platform, says it may be time to rethink the idea of the nines. “Maybe we need to change that term. Maybe we can popularize something like ‘happiness level objectives’ or ‘happiness level agreements.’ That way, the focus is on our products.”

When things go wrong

Companies go to great lengths to prevent disasters to avoid disappointing their customers and usually have contingencies for their contingencies, but sometimes, no matter how well they plan, crises can spin out of control. When that happens, SREs need to execute, which takes planning, too; knowing what to do when the going gets tough.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com