Oct
08
2021
--

Disaster Recovery for MongoDB on Kubernetes

Disaster Recovery for MongoDB on Kubernetes

Disaster Recovery for MongoDB on KubernetesAs per the glossary, Disaster Recovery (DR) protocols are an organization’s method of regaining access and functionality to its IT infrastructure in events like a natural disaster, cyber attack, or even business disruptions related to the COVID-19 pandemic. When we talk about data, storing backups on remote servers is enough to pass DR compliance checks for some companies. But for others, Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) are extremely tight and require more than just a backup/restore procedure.

In this blog post, we are going to show you how to set up MongoDB on two distant Kubernetes clusters with Percona Distribution for MongoDB Operator to meet the toughest DR requirements.

What to Expect

Here is what we are going to do:

  1. Setup two Kubernetes clusters
  2. Deploy Percona Distribution for MongoDB Operator on both of them. The Disaster Recovery site will run a MongoDB cluster in unmanaged mode.
  3. We are going to simulate the failure and perform a failover to DR site

In the 1.10.0 version of the Operator, we have added the Technology Preview of the new feature which enables users to deploy unmanaged MongoDB nodes and connect them to existing Replica Sets.

MongoDB Kubernetes

Set it All Up

We are not going to cover the configuration of the Kubernetes clusters, but in our tests, we relied on two Google Kubernetes Engine (GKE) clusters deployed in different regions. Read more about GKE here.

Prepare Main Site

We have shared all the resources for this blog post in this GitHub repo. As a first step we are going to deploy the operator on the Main site:

$ kubectl apply -f bundle.yaml

Deploy the MongoDB managed cluster with

cr-main.yaml

:

$ kubectl apply -f cr-main.yaml

It is important to understand that we will need to expose ReplicaSet nodes through a dedicated service. This includes Config Servers. This is required to ensure that ReplicaSet nodes on Main and DR can reach each other. So it is like a full mesh:

ReplicaSet nodes

To get there, cr-main.yaml has the following changes:

spec:
  replsets:
  - rs0:
    expose:
      enabled: true
      exposeType: LoadBalancer
  sharding:
    configsvrReplSet:
      expose:
        enabled: true
        exposeType: LoadBalancer

We are using the LoadBalancer Kubernetes Service object as it is just simpler for us, but there are other options – ClusterIP, NodePort. It is also possible to utilize 3rd party tools like Submariner to implement a private connection.

If you have an already running MongoDB cluster in Kubernetes, you can expose the ReplicaSets without downtime by changing these variables.

Prepare Disaster Recovery Site

The configuration of the Disaster Recovery site could be broken down into the following steps:

  1. Copy the Secrets from the Main cluster.
    1. system users secrets
    2. SSL keys – both used for external connections and internal replication traffic
  2. Tune Custom Resource:
    1. run nodes in unmanaged mode – Operator does not control replicaset configuration and secrets generation
    2. expose ReplicaSets (the same way we do it on the Main cluster)
    3. disable backups – backups can be only taken on the cluster managed by the Operator

Copy the Secrets

System user’s credentials are stored by default in my-cluster-name-secrets Secret object and defined in spec.secrets.users. Apply this secret in the DR cluster with kubectl apply -f yaml-with-secrets. If you don’t have it in your source code repository or if you rely on the Operator to generate it, you can get the secret from Kubernetes itself, remove the unnecessary metadata and apply.

On main execute:

$ kubectl get secret my-cluster-name-secrets -o yaml > my-cluster-secrets.yaml

Now remove the following lines from metadata:

annotations
creationTimestamp
resourceVersion
selfLink
uid

Save the file and apply it to the DR cluster.

The procedure to copy SSL keys is almost the same as for users. The difference is the names of the Secret objects – they are usually called <CLUSTER_NAME>-ssl and <CLUSTER_NAME>-ssl-internal. It is also possible to specify them in secrets.ssl and secrets.sslInternal in the Custom Resource. Copy these two keys from Main to DR and reference them in the CR.

Tune Custom Resource

cr-replica.yaml will have the following changes:

  secrets:
    users: my-cluster-name-secrets
    ssl: replica-cluster-ssl
    sslInternal: replica-cluster-ssl-internal

  replsets:
  - name: rs0
    size: 3
    expose:
      enabled: true
      exposeType: LoadBalancer

  sharding:
    enabled: true
    configsvrReplSet:
      size: 3
      expose:
        enabled: true
        exposeType: LoadBalancer

  backup:
    enabled: false

Once the Custom Resource is applied, the services are going to be created.  We will need the IP addresses of each ReplicaSet node to configure the DR site.

$ kubectl get services
NAME                  TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)           AGE
replica-cluster-cfg-0    LoadBalancer   10.111.241.213   34.78.119.1       27017:31083/TCP   5m28s
replica-cluster-cfg-1    LoadBalancer   10.111.243.70    35.195.138.253    27017:31957/TCP   4m52s
replica-cluster-cfg-2    LoadBalancer   10.111.246.94    146.148.113.165   27017:30196/TCP   4m6s
...
replica-cluster-rs0-0    LoadBalancer   10.111.241.41    34.79.64.213      27017:31993/TCP   5m28s
replica-cluster-rs0-1    LoadBalancer   10.111.242.158   34.76.238.149     27017:32012/TCP   4m47s
replica-cluster-rs0-2    LoadBalancer   10.111.242.191   35.195.253.107    27017:31209/TCP   4m22s

Add External Nodes to Main

At this step, we are going to add unmanaged nodes to the Replica Set on the Main site. In cr-main.yaml we should add externalNodes under replsets.[] and sharding.configsvrReplSet:

  replsets:
  - name: rs0
    externalNodes:
    - host: 34.79.64.213
      priority: 1
      votes: 1
    - host: 34.76.238.149
      priority: 1
      votes: 1
    - host: 35.195.253.107
      priority: 0
      votes: 0

  sharding:
    configsvrReplSet:
      externalNodes:
      - host: 34.78.119.1
        priority: 1
        votes: 1
      - host: 35.195.138.253
        priority: 1
        votes: 1
      - host: 146.148.113.165
        priority: 0
        votes: 0

Please note that we add three nodes, but only two are voters. We do this to avoid split-brain situations and do not start the primary election if the DR site is down or there is a network disruption between the Main and DR sites.

Failover

Once all the configuration above is applied, the situation will look like this:

Failover

We have three voters in the main cluster and two voters in the replica cluster. That means replica nodes won’t have the majority in case of main cluster failure and they won’t be able to elect a new primary. Therefore we need to step in and perform a manual failover.

Let’s kill the main cluster:

gcloud compute instances list | 
grep my-main-gke-demo | 
awk '{print $1}' | 
xargs gcloud compute instances delete --zone europe-west3-b

gcloud container node-pools delete \
--zone europe-west3-b \
--cluster my-main-gke-demo \
Default-pool

I deleted the nodes and the node pool of the main Kubernetes cluster so now the cluster is in an unhealthy state. Let’s see what mongos on the DR site says when we try to read or write through it (psmdb-tester can be found in the git repo as well):

% ./psmdb-tester
2021/09/03 18:19:19 Successfully connected and pinged 34.141.3.189:27017
2021/09/03 18:19:40 read failed: (FailedToSatisfyReadPreference) Encountered non-retryable error during query :: caused by :: Could not find host matching read preference { mode: "primary" } for set cfg
2021/09/03 18:19:49 write failed: (FailedToSatisfyReadPreference) Could not find host matching read preference { mode: "primary" } for set cfg
Disaster Recovery MongoDB

Normally, we can only alter the replica set configuration from the primary node but in this kind of situation where you don’t have a primary and only have a few surviving members, MongoDB allows us to force the reconfiguration from any alive member.

Let’s connect to one of the secondary nodes in the replica cluster and perform the failover:

kubectl exec -it psmdb-client-7b9f978649-pjb2k -- mongo 'mongodb://clusterAdmin:<pass>@replica-cluster-rs0-0.replica.svc.cluster.local/admin?ssl=false'
...
rs0:SECONDARY> cfg = rs.config()
rs0:SECONDARY> cfg.members = [cfg.members[3], cfg.members[4], cfg.members[5]]
rs0:SECONDARY> rs.reconfig(cfg, {force: true})

Note that the indexes of surviving members may differ in your environment. You should check rs.status() and rs.config() outputs first. The main idea is to repopulate config members with only surviving members.

After the reconfiguration, the replica set will have just three members and two of them will have votes and a majority. So, they’ll be able to select a new primary. After performing the same process on the cfg replica set, we will be able to read and write through mongos again:

% ./psmdb-tester
2021/09/03 18:41:48 Successfully connected and pinged 34.141.3.189:27017
2021/09/03 18:41:49 read succeed
2021/09/03 18:41:50 read succeed
2021/09/03 18:41:51 read succeed
2021/09/03 18:41:52 read succeed
2021/09/03 18:41:53 read succeed
2021/09/03 18:41:54 read succeed
2021/09/03 18:41:55 read succeed
2021/09/03 18:41:56 read succeed
2021/09/03 18:41:57 read succeed
2021/09/03 18:41:58 read succeed
2021/09/03 18:41:58 write succeed

Once the replica cluster has become the primary, you should reconfigure all clients that connect to the old main cluster and point them to the DR site.

Conclusion

Disaster Recovery is important for business continuity. The goal of administrators and SREs is to have a plan in place. With the new release of Percona Distribution for MongoDB Operator, setting up DR is fast, automated, and enables IT teams to meet RTO and RPO requirements.

We encourage you to try out our operator. See our GitHub repository and check out the documentation.

Found a bug or have a feature idea? Feel free to submit it in JIRA.

For general questions please raise the topic in the community forum.

You are a developer and looking to contribute? Please read our CONTRIBUTING.md and send the Pull Request.

Percona Distribution for MongoDB Operator

The Percona Distribution for MongoDB Operator simplifies running Percona Server for MongoDB on Kubernetes and provides automation for day-1 and day-2 operations. It’s based on the Kubernetes API and enables highly available environments. Regardless of where it is used, the Operator creates a member that is identical to other members created with the same Operator. This provides an assured level of stability to easily build test environments or deploy a repeatable, consistent database environment that meets Percona expert-recommended best practices.

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Jan
19
2021
--

StackPulse announces $28M investment to help developers manage outages

When a system outage happens, chaos can ensue as the team tries to figure out what’s happening and how to fix it. StackPulse, a new startup that wants to help developers manage these crisis situations more efficiently, emerged from stealth today with a $28 million investment.

The round actually breaks down to a previously unannounced $8 million seed investment and a new $20 million Series A. GGV led the A round, while Bessemer Venture Partners led the seed and also participated in the A. Glenn Solomon at GGV and Amit Karp at Bessemer will join the StackPulse board.

Nobody is immune to these outages. We’ve seen incidents from companies as varied as Amazon and Slack in recent months. The biggest companies like Google, Facebook and Amazon employ site reliability engineers and build customized platforms to help remediate these kinds of situations. StackPulse hopes to put this kind of capability within reach of companies, whose only defense is the on-call developers.

Company co-founder and CEO Ofer Smadari says that in the midst of a crisis with signals coming at you from Slack and PagerDuty and other sources, it’s hard to figure out what’s happening. StackPulse is designed to help sort out the details to get you back to equilibrium as quickly as possible.

First off, it helps identify the severity of the incident. Is it a false alarm or something that requires your team’s immediate attention or something that can be put off for a later maintenance cycle? If there is something going wrong that needs to be fixed right now, StackPulse can not only identify the source of the problem, but also help fix it automatically, Smadari explained.

After the incident has been resolved, it can also help with a post-mortem to figure out what exactly went wrong by pulling in all of the alert communications and incident data into the platform.

As the company emerges from stealth, it has some early customers, and 35 employees based in Portland, Oregon and Tel Aviv. Smadari says that he hopes to have 100 employees by the end of this year. As he builds the organization, he is thinking about how to build a diverse team for a diverse customer base. He believes that people with diverse backgrounds build a better product. He adds that diversity is a top level goal for the company, which already has an HR leader in place to help.

Glenn Solomon from GGV, who will be joining the company board, saw a strong founding team solving a big problem for companies and wanted to invest. “When they described the vision for the product they wanted to build, it made sense to us,” he said.

Customers are impatient with down time and Solomon sees developers on the front line trying to solve these issues. “Performance is more important than ever. When there is downtime, it’s damaging to companies,” he said. He believes StackPulse can help.

Sep
02
2020
--

Transposit scores $35M to build data-driven runbooks for faster disaster recovery

Transposit is a company built by engineers to help engineers, and one big way to help them is to get systems up and running faster when things go wrong — as they always will at some point. Transposit has come up with a way to build runbooks for faster disaster recovery, while using data to update them in an automated fashion.

Today, the company announced a $35 million Series B investment led by Altimeter Capital, with participation from existing investors Sutter Hill Ventures, SignalFire and Unusual Ventures. Today’s investment brings the total raised to $50.4 million, according to the company.

Company CEO Divanny Lamas and CTO and founder Tina Huang see technology issues as less an engineering problem and more as a human problem, because it’s humans who have to clean up the messes when things go wrong. Huang says forgetting the human side of things is where she thinks technology has gone astray.

“We know that the real superpower of the product is that we focus on the human and the user side of things. And as a result, we’re building an engineering culture that I think is somewhat differentiated,” Huang told TechCrunch.

Transposit is a platform that at its core helps manage APIs, connections to other programs, so it starts with a basic understanding of how various underlying technologies work together inside a company. This is essential for a tool that is trying to help engineers in a moment of panic figure out how to get back to a working state.

When it comes to disaster recovery, there are essentially two pieces: getting the systems working again, then figuring out what happened. For the first piece, the company is building data-driven runbooks. By being data-driven, they aren’t static documents. Instead, the underlying machine learning algorithms can look at how the engineers recovered and adjust accordingly.

Transposit diaster recovery dashboard

Image Credits: Transposit

“We realized that no one was focusing on what we realize is the root problem here, which is how do I have access to the right set of data to make it easier to reconstruct that timeline, and understand what happened? We took those two pieces together, this notion that runbooks are a critical piece of how you spread knowledge and spread process, and this other piece, which is the data, is critical,” Huang said.

Today the company has 26 employees, including Huang and Lamas, who Huang brought on board from Splunk last year to be CEO. The company is somewhat unique having two women running the organization, and they are trying to build a diverse workforce as they build their company to 50 people in the next 12 months.

The current make-up is 47% female engineers, and the goal is to remain diverse as they build the company, something that Lamas admits is challenging to do. “I wish I had a magic answer, or that Tina had a magic answer. The reality is that we’re just very demanding on recruiters. And we are very insistent that we have a diverse pipeline of candidates, and are constantly looking at our numbers and looking at how we’re doing,” Lamas said.

She says being diverse actually makes it easier to recruit good candidates. “People want to work at diverse companies. And so it gives us a real edge from a kind of culture perspective, and we find that we get really amazing candidates that are just tired of the status quo. They’re tired of the old way of doing things and they want to work in a company that reflects the world that they want to live in,” she said.

The company, which launched in 2016, took a few years to build the first piece, the underlying API platform. This year it added the disaster recovery piece on top of that platform, and has been running its beta since the beginning of the summer. They hope to add additional beta customers before making it generally available later this year.

Jul
07
2020
--

OwnBackup lands $50M as backup for Salesforce ecosystem thrives

OwnBackup has made a name for itself primarily as a backup and disaster recovery system for the Salesforce ecosystem, and today the company announced a $50 million investment.

Insight Partners led the round, with participation from Salesforce Ventures and Vertex Ventures. This chunk of money comes on top of a $23 million round from a year ago, and brings the total raised to more than $100 million, according to the company.

It shouldn’t come as a surprise that Salesforce Ventures chipped in when the majority of the company’s backup and recovery business involves the Salesforce ecosystem, although the company will be looking to expand beyond that with the new money.

“We’ve seen such growth over the last two and a half years around the Salesforce ecosystem, and the other ISV partners like Veeva and nCino that we’ve remained focused within the Salesforce space. But with this funding, we will expand over the next 12 months into a few new ecosystems,” company CEO Sam Gutmann told TechCrunch.

In spite of the pandemic, the company continues to grow, adding 250 new customers last quarter, bringing it to over 2,000 customers and 250 employees, according to Gutmann.

He says that raising the round, which closed at the beginning of May, had some hairy moments as the pandemic began to take hold across the world and worsen in the U.S. For a time, he began talking to new investors in case his existing ones got cold feet. As it turned out, when the quarterly numbers came in strong, the existing ones came back and the round was oversubscribed, Gutmann said.

“Q2 frankly was a record quarter for us, adding over 250 new accounts, and we’re seeing companies start to really understand how critical this is,” he said.

The company plans to continue hiring through the pandemic, although he says it might not be quite as aggressively as they once thought. Like many companies, even though they plan to hire, they are continually assessing the market. At this point, he foresees growing the workforce by about another 50 people this year, but that’s about as far as he can look ahead right now.

Gutmann says he is working with his management team to make sure he has a diverse workforce right up to the executive level, but he says it’s challenging. “I think our lower ranks are actually quite diverse, but as you get up into the leadership team, you can see on the website unfortunately we’re not there yet,” he said.

They are instructing their recruiting teams to look for diverse candidates whether by gender or ethnicity, and employees have formed a diversity and inclusion task force with internal training, particularly for managers around interviewing techniques.

He says going remote has been difficult, and he misses seeing his employees in the office. He hopes to have at least some come back before the end of the summer and slowly add more as we get into the fall, but that will depend on how things go.

May
20
2020
--

FireHydrant lands $8M Series A for disaster management tool

When I spoke to Robert Ross, CEO and co-founder at FireHydrant, we had a technology adventure. First the audio wasn’t working correctly on Zoom, then Google Meet. Finally we used cell phones to complete the interview. It was like a case study in what FireHydrant is designed to do — help companies manage incidents and recover more quickly when things go wrong with their services.

Today the company announced an $8 million Series A from Menlo Ventures and Work-Bench. That brings the total raised to $9.5 million, including the $1.5 million seed round we reported on last April.

In the middle of a pandemic with certain services under unheard of pressure, understanding what to do when your systems crash has become increasingly important. FireHydrant has literally developed a playbook to help companies recover faster.

These run books are digital documents that are unique to each company and include what to do to help manage the recovery process. Some of that is administrative. For example, certain people have to be notified by email, a Jira ticket has to be generated and a Slack channel opened to provide a communications conduit for the team.

While Ross says you can’t define the exact recovery process itself because each incident tends to be unique, you can set up an organized response to an incident and that can help you get to work on the recovery much more quickly. That ability to manage an incident can be a difference maker when it comes to getting your system back to a steady state.

Ross is a former site reliability engineer (SRE) himself. He has experienced the kinds of problems his company is trying to solve, and that background was something that attracted investor Matt Murphy from Menlo Ventures.

“I love his authentic perspective, as a former SRE, on the problem and how to create something that would make the SRE function and processes better for all. That value prop really resonated with us in a time when the shift to online is accelerating and remote coordination between people tasked with identifying and fixing problems is at all time high in terms of its importance. Ultimately we’re headed toward more and more automation in problem resolution and FH helps pave the way,” Murphy told TechCrunch.

It’s not easy being an early-stage company in the current climate, but Ross believes his company has created something that will resonate, perhaps even more right now. As he says, every company has incidents, and how you react can define you as a company. Having tooling to help you manage that process helps give you structure at a time you need it most.

Oct
31
2019
--

How you react when your systems fail may define your business

Just around 9:45 a.m. Pacific Time on February 28, 2017, websites like Slack, Business Insider, Quora and other well-known destinations became inaccessible. For millions of people, the internet itself seemed broken.

It turned out that Amazon Web Services was having a massive outage involving S3 storage in its Northern Virginia datacenter, a problem that created a cascading impact and culminated in an outage that lasted four agonizing hours.

Amazon eventually figured it out, but you can only imagine how stressful it might have been for the technical teams who spent hours tracking down the cause of the outage so they could restore service. A few days later, the company issued a public post-mortem explaining what went wrong and which steps they had taken to make sure that particular problem didn’t happen again. Most companies try to anticipate these types of situations and take steps to keep them from ever happening. In fact, Netflix came up with the notion of chaos engineering, where systems are tested for weaknesses before they turn into outages.

Unfortunately, no tool can anticipate every outcome.

It’s highly likely that your company will encounter a problem of immense proportions like the one that Amazon faced in 2017. It’s what every startup founder and Fortune 500 CEO worries about — or at least they should. What will define you as an organization, and how your customers will perceive you moving forward, will be how you handle it and what you learn.

We spoke to a group of highly-trained disaster experts to learn more about preventing these types of moments from having a profoundly negative impact on your business.

It’s always about your customers

Reliability and uptime are so essential to today’s digital businesses that enterprise companies developed a new role, the Site Reliability Engineer (SRE), to keep their IT assets up and running.

Tammy Butow, principal SRE at Gremlin, a startup that makes chaos engineering tools, says the primary role of the SRE is keeping customers happy. If the site is up and running, that’s generally the key to happiness. “SRE is generally more focused on the customer impact, especially in terms of availability, uptime and data loss,” she says.

Companies measure uptime according to the so-called “five nines,” or 99.999 percent availability, but software engineer Nora Jones, who most recently led Chaos Engineering and Human Factors at Slack, says there is often too much of an emphasis on this number. According to Jones, the focus should be on the customer and the impact that availability has on their perception of you as a company and your business’s bottom line.

Someone needs to be calm and just keep asking the right questions.

“It’s money at the end of the day, but also over time, user sentiment can change [if your site is having issues],” she says. “How are they thinking about you, the way they talk about your product when they’re talking to their friends, when they’re talking to their family members. The nines don’t capture any of that.”

Robert Ross, founder and CEO at FireHydrant, an SRE as a Service platform, says it may be time to rethink the idea of the nines. “Maybe we need to change that term. Maybe we can popularize something like ‘happiness level objectives’ or ‘happiness level agreements.’ That way, the focus is on our products.”

When things go wrong

Companies go to great lengths to prevent disasters to avoid disappointing their customers and usually have contingencies for their contingencies, but sometimes, no matter how well they plan, crises can spin out of control. When that happens, SREs need to execute, which takes planning, too; knowing what to do when the going gets tough.

Apr
02
2019
--

FireHydrant lands $1.5M seed investment to bring order to IT disaster recovery

FireHydrant, an NYC startup, wants to help companies recover from IT disasters more quickly, and understand why they happened — with the goal of preventing similar future scenarios from happening again. Today, the fledgling startup announced a $1.5 million seed investment from Work-Bench, a New York City venture capital firm that invests in early-stage enterprise startups.

In addition to the funding, the company announced it was opening registration for its FireHydrant incident management platform. The product has been designed with Google’s Site Reliability Engineering (SRE) methodology in mind, but company co-founder and CEO Bobby Ross says the tool is designed to help anyone understand the cause of a disaster, regardless of what happened, and whether they practice SRE or not.

“I had been involved in several fire fighting scenarios — from production databases being dropped to Kubernetes upgrades gone wrong — and every incident had a common theme: ?absolute chaos?,” Ross wrote in a blog post announcing the new product.

The product has two main purposes, according to Ross. It helps you figure out what’s happening as you attempt to recover from an ongoing disaster scenario, and once you’ve put out the fire, it lets you do a post-mortem to figure out exactly what happened with the hope of making sure that particular disaster doesn’t happen again.

As Ross describes it, a tool like PagerDuty can alert you that there’s a problem, but FireHydrant lets you figure out what specifically is going wrong and how to solve it. He says that the tool works by analyzing change logs, as a change is often the primary culprit of IT incidents. When you have an incident, FireHydrant will surface that suspected change, so you can check it first.

“We’ll say, hey, you had something change recently in this vicinity where you have an alert going off. There is a high likelihood that this change was actually causing your incident. And we actually bubble that up and mark it as a suspect,” Ross explained.

Screenshot: FireHydrant

Like so many startups, the company developed from a pain point the founders were feeling. The three founders were responsible for solving major outages at companies like Namely, DigitalOcean, CoreOS and Paperless Post.

But the actual idea for the company came about almost accidentally. In 2017, Ross was working on a series of videos and needed a way to explain what he was teaching. “I began writing every line of code with live commentary, and soon FireHydrant started to take the shape of what I envisioned as an SRE while at Namely, and I started to want it more than the video series. 40 hours of screencasts recorded later, I decided to stop recording and focus on the product…,” Ross wrote in the blog post.

Today it integrates with PagerDuty, GitHub and Slack, but the company is just getting started with the three founders, all engineers, working on the product and a handful of beta customers. It is planning to hire more engineers to keep building out the product. It’s early days, but if this tool works as described, it could go a long way toward solving the fire-fighting issues that every company faces at some point.

Jan
08
2019
--

Amazon reportedly acquired Israeli disaster recovery service CloudEndure for around $200M

Amazon has reportedly acquired Israeli disaster recovery startup CloudEndure. Neither company has responded to our request for confirmation, but we have heard from multiple sources that the deal has happened. While some outlets have been reporting the deal was worth $250 million, we are hearing it’s closer to $200 million.

The company provides disaster recovery for cloud customers. You may be thinking that disaster recovery is precisely why we put our trust in cloud vendors. If something goes wrong, it’s the vendor’s problem — and you would be right to make this assumption, but nothing is simple. If you have a hybrid or multi-cloud scenario, you need to have ways to recover your data in the event of a disaster like weather, a cyberattack or political issue.

That’s where a company like CloudEndure comes into play. It can help you recover and get back and running in another place, no matter where your data lives, by providing a continuous backup and migration between clouds and private data centers. While CloudEndure currently works with AWS, Azure and Google Cloud Platform, it’s not clear if Amazon would continue to support these other vendors.

The company was backed by Dell Technologies Capital, Infosys and Magma Venture Partners, among others. Ray Wang, founder and principal analyst at Constellation Research, says Infosys recently divested its part of the deal and that might have precipitated the sale. “So much information is sitting in the cloud that you need backups and regions to make sure you have seamless recovery in the event of a disaster,” Wang told TechCrunch.

While he isn’t clear what Amazon will do with the company, he says it will test just how open it is. “If you have multi-cloud and want your on-prem data backed up, or if you have backup on one cloud like AWS and want it on Google or Azure, you could do this today with CloudEndure,” he said. “That’s why I’m curious if they’ll keep supporting Azure or GCP,” he added.

CloudEndure was founded in 2012 and has raised just over $18 million. Its most recent investment came in 2016 when it raised $6 million, led by Infosys and Magma.

Dec
07
2017
--

Heptio teams up with Microsoft to build a better Kubernetes disaster recovery solution

 With the rise of Kubernetes as the de facto standard for container orchestration, it’s no surprise that there’s now a whole ecosystem of companies springing up around this open source project. Heptio is one of the most interesting ones, in no small part due to the fact that it was founded by Kubernetes co-founders Joe Beda and Craig McLuckie. Today, Heptio announced that it is… Read More

Nov
28
2017
--

VMware expands AWS partnership with new migration and disaster recovery tools

 Remember how VMware was supposed to be disrupted by AWS? Somewhere along the way it made a smart move. Instead of fighting the popular cloud platform, it decided to make it easier for IT to use its products on AWS. Today, at the opening of the AWS re:invent customer conference, it announced plans to expand that partnership with some new migration and disaster recovery services. As Mark… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com