Apr
02
2019
--

FireHydrant lands $1.5M seed investment to bring order to IT disaster recovery

FireHydrant, an NYC startup, wants to help companies recover from IT disasters more quickly, and understand why they happened — with the goal of preventing similar future scenarios from happening again. Today, the fledgling startup announced a $1.5 million seed investment from Work-Bench, a New York City venture capital firm that invests in early-stage enterprise startups.

In addition to the funding, the company announced it was opening registration for its FireHydrant incident management platform. The product has been designed with Google’s Site Reliability Engineering (SRE) methodology in mind, but company co-founder and CEO Bobby Ross says the tool is designed to help anyone understand the cause of a disaster, regardless of what happened, and whether they practice SRE or not.

“I had been involved in several fire fighting scenarios — from production databases being dropped to Kubernetes upgrades gone wrong — and every incident had a common theme: ?absolute chaos?,” Ross wrote in a blog post announcing the new product.

The product has two main purposes, according to Ross. It helps you figure out what’s happening as you attempt to recover from an ongoing disaster scenario, and once you’ve put out the fire, it lets you do a post-mortem to figure out exactly what happened with the hope of making sure that particular disaster doesn’t happen again.

As Ross describes it, a tool like PagerDuty can alert you that there’s a problem, but FireHydrant lets you figure out what specifically is going wrong and how to solve it. He says that the tool works by analyzing change logs, as a change is often the primary culprit of IT incidents. When you have an incident, FireHydrant will surface that suspected change, so you can check it first.

“We’ll say, hey, you had something change recently in this vicinity where you have an alert going off. There is a high likelihood that this change was actually causing your incident. And we actually bubble that up and mark it as a suspect,” Ross explained.

Screenshot: FireHydrant

Like so many startups, the company developed from a pain point the founders were feeling. The three founders were responsible for solving major outages at companies like Namely, DigitalOcean, CoreOS and Paperless Post.

But the actual idea for the company came about almost accidentally. In 2017, Ross was working on a series of videos and needed a way to explain what he was teaching. “I began writing every line of code with live commentary, and soon FireHydrant started to take the shape of what I envisioned as an SRE while at Namely, and I started to want it more than the video series. 40 hours of screencasts recorded later, I decided to stop recording and focus on the product…,” Ross wrote in the blog post.

Today it integrates with PagerDuty, GitHub and Slack, but the company is just getting started with the three founders, all engineers, working on the product and a handful of beta customers. It is planning to hire more engineers to keep building out the product. It’s early days, but if this tool works as described, it could go a long way toward solving the fire-fighting issues that every company faces at some point.

Jan
08
2019
--

Amazon reportedly acquired Israeli disaster recovery service CloudEndure for around $200M

Amazon has reportedly acquired Israeli disaster recovery startup CloudEndure. Neither company has responded to our request for confirmation, but we have heard from multiple sources that the deal has happened. While some outlets have been reporting the deal was worth $250 million, we are hearing it’s closer to $200 million.

The company provides disaster recovery for cloud customers. You may be thinking that disaster recovery is precisely why we put our trust in cloud vendors. If something goes wrong, it’s the vendor’s problem — and you would be right to make this assumption, but nothing is simple. If you have a hybrid or multi-cloud scenario, you need to have ways to recover your data in the event of a disaster like weather, a cyberattack or political issue.

That’s where a company like CloudEndure comes into play. It can help you recover and get back and running in another place, no matter where your data lives, by providing a continuous backup and migration between clouds and private data centers. While CloudEndure currently works with AWS, Azure and Google Cloud Platform, it’s not clear if Amazon would continue to support these other vendors.

The company was backed by Dell Technologies Capital, Infosys and Magma Venture Partners, among others. Ray Wang, founder and principal analyst at Constellation Research, says Infosys recently divested its part of the deal and that might have precipitated the sale. “So much information is sitting in the cloud that you need backups and regions to make sure you have seamless recovery in the event of a disaster,” Wang told TechCrunch.

While he isn’t clear what Amazon will do with the company, he says it will test just how open it is. “If you have multi-cloud and want your on-prem data backed up, or if you have backup on one cloud like AWS and want it on Google or Azure, you could do this today with CloudEndure,” he said. “That’s why I’m curious if they’ll keep supporting Azure or GCP,” he added.

CloudEndure was founded in 2012 and has raised just over $18 million. Its most recent investment came in 2016 when it raised $6 million, led by Infosys and Magma.

Dec
07
2017
--

Heptio teams up with Microsoft to build a better Kubernetes disaster recovery solution

 With the rise of Kubernetes as the de facto standard for container orchestration, it’s no surprise that there’s now a whole ecosystem of companies springing up around this open source project. Heptio is one of the most interesting ones, in no small part due to the fact that it was founded by Kubernetes co-founders Joe Beda and Craig McLuckie. Today, Heptio announced that it is… Read More

Nov
28
2017
--

VMware expands AWS partnership with new migration and disaster recovery tools

 Remember how VMware was supposed to be disrupted by AWS? Somewhere along the way it made a smart move. Instead of fighting the popular cloud platform, it decided to make it easier for IT to use its products on AWS. Today, at the opening of the AWS re:invent customer conference, it announced plans to expand that partnership with some new migration and disaster recovery services. As Mark… Read More

Oct
16
2017
--

When Should I Enable MongoDB Sharding?

MongoDB Sharding

MongoDB ShardingIn this blog post, we will talk about MongoDB sharding and walk through the main reasons why you should start a cluster (independent of the approach you have chosen).

Note: I will cover this subject in my webinar How To Scale with MongoDB on Wednesday, October 18, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

Sharding is the most complex architecture you can deploy using MongoDB, and there are two main approaches as to when to shard or not. The first is to configure the cluster as soon as possible – when you predict high throughput and fast data growth.

The second says you should use a cluster as the best alternative when the application demands more resources than the replica set can offer (such as low memory, an overloaded disk or high processor load). This approach is more corrective than preventative, but we will discuss that in the future.

1) Disaster recovery plan

Disaster recovery (DR) is a very delicate topic: how long would you tolerate an outage? If necessary, how long would it take you to restore the entire database? Depending on the database size and on disk speed, a backup/restore process might take hours or even days!
There is no hard number in Gigabytes to justify a cluster. But in general, you should engage when the database is more than 200GB the backup and restore processes might take a while to finish.
Let’s consider the case where we have a replica set with a 300GB database. The full restore process might last around four hours, whereas if the database has two shards, it will take about two hours – and depending on the number of shards we can improve that time. Simple math: if there are two shards, the restore process takes half of the time to restore when compared to a single replica set.

2) Hardware limitations

Disk and memory are inexpensive nowadays. However, this is not true when companies need to scale out to high numbers (such as TB of RAM). Suppose your cloud provider can only offer you up to 5,000 IOPS in the disk subsystem, but the application needs more than that to work correctly. To work around this performance limitation, it is better to start a cluster and divide the writes among instances. That said, if there are two shards the application will have 10000 IOPS available to use for writes and reads in the disk subsystem.

3) Storage engine limitations

There are a few storage engine limitations that can be a bottleneck in your use case. MMAPv2 does have a lock per collection, while WiredTiger has tickets that will limit the number of writes and reads happening concurrently. Although we can tweak the number of tickets available in WiredTiger, there is a virtual limit – which means that changing the available tickets might generate processor overload instead of increasing performance. If one of these situations becomes a bottleneck in your system, you start a cluster. Once you shard the collection, you distribute the load/lock among the different instances.

4) Hot data vs. cold data

Several databases only work with a small percentage of the data being stored. This is called hot data or working set. Cold data or historical data is rarely read, and demands considerable system resources when it is. So why spend money on expensive machines that only store cold data or low-value data? With a cluster deployment we can choose where the cold data is stored, and use cheap devices and disks to do so. The same is true for hot data – we can use better machines to have better performance. This methodology also speeds up writes and reads on the hot data, as the indexes are smaller and add less overhead to the system.

5) Geo-distributed data

It doesn’t matter whether this need comes from application design or legal compliance. If the data must stay within continent or country borders, a cluster helps make that happen. It is possible to limit data localization so that it is stored solely in a specific “part of the world.” The number of shards and their geographic positions is not essential for the application, as it only views the database. This is commonly used in worldwide companies for better performance, or simply to comply with the local law.

6) Infrastructure limitations

Infrastructure and hardware limitations are very similar. When thinking about infrastructure, however, we focus on specific cases when the instances should be small. An example is running MongoDB on Mesos. Some providers only offer a few cores and a limited amount of RAM. Even if you are willing to pay more for that, it is not possible to purchase more than they offer as their products. A cluster provides the option to split a small amount of data among a lot of shards, reaching the same performance a big and expensive machine provides.

7) Failure isolation

Consider that a replica set or a single instance holds all the data. If for any reason this instance/replica set goes down, the whole application goes down. In a cluster, if we lose one of the five shards, 80% of the data is still available. Running a few shards helps to isolate failures. Obviously, running a bunch of instances makes the cluster prone to have a failed instance, but as each shard must have at least three instances the probability of the entire shard being down is minimal. For providers that offer different zones, it is good practice to have different members of the shard in different availability zones (or even different regions).

8) Speed up queries

Queries can take too long, depending on the number of reads they perform. In a clustered deployment, queries can run in parallel and speed up the query response time. If a query runs in ten seconds in a replica set, it is very likely that the same query will run in five to six seconds if the cluster has two shards, and so on.

I hope this helps with MongoDB sharding. Having a cluster solves several other problems as well, and we have listed only a few of them. Don’t miss our webinar regarding scaling out MongoDB next Wednesday, October 18, 2017!

Jul
18
2017
--

Backups and Disaster Recovery

Backups and Disaster Recovery

Backups and Disaster RecoveryIn this post, we’ll look at strategies for backups and disaster recovery.

Note: I am giving a talk on Backups and Disaster Recovery Best Practices on July 27th.

When discussing disaster recovery, it’s important to take your business’ continuity plan into consideration. Backup and recovery processes are a critical part of any application infrastructure.

A well-tested backup and recovery system can be the difference between a minor outage and the end of your business.

You will want to take three things into consideration when planning your disaster recovery strategy: recovery time objective, recovery point objective and risk mitigation.

Recovery time objective (RTO) is how long it takes to restore your backups. Recovery point objective (RPO) is what point in time you want to recover (in other words, how much data you can afford to lose after recovery). Finally, you need to understand what risks you are trying to mitigate. Risks to your data include (but are not limited to) bad actors, data corruption, user error, host failure and data center failure.

Recommended Backup Strategies

We recommend that you use both physical (Percona XtraBackup, RDS/LVM Snapshots, MySQL Enterprise Backup) and logical backups (mysqldump, mydumper, mysqlpump). Logical backups protect against the loss of single data points, while physical backups protect against total data loss or host failure.

The best practice is running Percona XtraBackup nightly, followed by mysqldump (or in 5.7+, mysqlpump). Percona XtraBackup enables you to quickly restore a server, and mysqldump enables you to quickly restore data points. These address recovery time objectives.

For point-in-time recovery, it is recommended that you download binlogs on a regular basis (once an hour, for example).

Another option is binlog streaming. You can find more information on binlog streaming in our blog: Backing up binary log files with mysqlbinlog.

There is also a whitepaper that is the basis of my webinar here: MySQL Backup and Recovery Best Practices.

Delayed Slave

One way to save on operational overhead is to create a 24-hour delayed slave. This takes the place of the logical backup (mysqldump) as well as the binlog streaming. You want to ensure that you stop the delayed slave immediately following any issues. This ensures that the data does not get corrupted on the backup as well.

A delayed slave is created in 5.6 and above with:

CHANGE MASTER TO MASTER_DELAY = N;

After a disaster, you would issue:

STOP SLAVE;

Then, in order to get a point-in-time, you can use:

START SLAVE UNTIL MASTER_LOG_FILE = 'log_name', MASTER_LOG_POS = log_pos;

Restore

It is a good idea to test your backups at least once a quarter. Backups do not exist unless you know you can restore them. There are some recent high-profile cases where developers dropped tables or schemas, or data was corrupted in production, and in one case five different backup types were not viable to use to restore.

The best case scenario is an automated restore test that runs after your backup, and gives you information on how long it takes to restore (RTO) and how much data you can restore (RPO).

For more details on backups and disaster recovery, come to my webinar.

Apr
25
2017
--

Backup service Rubrik now works natively in AWS and Azure

Data flying over group of laptops to illustrate data integration/sharing. Rubrik, the startup that provides data management services like backup and recovery to large enterprises, is in the process of raising between $150 million and $200 million on a valuation of $1 billion, as we reported yesterday. And as a measure of how it’s growing, today it’s announcing an expansion of its product set, specifically in cloud services.
Now Rubrik — which… Read More

Apr
24
2017
--

Data management startup Rubrik is raising up to $200M on a $1B valuation

 Make way for another juggernaut amongst enterprise startups: Rubrik, a data backup company that only emerged from stealth in 2015, is in the process of raising between $150 million and $200 million on a valuation of $1 billion as the company enters a period of strong demand for its storage and data management products, according to sources. TechCrunch first learned of the new fundraise via… Read More

Dec
09
2015
--

CloudEndure Disaster Recovery Service Secures $7 Million Investment

Frustrated IT executive sitting on floor of data center. Disasters can take many forms from weather events to database corruptions. CloudEndure, a cloud-based disaster recovery service, announced a $7 million investment today led by Indian consulting firm Infosys and previous investor Magma Venture Partners. Today’s investment brings the total to just over $12 million. At first blush, Infosys may seem like an odd partner, a traditional… Read More

Dec
11
2014
--

Datto Snags Cloud Service Backupify

Stacks of backup tapes. Datto, a backup and disaster recovery service from Norwalk, Connecticut, purchased cloud to cloud backup service Backupify today. It was a move designed to give Datto a more complete product portfolio. Terms were not available, but Backupify co-founder and CEO Rob May told TechCrunch that his investors were more than satisfied. “It was a lot more than we raised,” he said.… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com