Jan
19
2021
--

StackPulse announces $28M investment to help developers manage outages

When a system outage happens, chaos can ensue as the team tries to figure out what’s happening and how to fix it. StackPulse, a new startup that wants to help developers manage these crisis situations more efficiently, emerged from stealth today with a $28 million investment.

The round actually breaks down to a previously unannounced $8 million seed investment and a new $20 million Series A. GGV led the A round, while Bessemer Venture Partners led the seed and also participated in the A. Glenn Solomon at GGV and Amit Karp at Bessemer will join the StackPulse board.

Nobody is immune to these outages. We’ve seen incidents from companies as varied as Amazon and Slack in recent months. The biggest companies like Google, Facebook and Amazon employ site reliability engineers and build customized platforms to help remediate these kinds of situations. StackPulse hopes to put this kind of capability within reach of companies, whose only defense is the on-call developers.

Company co-founder and CEO Ofer Smadari says that in the midst of a crisis with signals coming at you from Slack and PagerDuty and other sources, it’s hard to figure out what’s happening. StackPulse is designed to help sort out the details to get you back to equilibrium as quickly as possible.

First off, it helps identify the severity of the incident. Is it a false alarm or something that requires your team’s immediate attention or something that can be put off for a later maintenance cycle? If there is something going wrong that needs to be fixed right now, StackPulse can not only identify the source of the problem, but also help fix it automatically, Smadari explained.

After the incident has been resolved, it can also help with a post-mortem to figure out what exactly went wrong by pulling in all of the alert communications and incident data into the platform.

As the company emerges from stealth, it has some early customers, and 35 employees based in Portland, Oregon and Tel Aviv. Smadari says that he hopes to have 100 employees by the end of this year. As he builds the organization, he is thinking about how to build a diverse team for a diverse customer base. He believes that people with diverse backgrounds build a better product. He adds that diversity is a top level goal for the company, which already has an HR leader in place to help.

Glenn Solomon from GGV, who will be joining the company board, saw a strong founding team solving a big problem for companies and wanted to invest. “When they described the vision for the product they wanted to build, it made sense to us,” he said.

Customers are impatient with down time and Solomon sees developers on the front line trying to solve these issues. “Performance is more important than ever. When there is downtime, it’s damaging to companies,” he said. He believes StackPulse can help.

Sep
02
2020
--

Transposit scores $35M to build data-driven runbooks for faster disaster recovery

Transposit is a company built by engineers to help engineers, and one big way to help them is to get systems up and running faster when things go wrong — as they always will at some point. Transposit has come up with a way to build runbooks for faster disaster recovery, while using data to update them in an automated fashion.

Today, the company announced a $35 million Series B investment led by Altimeter Capital, with participation from existing investors Sutter Hill Ventures, SignalFire and Unusual Ventures. Today’s investment brings the total raised to $50.4 million, according to the company.

Company CEO Divanny Lamas and CTO and founder Tina Huang see technology issues as less an engineering problem and more as a human problem, because it’s humans who have to clean up the messes when things go wrong. Huang says forgetting the human side of things is where she thinks technology has gone astray.

“We know that the real superpower of the product is that we focus on the human and the user side of things. And as a result, we’re building an engineering culture that I think is somewhat differentiated,” Huang told TechCrunch.

Transposit is a platform that at its core helps manage APIs, connections to other programs, so it starts with a basic understanding of how various underlying technologies work together inside a company. This is essential for a tool that is trying to help engineers in a moment of panic figure out how to get back to a working state.

When it comes to disaster recovery, there are essentially two pieces: getting the systems working again, then figuring out what happened. For the first piece, the company is building data-driven runbooks. By being data-driven, they aren’t static documents. Instead, the underlying machine learning algorithms can look at how the engineers recovered and adjust accordingly.

Transposit diaster recovery dashboard

Image Credits: Transposit

“We realized that no one was focusing on what we realize is the root problem here, which is how do I have access to the right set of data to make it easier to reconstruct that timeline, and understand what happened? We took those two pieces together, this notion that runbooks are a critical piece of how you spread knowledge and spread process, and this other piece, which is the data, is critical,” Huang said.

Today the company has 26 employees, including Huang and Lamas, who Huang brought on board from Splunk last year to be CEO. The company is somewhat unique having two women running the organization, and they are trying to build a diverse workforce as they build their company to 50 people in the next 12 months.

The current make-up is 47% female engineers, and the goal is to remain diverse as they build the company, something that Lamas admits is challenging to do. “I wish I had a magic answer, or that Tina had a magic answer. The reality is that we’re just very demanding on recruiters. And we are very insistent that we have a diverse pipeline of candidates, and are constantly looking at our numbers and looking at how we’re doing,” Lamas said.

She says being diverse actually makes it easier to recruit good candidates. “People want to work at diverse companies. And so it gives us a real edge from a kind of culture perspective, and we find that we get really amazing candidates that are just tired of the status quo. They’re tired of the old way of doing things and they want to work in a company that reflects the world that they want to live in,” she said.

The company, which launched in 2016, took a few years to build the first piece, the underlying API platform. This year it added the disaster recovery piece on top of that platform, and has been running its beta since the beginning of the summer. They hope to add additional beta customers before making it generally available later this year.

Jul
07
2020
--

OwnBackup lands $50M as backup for Salesforce ecosystem thrives

OwnBackup has made a name for itself primarily as a backup and disaster recovery system for the Salesforce ecosystem, and today the company announced a $50 million investment.

Insight Partners led the round, with participation from Salesforce Ventures and Vertex Ventures. This chunk of money comes on top of a $23 million round from a year ago, and brings the total raised to more than $100 million, according to the company.

It shouldn’t come as a surprise that Salesforce Ventures chipped in when the majority of the company’s backup and recovery business involves the Salesforce ecosystem, although the company will be looking to expand beyond that with the new money.

“We’ve seen such growth over the last two and a half years around the Salesforce ecosystem, and the other ISV partners like Veeva and nCino that we’ve remained focused within the Salesforce space. But with this funding, we will expand over the next 12 months into a few new ecosystems,” company CEO Sam Gutmann told TechCrunch.

In spite of the pandemic, the company continues to grow, adding 250 new customers last quarter, bringing it to over 2,000 customers and 250 employees, according to Gutmann.

He says that raising the round, which closed at the beginning of May, had some hairy moments as the pandemic began to take hold across the world and worsen in the U.S. For a time, he began talking to new investors in case his existing ones got cold feet. As it turned out, when the quarterly numbers came in strong, the existing ones came back and the round was oversubscribed, Gutmann said.

“Q2 frankly was a record quarter for us, adding over 250 new accounts, and we’re seeing companies start to really understand how critical this is,” he said.

The company plans to continue hiring through the pandemic, although he says it might not be quite as aggressively as they once thought. Like many companies, even though they plan to hire, they are continually assessing the market. At this point, he foresees growing the workforce by about another 50 people this year, but that’s about as far as he can look ahead right now.

Gutmann says he is working with his management team to make sure he has a diverse workforce right up to the executive level, but he says it’s challenging. “I think our lower ranks are actually quite diverse, but as you get up into the leadership team, you can see on the website unfortunately we’re not there yet,” he said.

They are instructing their recruiting teams to look for diverse candidates whether by gender or ethnicity, and employees have formed a diversity and inclusion task force with internal training, particularly for managers around interviewing techniques.

He says going remote has been difficult, and he misses seeing his employees in the office. He hopes to have at least some come back before the end of the summer and slowly add more as we get into the fall, but that will depend on how things go.

May
20
2020
--

FireHydrant lands $8M Series A for disaster management tool

When I spoke to Robert Ross, CEO and co-founder at FireHydrant, we had a technology adventure. First the audio wasn’t working correctly on Zoom, then Google Meet. Finally we used cell phones to complete the interview. It was like a case study in what FireHydrant is designed to do — help companies manage incidents and recover more quickly when things go wrong with their services.

Today the company announced an $8 million Series A from Menlo Ventures and Work-Bench. That brings the total raised to $9.5 million, including the $1.5 million seed round we reported on last April.

In the middle of a pandemic with certain services under unheard of pressure, understanding what to do when your systems crash has become increasingly important. FireHydrant has literally developed a playbook to help companies recover faster.

These run books are digital documents that are unique to each company and include what to do to help manage the recovery process. Some of that is administrative. For example, certain people have to be notified by email, a Jira ticket has to be generated and a Slack channel opened to provide a communications conduit for the team.

While Ross says you can’t define the exact recovery process itself because each incident tends to be unique, you can set up an organized response to an incident and that can help you get to work on the recovery much more quickly. That ability to manage an incident can be a difference maker when it comes to getting your system back to a steady state.

Ross is a former site reliability engineer (SRE) himself. He has experienced the kinds of problems his company is trying to solve, and that background was something that attracted investor Matt Murphy from Menlo Ventures.

“I love his authentic perspective, as a former SRE, on the problem and how to create something that would make the SRE function and processes better for all. That value prop really resonated with us in a time when the shift to online is accelerating and remote coordination between people tasked with identifying and fixing problems is at all time high in terms of its importance. Ultimately we’re headed toward more and more automation in problem resolution and FH helps pave the way,” Murphy told TechCrunch.

It’s not easy being an early-stage company in the current climate, but Ross believes his company has created something that will resonate, perhaps even more right now. As he says, every company has incidents, and how you react can define you as a company. Having tooling to help you manage that process helps give you structure at a time you need it most.

Oct
31
2019
--

How you react when your systems fail may define your business

Just around 9:45 a.m. Pacific Time on February 28, 2017, websites like Slack, Business Insider, Quora and other well-known destinations became inaccessible. For millions of people, the internet itself seemed broken.

It turned out that Amazon Web Services was having a massive outage involving S3 storage in its Northern Virginia datacenter, a problem that created a cascading impact and culminated in an outage that lasted four agonizing hours.

Amazon eventually figured it out, but you can only imagine how stressful it might have been for the technical teams who spent hours tracking down the cause of the outage so they could restore service. A few days later, the company issued a public post-mortem explaining what went wrong and which steps they had taken to make sure that particular problem didn’t happen again. Most companies try to anticipate these types of situations and take steps to keep them from ever happening. In fact, Netflix came up with the notion of chaos engineering, where systems are tested for weaknesses before they turn into outages.

Unfortunately, no tool can anticipate every outcome.

It’s highly likely that your company will encounter a problem of immense proportions like the one that Amazon faced in 2017. It’s what every startup founder and Fortune 500 CEO worries about — or at least they should. What will define you as an organization, and how your customers will perceive you moving forward, will be how you handle it and what you learn.

We spoke to a group of highly-trained disaster experts to learn more about preventing these types of moments from having a profoundly negative impact on your business.

It’s always about your customers

Reliability and uptime are so essential to today’s digital businesses that enterprise companies developed a new role, the Site Reliability Engineer (SRE), to keep their IT assets up and running.

Tammy Butow, principal SRE at Gremlin, a startup that makes chaos engineering tools, says the primary role of the SRE is keeping customers happy. If the site is up and running, that’s generally the key to happiness. “SRE is generally more focused on the customer impact, especially in terms of availability, uptime and data loss,” she says.

Companies measure uptime according to the so-called “five nines,” or 99.999 percent availability, but software engineer Nora Jones, who most recently led Chaos Engineering and Human Factors at Slack, says there is often too much of an emphasis on this number. According to Jones, the focus should be on the customer and the impact that availability has on their perception of you as a company and your business’s bottom line.

Someone needs to be calm and just keep asking the right questions.

“It’s money at the end of the day, but also over time, user sentiment can change [if your site is having issues],” she says. “How are they thinking about you, the way they talk about your product when they’re talking to their friends, when they’re talking to their family members. The nines don’t capture any of that.”

Robert Ross, founder and CEO at FireHydrant, an SRE as a Service platform, says it may be time to rethink the idea of the nines. “Maybe we need to change that term. Maybe we can popularize something like ‘happiness level objectives’ or ‘happiness level agreements.’ That way, the focus is on our products.”

When things go wrong

Companies go to great lengths to prevent disasters to avoid disappointing their customers and usually have contingencies for their contingencies, but sometimes, no matter how well they plan, crises can spin out of control. When that happens, SREs need to execute, which takes planning, too; knowing what to do when the going gets tough.

Apr
02
2019
--

FireHydrant lands $1.5M seed investment to bring order to IT disaster recovery

FireHydrant, an NYC startup, wants to help companies recover from IT disasters more quickly, and understand why they happened — with the goal of preventing similar future scenarios from happening again. Today, the fledgling startup announced a $1.5 million seed investment from Work-Bench, a New York City venture capital firm that invests in early-stage enterprise startups.

In addition to the funding, the company announced it was opening registration for its FireHydrant incident management platform. The product has been designed with Google’s Site Reliability Engineering (SRE) methodology in mind, but company co-founder and CEO Bobby Ross says the tool is designed to help anyone understand the cause of a disaster, regardless of what happened, and whether they practice SRE or not.

“I had been involved in several fire fighting scenarios — from production databases being dropped to Kubernetes upgrades gone wrong — and every incident had a common theme: ?absolute chaos?,” Ross wrote in a blog post announcing the new product.

The product has two main purposes, according to Ross. It helps you figure out what’s happening as you attempt to recover from an ongoing disaster scenario, and once you’ve put out the fire, it lets you do a post-mortem to figure out exactly what happened with the hope of making sure that particular disaster doesn’t happen again.

As Ross describes it, a tool like PagerDuty can alert you that there’s a problem, but FireHydrant lets you figure out what specifically is going wrong and how to solve it. He says that the tool works by analyzing change logs, as a change is often the primary culprit of IT incidents. When you have an incident, FireHydrant will surface that suspected change, so you can check it first.

“We’ll say, hey, you had something change recently in this vicinity where you have an alert going off. There is a high likelihood that this change was actually causing your incident. And we actually bubble that up and mark it as a suspect,” Ross explained.

Screenshot: FireHydrant

Like so many startups, the company developed from a pain point the founders were feeling. The three founders were responsible for solving major outages at companies like Namely, DigitalOcean, CoreOS and Paperless Post.

But the actual idea for the company came about almost accidentally. In 2017, Ross was working on a series of videos and needed a way to explain what he was teaching. “I began writing every line of code with live commentary, and soon FireHydrant started to take the shape of what I envisioned as an SRE while at Namely, and I started to want it more than the video series. 40 hours of screencasts recorded later, I decided to stop recording and focus on the product…,” Ross wrote in the blog post.

Today it integrates with PagerDuty, GitHub and Slack, but the company is just getting started with the three founders, all engineers, working on the product and a handful of beta customers. It is planning to hire more engineers to keep building out the product. It’s early days, but if this tool works as described, it could go a long way toward solving the fire-fighting issues that every company faces at some point.

Jan
08
2019
--

Amazon reportedly acquired Israeli disaster recovery service CloudEndure for around $200M

Amazon has reportedly acquired Israeli disaster recovery startup CloudEndure. Neither company has responded to our request for confirmation, but we have heard from multiple sources that the deal has happened. While some outlets have been reporting the deal was worth $250 million, we are hearing it’s closer to $200 million.

The company provides disaster recovery for cloud customers. You may be thinking that disaster recovery is precisely why we put our trust in cloud vendors. If something goes wrong, it’s the vendor’s problem — and you would be right to make this assumption, but nothing is simple. If you have a hybrid or multi-cloud scenario, you need to have ways to recover your data in the event of a disaster like weather, a cyberattack or political issue.

That’s where a company like CloudEndure comes into play. It can help you recover and get back and running in another place, no matter where your data lives, by providing a continuous backup and migration between clouds and private data centers. While CloudEndure currently works with AWS, Azure and Google Cloud Platform, it’s not clear if Amazon would continue to support these other vendors.

The company was backed by Dell Technologies Capital, Infosys and Magma Venture Partners, among others. Ray Wang, founder and principal analyst at Constellation Research, says Infosys recently divested its part of the deal and that might have precipitated the sale. “So much information is sitting in the cloud that you need backups and regions to make sure you have seamless recovery in the event of a disaster,” Wang told TechCrunch.

While he isn’t clear what Amazon will do with the company, he says it will test just how open it is. “If you have multi-cloud and want your on-prem data backed up, or if you have backup on one cloud like AWS and want it on Google or Azure, you could do this today with CloudEndure,” he said. “That’s why I’m curious if they’ll keep supporting Azure or GCP,” he added.

CloudEndure was founded in 2012 and has raised just over $18 million. Its most recent investment came in 2016 when it raised $6 million, led by Infosys and Magma.

Dec
07
2017
--

Heptio teams up with Microsoft to build a better Kubernetes disaster recovery solution

 With the rise of Kubernetes as the de facto standard for container orchestration, it’s no surprise that there’s now a whole ecosystem of companies springing up around this open source project. Heptio is one of the most interesting ones, in no small part due to the fact that it was founded by Kubernetes co-founders Joe Beda and Craig McLuckie. Today, Heptio announced that it is… Read More

Nov
28
2017
--

VMware expands AWS partnership with new migration and disaster recovery tools

 Remember how VMware was supposed to be disrupted by AWS? Somewhere along the way it made a smart move. Instead of fighting the popular cloud platform, it decided to make it easier for IT to use its products on AWS. Today, at the opening of the AWS re:invent customer conference, it announced plans to expand that partnership with some new migration and disaster recovery services. As Mark… Read More

Oct
16
2017
--

When Should I Enable MongoDB Sharding?

MongoDB Sharding

MongoDB ShardingIn this blog post, we will talk about MongoDB sharding and walk through the main reasons why you should start a cluster (independent of the approach you have chosen).

Note: I will cover this subject in my webinar How To Scale with MongoDB on Wednesday, October 18, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

Sharding is the most complex architecture you can deploy using MongoDB, and there are two main approaches as to when to shard or not. The first is to configure the cluster as soon as possible – when you predict high throughput and fast data growth.

The second says you should use a cluster as the best alternative when the application demands more resources than the replica set can offer (such as low memory, an overloaded disk or high processor load). This approach is more corrective than preventative, but we will discuss that in the future.

1) Disaster recovery plan

Disaster recovery (DR) is a very delicate topic: how long would you tolerate an outage? If necessary, how long would it take you to restore the entire database? Depending on the database size and on disk speed, a backup/restore process might take hours or even days!
There is no hard number in Gigabytes to justify a cluster. But in general, you should engage when the database is more than 200GB the backup and restore processes might take a while to finish.
Let’s consider the case where we have a replica set with a 300GB database. The full restore process might last around four hours, whereas if the database has two shards, it will take about two hours – and depending on the number of shards we can improve that time. Simple math: if there are two shards, the restore process takes half of the time to restore when compared to a single replica set.

2) Hardware limitations

Disk and memory are inexpensive nowadays. However, this is not true when companies need to scale out to high numbers (such as TB of RAM). Suppose your cloud provider can only offer you up to 5,000 IOPS in the disk subsystem, but the application needs more than that to work correctly. To work around this performance limitation, it is better to start a cluster and divide the writes among instances. That said, if there are two shards the application will have 10000 IOPS available to use for writes and reads in the disk subsystem.

3) Storage engine limitations

There are a few storage engine limitations that can be a bottleneck in your use case. MMAPv2 does have a lock per collection, while WiredTiger has tickets that will limit the number of writes and reads happening concurrently. Although we can tweak the number of tickets available in WiredTiger, there is a virtual limit – which means that changing the available tickets might generate processor overload instead of increasing performance. If one of these situations becomes a bottleneck in your system, you start a cluster. Once you shard the collection, you distribute the load/lock among the different instances.

4) Hot data vs. cold data

Several databases only work with a small percentage of the data being stored. This is called hot data or working set. Cold data or historical data is rarely read, and demands considerable system resources when it is. So why spend money on expensive machines that only store cold data or low-value data? With a cluster deployment we can choose where the cold data is stored, and use cheap devices and disks to do so. The same is true for hot data – we can use better machines to have better performance. This methodology also speeds up writes and reads on the hot data, as the indexes are smaller and add less overhead to the system.

5) Geo-distributed data

It doesn’t matter whether this need comes from application design or legal compliance. If the data must stay within continent or country borders, a cluster helps make that happen. It is possible to limit data localization so that it is stored solely in a specific “part of the world.” The number of shards and their geographic positions is not essential for the application, as it only views the database. This is commonly used in worldwide companies for better performance, or simply to comply with the local law.

6) Infrastructure limitations

Infrastructure and hardware limitations are very similar. When thinking about infrastructure, however, we focus on specific cases when the instances should be small. An example is running MongoDB on Mesos. Some providers only offer a few cores and a limited amount of RAM. Even if you are willing to pay more for that, it is not possible to purchase more than they offer as their products. A cluster provides the option to split a small amount of data among a lot of shards, reaching the same performance a big and expensive machine provides.

7) Failure isolation

Consider that a replica set or a single instance holds all the data. If for any reason this instance/replica set goes down, the whole application goes down. In a cluster, if we lose one of the five shards, 80% of the data is still available. Running a few shards helps to isolate failures. Obviously, running a bunch of instances makes the cluster prone to have a failed instance, but as each shard must have at least three instances the probability of the entire shard being down is minimal. For providers that offer different zones, it is good practice to have different members of the shard in different availability zones (or even different regions).

8) Speed up queries

Queries can take too long, depending on the number of reads they perform. In a clustered deployment, queries can run in parallel and speed up the query response time. If a query runs in ten seconds in a replica set, it is very likely that the same query will run in five to six seconds if the cluster has two shards, and so on.

I hope this helps with MongoDB sharding. Having a cluster solves several other problems as well, and we have listed only a few of them. Don’t miss our webinar regarding scaling out MongoDB next Wednesday, October 18, 2017!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com