Jun
02
2021
--

With buyout, Cloudera hunts for relevance in a changing market

When Cloudera announced its sale to a pair of private equity firms yesterday for $5.3 billion, along with a couple of acquisitions of its own, the company detailed a new path that could help it drive back towards relevance in the big data market.

When the company launched in 2008, Hadoop was in its early days. The open source project developed at Yahoo three years earlier was built to deal with the large amounts of data that the Internet pioneer generated. It became increasingly clear over time that every company would have to deal with growing data stores, and it seemed that Cloudera was in the right market at the right time.

And for a while things went well. Cloudera rode the Hadoop startup wave, garnering a cool billion in funding along the way, including a stunning $740 million check from Intel Capital in 2014. It then went public in 2018 to much fanfare.

But the markets had already started to shift by the time of its public debut. Hadoop, a highly labor-intensive way to manage data, was being supplanted by cheaper and less complex cloud-based solutions.

“The excitement around the original promise of the Hadoop market has contracted significantly. It’s incredibly expensive and complex to get it working effectively in an enterprise context,” Casey Aylward, an investor at Costanoa Ventures told TechCrunch.

The company likely saw that writing on the wall when it merged with another Hadoop-based company, Hortonworks in 2019. That transaction valued the combined entity at $5.2 billion, almost the same amount it sold for yesterday, two years down the road. The decision to sell and go private may also have been spurred by Carl Icahn buying an 18% stake in the company that same year.

Looking to the future, Cloudera’s sale could provide the enterprise unicorn room as it regroups.

Patrick Moorhead, founder and principal analyst at Moor Insight & Strategies sees the deal as a positive step for the company. “I think this is good news for Cloudera because it now has the capital and flexibility to dive head first into SaaS. The company invented the entire concept of a data life cycle, implemented initially on premises, then extended to private and public clouds,” Moorhead said.

Adam Ronthal, Gartner Research VP agrees that it at least gives Cloudera more room to make necessary adjustments its market strategy as long as it doesn’t get stifled by its private equity overlords. “It should give Cloudera an opportunity to focus on their future direction with increased flexibility — provided they are able to invest in that future and that this does not just focus on cost cutting and maximizing profits. Maintaining a culture of innovation will be key,” Ronthal said.

Which brings us to the two purchases Cloudera also announced as part of its news package.

If you want to change direction in a hurry, there are worse ways than via acquisitions. And grabbing Datacoral and Cazena should help Cloudera alter its course more quickly than it could have managed on its own.

“[The] two acquisitions will help Cloudera capture some of the value on top of the lake storage layer — perhaps moving into different data management features and/or expanding into the compute layer for analytics and AI/ML use cases, where there has been a lot of growth and excitement in recent years,” Alyward said.

Chandana Gopal, Research Director for the future of intelligence at IDC agrees that the transactions give Cloudera some more modern options that could help speed up the data wrangling process. “Both the acquisitions are geared towards making the management of cloud infrastructure easier for end-users. Our research shows that data prep and integration takes 70%-80% of an analyst’s time versus the time spent in actual analysis. It seems like both these companies’ products will provide technology to improve the data integration/preparation experience,” she said.

The company couldn’t stay on the path it was on forever, certainly not with an activist investor breathing down its neck. Its recent efforts could give it the time away from public markets it needs to regroup. How successful Cloudera’s turnaround proves to be will depend on whether the private equity companies buying it can both agree on the direction and strategy for the company, while providing the necessary resources to push the company in a new direction. All of that and more will determine if these moves pay off in the end.

Jun
01
2021
--

Cloudera to go private as KKR & CD&R grab it for $5.3B

Cloudera was once one of the hottest Hadoop startups, but over time the shine has come off that market, and today it went private as KKR and Clayton, Dubilier & Rice, a pair of private equity firms, announced they intended to purchase Cloudera for $5.3 billion. The company has a market cap of around $3.7 billion.

Cloudera and Hortonworks, two key startups in the Hadoop space, merged in 2018 for $5.2 billion. Cloudera was likely under pressure from activist investor Carl Icahn, who took an 18% stake in the company in 2019 and now stands to gain from the sale, which the company stated represented a 24% premium for shareholders at $16 a share. Prior to the market opening this morning, the stock was sitting at $12.86.

Back in the day, about a decade ago, when Hadoop was the way to process big data, venture money was pouring into the space. Over time it lost some of its glow. That’s because it was highly labor intensive, and companies began moving to the cloud and looking at software services that did more of the work for them. More modern technologies like data lakes began replacing it and the company recognized that it must change its approach to survive in the modern data processing marketplace.

Cloudera CEO Rob Bearden sees the transaction as a way to do just that. “We believe that as a private company with the expertise and support of experienced investors such as CD&R and KKR, Cloudera will have the resources and flexibility to drive product-led growth and expand our addressable market opportunity,” Bearden said in a statement.

While there is a lot of executive jargon in that statement, it basically means that the company hopes that these private equity firms can give it some additional financial resources to move toward a more modern approach for processing large amounts of data.

While it was at it, Cloudera also announced a couple of acquisitions of its own to help it move toward that modernization goal. For starters, it grabbed Datacoral, a startup that abstracts away the infrastructure needed to build a data pipeline without using code. It also acquired Cazena, a startup that helps customers build cloud data lakes, giving the company a more modern approach to processing big data. Bearden sees both of these services helping Cloudera reposition itself in the big data self-service market

“Both businesses will enable our combined customers to enjoy a reduction in complexity and faster time to value for their data initiatives, leading to improved insights, faster innovation, and stronger engagements with their customers and partners,” Bearden said in a statement.

Cloudera went public in 2018, closing at $18.09 a share after raising $1 billion. The vast majority of that was a $740 million investment from Intel Capital in 2014. It’s worth noting that Cloudera bought Intel’s stake in the company at the end of last year for $314 million.

Hortonworks raised another $248 million. A third Hadoop startup, MapR, raised $280 million. The company’s assets were sold rather unceremoniously to HPE in 2019 for a price pegged at under $50 million, showing just how far the market has fallen since its earlier glory days.

The Cloudera deal includes a brief “go shop” provision that allows it to continue to look for a better deal. It’s doubtful it will find one, and if it doesn’t the transaction with KKR and CD&R is expected to close in the second half of this year subject to typical regulatory review. The company will announce earnings later today.

Oct
29
2019
--

Datameer announces $40M investment as it pivots away from Hadoop roots

Datameer, the company that was born as a data prep startup on top of the open-source Hadoop project, announced a $40 million investment and a big pivot away from Hadoop, while staying true to its big data roots.

The investment was led by existing investor ST Telemedia . Existing investors Redpoint Ventures, Kleiner Perkins, Nextworld Capital, Citi Ventures and Top Tier Capital Partners also participated. Today’s investment brings the total raised to almost $140 million, according to Crunchbase data.

Company CEO Christian Rodatus says the company’s original mission was about making Hadoop easier to use for data scientists, business analysts and engineers. In the last year, the three biggest commercial Hadoop vendors — Cloudera, Hortonworks and MapR — fell on hard times. Cloudera and Hortonworks merged and MapR was sold to HPE in a fire sale.

Starting almost two years ago, Datameer recognized that against this backdrop, it was time for a change. It began developing a couple of new products. It didn’t want to abandon its existing customer base entirely, of course, so it began rebuilding its Hadoop product and is now calling it Datameer X. It is a modern cloud-native product built to run on Kubernetes, the popular open-source container orchestration tool. Instead of Hadoop, it will be based on Spark. He reports they are about two-thirds done with this pivot, but the product has been in the hands of customers.

The company also announced Neebo, an entirely new SaaS tool to give data scientists the ability to process data in whatever form it takes. Rodatus sees a world coming where data will take many forms, from traditional data to Python code from data analysts or data scientists to SaaS vendor dashboards. He sees Neebo bringing all of this together in a managed service with the hope that it will free data scientists to concentrate on getting insight from the data. It will work with data visualization tools like Tableau and Looker, and should be generally available in the coming weeks.

The money should help them get through this pivot, hire more engineers to continue the process and build a go-to-market team for the new products. It’s never easy pivoting like this, but the investors are likely hoping that the company can build on its existing customer base, while taking advantage of the market need for data science processing tools. Time will tell if it works.

Oct
15
2019
--

Databricks brings its Delta Lake project to the Linux Foundation

Databricks, the big data analytics service founded by the original developers of Apache Spark, today announced that it is bringing its Delta Lake open-source project for building data lakes to the Linux Foundation under an open governance model. The company announced the launch of Delta Lake earlier this year, and, even though it’s still a relatively new project, it has already been adopted by many organizations and has found backing from companies like Intel, Alibaba and Booz Allen Hamilton.

“In 2013, we had a small project where we added SQL to Spark at Databricks […] and donated it to the Apache Foundation,” Databricks CEO and co-founder Ali Ghodsi told me. “Over the years, slowly people have changed how they actually leverage Spark and only in the last year or so it really started to dawn upon us that there’s a new pattern that’s emerging and Spark is being used in a completely different way than maybe we had planned initially.”

This pattern, he said, is that companies are taking all of their data and putting it into data lakes and then doing a couple of things with this data, machine learning and data science being the obvious ones. But they are also doing things that are more traditionally associated with data warehouses, like business intelligence and reporting. The term Ghodsi uses for this kind of usage is “Lake House.” More and more, Databricks is seeing that Spark is being used for this purpose and not just to replace Hadoop and doing ETL (extract, transform, load). “This kind of Lake House patterns we’ve seen emerge more and more and we wanted to double down on it.”

Spark 3.0, which is launching today soon, enables more of these use cases and speeds them up significantly, in addition to the launch of a new feature that enables you to add a pluggable data catalog to Spark.

Delta Lake, Ghodsi said, is essentially the data layer of the Lake House pattern. It brings support for ACID transactions to data lakes, scalable metadata handling and data versioning, for example. All the data is stored in the Apache Parquet format and users can enforce schemas (and change them with relative ease if necessary).

It’s interesting to see Databricks choose the Linux Foundation for this project, given that its roots are in the Apache Foundation. “We’re super excited to partner with them,” Ghodsi said about why the company chose the Linux Foundation. “They run the biggest projects on the planet, including the Linux project but also a lot of cloud projects. The cloud-native stuff is all in the Linux Foundation.”

“Bringing Delta Lake under the neutral home of the Linux Foundation will help the open-source community dependent on the project develop the technology addressing how big data is stored and processed, both on-prem and in the cloud,” said Michael Dolan, VP of Strategic Programs at the Linux Foundation. “The Linux Foundation helps open-source communities leverage an open governance model to enable broad industry contribution and consensus building, which will improve the state of the art for data storage and reliability.”

Aug
07
2019
--

With MapR fire sale, Hadoop’s promise has fallen on hard times

If you go back about a decade, Hadoop was hot and getting hotter. It was a platform for processing big data, just as big data was emerging from the domain of a few web-scale companies to one where every company was suddenly concerned about processing huge amounts of data. The future was bright, an open source project with a bunch of startups emerging to fulfill that big data promise in the enterprise.

Three companies in particular emerged out of that early scrum — Cloudera, Hortonworks and MapR — and between them raised more than $1.5 billion. The lion’s share of that went to Cloudera in one massive chunk when Intel Capital invested a whopping $740 million in the company. But times have changed.

2018 china ipos

Via TechCrunch, Crunchbase, Infogram

Falling hard

Just yesterday, HPE bought the assets of MapR, a company that had raised $280 million. The deal was pegged at under $50 million, according to multiple reports. That’s not what you call a healthy return on investment.

Jun
10
2019
--

Qubole launches Quantum, its serverless database engine

Qubole, the data platform founded by Apache Hive creator and former head of Facebook’s Data Infrastructure team Ashish Thusoo, today announced the launch of Quantum, its first serverless offering.

Qubole may not necessarily be a household name, but its customers include the likes of Autodesk, Comcast, Lyft, Nextdoor and Zillow . For these users, Qubole has long offered a self-service platform that allowed their data scientists and engineers to build their AI, machine learning and analytics workflows on the public cloud of their choice. The platform sits on top of open-source technologies like Apache Spark, Presto and Kafka, for example.

Typically, enterprises have to provision a considerable amount of resources to give these platforms the resources they need. These resources often go unused and the infrastructure can quickly become complex.

Qubole already abstracts most of this away, offering what is essentially a serverless platform. With Quantum, however, it is going a step further by launching a high-performance serverless SQL engine that allows users to query petabytes of data with nothing else but ANSI-SQL, giving them the choice between using a Presto cluster or a serverless SQL engine to run their queries, for example.

The data can be stored on AWS and users won’t have to set up a second data lake or move their data to another platform to use the SQL engine. Quantum automatically scales up or down as needed, of course, and users can still work with the same metastore for their data, no matter whether they choose the clustered or serverless option. Indeed, Quantum is essentially just another SQL engine without Qubole’s overall suite of engines.

Typically, Qubole charges enterprises by compute minutes. When using Quantum, the company uses the same metric, but enterprises pay for the execution time of the query. “So instead of the Qubole compute units being associated with the number of minutes the cluster was up and running, it is associated with the Qubole compute units consumed by that particular query or that particular workload, which is even more fine-grained,” Thusoo explained. “This works really well when you have to do interactive workloads.”

Thusoo notes that Quantum is targeted at analysts who often need to perform interactive queries on data stored in object stores. Qubole integrates with services like Tableau and Looker (which Google is now in the process of acquiring). “They suddenly get access to very elastic compute capacity, but they are able to come through a very familiar user interface,” Thusoo noted.

 

Jan
03
2019
--

Cloudera and Hortonworks finalize their merger

Cloudera and Hortonworks, two of the biggest players in the Hadoop big data space, today announced that they have finalized their all-stock merger. The new company will use the Cloudera brand and will continue to trade under the CLDR symbol on the New York Stock Exchange.

“Today, we start an exciting new chapter for Cloudera as we become the leading enterprise data cloud provider,” said Tom Reilly, chief executive officer of Cloudera, in today’s announcement. “This combined team and technology portfolio establish the new Cloudera as a clear market leader with the scale and resources to drive continued innovation and growth. We will provide customers a comprehensive solution-set to bring the right data analytics to data anywhere the enterprise needs to work, from the Edge to AI, with the industry’s first Enterprise Data Cloud.”

The companies describe the deal as a “merger of equals,” though Cloudera stockholders will own about 60 percent of the equity in the company.

The combined company expects to generate more than $720 million in revenue from its 2,500 customers that rely on it to help them manage the complexities of processing their data. While Hadoop itself is open source and freely available, Cloudera and Hortonworks abstract away most of the infrastructure. Both focused on slightly different markets, though, with Hortonworks going after a more technical user and a pure open-source approach, while Cloudera also offered some proprietary tools.

“Together, we are well-positioned to continue growing and competing in the streaming and IoT, data management, data warehousing, machine learning/AI and hybrid cloud markets,” said Hortonworks CEO Rob Bearden back when the deal was first announced. “Importantly, we will be able to offer a broader set of offerings that will enable our customers to capitalize on the value of their data.”

May
31
2018
--

Don’t Drown in your Data Lake

Don't drown in your data lake

Don't drown in your data lakeA data lake is “…a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms…”1. Many companies find value in using a data lake but aren’t clear that they need to properly plan for it and maintain it in order to prevent issues.

The idea of a data lake rose from the need to store data in a raw format that is accessible to a variety of applications and authorized users. Hadoop is often used to query the data, and the necessary structures for querying are created through the query tool (schema on read) rather than as part of the data design (schema on write). There are other tools available for analysis, and many cloud providers are actively developing additional options for creating and managing your data lake. The cloud is often viewed as an ideal place for your data lake since it is inherently elastic and can expand to meet the needs of your data.

Data Lake or Data Swamp?

One of the key components of a functioning data lake is the continuing inflow and egress of data. Some data must be kept indefinitely but some can be archived or deleted after a defined period of time. Failure to remove stale data can result in a data swamp, where the out of date data is taking up valuable and costly space and may be causing queries to take longer to complete. This is one of the first issues that companies encounter in maintaining their data lake. Often, people view the data lake as a “final resting place” for data, but it really should be used for data that is accessed often, or at least occasionally.

A natural spring-fed lake can turn into a swamp due to a variety of factors. If fresh water is not allowed to flow into the lake, this can cause stagnation, meaning that plants and animals that previously were not able to be supported by the lake take hold. Similarly, if water cannot exit the lake at some point, the borders will be breached, and the surrounding land will be inundated. Both of these conditions can cause a once pristine lake to turn into a fetid and undesirable swamp. If data is no longer being added to your data lake, the results will become dated and eventually unreliable. Also, if data is always being added to the lake but is not accessed on a regular basis, this can lead to unrestricted growth of your data lake, with no real plan for how the data will be used. This can become an expensive “cold storage” facility that is likely more expensive than archived storage.

If bad or undesirable items, like old cars or garbage, are thrown into a lake, this can damage the ecosystem, causing unwanted reactions. In a data lake, this is akin to simply throwing data into the data lake with no real rules or rationale. While the data is saved, it may not be useful and can cause negative consequences across the whole environment since it is consuming space and may slow response times. Even though a basic concept of a data lake is that the data does not need to conform to a predefined structure, like you would see with a relational database, it is important that some rules and guidelines exist regarding the type and quality of data that is included in the lake. In the absence of some guidelines, it becomes difficult to access the relevant data for your needs. Proper definition and tagging of content help to ensure that the correct data is accessible and available when needed.

Unrestricted Growth Consequences

Many people have a junk drawer somewhere in their house; a drawer that is filled with old receipts, used tickets, theater programs, and the like. Some of this may be stored for sentimental reasons, but a lot of it is put into this drawer since it was a convenient dropping place for things. Similarly, if we look to the data lake as the “junk drawer” for our company, it is guaranteed to be bigger and more expensive than it truly needs to be.

It is important that the data that is stored in your data lake has a current or expected purpose. While you may not have a current use for some data, it can be helpful to keep it around in case a need arises. An example of this is in the area of machine learning. Providing more ancillary data enables better decisions since it provides a deeper view into the decision process. Therefore, maintaining some data that may not have a specific and current need can be helpful. However, there are cases where maintaining a huge volume of data can be counterproductive. Consider temperature information delivered from a switch. If the temperature reaches a specific threshold, the switch should be shut down. Reporting on the temperature in an immediate and timely manner is important to make an informed decision, but stable temperature data from days, week, or months ago could be summarized and stored in a more efficient manner. The granular details can then be purged from the lake.

So, where is the balance? If you keep all the data, it can make your data lake unwieldy and costly. If you only keep data that has a specific current purpose, you may be impairing your future plans. Obviously, the key is to monitor your access and use of the data frequently, and purge or archive some of the data that is not being regularly used.

Uncontrolled Access Concerns

Since much of the data in your data lake is company confidential, it is imperative that access to that data be controlled. The fact that the data in the lake is stored in its raw format means that it is more difficult to control access. The structures of a relational database provide some of the basis for access control, allowing us to limit who has access to specific queries, tables, fields, schemas, databases, and other objects. In the absence of these structures, controlling access requires more finesse. Determining who has access to what parts of the data in the lake must be handled, as well as isolating the data within your own network environment. Many of these restrictions may already be in place in your current environment, but they should be reviewed before being relied on fully, since the data lake may store information that was previously unavailable to some users. Access should be regularly reviewed to identify potential rogue activities. Encryption options also exist to further secure the data from unwanted access, and file system security can be used to limit access. All of these components must be considered, implemented, and reviewed to ensure that the data is secure.

User Considerations

In a relational database, the data structure inherently determines some of the consistencies and format of the data. This enables users to easily query the data and be assured that they are returning valid results. The lack of such structures in the data lake means that users must be more highly skilled at data manipulation. Having users with less skill accessing the data is possible, but it may not provide the best results. A data scientist is better positioned to access and query the complete data set. Obviously, users with a higher skill set are rare and cost more to hire, but the return may be worth it in the long run.

So What Do I Do Now?

This is an area where there are no hard and fast rules. Each company must develop and implement processes and procedures that make sense for their individual needs. Only with a plan for monitoring inputs, outputs, access patterns, and the like are you able to make a solid determination for your company’s needs. Percona can help to determine a plan for reporting usage, assess security settings, and more. As you are using the data in your data lake, we can also provide guidance regarding tools used to access the data.

1 Wikipedia, May 22, 2018

The post Don’t Drown in your Data Lake appeared first on Percona Database Performance Blog.

Oct
03
2017
--

Investors place $25M on AtScale to get the big picture of big data

 AtScale, a four-year old startup that helps companies get a big-picture view of their big data inside their BI tools, announced a $25 million Series C investment today. The round was led by Atlantic Bridge with participation from new investors Wells Fargo and Industry Ventures along with returning investors Storm Ventures, UMC, Comcast and XSeed Capital. With today’s investment, the… Read More

Jun
06
2017
--

Databricks releases serverless platform for Apache Spark along with new library supporting deep learning

 Today to kick off Spark Summit, Databricks announced a Serverless Platform for Apache Spark — welcome news for developers looking to reduce time spent on cluster management. The move to simplify developer experiences is set to be a major theme of the event overall. In addition to Serverless, the company also introduced Deep Learning Pipelines, a library that makes it easy to mix… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com