Jun
10
2019
--

Qubole launches Quantum, its serverless database engine

Qubole, the data platform founded by Apache Hive creator and former head of Facebook’s Data Infrastructure team Ashish Thusoo, today announced the launch of Quantum, its first serverless offering.

Qubole may not necessarily be a household name, but its customers include the likes of Autodesk, Comcast, Lyft, Nextdoor and Zillow . For these users, Qubole has long offered a self-service platform that allowed their data scientists and engineers to build their AI, machine learning and analytics workflows on the public cloud of their choice. The platform sits on top of open-source technologies like Apache Spark, Presto and Kafka, for example.

Typically, enterprises have to provision a considerable amount of resources to give these platforms the resources they need. These resources often go unused and the infrastructure can quickly become complex.

Qubole already abstracts most of this away, offering what is essentially a serverless platform. With Quantum, however, it is going a step further by launching a high-performance serverless SQL engine that allows users to query petabytes of data with nothing else but ANSI-SQL, giving them the choice between using a Presto cluster or a serverless SQL engine to run their queries, for example.

The data can be stored on AWS and users won’t have to set up a second data lake or move their data to another platform to use the SQL engine. Quantum automatically scales up or down as needed, of course, and users can still work with the same metastore for their data, no matter whether they choose the clustered or serverless option. Indeed, Quantum is essentially just another SQL engine without Qubole’s overall suite of engines.

Typically, Qubole charges enterprises by compute minutes. When using Quantum, the company uses the same metric, but enterprises pay for the execution time of the query. “So instead of the Qubole compute units being associated with the number of minutes the cluster was up and running, it is associated with the Qubole compute units consumed by that particular query or that particular workload, which is even more fine-grained,” Thusoo explained. “This works really well when you have to do interactive workloads.”

Thusoo notes that Quantum is targeted at analysts who often need to perform interactive queries on data stored in object stores. Qubole integrates with services like Tableau and Looker (which Google is now in the process of acquiring). “They suddenly get access to very elastic compute capacity, but they are able to come through a very familiar user interface,” Thusoo noted.

 

Jan
03
2019
--

Cloudera and Hortonworks finalize their merger

Cloudera and Hortonworks, two of the biggest players in the Hadoop big data space, today announced that they have finalized their all-stock merger. The new company will use the Cloudera brand and will continue to trade under the CLDR symbol on the New York Stock Exchange.

“Today, we start an exciting new chapter for Cloudera as we become the leading enterprise data cloud provider,” said Tom Reilly, chief executive officer of Cloudera, in today’s announcement. “This combined team and technology portfolio establish the new Cloudera as a clear market leader with the scale and resources to drive continued innovation and growth. We will provide customers a comprehensive solution-set to bring the right data analytics to data anywhere the enterprise needs to work, from the Edge to AI, with the industry’s first Enterprise Data Cloud.”

The companies describe the deal as a “merger of equals,” though Cloudera stockholders will own about 60 percent of the equity in the company.

The combined company expects to generate more than $720 million in revenue from its 2,500 customers that rely on it to help them manage the complexities of processing their data. While Hadoop itself is open source and freely available, Cloudera and Hortonworks abstract away most of the infrastructure. Both focused on slightly different markets, though, with Hortonworks going after a more technical user and a pure open-source approach, while Cloudera also offered some proprietary tools.

“Together, we are well-positioned to continue growing and competing in the streaming and IoT, data management, data warehousing, machine learning/AI and hybrid cloud markets,” said Hortonworks CEO Rob Bearden back when the deal was first announced. “Importantly, we will be able to offer a broader set of offerings that will enable our customers to capitalize on the value of their data.”

May
31
2018
--

Don’t Drown in your Data Lake

Don't drown in your data lake

Don't drown in your data lakeA data lake is “…a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms…”1. Many companies find value in using a data lake but aren’t clear that they need to properly plan for it and maintain it in order to prevent issues.

The idea of a data lake rose from the need to store data in a raw format that is accessible to a variety of applications and authorized users. Hadoop is often used to query the data, and the necessary structures for querying are created through the query tool (schema on read) rather than as part of the data design (schema on write). There are other tools available for analysis, and many cloud providers are actively developing additional options for creating and managing your data lake. The cloud is often viewed as an ideal place for your data lake since it is inherently elastic and can expand to meet the needs of your data.

Data Lake or Data Swamp?

One of the key components of a functioning data lake is the continuing inflow and egress of data. Some data must be kept indefinitely but some can be archived or deleted after a defined period of time. Failure to remove stale data can result in a data swamp, where the out of date data is taking up valuable and costly space and may be causing queries to take longer to complete. This is one of the first issues that companies encounter in maintaining their data lake. Often, people view the data lake as a “final resting place” for data, but it really should be used for data that is accessed often, or at least occasionally.

A natural spring-fed lake can turn into a swamp due to a variety of factors. If fresh water is not allowed to flow into the lake, this can cause stagnation, meaning that plants and animals that previously were not able to be supported by the lake take hold. Similarly, if water cannot exit the lake at some point, the borders will be breached, and the surrounding land will be inundated. Both of these conditions can cause a once pristine lake to turn into a fetid and undesirable swamp. If data is no longer being added to your data lake, the results will become dated and eventually unreliable. Also, if data is always being added to the lake but is not accessed on a regular basis, this can lead to unrestricted growth of your data lake, with no real plan for how the data will be used. This can become an expensive “cold storage” facility that is likely more expensive than archived storage.

If bad or undesirable items, like old cars or garbage, are thrown into a lake, this can damage the ecosystem, causing unwanted reactions. In a data lake, this is akin to simply throwing data into the data lake with no real rules or rationale. While the data is saved, it may not be useful and can cause negative consequences across the whole environment since it is consuming space and may slow response times. Even though a basic concept of a data lake is that the data does not need to conform to a predefined structure, like you would see with a relational database, it is important that some rules and guidelines exist regarding the type and quality of data that is included in the lake. In the absence of some guidelines, it becomes difficult to access the relevant data for your needs. Proper definition and tagging of content help to ensure that the correct data is accessible and available when needed.

Unrestricted Growth Consequences

Many people have a junk drawer somewhere in their house; a drawer that is filled with old receipts, used tickets, theater programs, and the like. Some of this may be stored for sentimental reasons, but a lot of it is put into this drawer since it was a convenient dropping place for things. Similarly, if we look to the data lake as the “junk drawer” for our company, it is guaranteed to be bigger and more expensive than it truly needs to be.

It is important that the data that is stored in your data lake has a current or expected purpose. While you may not have a current use for some data, it can be helpful to keep it around in case a need arises. An example of this is in the area of machine learning. Providing more ancillary data enables better decisions since it provides a deeper view into the decision process. Therefore, maintaining some data that may not have a specific and current need can be helpful. However, there are cases where maintaining a huge volume of data can be counterproductive. Consider temperature information delivered from a switch. If the temperature reaches a specific threshold, the switch should be shut down. Reporting on the temperature in an immediate and timely manner is important to make an informed decision, but stable temperature data from days, week, or months ago could be summarized and stored in a more efficient manner. The granular details can then be purged from the lake.

So, where is the balance? If you keep all the data, it can make your data lake unwieldy and costly. If you only keep data that has a specific current purpose, you may be impairing your future plans. Obviously, the key is to monitor your access and use of the data frequently, and purge or archive some of the data that is not being regularly used.

Uncontrolled Access Concerns

Since much of the data in your data lake is company confidential, it is imperative that access to that data be controlled. The fact that the data in the lake is stored in its raw format means that it is more difficult to control access. The structures of a relational database provide some of the basis for access control, allowing us to limit who has access to specific queries, tables, fields, schemas, databases, and other objects. In the absence of these structures, controlling access requires more finesse. Determining who has access to what parts of the data in the lake must be handled, as well as isolating the data within your own network environment. Many of these restrictions may already be in place in your current environment, but they should be reviewed before being relied on fully, since the data lake may store information that was previously unavailable to some users. Access should be regularly reviewed to identify potential rogue activities. Encryption options also exist to further secure the data from unwanted access, and file system security can be used to limit access. All of these components must be considered, implemented, and reviewed to ensure that the data is secure.

User Considerations

In a relational database, the data structure inherently determines some of the consistencies and format of the data. This enables users to easily query the data and be assured that they are returning valid results. The lack of such structures in the data lake means that users must be more highly skilled at data manipulation. Having users with less skill accessing the data is possible, but it may not provide the best results. A data scientist is better positioned to access and query the complete data set. Obviously, users with a higher skill set are rare and cost more to hire, but the return may be worth it in the long run.

So What Do I Do Now?

This is an area where there are no hard and fast rules. Each company must develop and implement processes and procedures that make sense for their individual needs. Only with a plan for monitoring inputs, outputs, access patterns, and the like are you able to make a solid determination for your company’s needs. Percona can help to determine a plan for reporting usage, assess security settings, and more. As you are using the data in your data lake, we can also provide guidance regarding tools used to access the data.

1 Wikipedia, May 22, 2018

The post Don’t Drown in your Data Lake appeared first on Percona Database Performance Blog.

Oct
03
2017
--

Investors place $25M on AtScale to get the big picture of big data

 AtScale, a four-year old startup that helps companies get a big-picture view of their big data inside their BI tools, announced a $25 million Series C investment today. The round was led by Atlantic Bridge with participation from new investors Wells Fargo and Industry Ventures along with returning investors Storm Ventures, UMC, Comcast and XSeed Capital. With today’s investment, the… Read More

Jun
06
2017
--

Databricks releases serverless platform for Apache Spark along with new library supporting deep learning

 Today to kick off Spark Summit, Databricks announced a Serverless Platform for Apache Spark — welcome news for developers looking to reduce time spent on cluster management. The move to simplify developer experiences is set to be a major theme of the event overall. In addition to Serverless, the company also introduced Deep Learning Pipelines, a library that makes it easy to mix… Read More

Apr
01
2017
--

Cloudera finally ready for the public stage

TechCrunch's Ron Miller on stage with Cloudera CEO Tom Reilly at the Intel Capital Summit in 2014. When I first met Cloudera CEO Tom Reilly in 2015 at the Intel Capital Summit, we were about to go on stage for a fireside chat to discuss among other things Intel’s massive investment in his company.
While on stage, the conversation inevitably turned to when the company might go public. As you might expect, he gave me the standard startup CEO answer. While Cloudera was certainly of… Read More

Jan
30
2017
--

MXNet accepted to the Apache Incubator

Connecting lines, computer illustration. MXNet, Amazon Web Services’ preferred deep learning framework, was accepted to the Apache Incubator today. Admission to the incubator is the first step necessary for the open-source initiative to officially become part of the Apache Software Foundation. The Apache Software Foundation supports the efforts of thousands of developers maintaining open-source projects around the world.… Read More

Jan
12
2017
--

Talend looks to ease big data prep with latest release

data integration concept diagram. Talend, the big data integration vendor that went public last July, announced its winter release today with new tools to help automate data preparation, a sticky problem for enterprise customers. Surely, there are ever-increasing amounts of data and companies struggle to keep up. There aren’t enough data scientists in the world to fill the need. It requires software to pick up some of… Read More

Sep
27
2016
--

IBM releases DataWorks to give enterprise data a home and a brain

IBM DataWorks While the gears of research are turning fast developing new methods of machine intelligence, another, perhaps more impactful, trend is brewing in the field. Open source frameworks like Apache Spark are hitting their stride at the ideal time to put data analytics in the hands of the business development analyst without forgetting about the needs of the data scientist. IBM’s new… Read More

May
23
2016
--

Cray’s latest supercomputer runs OpenStack and open source big data tools

Cray Urika-GX super computer Cray has always been associated with speed and power and its latest computing beast called the Cray Urika-GX system has been designed specifically for big data workloads. What’s more, it runs on OpenStack, the open source cloud platform and supports open source big data processing tools like Hadoop and Spark. Cray recognizes that the computing world had evolved since Seymour Cray… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com