May
31
2018
--

Don’t Drown in your Data Lake

Don't drown in your data lake

Don't drown in your data lakeA data lake is “…a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms…”1. Many companies find value in using a data lake but aren’t clear that they need to properly plan for it and maintain it in order to prevent issues.

The idea of a data lake rose from the need to store data in a raw format that is accessible to a variety of applications and authorized users. Hadoop is often used to query the data, and the necessary structures for querying are created through the query tool (schema on read) rather than as part of the data design (schema on write). There are other tools available for analysis, and many cloud providers are actively developing additional options for creating and managing your data lake. The cloud is often viewed as an ideal place for your data lake since it is inherently elastic and can expand to meet the needs of your data.

Data Lake or Data Swamp?

One of the key components of a functioning data lake is the continuing inflow and egress of data. Some data must be kept indefinitely but some can be archived or deleted after a defined period of time. Failure to remove stale data can result in a data swamp, where the out of date data is taking up valuable and costly space and may be causing queries to take longer to complete. This is one of the first issues that companies encounter in maintaining their data lake. Often, people view the data lake as a “final resting place” for data, but it really should be used for data that is accessed often, or at least occasionally.

A natural spring-fed lake can turn into a swamp due to a variety of factors. If fresh water is not allowed to flow into the lake, this can cause stagnation, meaning that plants and animals that previously were not able to be supported by the lake take hold. Similarly, if water cannot exit the lake at some point, the borders will be breached, and the surrounding land will be inundated. Both of these conditions can cause a once pristine lake to turn into a fetid and undesirable swamp. If data is no longer being added to your data lake, the results will become dated and eventually unreliable. Also, if data is always being added to the lake but is not accessed on a regular basis, this can lead to unrestricted growth of your data lake, with no real plan for how the data will be used. This can become an expensive “cold storage” facility that is likely more expensive than archived storage.

If bad or undesirable items, like old cars or garbage, are thrown into a lake, this can damage the ecosystem, causing unwanted reactions. In a data lake, this is akin to simply throwing data into the data lake with no real rules or rationale. While the data is saved, it may not be useful and can cause negative consequences across the whole environment since it is consuming space and may slow response times. Even though a basic concept of a data lake is that the data does not need to conform to a predefined structure, like you would see with a relational database, it is important that some rules and guidelines exist regarding the type and quality of data that is included in the lake. In the absence of some guidelines, it becomes difficult to access the relevant data for your needs. Proper definition and tagging of content help to ensure that the correct data is accessible and available when needed.

Unrestricted Growth Consequences

Many people have a junk drawer somewhere in their house; a drawer that is filled with old receipts, used tickets, theater programs, and the like. Some of this may be stored for sentimental reasons, but a lot of it is put into this drawer since it was a convenient dropping place for things. Similarly, if we look to the data lake as the “junk drawer” for our company, it is guaranteed to be bigger and more expensive than it truly needs to be.

It is important that the data that is stored in your data lake has a current or expected purpose. While you may not have a current use for some data, it can be helpful to keep it around in case a need arises. An example of this is in the area of machine learning. Providing more ancillary data enables better decisions since it provides a deeper view into the decision process. Therefore, maintaining some data that may not have a specific and current need can be helpful. However, there are cases where maintaining a huge volume of data can be counterproductive. Consider temperature information delivered from a switch. If the temperature reaches a specific threshold, the switch should be shut down. Reporting on the temperature in an immediate and timely manner is important to make an informed decision, but stable temperature data from days, week, or months ago could be summarized and stored in a more efficient manner. The granular details can then be purged from the lake.

So, where is the balance? If you keep all the data, it can make your data lake unwieldy and costly. If you only keep data that has a specific current purpose, you may be impairing your future plans. Obviously, the key is to monitor your access and use of the data frequently, and purge or archive some of the data that is not being regularly used.

Uncontrolled Access Concerns

Since much of the data in your data lake is company confidential, it is imperative that access to that data be controlled. The fact that the data in the lake is stored in its raw format means that it is more difficult to control access. The structures of a relational database provide some of the basis for access control, allowing us to limit who has access to specific queries, tables, fields, schemas, databases, and other objects. In the absence of these structures, controlling access requires more finesse. Determining who has access to what parts of the data in the lake must be handled, as well as isolating the data within your own network environment. Many of these restrictions may already be in place in your current environment, but they should be reviewed before being relied on fully, since the data lake may store information that was previously unavailable to some users. Access should be regularly reviewed to identify potential rogue activities. Encryption options also exist to further secure the data from unwanted access, and file system security can be used to limit access. All of these components must be considered, implemented, and reviewed to ensure that the data is secure.

User Considerations

In a relational database, the data structure inherently determines some of the consistencies and format of the data. This enables users to easily query the data and be assured that they are returning valid results. The lack of such structures in the data lake means that users must be more highly skilled at data manipulation. Having users with less skill accessing the data is possible, but it may not provide the best results. A data scientist is better positioned to access and query the complete data set. Obviously, users with a higher skill set are rare and cost more to hire, but the return may be worth it in the long run.

So What Do I Do Now?

This is an area where there are no hard and fast rules. Each company must develop and implement processes and procedures that make sense for their individual needs. Only with a plan for monitoring inputs, outputs, access patterns, and the like are you able to make a solid determination for your company’s needs. Percona can help to determine a plan for reporting usage, assess security settings, and more. As you are using the data in your data lake, we can also provide guidance regarding tools used to access the data.

1 Wikipedia, May 22, 2018

The post Don’t Drown in your Data Lake appeared first on Percona Database Performance Blog.

Oct
03
2017
--

Investors place $25M on AtScale to get the big picture of big data

 AtScale, a four-year old startup that helps companies get a big-picture view of their big data inside their BI tools, announced a $25 million Series C investment today. The round was led by Atlantic Bridge with participation from new investors Wells Fargo and Industry Ventures along with returning investors Storm Ventures, UMC, Comcast and XSeed Capital. With today’s investment, the… Read More

Jun
06
2017
--

Databricks releases serverless platform for Apache Spark along with new library supporting deep learning

 Today to kick off Spark Summit, Databricks announced a Serverless Platform for Apache Spark — welcome news for developers looking to reduce time spent on cluster management. The move to simplify developer experiences is set to be a major theme of the event overall. In addition to Serverless, the company also introduced Deep Learning Pipelines, a library that makes it easy to mix… Read More

Apr
01
2017
--

Cloudera finally ready for the public stage

TechCrunch's Ron Miller on stage with Cloudera CEO Tom Reilly at the Intel Capital Summit in 2014. When I first met Cloudera CEO Tom Reilly in 2015 at the Intel Capital Summit, we were about to go on stage for a fireside chat to discuss among other things Intel’s massive investment in his company.
While on stage, the conversation inevitably turned to when the company might go public. As you might expect, he gave me the standard startup CEO answer. While Cloudera was certainly of… Read More

Jan
30
2017
--

MXNet accepted to the Apache Incubator

Connecting lines, computer illustration. MXNet, Amazon Web Services’ preferred deep learning framework, was accepted to the Apache Incubator today. Admission to the incubator is the first step necessary for the open-source initiative to officially become part of the Apache Software Foundation. The Apache Software Foundation supports the efforts of thousands of developers maintaining open-source projects around the world.… Read More

Jan
12
2017
--

Talend looks to ease big data prep with latest release

data integration concept diagram. Talend, the big data integration vendor that went public last July, announced its winter release today with new tools to help automate data preparation, a sticky problem for enterprise customers. Surely, there are ever-increasing amounts of data and companies struggle to keep up. There aren’t enough data scientists in the world to fill the need. It requires software to pick up some of… Read More

Sep
27
2016
--

IBM releases DataWorks to give enterprise data a home and a brain

IBM DataWorks While the gears of research are turning fast developing new methods of machine intelligence, another, perhaps more impactful, trend is brewing in the field. Open source frameworks like Apache Spark are hitting their stride at the ideal time to put data analytics in the hands of the business development analyst without forgetting about the needs of the data scientist. IBM’s new… Read More

May
23
2016
--

Cray’s latest supercomputer runs OpenStack and open source big data tools

Cray Urika-GX super computer Cray has always been associated with speed and power and its latest computing beast called the Cray Urika-GX system has been designed specifically for big data workloads. What’s more, it runs on OpenStack, the open source cloud platform and supports open source big data processing tools like Hadoop and Spark. Cray recognizes that the computing world had evolved since Seymour Cray… Read More

Mar
24
2016
--

Newcomer Galactic Exchange can spin up a Hadoop cluster in five minutes

Business running with computer to illustrate speed. A new company with a cool name, Galactic Exchange, came out of stealth today with a great idea. It claims it can spin up a Hadoop cluster for you in five minutes, ready to go. That’s no small feat if it works as advertised and greatly simplifies what has traditionally been a process wrought with complexity.
The new product called ClusterGX is being released in Beta this week… Read More

Mar
15
2016
--

Altiscale’s latest cloud service brings Hadoop to business users

Man reaching for tablet with lots of data on it. Altiscale, a company that has always been about reducing the complexity related to using Hadoop, has taken that to the next level today with the release of Altiscale Insight Cloud, a cloud service aimed at making Hadoop accessible to business users.
Altiscale Insight Cloud provides services for data ingestion, processing, analysis and visualization. That includes the ability to… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com