Feb
07
2019
--

Microsoft Azure sets its sights on more analytics workloads

Enterprises now amass huge amounts of data, both from their own tools and applications, as well as from the SaaS applications they use. For a long time, that data was basically exhaust. Maybe it was stored for a while to fulfill some legal requirements, but then it was discarded. Now, data is what drives machine learning models, and the more data you have, the better. It’s maybe no surprise, then, that the big cloud vendors started investing in data warehouses and lakes early on. But that’s just a first step. After that, you also need the analytics tools to make all of this data useful.

Today, it’s Microsoft turn to shine the spotlight on its data analytics services. The actual news here is pretty straightforward. Two of these are services that are moving into general availability: the second generation of Azure Data Lake Storage for big data analytics workloads and Azure Data Explorer, a managed service that makes easier ad-hoc analysis of massive data volumes. Microsoft is also previewing a new feature in Azure Data Factory, its graphical no-code service for building data transformation. Data Factory now features the ability to map data flows.

Those individual news pieces are interesting if you are a user or are considering Azure for your big data workloads, but what’s maybe more important here is that Microsoft is trying to offer a comprehensive set of tools for managing and storing this data — and then using it for building analytics and AI services.

(Photo credit:Josh Edelson/AFP/Getty Images)

“AI is a top priority for every company around the globe,” Julia White, Microsoft’s corporate VP for Azure, told me. “And as we are working with our customers on AI, it becomes clear that their analytics often aren’t good enough for building an AI platform.” These companies are generating plenty of data, which then has to be pulled into analytics systems. She stressed that she couldn’t remember a customer conversation in recent months that didn’t focus on AI. “There is urgency to get to the AI dream,” White said, but the growth and variety of data presents a major challenge for many enterprises. “They thought this was a technology that was separate from their core systems. Now it’s expected for both customer-facing and line-of-business applications.”

Data Lake Storage helps with managing this variety of data since it can handle both structured and unstructured data (and is optimized for the Spark and Hadoop analytics engines). The service can ingest any kind of data — yet Microsoft still promises that it will be very fast. “The world of analytics tended to be defined by having to decide upfront and then building rigid structures around it to get the performance you wanted,” explained White. Data Lake Storage, on the other hand, wants to offer the best of both worlds.

Likewise, White argued that while many enterprises used to keep these services on their on-premises servers, many of them are still appliance-based. But she believes the cloud has now reached the point where the price/performance calculations are in its favor. It took a while to get to this point, though, and to convince enterprises. White noted that for the longest time, enterprises that looked at their analytics projects thought $300 million projects took forever, tied up lots of people and were frankly a bit scary. “But also, what we had to offer in the cloud hasn’t been amazing until some of the recent work,” she said. “We’ve been on a journey — as well as the other cloud vendors — and the price performance is now compelling.” And it sure helps that if enterprises want to meet their AI goals, they’ll now have to tackle these workloads, too.

Jan
03
2019
--

Cloudera and Hortonworks finalize their merger

Cloudera and Hortonworks, two of the biggest players in the Hadoop big data space, today announced that they have finalized their all-stock merger. The new company will use the Cloudera brand and will continue to trade under the CLDR symbol on the New York Stock Exchange.

“Today, we start an exciting new chapter for Cloudera as we become the leading enterprise data cloud provider,” said Tom Reilly, chief executive officer of Cloudera, in today’s announcement. “This combined team and technology portfolio establish the new Cloudera as a clear market leader with the scale and resources to drive continued innovation and growth. We will provide customers a comprehensive solution-set to bring the right data analytics to data anywhere the enterprise needs to work, from the Edge to AI, with the industry’s first Enterprise Data Cloud.”

The companies describe the deal as a “merger of equals,” though Cloudera stockholders will own about 60 percent of the equity in the company.

The combined company expects to generate more than $720 million in revenue from its 2,500 customers that rely on it to help them manage the complexities of processing their data. While Hadoop itself is open source and freely available, Cloudera and Hortonworks abstract away most of the infrastructure. Both focused on slightly different markets, though, with Hortonworks going after a more technical user and a pure open-source approach, while Cloudera also offered some proprietary tools.

“Together, we are well-positioned to continue growing and competing in the streaming and IoT, data management, data warehousing, machine learning/AI and hybrid cloud markets,” said Hortonworks CEO Rob Bearden back when the deal was first announced. “Importantly, we will be able to offer a broader set of offerings that will enable our customers to capitalize on the value of their data.”

Dec
19
2018
--

Google’s Cloud Spanner database adds new features and regions

Cloud Spanner, Google’s globally distributed relational database service, is getting a bit more distributed today with the launch of a new region and new ways to set up multi-region configurations. The service is also getting a new feature that gives developers deeper insights into their most resource-consuming queries.

With this update, Google is adding to the Cloud Spanner lineup Hong Kong (asia-east2), its newest data center location. With this, Cloud Spanner is now available in 14 out of 18 Google Cloud Platform (GCP) regions, including seven the company added this year alone. The plan is to bring Cloud Spanner to every new GCP region as they come online.

The other new region-related news is the launch of two new configurations for multi-region coverage. One, called eur3, focuses on the European Union, and is obviously meant for users there who mostly serve a local customer base. The other is called nam6 and focuses on North America, with coverage across both costs and the middle of the country, using data centers in Oregon, Los Angeles, South Carolina and Iowa. Previously, the service only offered a North American configuration with three regions and a global configuration with three data centers spread across North America, Europe and Asia.

While Cloud Spanner is obviously meant for global deployments, these new configurations are great for users who only need to serve certain markets.

As far as the new query features are concerned, Cloud Spanner is now making it easier for developers to view, inspect and debug queries. The idea here is to give developers better visibility into their most frequent and expensive queries (and maybe make them less expensive in the process).

In addition to the Cloud Spanner news, Google Cloud today announced that its Cloud Dataproc Hadoop and Spark service now supports the R language, in addition to Python 3.7 support on App Engine.

Jul
11
2013
--

MySQL and Hadoop integration

hadoop_and_mysql

Dolphin and Elephant: an Introduction

This post is intended for MySQL DBAs or Sysadmins who need to start using Apache Hadoop and want to integrate those 2 solutions. In this post I will cover some basic information about the Hadoop, focusing on Hive as well as MySQL and Hadoop/Hive integration.

First of all, if you were dealing with MySQL or any other relational database most of your professional life (like I was), Hadoop may look different. Very different. Apparently, Hadoop is the opposite to any relational database. Unlike the database where we have a set of tables and indexes, Hadoop works with a set of text files. And… there are no indexes at all. And yes, this may be shocking, but all scans are sequential (full “table” scans in MySQL terms).

So, when does Hadoop makes sense?

First, Hadoop is great if you need to store huge amounts of data (we are talking about Petabytes now) and those data does not require real-time (milliseconds) response time. Hadoop works as a cluster of nodes (similar to MySQL Cluster) and all data are spread across the cluster (with redundancy), so it provides both high availability (if implemented correctly) and scalability. The data retrieval process (map/reduce) is a parallel process, so the more data nodes you will add to Hadoop the faster the process will be.

Second, Hadoop may be very helpful if you need to store your historical data for a long period of time. For example: store the online orders for the last 3 years in MySQL and store all orders (including those mail and phone orders since 1986 in Hadoop for trend analysis and historical purposes).

Integration

The next step after installing and configuring Hadoop is to implement a data flow between Hadoop and MySQL. If you have an OLTP system based on MySQL and you will want to use Hadoop for data analysis (data science) you may want to add a constant data flow between Hadoop and MySQL. For example, one may want to implement a data archiving, where old data is not deleted but rather placed into Hadoop and will be available for a further analysis. There are 2 major ways of doing it:

  1. Non realtime: Sqoop
  2. Realtime: Hadoop Applier for MySQL

Using Apache Sqoop for MySQL and Hadoop integration

Apache Sqoop can be run from a cronjob to get the data from MySQL and load it into Hadoop. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL. Here is how we can import the whole table from MySQL to Hive:

$ sqoop import --connect jdbc:mysql://mysql_host/db_name --table ORDERS --hive-import

If you do not have a BLOBs or TEXTs in your table you can use “–direct” option which will probably be faster (it will use mysqldump). Another useful option is “–default-character-set”, for example for utf8 one can use “–default-character-set=utf8″. “–verify” option will help to check for data integrity.

To constantly import only the new rows from the table we can use option “–where “. For example:

$ sqoop import --connect jdbc:mysql://mysql_host/db_name --table ORDERS --hive-import --where "order_date > '2013-07-01'"

The following picture illustrates the process:

Sqoop

Using MySQL Applier for Hadoop

Sqoop is great if you need to perform a “batch” import. For a realtime data integration, we can use MySQL Applier for Hadoop. With the MySQL applier Hadoop / Hive will be integrated as if it is additional MySQL slave. MySQL Applier will read binlog events from the MySQL and “apply” those to our Hive table.

The following picture illustrate this process:

applier

Conclusion

In this post I have showed the ways to integrate MySQL and Hadoop (the big picture). In the subsequent post I will show how to implement a data archiving with MySQL using Hadoop/Hive as a target.

The post MySQL and Hadoop integration appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com