Aug
08
2013
--

Big Data with MySQL and Hadoop at MySQL Connect 2013

I will be talking about Big Data with MySQL and Hadoop at MySQL Connect 2013 (Sept. 21-22) in San Francisco as well as at Percona University at Washington, DC (September 12, 2013). Apache Hadoop is a very popular Big Data solution and we can nowadays easily integrate it with MySQL. I will start with a brief introduction of Apache Hadoop and its components (HFDS, Map/Reduce, Hive, HBase/HCatalog, Flume, Scoop, etc). Next I will show 2 major Big Data scenarios:

  • From file to Hadoop to MySQL. This is an example of “ELT” process: Extract data from external source; Load data into Hadoop; Transform data/Analyze data; Extract results to MySQL. It is similar to the original Data Warehouse ETL (Extract; Transfer; Load) process, however, instead of “transforming” data before loading it to the Data Warehouse, we will load it “as is” and then run the data analysis. As a result of this analysis (map/reduce process) we can generate a report and load it to MySQL (using Sqoop export). To illustrate this process I will show 2 classical examples: Clickstream analysis and Twitter feed analysis. On top of those examples I will also show how to use MySQL / Full Text Search solutions to perform a near real-time reports from HBase.

Picture 1: ELT pipeline, from File to Hadoop to MySQL

clickstream_example

  • From OLTP MySQL to Hadoop to MySQL reporting. In this scenario we extract data (potentially close to real-time) from MySQL, load it to Hadoop for storage and analysis and later generate reports to load it into another MySQL instance (reporting), which can be used to generate and display graphs.

Picture 2: From OLTP MySQL to Hadoop to MySQL reporting.

hadoop_mysql_reporting

Note: The reason why we need an additional storage for MySQL reports is that it may take a long time to generate a Hive report (as it is executed with Map/Reduce which reads all the files/no indexes). So it make sense to “offload” a common reports’ results into a separate storage (MySQL).

In both scenarios we will need a way to integrate Hadoop and MySQL. In my previous post, MySQL and Hadoop integration, I have demonstrated how to integrate Hadoop and MySQL with Sqoop and Hadoop Applier for MySQL. This case is similar, however we can use a different toolset. In the scenario 1 (i.e. Clickstream) we can use Apache Flume to grab files (or read “events”) and load them into Hadoop. With Flume we can define a “source” and a “sink”. Flume supports a range of different sources including HTTP requests, Syslog, TCP, etc. HTTP source is interesting, as we can convert all (or a number of) HTTP requests (“Source”) into an “event” which can be loaded into Hadoop (“Sink”).

During my presentation I will show the exact configurations for the sample Clickstream process, including:

  1. Flume configuration
  2. HiveQL queries to generate a report
  3. Sqoop export queries to load the report into MySQL

See you at MySQL Connect 2013!

The post Big Data with MySQL and Hadoop at MySQL Connect 2013 appeared first on MySQL Performance Blog.

Jul
11
2013
--

MySQL and Hadoop integration

hadoop_and_mysql

Dolphin and Elephant: an Introduction

This post is intended for MySQL DBAs or Sysadmins who need to start using Apache Hadoop and want to integrate those 2 solutions. In this post I will cover some basic information about the Hadoop, focusing on Hive as well as MySQL and Hadoop/Hive integration.

First of all, if you were dealing with MySQL or any other relational database most of your professional life (like I was), Hadoop may look different. Very different. Apparently, Hadoop is the opposite to any relational database. Unlike the database where we have a set of tables and indexes, Hadoop works with a set of text files. And… there are no indexes at all. And yes, this may be shocking, but all scans are sequential (full “table” scans in MySQL terms).

So, when does Hadoop makes sense?

First, Hadoop is great if you need to store huge amounts of data (we are talking about Petabytes now) and those data does not require real-time (milliseconds) response time. Hadoop works as a cluster of nodes (similar to MySQL Cluster) and all data are spread across the cluster (with redundancy), so it provides both high availability (if implemented correctly) and scalability. The data retrieval process (map/reduce) is a parallel process, so the more data nodes you will add to Hadoop the faster the process will be.

Second, Hadoop may be very helpful if you need to store your historical data for a long period of time. For example: store the online orders for the last 3 years in MySQL and store all orders (including those mail and phone orders since 1986 in Hadoop for trend analysis and historical purposes).

Integration

The next step after installing and configuring Hadoop is to implement a data flow between Hadoop and MySQL. If you have an OLTP system based on MySQL and you will want to use Hadoop for data analysis (data science) you may want to add a constant data flow between Hadoop and MySQL. For example, one may want to implement a data archiving, where old data is not deleted but rather placed into Hadoop and will be available for a further analysis. There are 2 major ways of doing it:

  1. Non realtime: Sqoop
  2. Realtime: Hadoop Applier for MySQL

Using Apache Sqoop for MySQL and Hadoop integration

Apache Sqoop can be run from a cronjob to get the data from MySQL and load it into Hadoop. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL. Here is how we can import the whole table from MySQL to Hive:

$ sqoop import --connect jdbc:mysql://mysql_host/db_name --table ORDERS --hive-import

If you do not have a BLOBs or TEXTs in your table you can use “–direct” option which will probably be faster (it will use mysqldump). Another useful option is “–default-character-set”, for example for utf8 one can use “–default-character-set=utf8″. “–verify” option will help to check for data integrity.

To constantly import only the new rows from the table we can use option “–where “. For example:

$ sqoop import --connect jdbc:mysql://mysql_host/db_name --table ORDERS --hive-import --where "order_date > '2013-07-01'"

The following picture illustrates the process:

Sqoop

Using MySQL Applier for Hadoop

Sqoop is great if you need to perform a “batch” import. For a realtime data integration, we can use MySQL Applier for Hadoop. With the MySQL applier Hadoop / Hive will be integrated as if it is additional MySQL slave. MySQL Applier will read binlog events from the MySQL and “apply” those to our Hive table.

The following picture illustrate this process:

applier

Conclusion

In this post I have showed the ways to integrate MySQL and Hadoop (the big picture). In the subsequent post I will show how to implement a data archiving with MySQL using Hadoop/Hive as a target.

The post MySQL and Hadoop integration appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com