Jun
02
2014
--

Using InfiniDB MySQL server with Hadoop cluster for data analytics

In my previous post about Hadoop and Impala I benchmarked performance of analytical queries in Impala.

This time I’ve tried InfiniDB for Hadoop (open-source version) on the modern hardware with an 8-node Hadoop cluster. One of the main advantages (at least for me) of InifiniDB for Hadoop is that it stores the data inside the Hadoop cluster but uses the MySQL server to execute queries. This allows for an easy “migration” of existing analytical tools. The results are quite interesting and promising.

Quick How-To

The InfiniDB documentation is not very clear on step-by-step instructions so I’ve created this quick guide:

  1. Install Hadoop cluster (minimum install will work). I’ve used Cloudera Manager (CDH5) to compare the speed of InfiniDB to Cloudera Impala. Install the tools in the “Pre-requirements” sections of InfiniDB for Hadoop Manual
  2. Install the InfiniDB for Hadoop binaries on 1 Hadoop node (you can choose any node).  This will install InfiniDB and its version of MySQL (based on MySQL 5.1).
  3. After installation it will tell you the variables to set and run the postConfigure script. Example:
    export JAVA_HOME=/usr/java/jdk1.6.0_31
    export LD_LIBRARY_PATH=/usr/java/jdk1.6.0_31/jre/lib/amd64/server
    . /root/setenv-hdfs-20
    /usr/local/Calpont/bin/postConfigure
  4. The postConfigure script will ask the questions. Couple imfortant notes:
  • Make sure to use HDFS as a “type of Data Storage”.
  • The performance module 1 (pm1) should point to the host (hostname and IP) you are running the postConfigure script on. Other pm(s) should point to other Hadoop nodes

When installation is finished you will be able to login into MySQL server, it uses script called ibdmysql which will call mysql cli with the correct socket and port. Check that the infiniDB is enabled by running “show engines”, InfiniDB should be in the list.

The next step will be importing data.

Data import

First we will need to create a MySQL table with “engine=InfiniDB”:

CREATE TABLE `ontime` (
  `YearD` int(11) NOT NULL,
  `Quarter` tinyint(4) DEFAULT NULL,
  `MonthD` tinyint(4) DEFAULT NULL,
  `DayofMonth` tinyint(4) DEFAULT NULL,
  `DayOfWeek` tinyint(4) DEFAULT NULL,
  `FlightDate` date DEFAULT NULL,
...
) ENGINE=InfiniDB DEFAULT CHARSET=latin1

Second,  I’ve used the cpimport to load the data. It turned out it is much more efficient and easier to load 1 big file rather than 20×12 smaller files (original “ontime” data is 1 file per month), so I’ve exported the “Ontime” data from MySQL table and created 1 big file “ontime.psv”.

I used the following command to export data into InfiniDB:

[root@n0 ontime]# /usr/local/Calpont/bin/cpimport -s '|' ontime ontime ontime.psv
2014-05-20 15:12:58 (18787) INFO : Running distributed import (mode 1) on all PMs...
2014-05-20 15:25:28 (18787) INFO : For table ontime.ontime: 155083620 rows processed and 155083620 rows inserted.
2014-05-20 15:25:28 (18787) INFO : Bulk load completed, total run time : 751.561 seconds

The data is stored in Hadoop:

[root@n0 ontime]# hdfs dfs -du -h /usr/local/Calpont
1.4 G /usr/local/Calpont/data1
1.4 G /usr/local/Calpont/data2
1.4 G /usr/local/Calpont/data3
1.4 G /usr/local/Calpont/data4
1.4 G /usr/local/Calpont/data5
1.4 G /usr/local/Calpont/data6
1.4 G /usr/local/Calpont/data7
1.4 G /usr/local/Calpont/data8

The total size of the data is 8×1.4G = 11.2G (compressed). To compare the size of the same dataset in Impala Parquet format is 3.6G. Original size was ~60G.

[root@n0 ontime]# hdfs dfs -du -h /user/hive/warehouse
3.6 G /user/hive/warehouse/ontime_parquet_snappy

Now we can run the 2 queries I’ve tested before:

1. Simple group-by

mysql> select yeard, count(*) from ontime group by yeard order by yeard;
+-------+----------+
| yeard | count(*) |
+-------+----------+
|  1988 |  5202096 |
|  1989 |  5041200 |
|  1990 |  5270893 |
|  1991 |  5076925 |
|  1992 |  5092157 |
|  1993 |  5070501 |
|  1994 |  5180048 |
|  1995 |  5327435 |
|  1996 |  5351983 |
|  1997 |  5411843 |
|  1998 |  5384721 |
|  1999 |  5527884 |
|  2000 |  5683047 |
|  2001 |  5967780 |
|  2002 |  5271359 |
|  2003 |  6488540 |
|  2004 |  7129270 |
|  2005 |  7140596 |
|  2006 |  7141922 |
|  2007 |  7455458 |
|  2008 |  7009726 |
|  2009 |  6450285 |
|  2010 |  6450117 |
|  2011 |  6085281 |
|  2012 |  6096762 |
|  2013 |  6369482 |
|  2014 |  1406309 |
+-------+----------+
27 rows in set (0.22 sec)

2. The complex query from my original post:

mysql> select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC, cnt desc LIMIT  1000;
+------------+------------+---------+----------+-----------------+------+
| min(yeard) | max(yeard) | Carrier | cnt      | flights_delayed | rate |
+------------+------------+---------+----------+-----------------+------+
|       2003 |       2009 | EV      |  1454777 |          237698 | 0.16 |
|       2003 |       2009 | FL      |  1082489 |          158748 | 0.15 |
|       2006 |       2009 | YV      |   740608 |          110389 | 0.15 |
|       2006 |       2009 | XE      |  1016010 |          152431 | 0.15 |
|       2003 |       2009 | B6      |   683874 |          103677 | 0.15 |
|       2001 |       2009 | MQ      |  3238137 |          448037 | 0.14 |
|       2003 |       2005 | DH      |   501056 |           69833 | 0.14 |
|       2004 |       2009 | OH      |  1195868 |          160071 | 0.13 |
|       2003 |       2006 | RU      |  1007248 |          126733 | 0.13 |
|       1988 |       2009 | UA      |  9593284 |         1197053 | 0.12 |
|       2003 |       2006 | TZ      |   136735 |           16496 | 0.12 |
|       1988 |       2001 | TW      |  2656286 |          280283 | 0.11 |
|       1988 |       2009 | AA      | 10568437 |         1183786 | 0.11 |
|       1988 |       2009 | CO      |  6023831 |          673354 | 0.11 |
|       1988 |       2009 | DL      | 11866515 |         1156048 | 0.10 |
|       2003 |       2009 | OO      |  2654259 |          257069 | 0.10 |
|       1988 |       2009 | AS      |  1506003 |          146920 | 0.10 |
|       2007 |       2009 | 9E      |   577244 |           59440 | 0.10 |
|       1988 |       2009 | US      | 10276862 |          990995 | 0.10 |
|       1988 |       2009 | NW      |  7601727 |          725460 | 0.10 |
|       1988 |       2005 | HP      |  2607603 |          235675 | 0.09 |
|       1988 |       2009 | WN      | 12722174 |         1107840 | 0.09 |
|       2005 |       2009 | F9      |   307569 |           28679 | 0.09 |
|       1988 |       1991 | PA      |   203401 |           19263 | 0.09 |
+------------+------------+---------+----------+-----------------+------+
24 rows in set (0.86 sec)

The same query in impala (on the same hardware) runs for 7.18 seconds:

[n8.local:21000] > select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_parquet_snappy WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC LIMIT  1000;
Query: select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_parquet_snappy WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC LIMIT  1000
  +------------+------------+---------+----------+-----------------+------+
  | min(yeard) | max(yeard) | carrier | cnt      | flights_delayed | rate |
  +------------+------------+---------+----------+-----------------+------+
  | 2003       | 2009       | EV      | 1454777  | 237698          | 0.16 |
  | 2003       | 2009       | FL      | 1082489  | 158748          | 0.15 |
  | 2006       | 2009       | XE      | 1016010  | 152431          | 0.15 |
  | 2006       | 2009       | YV      | 740608   | 110389          | 0.15 |
  | 2003       | 2009       | B6      | 683874   | 103677          | 0.15 |
  | 2001       | 2009       | MQ      | 3238137  | 448037          | 0.14 |
  | 2003       | 2005       | DH      | 501056   | 69833           | 0.14 |
  | 2004       | 2009       | OH      | 1195868  | 160071          | 0.13 |
  | 2003       | 2006       | RU      | 1007248  | 126733          | 0.13 |
  | 1988       | 2009       | UA      | 9593284  | 1197053         | 0.12 |
  | 2003       | 2006       | TZ      | 136735   | 16496           | 0.12 |
  | 1988       | 2001       | TW      | 2656286  | 280283          | 0.11 |
  | 1988       | 2009       | CO      | 6023831  | 673354          | 0.11 |
  | 1988       | 2009       | AA      | 10568437 | 1183786         | 0.11 |
  | 1988       | 2009       | US      | 10276862 | 990995          | 0.10 |
  | 2007       | 2009       | 9E      | 577244   | 59440           | 0.10 |
  | 1988       | 2009       | DL      | 11866515 | 1156048         | 0.10 |
  | 2003       | 2009       | OO      | 2654259  | 257069          | 0.10 |
  | 1988       | 2009       | NW      | 7601727  | 725460          | 0.10 |
  | 1988       | 2009       | AS      | 1506003  | 146920          | 0.10 |
  | 1988       | 1991       | PA      | 203401   | 19263           | 0.09 |
  | 1988       | 2009       | WN      | 12722174 | 1107840         | 0.09 |
  | 1988       | 2005       | HP      | 2607603  | 235675          | 0.09 |
  | 2005       | 2009       | F9      | 307569   | 28679           | 0.09 |
  +------------+------------+---------+----------+-----------------+------+
  Returned 24 row(s) in 7.18s

Conclusion and charts

To summaries I’ve created the following charts:

Simple query:

As we can see InfiniDB looks pretty good here. It also uses MySQL protocol, so existing application which uses MySQL will be able to work here without any additional “connectors”.

One note regarding my query example: the “complex” query is designed in a way that will make it hard to use any particular set of index; this query will have to scan the >70% of the table to generate the resultset. That is why it is so slow in MySQL compared to columnar databases. Another “issue” is that the table is very wide and most of the columns are declared as varchar (table is not normalized), which makes it large in MySQL. All this will make it ideal for columnar storage and compression. Other cases may not show that huge of a difference.

So far I was testing with small data (60G), I will plan to run big data benchmark next.

The post Using InfiniDB MySQL server with Hadoop cluster for data analytics appeared first on MySQL Performance Blog.

Apr
21
2014
--

Using Apache Hadoop and Impala together with MySQL for data analysis

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format) and run a reporting on top of that. For the examples below I will use the “ontime flight performance” data from my previous post (Increasing MySQL performance with parallel query execution). I’ve used the Cloudera Manager v.4 to install Apache Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs:

Purpose Server specs
Namenode, Hive metastore, etc + Datanodes 2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16G RAM, RAID 10 with 8 SAS drives
Datanodes only 4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8G RAM, single 4TB drive

As you can see those a pretty old servers; the only thing I’ve changed is added a 4TB drive to be able to store more data. Hadoop provides redundancy on the server level (it writes 3 copies of the same block to all datanodes) so we do not need RAID on the datanodes (need redundancy for namenodes thou).

Data export

There are a couple of ways to export data from MySQL to Hadoop. For the purpose of this test I have simply exported the ontime table into a text file with:

select * into outfile '/tmp/ontime.psv' 
FIELDS TERMINATED BY ','
from ontime;

(you can use “|” or any other symbol as a delimiter) Alternatively, you can download data directly from www.transtats.bts.gov site using this simple script:

for y in {1988..2013}
do
        for i in {1..12}
        do
                u="http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_${y}_${i}.zip"
                wget $u -o ontime.log
                unzip On_Time_On_Time_Performance_${y}_${i}.zip
        done
done

Load into Hadoop HDFS

First thing we will need to do is to load data into HDFS as a set of files. Hive or Impala it will work with a directory to which you have imported your data and concatenate all files inside this directory. In our case it is easy to simply copy all our files into the directory inside HDFS

$ hdfs dfs -mkdir /data/ontime/
$ hdfs -v dfs -copyFromLocal On_Time_On_Time_Performance_*.csv /data/ontime/

 Create external table in Impala

Now, when we have all data files loaded we can create an external table:

CREATE EXTERNAL TABLE ontime_csv (
YearD int ,
Quarter tinyint ,
MonthD tinyint ,
DayofMonth tinyint ,
DayOfWeek tinyint ,
FlightDate string ,
UniqueCarrier string ,
AirlineID int ,
Carrier string ,
TailNum string ,
FlightNum string ,
OriginAirportID int ,
OriginAirportSeqID int ,
OriginCityMarketID int ,
Origin string ,
OriginCityName string ,
OriginState string ,
OriginStateFips string ,
OriginStateName string ,
OriginWac int ,
DestAirportID int ,
DestAirportSeqID int ,
DestCityMarketID int ,
Dest string ,
...
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE 
LOCATION '/data/ontime';

Note the “EXTERNAL” keyword and LOCATION (LOCATION points to a directory inside HDFS, not a file). The impala will create a meta information only (will not modify the table). We can query this table right away, however, impala will need to scan all files (full scan) for queries.

Example:

[d30.local:21000] > select yeard, count(*) from ontime_psv  group by yeard;
Query: select yeard, count(*) from ontime_psv  group by yeard
+-------+----------+
| yeard | count(*) |
+-------+----------+
| 2010  | 6450117  |
| 2013  | 5349447  |
| 2009  | 6450285  |
| 2002  | 5271359  |
| 2004  | 7129270  |
| 1997  | 5411843  |
| 2012  | 6096762  |
| 2005  | 7140596  |
| 1999  | 5527884  |
| 2007  | 7455458  |
| 1994  | 5180048  |
| 2008  | 7009726  |
| 1988  | 5202096  |
| 2003  | 6488540  |
| 1996  | 5351983  |
| 1989  | 5041200  |
| 2011  | 6085281  |
| 1998  | 5384721  |
| 1991  | 5076925  |
| 2006  | 7141922  |
| 1993  | 5070501  |
| 2001  | 5967780  |
| 1995  | 5327435  |
| 1990  | 5270893  |
| 1992  | 5092157  |
| 2000  | 5683047  |
+-------+----------+
Returned 26 row(s) in 131.38s

(Note that “group by” will not sort the rows, unlike MySQL. To sort we will need to add “ORDER BY yeard”)

Explain plan:

Query: explain select yeard, count(*) from ontime_csv  group by yeard
+-----------------------------------------------------------+
| Explain String                                            |
+-----------------------------------------------------------+
| PLAN FRAGMENT 0                                           |
|   PARTITION: UNPARTITIONED                                |
|                                                           |
|   4:EXCHANGE                                              |
|                                                           |
| PLAN FRAGMENT 1                                           |
|   PARTITION: HASH_PARTITIONED: yeard                      |
|                                                           |
|   STREAM DATA SINK                                        |
|     EXCHANGE ID: 4                                        |
|     UNPARTITIONED                                         |
|                                                           |
|   3:AGGREGATE (merge finalize)                            |
|   |  output: SUM(COUNT(*))                                |
|   |  group by: yeard                                      |
|   |                                                       |
|   2:EXCHANGE                                              |
|                                                           |
| PLAN FRAGMENT 2                                           |
|   PARTITION: RANDOM                                       |
|                                                           |
|   STREAM DATA SINK                                        |
|     EXCHANGE ID: 2                                        |
|     HASH_PARTITIONED: yeard                               |
|                                                           |
|   1:AGGREGATE                                             |
|   |  output: COUNT(*)                                     |
|   |  group by: yeard                                      |
|   |                                                       |
|   0:SCAN HDFS                                             |
|      table=ontime.ontime_csv #partitions=1/1 size=45.68GB |
+-----------------------------------------------------------+
Returned 31 row(s) in 0.13s

As we can see it will scan 45G of data.

Impala with columnar format and compression

The great benefit of the impala is that it supports columnar format and compression. I’ve tried the new “parquet” format with “snappy” compression codec. As our table is very wide (and de-normalized) it will help alot to use columnar format. To take advantages of the “parquet” format we will need to load data into it, which is easy to do when we already have a table inside impala and files inside HDFS:

[d30.local:21000] > set PARQUET_COMPRESSION_CODEC=snappy;
[d30.local:21000] > create table ontime_parquet_snappy LIKE ontime_parquet_snappy STORED AS PARQUET;
[d30.local:21000] > insert into ontime_parquet_snappy select * from ontime_csv;
Query: insert into ontime_parquet_snappy select * from ontime_csv
Inserted 152657276 rows in 729.76s

Then we can test our query against the new table:

Query: explain select yeard, count(*) from ontime_parquet_snappy  group by yeard
+---------------------------------------------------------------------+
| Explain String                                                      |
+---------------------------------------------------------------------+
| PLAN FRAGMENT 0                                                     |
|   PARTITION: UNPARTITIONED                                          |
|                                                                     |
|   4:EXCHANGE                                                        |
|                                                                     |
| PLAN FRAGMENT 1                                                     |
|   PARTITION: HASH_PARTITIONED: yeard                                |
|                                                                     |
|   STREAM DATA SINK                                                  |
|     EXCHANGE ID: 4                                                  |
|     UNPARTITIONED                                                   |
|                                                                     |
|   3:AGGREGATE (merge finalize)                                      |
|   |  output: SUM(COUNT(*))                                          |
|   |  group by: yeard                                                |
|   |                                                                 |
|   2:EXCHANGE                                                        |
|                                                                     |
| PLAN FRAGMENT 2                                                     |
|   PARTITION: RANDOM                                                 |
|                                                                     |
|   STREAM DATA SINK                                                  |
|     EXCHANGE ID: 2                                                  |
|     HASH_PARTITIONED: yeard                                         |
|                                                                     |
|   1:AGGREGATE                                                       |
|   |  output: COUNT(*)                                               |
|   |  group by: yeard                                                |
|   |                                                                 |
|   0:SCAN HDFS                                                       |
|      table=ontime.ontime_parquet_snappy #partitions=1/1 size=3.95GB |
+---------------------------------------------------------------------+
Returned 31 row(s) in 0.02s

As we can see it will scan much smaller amount of data: 3.95 (with compression) compared to 45GB

Results:

Query: select yeard, count(*) from ontime_parquet_snappy  group by yeard
+-------+----------+
| yeard | count(*) |
+-------+----------+
| 2010  | 6450117  |
| 2013  | 5349447  |
| 2009  | 6450285  |
...
Returned 26 row(s) in 4.17s

And the response time is much better as well.

Impala complex query example

I’ve used the complex query from my previous post. I had to adapt it for use with Impala: it does not support “sum(ArrDelayMinutes>30)” notation but “sum(if(ArrDelayMinutes>30, 1, 0)” works fine.

select
   min(yeard), max(yeard), Carrier, count(*) as cnt,
   sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed,
   round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate
FROM ontime_parquet_snappy
WHERE
   DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI')
   and DestState not in ('AK', 'HI', 'PR', 'VI')
   and flightdate < '2010-01-01'
GROUP by carrier
HAVING cnt > 100000 and max(yeard) > 1990
ORDER by rate DESC
LIMIT 1000;

The query is intentionally designed the way it does not take advantage of the indexes: most of the conditions will only filter out less than 30% of the data.

Impala results:

+------------+------------+---------+----------+-----------------+------+
| min(yeard) | max(yeard) | carrier | cnt      | flights_delayed | rate |
+------------+------------+---------+----------+-----------------+------+
| 2003       | 2009       | EV      | 1454777  | 237698          | 0.16 |
| 2003       | 2009       | FL      | 1082489  | 158748          | 0.15 |
| 2006       | 2009       | XE      | 1016010  | 152431          | 0.15 |
| 2003       | 2009       | B6      | 683874   | 103677          | 0.15 |
| 2006       | 2009       | YV      | 740608   | 110389          | 0.15 |
| 2003       | 2005       | DH      | 501056   | 69833           | 0.14 |
| 2001       | 2009       | MQ      | 3238137  | 448037          | 0.14 |
| 2004       | 2009       | OH      | 1195868  | 160071          | 0.13 |
| 2003       | 2006       | RU      | 1007248  | 126733          | 0.13 |
| 2003       | 2006       | TZ      | 136735   | 16496           | 0.12 |
| 1988       | 2009       | UA      | 9593284  | 1197053         | 0.12 |
| 1988       | 2009       | AA      | 10600509 | 1185343         | 0.11 |
| 1988       | 2001       | TW      | 2659963  | 280741          | 0.11 |
| 1988       | 2009       | CO      | 6029149  | 673863          | 0.11 |
| 2007       | 2009       | 9E      | 577244   | 59440           | 0.10 |
| 1988       | 2009       | US      | 10276941 | 991016          | 0.10 |
| 2003       | 2009       | OO      | 2654259  | 257069          | 0.10 |
| 1988       | 2009       | NW      | 7601727  | 725460          | 0.10 |
| 1988       | 2009       | DL      | 11869471 | 1156267         | 0.10 |
| 1988       | 2009       | AS      | 1506003  | 146920          | 0.10 |
| 1988       | 2005       | HP      | 2607603  | 235675          | 0.09 |
| 2005       | 2009       | F9      | 307569   | 28679           | 0.09 |
| 1988       | 1991       | PA      | 206841   | 19465           | 0.09 |
| 1988       | 2009       | WN      | 12722174 | 1107840         | 0.09 |
+------------+------------+---------+----------+-----------------+------+
Returned 24 row(s) in 15.28s

15.28 seconds is significantly faster than original MySQL results (15 min 56.40 sec without parallel execution and  5 min 47 with the parallel execution). However, this is not “apple to apple comparison”:

  • MySQL will scan 45G of data and Impala with parquet will only scan 3.5G
  • MySQL will run on a single server, Hadoop + Impala will run in parallel on 6 servers.

Nevertheless, Hadoop + Implala shows impressive performance and ability to scale out the box, which can help a lot with the large data volume analysis.

Conclusion

Hadoop + Impala will give us an easy way to analyze large datasets using SQL with the ability to scale even on the old hardware.

In my next posts I will plan to explore:

As always, please share your thoughts in the comments.

The post Using Apache Hadoop and Impala together with MySQL for data analysis appeared first on MySQL Performance Blog.

Sep
12
2013
--

MySQL webinar: ‘Introduction to open source column stores’

MySQL webinar: 'Introduction to column stores'Join me Wednesday, September 18 at 10 a.m. PDT for an hour-long webinar where I will introduce the basic concepts behind column store technology. The webinar’s title is: “Introduction to open source column stores.”

What will be discussed?

This webinar will talk about Infobright, LucidDB, MonetDB, Hadoop (Impala) and other column stores

  • I will compare features between major column stores (both open and closed source).
  • Some benchmarks will be used to demonstrate the basic performance characteristics of the open source column stores.
  • There will be a question and answer session to ask me anything you like about column stores (you can also ask in the comments section below if you prefer to submit them in advance).

Who should attend?

  • Anyone interested in analytics or OLAP
  • Those new to column store technology who want to find out how they will benefit from using a column store
  • Database administrators or users that have to query large amounts of data quickly, especially in aggregate

T-Shirts!

A randomly selected participant in the webinar will be chosen to receive a free Percona t-shirt.  You must be present at the end of the webinar during the question and answer section to be eligible for the shirt.

You can also win a Percona t-shirt by answering one or more of the following questions (the first to answer correctly wins, one winner only, sorry. ):

  • What is one advantage of columnar storage?
  • What is one big drawback of columnar storage?
  • Describe (or link to) an application or project that uses a columnar storage

Register here.

The post MySQL webinar: ‘Introduction to open source column stores’ appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com