Jul
12
2010
--

Intro to OLAP

This is the first of a series of posts about business intelligence tools, particularly OLAP (or online analytical processing) tools using MySQL and other free open source software. OLAP tools are a part of the larger topic of business intelligence, a topic that has not had a lot of coverage on MPB. Because of this, I am going to start out talking about these topics in general, rather than getting right to gritty details of their performance.

I plan on covering the following topics:

  1. Introduction to OLAP and business intelligence. (this post)
  2. Identifying the differences between a data warehouse, and a data mart.
  3. Introduction to MDX queries and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

What is BI?
Chances are that you have heard the term business intelligence. Business intelligence (or BI) is a term which encompasses many different tools and methods for analyzing data, usually presenting it in a way that is easily consumed by upper management. This analysis is often used to determine how effectively the business has been at meeting certain performance goals, and to forecast how they will do in the future. To put it another way the tools are designed to provide insight about the business process, hence the name. Probably the most popular BI activity for web sites is click analysis.

As far as BI is concerned, this series of posts focuses on OLAP analysis and in a lesser sense, on data warehousing. Data warehouses often provide the information upon which OLAP analysis is performed, but more on this in post #2.

OLAP? What is that?
OLAP is an acronym which stands for online analytical processing. OLAP analysis, which is really just another name for multidimensional analysis, consists of displaying summary aggregations of the data broken down into different groups. A typical OLAP analysis might show “sale total, by year, by sales rep, by product category”. OLAP analysis is usually used for reporting on current data, looking at historical trends and trying to make predictions about future trends.

Multidimensional Analysis
Multidimensional analysis is a form of statistical analysis. In multidimensional analysis samples representing a particular measure are compared or broken down into different dimensions. For example, in a sales analysis, the “sale amount” is a measure. Measures are always aggregated values. That is, total sales might be expressed as SUM(sale_amt). This is because the SUM of the individual sales will be grouped along different dimensions, such as by year or by product. I’m getting a little ahead of myself. Before we talk about measures and dimensions, we should talk about the two ways in which this information can be stored.

There are two main ways to store multidimensional data for OLAP analysis
OLAP servers typically come in two basic flavors. Some servers have specialized data stores which store data in a form which is highly effective for multidimensional analysis. These servers are termed MOLAP and they tend to have exceptional performance due to their specialized data store. Almost all MOLAP solutions pre-compute many (or even all) of the possible answers to multi-dimensional queries. Palo is an example of an open source version of this technology. ESSbase is an example of closed source product. MOLAP servers often feature extensive compression of data which can improve performance. Loading data into a MOLAP server usually takes a very long time because many of the answers in the cube must be calculated. The extra time spent during the load is usually called “processing” time.

A relational OLAP (or ROLAP) server uses data stored in an RDBMS. These systems trade the performance of a multidimensional store for the convenience of an RDBMS. These servers almost always query over a database which is structured as a STAR or snowflake type schema. To go back to the sales analysis example above, in a STAR schema the facts about the sales would be stored in the fact table, and the list of customers and products would be stored in separate dimension tables. Some ROLAP servers support the aggregation of data into additional tables, and can use the tables automatically. These servers can approach the performance of MOLAP with the convenience of ROLAP, but there are still challenges with this approach. The biggest challenges are the amount of time that it takes to keep the tables updated and in the complexity of the many scripts or jobs which might be necessary to keep the tables in sync. Part five of my series will introduce materialized views which attempt to address these challenges in a manageable way.

What makes a ROLAP so great?
An OLAP server usually returns information to the user as a ‘pivot table‘ or ‘pivot report’. While you could create such a report in a spreadsheet, the ROLAP tool is designed to deal with millions or even billions of rows of data, much more than a spreadsheet can usually handle. MOLAP servers usually require that all, or almost all of the data must fit it memory. Another difference is the ease by which this analysis is constructed. You don’t necessarily have to write queries or drag and drop a report together in order to analyze multidimensional data using an OLAP tool.

Data before pivoting:
Example image from Wikimedia commons showing detail data for sales

Data summarized in pivot form:
Wikimedia commons image showing data summarized in pivot format

ROLAP tools use star schema
As I said before, a sale amount would be considered a measure, and it would usually be aggregated with SUM. The other information about the sale, such as the product, when it was sold and to whom it was sold would be defined in dimension tables. The fact table contains columns which are joined to the dimension tables, such as product_id and customer_id. These are often defined as foreign keys from the fact table to the dimension tables.

A note about degenerate dimensions:
Any values in the fact table that don’t join to dimensions are either considered degenerate dimensions or measures. In the example below the status of the order is a degenerate dimension. A degenerate dimension is stored as an ENUM in many cases. In the example below that there is no actual dimension table which includes the two different order statuses. Such a dimension would add an extra join, which is expensive. Any yes/no field and/or fields with a very low cardinality (such as gender or order status) will probably be stored in the fact table instead of in a dedicated dimension. In the “pivot data” example above, all the dimensions are degenerate: gender, region, style, date.

Star schema with degenerate dimension

Example star schema about sales.

Often a dimension will include redundant information to make reporting easier, a process called “denormalization”. Hierarchical information may be stored in a single dimension. For example, a dimension for products may include both the category AND a sub-category. A time dimension includes year, month and quarter. You can create multiple different hierarchies from a single dimension. This allows ‘drill down’ into the dimension. By default the data would be summarized by year, but you can drill down to quarter or month level aggregation.
Sample date hierarchy, showing quarter, month, year and day hierarchies.

The screenshots here in the jPivot (an OLAP cube browser) documentation can give you a better idea about the display of data. The examples break down sales by product, by category, and by region.

The information is presented in such a fashion that it can be “drilled into” and “filtered on” to provide an easy to use interface to the underlying data. Graphical display of the data as pie, line or bar charts is possible.

Focusing on ROLAP.
This is the MySQL performance blog, and as such an in depth discussion of MOLAP technology is not particularly warranted here. Our discussion will focus on Mondrian. Mondrian is an open source ROLAP server featuring an in-memory OLAP cache. Mondrian is part of the Pentaho open source business intelligence suite. Mondrian is also used by other projects such as Wabit and Jaspersoft. If you are using open source BI then you are probably already using Mondrian. Closed source ROLAP servers include Microstrategy, Microsoft Analysis Services and Oracle BI.

Mondrian speaks MDX, olap4j and XML for analysis. This means that there is a very high chance that your existing BI tools (if you have them) will work with it. MDX is a query language that looks similar to SQL but is actually very different. Olap4j is an OLAP interface for java applications. XML for analysis (XMLA) is an industry standard analytical interface originally created by Microsoft, SAS and Hyperion.

Whats next?
Next we’ll talk about the difference between data marts and data warehouses. The former are usually used for OLAP analysis, but they can be fundamentally related to a warehouse.


Entry posted by Justin Swanhart |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
05
2009
--

Air traffic queries in MyISAM and Tokutek (TokuDB)

This is next post in series
Analyzing air traffic performance with InfoBright and MonetDB
Air traffic queries in LucidDB
Air traffic queries in InfiniDB: early alpha

Let me explain the reason of choosing these engines. After initial three posts I am often asked “What is baseline ? Can we compare results with standard MySQL engines ?”. So there come MyISAM to consider it as base point to see how column-oriented-analytic engines are better here.

However, take into account, that for MyISAM we need to choose proper indexes to execute queries effectively, and there is pain coming with indexes: – load of data is getting slower; – to design proper indexes is additional research, especially when MySQL optimizer is not smart in picking best one.

The really nice thing about MonetDB, InfoBright, InfiniDB is that they do not need indexes, so you may not worry about maintaining them and picking best one. I am not sure about LucidDB, I was told indexes are needed, but creating new index was really fast even on full database, so I guess, it’s not B-Tree indexes. So this my reflexion on indexes turned me onto TokuDB direction.

What is so special about TokuDB ? There two things: indexes have special structure and are “cheap”, by “cheap” I mean the maintenance cost is constant and independent on datasize. With regular B-Tree indexes cost grows exponentially on datasize (Bradley Kuszmaul from Tokutek will correct me if I am wrong in this statement). Another point with TokuDB, it uses compression, so I expect less size of loaded data and less IO operations during query execution.

So what indexes we need for queries. To recall you details, the schema is available in this post
http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/, and
queries I posted on sheet “Queries” in my summary Spreadsheet.

With Bradley’s help we chose next indexes:

CODE:

  1. KEY `Year` (`Year`,`Month`),
  2.   KEY `Year_2` (`Year`,`DayOfWeek`),
  3.   KEY `DayOfWeek` (`DayOfWeek`,`Year`,`DepDelay`),
  4.   KEY `DestCityName` (`DestCityName`,`OriginCityName`,`Year`),
  5.   KEY `Year_3` (`Year`,`DestCityName`,`OriginCityName`),
  6.   KEY `Year_4` (`Year`,`Carrier`,`DepDelay`),
  7.   KEY `Origin` (`Origin`,`Year`,`DepDelay`)

And I measured load time for both MyISAM and TokuDB in empty table with created indexes.

Load time for MyISAM: 16608 sec
For TokuDB: 19131 sec

Datasize (including indexes)

MyISAM: 36.7GB
TokuDB: 6.7GB

I am a bit surprised that TokuDB is slower loading data, but my guess it is related to compression, and I expect with bigger amount of data TokuDB will be faster MyISAM.

Now to queries. Bradley pointed me that query Q5 SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier,
count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY
carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE
Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3
can be rewritten as
SELECT carrier,totalflights,ndelayed,ndelayed*1000/totalflights as c3 FROM (SELECT carrier,count(*) as totalflights,sum(if(depdelay>10,1,0)) as ndelayed from ontime where year=2007 group by carrier) t order by c3 desc; ( I name it as Query Q5i)

The summary table with queries execution time (in sec, less is better):

Query MyISAM TokuDB
Q0 72.84 50.25
Q1 61.03 55.01
Q2 98.12 58.36
Q3 123.04 66.87
Q4 6.92 6.91
Q5 13.61 11.86
Q5i 7.68 6.96
Q6 123.84 69.03
Q7 187.22 159.62
Q8 (1y) 8.75 7.59
Q8 (2y) 102.17 64.95
Q8 (3y) 104.7 69.76
Q8 (4y) 107.05 70.46
Q8 (10y) 119.54 84.64
Q9 69.05 47.67

For reference I used 5.1.36-Tokutek-2.1.0 for both MyISAM and TokuDB tests.

And if you are interested to compare MyISAM with previous engines:

Query MyISAM MonetDB InfoBright LucidDB InfiniDB
Q0 72.84 29.9 4.19 103.21 NA
Q1 61.03 7.9 12.13 49.17 6.79
Q2 98.12 0.9 6.73 27.13 4.59
Q3 123.04 1.7 7.29 27.66 4.96
Q4 6.92 0.27 0.99 2.34 0.75
Q5 13.61 0.5 2.92 7.35 NA
Q6 123.84 12.5 21.83 78.42 NA
Q7 187.22 27.9 8.59 106.37 NA
Q8 (1y) 8.75 0.55 1.74 6.76 8.13
Q8 (2y) 102.17 1.1 3.68 28.82 16.54
Q8 (3y) 104.7 1.69 5.44 35.37 24.46
Q8 (4y) 107.05 2.12 7.22 41.66 32.49
Q8 (10y) 119.54 29.14 17.42 72.67 70.35
Q9 69.05 6.3 0.31 76.12 9.54

The all results are available in summary Spreadsheet

I especially do not put TokuDB in the same table with analytic oriented databases, to highlight TokuDB is OLTP engine for general purposes.
As you see it is doing better than MyISAM in all queries.


Entry posted by Vadim |
25 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
02
2009
--

Air traffic queries in InfiniDB: early alpha

As Calpont announced availability of InfiniDB I surely couldn’t miss a chance to compare it with previously tested databases in the same environment.
See my previous posts on this topic:
Analyzing air traffic performance with InfoBright and MonetDB
Air traffic queries in LucidDB

I could not run all queries against InfiniDB and I met some hiccups during my experiment, so it was less plain experience than with other databases.

So let’s go by the same steps:

Load data

InfiniDB supports MySQL’s LOAD DATA statement and it’s own colxml / cpimport utilities. As LOAD DATA is more familiar for me, I started with that, however after issuing LOAD DATA on 180MB file ( for 1989 year, 1st month) very soon it caused extensive swapping (my box has 4GB of RAM) and statement failed with
ERROR 1 (HY000) at line 1: CAL0001: Insert Failed: St9bad_alloc

Alright, colxml / cpimport was more successful, however it has less flexibility in syntax than LOAD DATA, so I had to transform the input files into a format that cpimport could understand.

Total load time was 9747 sec or 2.7h (not counting time spent on files transformation)

I put summary data into on load data time, datasize and query time to Google Spreadsheet so you can easy compare with previous results. There are different sheets for queries, datasize and time of load.

Datasize

Size of database after loading is another confusing point. InfiniDB data directory has complex structure like

CODE:

  1. ./000.dir/000.dir/003.dir/233.dir
  2. ./000.dir/000.dir/003.dir/233.dir/000.dir
  3. ./000.dir/000.dir/003.dir/233.dir/000.dir/FILE000.cdf
  4. ./000.dir/000.dir/003.dir/241.dir
  5. ./000.dir/000.dir/003.dir/241.dir/000.dir
  6. ./000.dir/000.dir/003.dir/241.dir/000.dir/FILE000.cdf
  7. ./000.dir/000.dir/003.dir/238.dir
  8. ./000.dir/000.dir/003.dir/238.dir/000.dir
  9. ./000.dir/000.dir/003.dir/238.dir/000.dir/FILE000.cdf
  10. ./000.dir/000.dir/003.dir/235.dir
  11. ./000.dir/000.dir/003.dir/235.dir/000.dir
  12. ./000.dir/000.dir/003.dir/235.dir/000.dir/FILE000.cdf

so it’s hard to day what files are related to table. But after load, the size of 000.dir is 114G, which is as twice big as original data files. SHOW TABLE STATUS does not really help there, it shows

CODE:

  1. Name: ontime
  2.          Engine: InfiniDB
  3.         Version: 10
  4.      Row_format: Dynamic
  5.            Rows: 2000
  6.  Avg_row_length: 0
  7.     Data_length: 0
  8. Max_data_length: 0
  9.    Index_length: 0
  10.       Data_free: 0
  11.  Auto_increment: NULL
  12.     Create_time: NULL
  13.     Update_time: NULL
  14.      Check_time: NULL
  15.       Collation: latin1_swedish_ci
  16.        Checksum: NULL
  17.  Create_options:
  18.         Comment:

with totally misleading information.

So I put 114GB as size of data after load, until someone points me how to get real size, and also explains what takes so much space.

Queries

First count start query SELECT count(*) FROM ontime took 2.67 sec, which shows that InfiniDB does not store counter of records, however calculates it pretty fast.

Q0:
select avg(c1) from (select year,month,count(*) as c1 from ontime group by YEAR,month) t;

Another bumper, on this query InfiniDB complains

ERROR 138 (HY000):
The query includes syntax that is not supported by InfiniDB. Use 'show warnings;' to get more information. Review the Calpont InfiniDB Syntax guide for additional information on supported distributed syntax or consider changing the InfiniDB Operating Mode (infinidb_vtable_mode).
mysql> show warnings;
+-------+------+------------------------------------------------------------+
| Level | Code | Message |
+-------+------+------------------------------------------------------------+
| Error | 9999 | Subselect in From clause is not supported in this release. |
+-------+------+------------------------------------------------------------+

Ok, so InfiniDB does not support DERIVED TABLES, which is big limitation from my point of view.
As workaround I tried to create temporary table, but got another error:

CODE:

  1. mysql> create temporary table tq2 as (select Year,Month,count(*) as c1 from ontime group by Year, Month);
  2. ERROR 122 (HY000): Cannot open table handle for ontime.

As warning suggests I turned infinidb_vtable_mode = 2, which is:

CODE:

  1. 2) auto-switch mode: InfiniDB will attempt to process the query internally, if it
  2. cannot, it will automatically switch the query to run in row-by-row mode.

but query took 667 sec :

so I skip queries Q5, Q6, Q7 from consideration, which are also based on DERIVED TABLES, as not supported by InfiniDB.

Other queries: (again look on comparison with other engines in Google Spreadsheet or in summary table at the bottom)

Query Q1:
mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC;
7 rows in set (6.79 sec)

Query Q2:
mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC;

7 rows in set (4.59 sec)

Query Q3:
SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;

4.96 sec

Query Q4:
mysql> SELECT Carrier, count(*) FROM ontime WHERE DepDelay > 10 AND YearD=2007 GROUP BY Carrier ORDER BY 2 DESC;

I had another surprise with query, after 15 min it did not return results, I check system and it was totally idle, but query stuck. I killed query, restarted mysqld but could not connect to mysqld anymore. In processes I see that InfiniDB started couple external processes: ExeMgr, DDLProc, PrimProc, controllernode fg, workernode DBRM_Worker1 fg which cooperate each with other using IPC shared memory and semaphores. To clean system I rebooted server, and only after that mysqld was able to start.

After that query Q4 took 0.75 sec

Queries Q5-Q7 skipped.

Query Q8:

SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE YearD BETWEEN 2008 and 2008 GROUP BY DestCityName ORDER BY 2 DESC LIMIT 10;

And times for InfiniDB:

1y: 8.13 sec
2y: 16.54 sec
3y: 24.46 sec
4y: 32.49 sec
10y: 1 min 10.35 sec

Query Q9:

Q9:
select Year ,count(*) as c1 from ontime group by Year;

Time: 9.54 sec

Ok, so there is summary table with queries times (in sec, less is better)

Query MonetDB InfoBright LucidDB InfiniDB
Q0 29.9 4.19 103.21 NA
Q1 7.9 12.13 49.17 6.79
Q2 0.9 6.73 27.13 4.59
Q3 1.7 7.29 27.66 4.96
Q4 0.27 0.99 2.34 0.75
Q5 0.5 2.92 7.35 NA
Q6 12.5 21.83 78.42 NA
Q7 27.9 8.59 106.37 NA
Q8 (1y) 0.55 1.74 6.76 8.13
Q8 (2y) 1.1 3.68 28.82 16.54
Q8 (3y) 1.69 5.44 35.37 24.46
Q8 (4y) 2.12 7.22 41.66 32.49
Q8 (10y) 29.14 17.42 72.67 70.35
Q9 6.3 0.31 76.12 9.54

Conclusions

  • InfiniDB server version shows Server version: 5.1.39-community InfiniDB Community Edition 0.9.4.0-5-alpha (GPL), so I consider it as alpha release, and it is doing OK for alpha. I will wait for more stable release for further tests, as it took good amount of time to deal with different glitches.
  • InfiniDB shows really good time for queries it can handle, quite often better than InfoBright.
  • Inability to handle derived tables is significant drawback for me, I hope it will be fixed

Entry posted by Vadim |
18 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Oct
26
2009
--

Air traffic queries in LucidDB

After my first post Analyzing air traffic performance with InfoBright and MonetDB where I was not able to finish task with LucidDB, John Sichi contacted me with help to setup. You can see instruction how to load data on LucidDB Wiki page

You can find the description of benchmark in original post, there I will show number I have for LucidDB vs previous systems.

Load time
To load data into LucidDB in single thread, it took for me 15273 sec or 4.24h. In difference with other systems LucidDB support multi-threaded load, with concurrency 2 (as I have only 2 cores on that box), the load time is 9955 sec or 2.76h. For comparison
for InforBright load time is 2.45h and for MonetDB it is 2.6h

DataSize
Another interesting metric is datasize after load. In LucidDB db file after load takes 9.3GB.
UPDATE 27-Oct-2009 From metadata table the actual size of data is 4.5GB, the 9.3GB is size of physical file db.dat, which probably was not truncated after several loads of data.

For InfoBright it is 1.6GB, and for MonetDB – 65GB. Obviously LucidDB uses some compression, but it is not so aggressive as in InfoBright case. As original dataset is 55GB, compression rate for LucidDB is somewhat 1:12

Queries time

Let me put list of queries and times for all systems.

– Lame query “count start”
LucidDB:
SELECT count(*) FROM otp."ontime";
1 row selected (55.165 seconds)

Both InfoBright and MonetDB returned result immediately.
It seems LucidDB has to scan whole table to get result.

– Q0:
select avg(c1) from (select "Year","Month",count(*) as c1 from otp."ontime" group by "Year","Month") t;
LucidDB: 103.205 seconds
InfoBright: 4.19 sec
MonetDB: 29.9 sec

– Q1:
SELECT “DayOfWeek”, count(*) AS c FROM OTP.”ontime” WHERE “Year” BETWEEN 2000 AND 2008 GROUP BY “DayOfWeek” ORDER BY c DESC;
LucidDB: 49.17 seconds
InfoBright: 12.13 sec
MonetDB: 7.9 sec

– Q2:
SELECT “DayOfWeek”, count(*) AS c FROM otp.”ontime” WHERE “DepDelay”>10 AND “Year” BETWEEN 2000 AND 2008 GROUP BY “DayOfWeek” ORDER BY c DESC;
LucidDB: 27.131 seconds
InfoBright: 6.37 sec
MonetDB: 0.9 sec

– Q3:
!set rowlimit 10
SELECT “Origin”, count(*) AS c FROM otp.”ontime” WHERE “DepDelay”>10 AND “Year” BETWEEN 2000 AND 2008 GROUP BY “Origin” ORDER BY c DESC;
LucidDB: 27.664 seconds
InfoBright: 7.29 sec
MonetDB: 1.7 sec

– Q4:
SELECT “Carrier”, count(*) FROM otp.”ontime” WHERE “DepDelay”>10 AND “Year”=2007 GROUP BY “Carrier” ORDER BY 2 DESC;
LucidDB: 2.338 seconds
InfoBright: 0.99 sec
MonetDB: 0.27 sec

– Q5:
SELECT t.”Carrier”, c, c2, c*1000/c2 as c3 FROM (SELECT “Carrier”, count(*) AS c FROM OTP.”ontime” WHERE “DepDelay”>10 AND “Year”=2007 GROUP BY “Carrier”) t JOIN (SELECT “Carrier”, count(*) AS c2 FROM OTP.”ontime” WHERE “Year”=2007 GROUP BY “Carrier”) t2 ON (t.”Carrier”=t2.”Carrier”) ORDER BY c3 DESC;
LucidDB: 7.351 seconds
InfoBright: 2.92 sec
MonetDB: 0.5 sec

– Q6:
SELECT t.”Carrier”, c, c2, c*1000/c2 as c3 FROM (SELECT “Carrier”, count(*) AS c FROM OTP.”ontime” WHERE “DepDelay”>10 AND “Year” BETWEEN 2000 AND 2008 GROUP BY “Carrier”) t JOIN (SELECT “Carrier”, count(*) AS c2 FROM OTP.”ontime” WHERE “Year” BETWEEN 2000 AND 2008 GROUP BY “Carrier”) t2 ON (t.”Carrier”=t2.”Carrier”) ORDER BY c3 DESC;
LucidDB: 78.423 seconds
InfoBright: 21.83 sec
MonetDB: 12.5 sec

– Q7:
SELECT t.”Year”, c1/c2 FROM (select “Year”, count(*)*1000 as c1 from OTP.”ontime” WHERE “DepDelay”>10 GROUP BY “Year”) t JOIN (select “Year”, count(*) as c2 from OTP.”ontime” GROUP BY “Year”) t2 ON (t.”Year”=t2.”Year”);
LucidDB: 106.374 seconds
InfoBright: 8.59 sec
MonetDB: 27.9 sec

– Q8:
SELECT “DestCityName”, COUNT( DISTINCT “OriginCityName”) FROM “ontime” WHERE “Year” BETWEEN 2008 and 2008 GROUP BY “DestCityName” ORDER BY 2 DESC;

Years, LucidDB, InfoBright, MonetDB
1y, 6.76s, 1.74s, 0.55s
2y, 28.82s, 3.68s, 1.10s
3y, 35.37s, 5.44s, 1.69s
4y, 41.66s, 7.22s, 2.12s
10y, 72.67s, 17.42s, 29.14s

– Q9:
select “Year” ,count(*) as c1 from “ontime” group by “Year”;
LucidDB: 76.121 seconds
InfoBright: 0.31 sec
MonetDB: 6.3 sec

As you see LucidDB is not showing best results. However on good side about LucidDB I can mention it is very reach featured, with full support of DML statement. ETL features is also very impressive, you can extract, filter, transform external data (there is even access to MySQL via JDBC driver) just in SQL queries (compare with single LOAD DATA statement in InfoBright ICE edition). Also I am not so much in Java, but as I understood LucidDB can be easily integrated with Java applications, which is important if your development is Java based.

Worth to mention that in LucidDB single query execution takes 100% of user time in single CPU, which may signal that there some low-hanging fruits for optimization. OProfile can show clear places to fix.


Entry posted by Vadim |
10 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com