Mar
10
2015
--

Advanced JSON for MySQL

What is JSON

JSON is an text based, human readable format for transmitting data between systems, for serializing objects and for storing document store data for documents that have different attributes/schema for each document. Popular document store databases use JSON (and the related BSON) for storing and transmitting data.

Problems with JSON in MySQL

It is difficult to inter-operate between MySQL and MongoDB (or other document databases) because JSON has traditionally been very difficult to work with. Up until recently, JSON is just a TEXT document. I said up until recently, so what has changed? The biggest thing is that there are new JSON UDF by Sveta Smirnova, which are part of the MySQL 5.7 Labs releases. Currently the JSON UDF are up to version 0.0.4. While these new UDF are a welcome edition to the MySQL database, they don’t solve the really tough JSON problems we face.

Searching

The JSON UDF provide a number of functions that make working with JSON easier, including the ability to extract portions of a document, or search a document for a particular key. That being said, you can’t use JSON_EXTRACT() or JSON_SEARCH in the WHERE clause, because it will initiate a dreaded full-table-scan (what MongoDB would call a full collection scan). This is a big problem and common wisdom is that JSON can’t be indexed for efficient WHERE clauses, especially sub-documents like arrays or objects within the JSON.

Actually, however, I’ve come up with a technique to effectively index JSON data in MySQL (to any depth). The key lies in transforming the JSON from a format that is not easily indexed into one that is easily indexed. Now, when you think index you think B-TREE or HASH indexes (or bitmap indexes) but MySQL also supports FULLTEXT indexes.

A fulltext index is an inverted index where words (tokens) point to documents. While text indexes are great, they aren’t normally usable for JSON. The reason is, MySQL splits words on whitespace and non-alphanumeric characters. A JSON document doesn’t end up being usable when the name of the field (the key) can’t be associated with the value. But what if we transform the JSON? You can “flatten” the JSON down into key/value pairs and use a text index to associate the key/value pairs with the document. I created a UDF called RAPID_FLATTEN_JSON using the C++ Rapid JSON library. The UDF flattens JSON documents down into key/value pairs for the specific purpose of indexing.

Here is an example JSON document:

{
	"id": "0001",
	"type": "donut",
	"name": "Cake",
	"ppu": 0.55,
	"batters":
		{
			"batter":
				[
					{ "id": "1001", "type": "Regular" },
					{ "id": "1002", "type": "Chocolate" },
					{ "id": "1003", "type": "Blueberry" },
					{ "id": "1004", "type": "Devil's Food" }
				]
		},
	"topping":
		[
			{ "id": "5001", "type": "None" },
			{ "id": "5002", "type": "Glazed" },
			{ "id": "5005", "type": "Sugar" },
			{ "id": "5007", "type": "Powdered Sugar" },
			{ "id": "5006", "type": "Chocolate with Sprinkles" },
			{ "id": "5003", "type": "Chocolate" },
			{ "id": "5004", "type": "Maple" }
		]
}

Flattened:

mysql> select RAPID_FLATTEN_JSON(load_file('/tmp/doc.json'))G
*************************** 1. row ***************************
RAPID_FLATTEN_JSON(load_file('/tmp/doc.json')): id=0001
type=donut
name=Cake
ppu=0.55
id=1001
type=Regular
id=1002
type=Chocolate
id=1003
type=Blueberry
id=1004
type=Devil's Food
type=Devil's
type=Food
id=5001
type=None
id=5002
type=Glazed
id=5005
type=Sugar
id=5007
type=Powdered Sugar
type=Powdered
type=Sugar
id=5006
type=Chocolate with Sprinkles
type=Chocolate
type=with
type=Sprinkles
id=5003
type=Chocolate
id=5004
type=Maple
1 row in set (0.00 sec)

Obviously this is useful, because our keys are now attached to our values in an easily searchable way. All you need to do is store the flattened version of the JSON in another field (or another table), and index it with a FULLTEXT index to make it searchable. But wait, there is one more big problem: MySQL will split words on the equal sign. We don’t want this as it removes the locality of the keyword and the value. To fix this problem you’ll have to undertake the (actually quite easy) step of adding a new collation to MySQL (I called mine ft_kvpair_ci). I added equal (=) to the list of lower case characters as described in the manual. You just have to change two text files, no need to recompile the server or anything, and as I said, it is pretty easy. Let me know if you get stuck on this step and I can show you the 5.6.22 files I modified.

By the way, I used a UDF, because MySQL FULLTEXT indexes don’t support pluggable parsers for InnoDB until 5.7. This will be much cleaner in 5.7 with a parser plugin and there will be no need to maintain an extra column.

Using the solution:
Given a table full of complex json:

create table json2(id int auto_increment primary key, doc mediumtext);

Add a column for the index data and FULLTEXT index it:

alter table json2 add flat mediumtext character set latin1 collate ft_kvpair_ci, FULLTEXT(flat);

Then populate the index. Note that you can create a trigger to keep the second column in sync, I let that up to an exercise of the reader, or you can use Flexviews to maintain a copy in a second table automatically.

mysql> update json2 set flat=RAPID_FLATTEN_JSON(doc);
Query OK, 18801 rows affected (26.34 sec)
Rows matched: 18801  Changed: 18801  Warnings: 0

Using the index:

mysql> select count(*) from json2 where match(flat) against ('last_name=Vembu');
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)

The documents I searched for that example are very complex and highly nested. Check out the full matching documents for the query here here

If you want to only index a subportion of the document, use the MySQL UDF JSON_EXTRACT to extract the portion you want to index, and only flatten that.

Aggregating

JSON documents may contain sub-documents as mentioned a moment ago. JSON_EXTRACT can extract a portion of a document, but it is still a text document. There is no function that can extract ALL of a particular key (like invoice_price) and aggregate the results. So, if you have a document called orders which contains a varying number of items and their prices, it is very difficult (if not impossible) to use the JSON UDF to aggregate a “total sales” figure from all the order documents.

To solve this problem, I created another UDF called RAPID_EXTRACT_ALL(json, ‘key’). This UDF will extract all the values for the given key. For example, if there are 10 line items with invoice_id: 30, it will extract the value (30 in this case) for each item. This UDF returns each item separated by newline. I created a few stored routines called jsum, jmin, jmax, jcount, and javg. They can process the output of rapid_extract_all and aggregate it. If you want to only RAPID_EXTRACT_ALL from a portion of a document, extract that portion with the MySQL UDF JSON_EXTRACT first, then process that with RAPID_EXTRACT_ALL.

For example:

mysql> select json_extract_all(doc,'id') ids, jsum(json_extract_all(doc,'id')) from json2 limit 1G
*************************** 1. row ***************************
ids: 888
889
2312
5869
8702
jsum(json_extract_all(doc,'id')): 18660.00000
1 row in set (0.01 sec)

Aggregating all of the id values in the entire collection:

mysql> select sum( jsum(json_extract_all(doc,'id')) ) from json2 ;
+-----------------------------------------+
| sum( jsum(json_extract_all(doc,'id')) ) |
+-----------------------------------------+
|                         296615411.00000 |
+-----------------------------------------+
1 row in set (2.90 sec)

Of course you could extract other fields and sort and group on them.

Where to get the tools:
You can find the UDF in the swanhart-tools github repo. I think you will find these tools very useful in working with JSON documents in MySQL.

(This post was originally posted on my personal blog: swanhart.livejournal.com, but is reposed here for wider distribution)

The post Advanced JSON for MySQL appeared first on MySQL Performance Blog.

Feb
17
2015
--

‘Indexing’ JSON documents for efficient MySQL queries over JSON data

MySQL meets NoSQL with JSON UDF

I recently got back from FOSDEM, in Brussels, Belgium. While I was there I got to see a great talk by Sveta Smirnova, about her MySQL 5.7 Labs release JSON UDF functions. It is important to note that while the UDF come in a 5.7 release it is absolutely possible to compile and use the UDF with earlier versions of MySQL because the UDF interface has not changed for a long time. However, the UDF should still be considered alpha/preview level of quality and should not be used in production yet! For this example I am using Percona Server 5.6 with the UDF.

That being said, the proof-of-concept that I’m about to present here uses only one JSON function (JSON_EXTRACT) and it has worked well enough in my testing to present my idea here. The JSON functions will probably be GA sometime soon anyway, and this is a useful test of the JSON_EXTRACT function.

The UDF let you parse, search and manipulate JSON data inside of MySQL, bringing MySQL closer to the capabilities of a document store.

Since I am using Percona Server 5.6, I needed to compile and install the UDF. Here are the steps I took to compile the plugin:

  1. $ cd mysql-json-udfs-0.3.3-labs-json-udfs-src
  2. $ cmake -DMYSQL_DIR=/usr/local/mysql .
  3. $ sudo make install
  4. $ sudo cp *.so /usr/local/mysql/lib/plugin

JSON UDF are great, but what’s the problem

The JSON functions work very well for manipulating individual JSON objects, but like all other functions, using JSON_EXTRACT in the WHERE clause will result in a full table scan. This means the functions are virtually useless for searching through large volumes of JSON data.  If you want to use MySQL as a document store, this is going to limit the usefulness in the extreme as the ability to extract key/value pairs from JSON documents is powerful, but without indexing it can’t scale well.

What can be done to index JSON in MySQL for efficient access?

The JSON UDF provides a JSON_EXTRACT function which can pull data out of a JSON document. There are two ways we can use this function to “index” the JSON data.

  1. Add extra columns to the table (or use a separate table, or tables) containing the JSON and populate the columns using JSON_EXTRACT in a trigger. The downside is that this slows down inserts and modifications of the documents significantly.
  2. Use Flexviews materialized views to maintain an index table separately and asynchronously. The upside is that insertion/modification speed is not affected, but there is slight delay before index is populated. This is similar to eventual consistency in a document store.

Writing triggers is an exercise I’ll leave up to the user. The rest of this post will discuss using Flexviews materialized views to create a JSON index.

What is Flexviews?

Flexviews can create ‘incrementally refreshable’ materialized views. This means that the views are able to be refreshed efficiently using changes captured by FlexCDC, the change data capture tool that ships with Flexviews. Since the view can be refreshed fast, it is possible to refresh it frequently and have a low latency index, but not one perfectly in sync with the base table at all times.

The materialized view is a real table that is indexed to provide fast access. Flexviews includes a SQL_API, or a set of stored procedures for defining and maintaining materialized views.

See this set of slides for an overview of Flexviews: http://www.slideshare.net/MySQLGeek/flexviews-materialized-views-for-my-sql

Demo/POC using materialized view as an index

The first step to creating an incrementally refreshable materialized view with Flexviews, is to create a materialized view change log on all of the tables used in the view. The CREATE_MVLOG($schema, $table) function creates the log and FlexCDC will immediately being to collect changes into it.

mysql> call flexviews.create_mvlog('ssb_json','json');
Query OK, 1 row affected (0.01 sec)

Next, the materialized view name, and refresh type must be registered with the CREATE($schema, $mvname, $refreshtype) function:

mysql> call flexviews.create('ssb_json','json_idx','INCREMENTAL');
Query OK, 0 rows affected (0.00 sec)
-- save the just generated identifier for the view.  You can use GET_ID($schema,$mvname) later.
mysql> set @mvid := last_insert_id();
Query OK, 0 rows affected (0.00 sec)

Now one or more tables have to be added to the view using the ADD_TABLE($mvid, $schema, $table, $alias,$joinclause) function. This example will use only one table, but Flexviews supports joins too.

mysql> call flexviews.add_table(@mvid, 'ssb_json','json','json',null);
Query OK, 1 row affected (0.00 sec)

Expressions must be added to the view next. Since aggregation is not used in this example, the expressions should be ‘COLUMN’ type expressions. The function ADD_EXPR($mvid, $expression_type, $expression, $alias) is used to add expressions. Note that JSON_EXTRACT returns a TEXT column, so I’ve CAST the function to integer so that it can be indexed. Flexviews does not currently have a way to define prefix indexes.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', "cast(json_extract(doc,'D_DateKey') as date)", 'D_DateKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'C_CustomerKey') as unsigned)", 'C_CustomerKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'S_SuppKey') as unsigned)", 'S_SuppKey');
Query OK, 1 row affected (0.01 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'P_PartKey') as unsigned)", 'P_PartKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_OrderKey') as unsigned)", 'LO_OrderKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_LineNumber') as unsigned)", 'LO_LineNumber');
Query OK, 1 row affected (0.00 sec)

I’ve also projected out the ‘id’ column from the table, which is the primary key. This ties the index entries to the original row, so that the original document can be retrieved.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', 'id', 'id');
Query OK, 1 row affected (0.00 sec)

Since we want to use the materialized view as an index, we need to index the columns we’ve added to it.

mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_LineNumber", 'LO_LineNumber_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_OrderKey", 'LO_OrderKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"P_PartKey", 'P_PartKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"S_SuppKey", 'S_SuppKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"D_DateKey", 'D_DateKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"C_CustomerKey", 'C_CustomerKey_Idx');
Query OK, 1 row affected (0.00 sec)

Finally, the view has to be created. There are 6 million rows in my table, the JSON functions are UDF so they are not as fast as built in functions, and I indexed a lot of things (six different indexes are being populated at once) so it takes some time to build the index:

mysql> call flexviews.enable(@mvid);
Query OK, 2 rows affected (35 min 53.17 sec)

After the materialized view is built, you can see it in the schema. Note there is also a delta table, which I will explain a bit later.

mysql> show tables;
+--------------------+
| Tables_in_ssb_json |
+--------------------+
| json               |
| json_idx           |
| json_idx_delta     |
+--------------------+
3 rows in set (0.00 sec)

Here is the table definition of json_idx, our materialized view. You can see it is indexed:

CREATE TABLE `json_idx` (
  `mview$pk` bigint(20) NOT NULL AUTO_INCREMENT,
  `D_DateKey` date DEFAULT NULL,
  `C_CustomerKey` bigint(21) unsigned DEFAULT NULL,
  `S_SuppKey` bigint(21) unsigned DEFAULT NULL,
  `P_PartKey` bigint(21) unsigned DEFAULT NULL,
  `LO_OrderKey` bigint(21) unsigned DEFAULT NULL,
  `LO_LineNumber` bigint(21) unsigned DEFAULT NULL,
  `id` bigint(20) NOT NULL DEFAULT '0',
  `mview$hash` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`mview$pk`),
  KEY `LO_LineNumber_Idx` (`LO_LineNumber`),
  KEY `LO_OrderKey_Idx` (`LO_OrderKey`),
  KEY `P_PartKey_Idx` (`P_PartKey`),
  KEY `S_SuppKey_Idx` (`S_SuppKey`),
  KEY `D_DateKey_Idx` (`D_DateKey`),
  KEY `C_CustomerKey_Idx` (`C_CustomerKey`),
  KEY `mview$hash_key` (`mview$hash`)
) ENGINE=InnoDB AUTO_INCREMENT=6029221 DEFAULT CHARSET=latin1;

Here are some sample contents. You can see the integer values extracted out of the JSON:

mysql> select * from json_idx limit 10;
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
|        1 | 1996-08-08 |          6748 |         1 |    178778 |       35620 |             2 |  1 | 3151656687 |
|        2 | 1994-05-20 |          5272 |         1 |     52362 |      102790 |             4 |  2 | 2181615425 |
|        3 | 1995-05-04 |         22870 |         1 |    136407 |      146757 |             3 |  3 |  544130577 |
|        4 | 1996-06-16 |         12217 |         1 |    129103 |      151200 |             1 |  4 | 2969697839 |
|        5 | 1992-07-20 |         21757 |         1 |     35243 |      151745 |             1 |  5 | 1438921571 |
|        6 | 1997-08-16 |         18760 |         1 |    150766 |      159232 |             6 |  6 | 3941775529 |
|        7 | 1994-03-04 |           757 |         1 |     15750 |      188902 |             3 |  7 | 2142628088 |
|        8 | 1993-11-04 |         17830 |         1 |    192023 |      253828 |             5 |  8 | 3480147565 |
|        9 | 1993-07-12 |         16933 |         1 |     59997 |      269062 |             5 |  9 | 3572286145 |
|       10 | 1998-06-16 |         26660 |         1 |     30023 |      298272 |             3 | 10 | 1971966244 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
10 rows in set (0.00 sec)

Now, there needs to be an easy way to use this index in a select statement. Since a JOIN is needed between the materialized view and the base table, a regular VIEW makes sense to access the data. We’ll call this the index view:

mysql> create view json_idx_v as select * from json natural join json_idx;
Query OK, 0 rows affected (0.00 sec)

And just for completeness, here is the contents of a row from our new index view:

mysql> select * from json_idx_v limit 1G
*************************** 1. row ***************************
           id: 1
          doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU     5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU     0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
     mview$pk: 1
    D_DateKey: 1996-08-08
C_CustomerKey: 6748
    S_SuppKey: 1
    P_PartKey: 178778
  LO_OrderKey: 35620
LO_LineNumber: 2
   mview$hash: 3151656687
1 row in set (0.00 sec)

Using the UDF to find a document

The UDF does a full table scan, parsing all six million documents (TWICE!) as it goes along. Unsurprisingly, this is slow:

mysql> select * from json where json_extract(doc,'LO_OrderKey') = 35620 and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (54.49 sec)
mysql> explain select * from json
                where json_extract(doc,'LO_OrderKey') = 35620
                  and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5233236
        Extra: Using where
1 row in set (0.00 sec)

Using the index view to find a document

mysql> select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (0.00 sec)
mysql> explain select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json_idx
         type: ref
possible_keys: LO_LineNumber_Idx,LO_OrderKey_Idx
          key: LO_OrderKey_Idx
      key_len: 9
          ref: const
         rows: 4
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: ssb_json.json_idx.id
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Keeping the index in sync

Flexviews materialized views need to be refreshed when the underlying table changes. Flexviews includes a REFRESH($mvid, $mode, $transaction_id) function.

I am going to remove one document from the table:

mysql> delete from json where id = 10000;
Query OK, 1 row affected (0.01 sec)

Note there is now one row in the materialized view change log. dml_type is -1 because it is a delete:

mysql> select * from flexviews.mvlog_f1673fac9814a93508a1c917566ecd4dG
*************************** 1. row ***************************
    dml_type: -1
      uow_id: 113
fv$server_id: 33
      fv$gsn: 1083
          id: 10000
         doc: {"LO_OrderKey":"3359521","LO_LineNumber":"2","LO_CustKey":"10306","LO_PartKey":"77997","LO_SuppKey":"4","LO_OrderDateKey":"19951010","LO_OrderPriority":"2-HIGH","LO_ShipPriority":"0","LO_Quantity":"43","LO_ExtendedPrice":"8492457","LO_OrdTotalPrice":"27032802","LO_Discount":"2","LO_Revenue":"8322607","LO_SupplyCost":"118499","LO_Tax":"4","LO_CommitDateKey":"19951228","LO_ShipMode":"FOB","C_CustomerKey":"10306","C_Name":"Customer#000010306","C_Address":"4UR9tz8","C_City":"ROMANIA  5","C_Nation":"ROMANIA","C_Region":"EUROPE","C_Phone":"29-361-986-3513","C_MktSegment":"BUILDING","S_SuppKey":"4","S_Name":"Supplier#000000004","S_Address":"qGTQJXogS83a7MB","S_City":"MOROCCO  4","S_Nation":"MOROCCO","S_Region":"AFRICA","S_Phone":"25-128-190-5944","D_DateKey":"19951010","D_Date":"Octorber 10, 1995","D_DayOfWeek":"Wednesday","D_Month":"Octorber","D_Year":"1995","D_YearMonthNum":"199510","D_YearMonth":"Oct1995","D_DayNumInWeek":"4","D_DayNumInMonth":"10","D_DayNumInYear":"283","D_MonthNumInYear":"10","D_WeekNumInYear":"41","D_SellingSeason":"Fall","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"77997","P_Name":"burnished olive","P_MFGR":"MFGR#2","P_Category":"MFGR#24","P_Brand":"MFGR#2426","P_Colour":"orchid","P_Type":"MEDIUM PLATED TIN","P_Size":"16","P_Container":"WRAP PKG"}
1 row in set (0.01 sec)

Now we can verify the materialized view is out of date:

mysql> select * from json_idx where id = 10000;
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id    | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
|    10000 | 1995-10-10 |         10306 |         4 |     77997 |     3359521 |             2 | 10000 | 2937185172 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
1 row in set (2.60 sec)

To bring the index up to date we must refresh it. Usually you will use the ‘BOTH’ mode to ‘COMPUTE’ and ‘APPLY’ the changes at the same time, but I am going to use COMPUTE mode to show you what ends up in the delta table:

mysql> select * from json_idx_deltaG
*************************** 1. row ***************************
     dml_type: -1
       uow_id: 113
       fv$gsn: 1083
    D_DateKey: 1995-10-10
C_CustomerKey: 10306
    S_SuppKey: 4
    P_PartKey: 77997
  LO_OrderKey: 3359521
LO_LineNumber: 2
           id: 10000
   mview$hash: 2937185172
1 row in set (0.00 sec)

Delta tables are similar to materialized view change log tables, except they contain insertions and deletions to the view contents. In this case, you can see dml_type is -1 and id = 10000, so the row from the view corresponding to the row we deleted, will be deleted when the change is applied.

Finally the change can be applied:

mysql> call flexviews.refresh(flexviews.get_id('ssb_json','json_idx'), 'APPLY',NULL);
Query OK, 2 rows affected (0.47 sec)
mysql> select * from json_idx where id = 10000; -- note, we didn't index id in the MV
Empty set (2.61 sec)

Finally, it makes sense to try to keep the index in sync as quickly as possible using a MySQL event:

DELIMITER ;;
CREATE EVENT IF NOT EXISTS flexviews.refresh_json_idx
ON SCHEDULE EVERY 1 SECOND
DO
BEGIN
  DECLARE v_got_lock tinyint default 0;
  SELECT GET_LOCK('JSON_IDX_LOCK', 0) INTO v_got_lock;
  IF v_got_lock = 1 THEN
    CALL flexviews.refresh(flexviews.get_id('ssb_json','json_idx'),'BOTH',NULL);
    SELECT RELEASE_LOCK('JSON_IDX_LOCK') INTO @discard;
  END IF;
END;;
DELIMITER ;

So there you have it. A way to index and quickly search through JSON documents and keep the index in sync automatically.

The post ‘Indexing’ JSON documents for efficient MySQL queries over JSON data appeared first on MySQL Performance Blog.

Sep
22
2014
--

How to scale big data applications using MySQL sharding frameworks

How to scale big data applications using MySQL sharding frameworksThis Wednesday I’ll be discussing two common types of big data: machine-generated data and user-generated content. These types of big data are amenable to sharding, a commonly used technique for spreading data over more than one database server.

I’ll be discussing this in-depth during a live webinar at 10 a.m. Pacific time on Sept. 24. I’ll also talk about two major sharding frameworks: MySQL Fabric and Shard-Query for OLTP or OLAP workloads, respectively. Following the webinar there will be a brief Q/A session.

Find the webinar link here: “How to Scale Big Data Applications Using MySQL Sharding Frameworks” for more information or register directly here.

Find Shard-Query (part of Swanhart-Tools) here, in Github
Find MySQL Fabric (part of MySQL Utilities) here, at the MySQL documentation page

The post How to scale big data applications using MySQL sharding frameworks appeared first on Percona Performance Blog.

Sep
10
2014
--

Generating test data from the mysql> prompt

There are a lot of tools that generate test data.  Many of them have complex XML scripts or GUI interfaces that let you identify characteristics about the data. For testing query performance and many other applications, however, a simple quick and dirty data generator which can be constructed at the MySQL command line is useful.

First, let’s talk about what kind of data you can easily create with MySQL function calls:

You can generate a decimal number between zero and another number using the MySQL RAND() function like the following query (here between 0 and 10000):

SELECT RAND() * 10000;

Similarly, you can generate a random integer by adding the FLOOR() function:

SELECT FLOOR(RAND() * 10000)

You can generate a random string of 32 characters using MD5():

SELECT MD5(RAND() * 10000)

You can return a random integer between 500 and 1000 with the following:

SELECT FLOOR( 500 + RAND() * (1000 - 500))

You can return a random string from a list of strings by using a table to hold the list. A subselect can select a random name from the list of names.

create table names(id int auto_increment primary key, name varchar(20));
insert into names (name) values ('Justin','Jerry','James','Josh','Julien');
select (select name from names where id = 1 + rand() * 4);

Now we can generate a “fact” table with many rows using fairly simple SQL statements.

First create a table to generate data into:

CREATE TABLE fact (
  dim1 int,
  dim2 int,
  name varchar(20),
  hash varchar(32),
  measure1 double
);

Seed the table with one initial row:

INSERT INTO fact
VALUES (1,1,'Justin',md5(''), .1);

Now grow the table by selecting from the table but providing new random values for the inserted rows:

INSERT INTO fact
SELECT FLOOR(1+ rand()*9999),
       FLOOR(1 + rand()*499),
       (select name from names where id = 1 + rand() * 4),
       MD5(1+rand()*9999),
       rand()
 FROM fact;

As you repeat the INSERT … SELECT, the table will grow exponentially. You may want to add a LIMIT clause to the INSERT … SELECT to reduce the amount of data generated as the table grows.

You will create a table with an even data distribution for each column. You can then add some queries to add skew, either using INSERT … SELECT or UPDATE, for example:

INSERT INTO fact
SELECT 1,1,'Justin',md5(''), .1
  FROM fact
 LIMIT 10000;

That will skew the values by creating many rows with the same data as our initial row.

Using these simple tools, you can generate a data set that is great for testing purposes. For example, dim1 might be a customer_id and dim2 a product_id, and you would populate those tables with 10000 and 500 rows, respectively.

The post Generating test data from the mysql> prompt appeared first on MySQL Performance Blog.

Aug
27
2014
--

Trawling the binlog with FlexCDC and new FlexCDC plugins for MySQL

Swanhart-Tools includes FlexCDC, a change data capture tool for MySQL. FlexCDC follows a server’s binary log and usually writes “changelogs” that track the changes to tables in the database. I say usually because the latest version of Swanhart-Tools (only in github for now) supports FlexCDC plugins, which allow you to send the updates to a remote data source, or to any other place of your liking.  You can find out more about FlexCDC basics in a previous blog post.

Please note that FlexCDC still needs to have source and destination instances defined in the configuration, even when using plugins.  This is because the FlexCDC state (how much into which binary log has FlexCDC progressed, and what tables are being captured) is stored in the “dest”.  Normally when using a plugin, the source and destination instances will be the same. FlexCDC will create a ‘flexviews’ database with a number of state tables in the destination instance.  This also means you still have to use the create_mvlog.php add_table.php or Flexview’s create_mvlog(…) procedure to mark which tables to capture!  See the previous blog post about FlexCDC.

When you create the mvlog, there will still be a changelog table created in the dest, just like when not using a plugin. This is because the INFORMATION_SCHEMA is used to get column datatypes and additional information (such as if an int is signed or unsigned) and this lookup is done against the table in the dest. The reason this is needed, is because mysqlbinlog, the utility used to scrape the binlog, produces strange output for large signed integers (it provides the signed and unsigned version), thus FlexCDC must figure out the right one to choose from the actual DDL of the changelog table. FlexCDC can’t look at the DDL of the source table though, because the consumer may be behind, and the current structure may not match the structure of the rows in the log.

The new plugin system allows you to do a lot of nifty things like:

  • Replicate to external databases
  • Publish changes to a message queue (this is like Facebook’s Wormhole)
  • Keep a remote cache server in sync
  • and more…

The latest version of Swanhart-Tools includes an Example plugin (in flexviews/consumer/include/example_plugin.php) that simply prints the events that come through it, not logging them into the changelog table at all. There is an example of the output at the end of the post.

The example plugin looks like this:

<?php
class FlexCDC_Plugin {
	static function begin_trx($uow_id, $gsn) {
		echo "START TRANSACTION: trx_id: $uow_id, Prev GSN: $gsn";
	}
	static function commit_trx($uow_id, $gsn) {
		echo "COMMIT: trx_id: $uow_id, Last GSN: $gsn";
	}
	static function rollback_trx($uow_id) {
		echo "ROLLBACK: trx_id: $uow_id";
	}
	static function insert($row, $db, $table, $trx_id, $gsn) {
		echo "TRX_ID: $trx_id, Schema:$db, Table: $table, DML: INSERT, AT: $gsn"; print_r($row);
	}
	static function delete($row, $db, $table, $trx_id, $gsn) {
		echo "TRX_ID: $trx_id, Schema:$db, Table: $table, DML: DELETE, AT: $gsn"; print_r($row);
	}
	static function update_before($row, $db, $table, $trx_id, $gsn) {
		echo "TRX_ID: $trx_id, Schema:$db, Table: $table, DML: UPDATE (OLD), AT: $gsn"; print_r($row);
	}
	static function update_after($row, $db, $table, $trx_id, $gsn) {
		echo "TRX_ID: $trx_id, Schema:$db, Table: $table, DML: UPDATE (NEW), AT: $gsn"; print_r($row);
	}
}

Important Note: You must define all seven of these functions in your plugin, even if you do not plan to have actions for each of the callbacks – just leave the function body empty to do no action (the call is simply a noop that case.) Note that the plugin functions must be declared STATIC.  This is due to the way that FlexCDC calls the functions.

Transaction state callbacks
There are three callback functions which notify the plugin of changes in transaction state. Before I go into what they do, I want to note the $trx_id and $gsn parameters which are present in every callback. Each transaction is assigned a monotonically increasing transaction identifier. This sequence uniquely identifies each transaction that FlexCDC processes. In addition, each row change is assigned a unique sequence number which FlexCDC calls the Generic Sequence Number (or GSN).

As you can see, the start_trx(…) callback (called when a transaction starts) is passed both the new transaction number and also the highest GSN used in the previous transaction. This is called the GSN high water mark (GSNHWM). At transaction commit, the commit_trx(…) callback is called and the transaction id and the last GSN assigned in the transaction are passed into the callback. This same value will appear as the GSNHWM in the next start_trx(…) callback. Finally, at rollback any sequence numbers assigned in that transaction will be re-used, so no GSN is passed to the rollback callback, but a transaction id is, which lets you determine exactly which transaction is rolling back.

Row change callbacks

Each of the four row change callback functions capture a particular change to the data set. Each of the functions take five parameters. The first ($row) is an array which contains the row being acted upon. The second ($db) is the schema which contains the row. The third ($table) is the table that contains the row. Each callback also receives the transaction identifier, and of course, each row change is assigned a unique GSN.

For example:
An update will fire both update_before(…) and update_after(…) callbacks with the row images before and after the change, respectively. There is an example of this at the end of the post.

Configuring FlexCDC to use a plugin
FlexCDC uses a configuration file called consumer.ini by default.  To the [flexcdc] section add:
plugin=plugin_file.php

The plugin must be in the FlexCDC include/ directory.  You will find example_plugin.php in this directory, to serve as an example.

How it works
Flexviews uses mysqlbinlog to decode the binary log from the source server. It uses the –decode-rows=ROWS option to decode RBR into a format which can be parsed by an external utility. FlexCDC collects information about each transaction and the row changes that happen in the database (which means it requires ROW based binary logging to be used.)  When a plugin is defined the normal actions used by FlexCDC are overridden with the callback functions.

Here is the output from the example plugin, for an update that affected 3 rows (update test.t3 set c1 = c1 – 1):

START TRANSACTION: trx_id: 44, Prev GSN: 107
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (OLD), AT: 108
Array
(
    [c1] => -3
    [c2] => 1
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (NEW), AT: 109
Array
(
    [c1] => -4
    [c2] => 1
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (OLD), AT: 110
Array
(
    [c1] => -5
    [c2] => 2
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (NEW), AT: 111
Array
(
    [c1] => -6
    [c2] => 2
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (OLD), AT: 112
Array
(
    [c1] => -5
    [c2] => 2
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (NEW), AT: 113
Array
(
    [c1] => -6
    [c2] => 2
)
COMMIT: trx_id: 44, Last GSN: 113

One thing you should notice, is that FlexCDC provides column names for the data coming from the binary log. This is because the log table exists in the dest instance and FlexCDC can get the list of column names from there. When you use other CDC tools, like the C binlog API, you don’t get column names.

The post Trawling the binlog with FlexCDC and new FlexCDC plugins for MySQL appeared first on MySQL Performance Blog.

Aug
27
2014
--

Trawling the binlog with FlexCDC and new FlexCDC plugins for MySQL

Swanhart-Tools includes FlexCDC, a change data capture tool for MySQL. FlexCDC follows a server’s binary log and usually writes “changelogs” that track the changes to tables in the database. I say usually because the latest version of Swanhart-Tools (only in github for now) supports FlexCDC plugins, which allow you to send the updates to a remote data source, or to any other place of your liking.  You can find out more about FlexCDC basics in a previous blog post.

Please note that FlexCDC still needs to have source and destination instances defined in the configuration, even when using plugins.  This is because the FlexCDC state (how much into which binary log has FlexCDC progressed, and what tables are being captured) is stored in the “dest”.  Normally when using a plugin, the source and destination instances will be the same. FlexCDC will create a ‘flexviews’ database with a number of state tables in the destination instance.  This also means you still have to use the create_mvlog.php add_table.php or Flexview’s create_mvlog(…) procedure to mark which tables to capture!  See the previous blog post about FlexCDC.

When you create the mvlog, there will still be a changelog table created in the dest, just like when not using a plugin. This is because the INFORMATION_SCHEMA is used to get column datatypes and additional information (such as if an int is signed or unsigned) and this lookup is done against the table in the dest. The reason this is needed, is because mysqlbinlog, the utility used to scrape the binlog, produces strange output for large signed integers (it provides the signed and unsigned version), thus FlexCDC must figure out the right one to choose from the actual DDL of the changelog table. FlexCDC can’t look at the DDL of the source table though, because the consumer may be behind, and the current structure may not match the structure of the rows in the log.

The new plugin system allows you to do a lot of nifty things like:

  • Replicate to external databases
  • Publish changes to a message queue (this is like Facebook’s Wormhole)
  • Keep a remote cache server in sync
  • and more…

The latest version of Swanhart-Tools includes an Example plugin (in flexviews/consumer/include/example_plugin.php) that simply prints the events that come through it, not logging them into the changelog table at all. There is an example of the output at the end of the post.

The example plugin looks like this:

<?php
class FlexCDC_Plugin {
	static function begin_trx($uow_id, $gsn) {
		echo "START TRANSACTION: trx_id: $uow_id, Prev GSN: $gsn";
	}
	static function commit_trx($uow_id, $gsn) {
		echo "COMMIT: trx_id: $uow_id, Last GSN: $gsn";
	}
	static function rollback_trx($uow_id) {
		echo "ROLLBACK: trx_id: $uow_id";
	}
	static function insert($row, $db, $table, $trx_id, $gsn) {
		echo "TRX_ID: $trx_id, Schema:$db, Table: $table, DML: INSERT, AT: $gsn"; print_r($row);
	}
	static function delete($row, $db, $table, $trx_id, $gsn) {
		echo "TRX_ID: $trx_id, Schema:$db, Table: $table, DML: DELETE, AT: $gsn"; print_r($row);
	}
	static function update_before($row, $db, $table, $trx_id, $gsn) {
		echo "TRX_ID: $trx_id, Schema:$db, Table: $table, DML: UPDATE (OLD), AT: $gsn"; print_r($row);
	}
	static function update_after($row, $db, $table, $trx_id, $gsn) {
		echo "TRX_ID: $trx_id, Schema:$db, Table: $table, DML: UPDATE (NEW), AT: $gsn"; print_r($row);
	}
}

Important Note: You must define all seven of these functions in your plugin, even if you do not plan to have actions for each of the callbacks – just leave the function body empty to do no action (the call is simply a noop that case.) Note that the plugin functions must be declared STATIC.  This is due to the way that FlexCDC calls the functions.

Transaction state callbacks
There are three callback functions which notify the plugin of changes in transaction state. Before I go into what they do, I want to note the $trx_id and $gsn parameters which are present in every callback. Each transaction is assigned a monotonically increasing transaction identifier. This sequence uniquely identifies each transaction that FlexCDC processes. In addition, each row change is assigned a unique sequence number which FlexCDC calls the Generic Sequence Number (or GSN).

As you can see, the start_trx(…) callback (called when a transaction starts) is passed both the new transaction number and also the highest GSN used in the previous transaction. This is called the GSN high water mark (GSNHWM). At transaction commit, the commit_trx(…) callback is called and the transaction id and the last GSN assigned in the transaction are passed into the callback. This same value will appear as the GSNHWM in the next start_trx(…) callback. Finally, at rollback any sequence numbers assigned in that transaction will be re-used, so no GSN is passed to the rollback callback, but a transaction id is, which lets you determine exactly which transaction is rolling back.

Row change callbacks

Each of the four row change callback functions capture a particular change to the data set. Each of the functions take five parameters. The first ($row) is an array which contains the row being acted upon. The second ($db) is the schema which contains the row. The third ($table) is the table that contains the row. Each callback also receives the transaction identifier, and of course, each row change is assigned a unique GSN.

For example:
An update will fire both update_before(…) and update_after(…) callbacks with the row images before and after the change, respectively. There is an example of this at the end of the post.

Configuring FlexCDC to use a plugin
FlexCDC uses a configuration file called consumer.ini by default.  To the [flexcdc] section add:
plugin=plugin_file.php

The plugin must be in the FlexCDC include/ directory.  You will find example_plugin.php in this directory, to serve as an example.

How it works
Flexviews uses mysqlbinlog to decode the binary log from the source server. It uses the –decode-rows=ROWS option to decode RBR into a format which can be parsed by an external utility. FlexCDC collects information about each transaction and the row changes that happen in the database (which means it requires ROW based binary logging to be used.)  When a plugin is defined the normal actions used by FlexCDC are overridden with the callback functions.

Here is the output from the example plugin, for an update that affected 3 rows (update test.t3 set c1 = c1 – 1):

START TRANSACTION: trx_id: 44, Prev GSN: 107
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (OLD), AT: 108
Array
(
    [c1] => -3
    [c2] => 1
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (NEW), AT: 109
Array
(
    [c1] => -4
    [c2] => 1
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (OLD), AT: 110
Array
(
    [c1] => -5
    [c2] => 2
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (NEW), AT: 111
Array
(
    [c1] => -6
    [c2] => 2
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (OLD), AT: 112
Array
(
    [c1] => -5
    [c2] => 2
)
TRX_ID: 44, Schema:test, Table: t3, DML: UPDATE (NEW), AT: 113
Array
(
    [c1] => -6
    [c2] => 2
)
COMMIT: trx_id: 44, Last GSN: 113

One thing you should notice, is that FlexCDC provides column names for the data coming from the binary log. This is because the log table exists in the dest instance and FlexCDC can get the list of column names from there. When you use other CDC tools, like the C binlog API, you don’t get column names.

The post Trawling the binlog with FlexCDC and new FlexCDC plugins for MySQL appeared first on MySQL Performance Blog.

May
01
2014
--

Parallel Query for MySQL with Shard-Query

While Shard-Query can work over multiple nodes, this blog post focuses on using Shard-Query with a single node.  Shard-Query can add parallelism to queries which use partitioned tables.  Very large tables can often be partitioned fairly easily. Shard-Query can leverage partitioning to add paralellism, because each partition can be queried independently. Because MySQL 5.6 supports the partition hint, Shard-Query can add parallelism to any partitioning method (even subpartioning) on 5.6 but it is limited to RANGE/LIST partitioning methods on early versions.

The output from Shard-Query is from the commandline client, but you can use MySQL proxy to communicate with Shard-Query too.

In the examples I am going to use the schema from the Star Schema Benchmark.  I generated data for scale factor 10, which means about 6GB of data in the largest table. I am going to show a few different queries, and explain how Shard-Query executes them in parallel.

Here is the DDL for the lineorder table, which I will use for the demo queries:

CREATE TABLE IF NOT EXISTS lineorder
(
 LO_OrderKey bigint not null,
 LO_LineNumber tinyint not null,
 LO_CustKey int not null,
 LO_PartKey int not null,
 LO_SuppKey int not null,
 LO_OrderDateKey int not null,
 LO_OrderPriority varchar(15),
 LO_ShipPriority char(1),
 LO_Quantity tinyint,
 LO_ExtendedPrice decimal,
 LO_OrdTotalPrice decimal,
 LO_Discount decimal,
 LO_Revenue decimal,
 LO_SupplyCost decimal,
 LO_Tax tinyint,
 LO_CommitDateKey int not null,
 LO_ShipMode varchar(10),
 primary key(LO_OrderDateKey,LO_PartKey,LO_SuppKey,LO_Custkey,LO_OrderKey,LO_LineNumber)
) PARTITION BY HASH(LO_OrderDateKey) PARTITIONS 8;

Notice that the lineorder table is partitioned by HASH(LO_OrderDateKey) into 8 partitions.  I used 8 partitions and my test box has 4 cores. It does not hurt to have more partitions than cores. A number of partitions that is two or three times the number of cores generally works best because it keeps each partition small, and smaller partitions are faster to scan. If you have a very large table, a larger number of partitions may be acceptable. Shard-Query will submit a query to Gearman for each partition, and the number of Gearman workers controls the parallelism.

The SQL for the first demo is:

SELECT COUNT(DISTINCT LO_OrderDateKey) FROM lineorder;

Here is the explain from regular MySQL:

mysql> explain select count(distinct LO_OrderDateKey) from lineorder\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: lineorder
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 25
          ref: NULL
         rows: 58922188
        Extra: Using index
1 row in set (0.00 sec)

 

So it is basically a full table scan. It takes a long time:

mysql> select count(distinct LO_OrderDateKey) from lineorder;
+---------------------------------+
| count(distinct LO_OrderDateKey) |
+---------------------------------+
|                            2406 |
+---------------------------------+
1 row in set (4 min 48.63 sec)

 

Shard-Query executes this query differently from MySQL. It sends a query to each partition, in parallel like the following queries:

Array
(
    [0] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p0)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [1] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p1)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [2] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p2)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [3] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p3)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [4] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p4)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [5] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p5)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [6] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p6)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [7] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p7)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
)

You will notice that there is one query for each partition.  Those queries will be sent to Gearman and executed in parallel by as many Gearman workers as possible (in this case 4.)  The output of the queries go into a coordinator table, and then another query does a final aggregation.  That query looks like this:

SELECT COUNT(distinct expr_2839651562) AS `count`
FROM `aggregation_tmp_73522490`

The Shard-Query time:

select count(distinct LO_OrderDateKey) from lineorder;
Array
(
    [count ] => 2406
)
1 rows returned
Exec time: 0.10923719406128

That isn’t a typo, it really is sub-second compared to minutes in regular MySQL.

This is because Shard-Query uses GROUP BY to answer this query and a  loose index scan of the PRIMARY KEY is possible:

mysql> explain partitions SELECT LO_OrderDateKey AS expr_2839651562
    -> FROM lineorder  PARTITION(p7)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    -> \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: lineorder
   partitions: p7
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 80108
        Extra: Using index for group-by
1 row in set (0.00 sec)

Next another simple query will be tested, first on regular MySQL:

mysql> select count(*) from lineorder;
+----------+
| count(*) |
+----------+
| 59986052 |
+----------+
1 row in set (4 min 8.70 sec)

Again, the EXPLAIN shows a full table scan:

mysql> explain select count(*) from lineorder\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: lineorder
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 25
          ref: NULL
         rows: 58922188
        Extra: Using index
1 row in set (0.00 sec)

Now, Shard-Query can’t do anything special to speed up this query, except to execute it in parallel, similar to the first query:

[0] => SELECT COUNT(*) AS expr_3190753946
FROM lineorder PARTITION(p0) AS `lineorder` WHERE 1=1 AND 1=1
[1] => SELECT COUNT(*) AS expr_3190753946
FROM lineorder PARTITION(p1) AS `lineorder` WHERE 1=1 AND 1=1
[2] => SELECT COUNT(*) AS expr_3190753946
FROM lineorder PARTITION(p2) AS `lineorder` WHERE 1=1 AND 1=1
[3] => SELECT COUNT(*) AS expr_3190753946
FROM lineorder PARTITION(p3) AS `lineorder` WHERE 1=1 AND 1=1
...

The aggregation SQL is similar, but this time the aggregate function is changed to SUM to combine the COUNT from each partition:

SELECT SUM(expr_3190753946) AS ` count `
FROM `aggregation_tmp_51969525`

And the query is quite a bit faster at 140.24 second compared with MySQL’s 248.7 second result:

Array
(
[count ] => 59986052
)
1 rows returned
Exec time: 140.24419403076

Finally, I want to look at a more complex query that uses joins and aggregation.

mysql> explain select d_year, c_nation,  sum(lo_revenue - lo_supplycost) as profit  from lineorder
join dim_date  on lo_orderdatekey = d_datekey
join customer  on lo_custkey = c_customerkey
join supplier  on lo_suppkey = s_suppkey
join part  on lo_partkey = p_partkey
where  c_region = 'AMERICA'  and s_region = 'AMERICA'
and (p_mfgr = 'MFGR#1'  or p_mfgr = 'MFGR#2')
group by d_year, c_nation  order by d_year, c_nation;
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+------+---------------------------------+
| id | select_type | table     | type   | possible_keys | key     | key_len | ref                      | rows | Extra                           |
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+------+---------------------------------+
|  1 | SIMPLE      | dim_date  | ALL    | PRIMARY       | NULL    | NULL    | NULL                     |    5 | Using temporary; Using filesort |
|  1 | SIMPLE      | lineorder | ref    | PRIMARY       | PRIMARY | 4       | ssb.dim_date.D_DateKey   |   89 | NULL                            |
|  1 | SIMPLE      | supplier  | eq_ref | PRIMARY       | PRIMARY | 4       | ssb.lineorder.LO_SuppKey |    1 | Using where                     |
|  1 | SIMPLE      | customer  | eq_ref | PRIMARY       | PRIMARY | 4       | ssb.lineorder.LO_CustKey |    1 | Using where                     |
|  1 | SIMPLE      | part      | eq_ref | PRIMARY       | PRIMARY | 4       | ssb.lineorder.LO_PartKey |    1 | Using where                     |
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+------+---------------------------------+
5 rows in set (0.01 sec)

Here is the query on regular MySQL:

mysql> select d_year, c_nation,  sum(lo_revenue - lo_supplycost) as profit  from lineorder  join dim_date  on lo_orderdatekey = d_datekey  join customer  on lo_custkey = c_customerkey  join supplier  on lo_suppkey = s_suppkey  join part  on lo_partkey = p_partkey  where  c_region = 'AMERICA'  and s_region = 'AMERICA'  and (p_mfgr = 'MFGR#1'  or p_mfgr = 'MFGR#2')  group by d_year, c_nation  order by d_year, c_nation;
+--------+---------------+--------------+
| d_year | c_nation      | profit       |
+--------+---------------+--------------+
|   1992 | ARGENTINA     | 102741829748 |
...
|   1998 | UNITED STATES |  61345891337 |
+--------+---------------+--------------+
35 rows in set (11 min 56.79 sec)

Again, Shard-Query splits up the query to run over each partition (I won’t bore you with the details) and it executes the query faster than MySQL, in 343.3 second compared to ~720:

Array
(
    [d_year] => 1998
    [c_nation] => UNITED STATES
    [profit] => 61345891337
)
35 rows returned
Exec time: 343.29854893684

I hope you see how using Shard-Query can speed up queries without using sharding, on just a single server. All you really need to do is add partitioning.

You can get Shard-Query from GitHub at http://github.com/greenlion/swanhart-tools

Please note: Configure and install Shard-Query as normal, but simply use one node and set the column option (the shard column) to “nocolumn” or false, because you are not required to use a shard column if you are not sharding.

The post Parallel Query for MySQL with Shard-Query appeared first on MySQL Performance Blog.

Sep
12
2013
--

MySQL webinar: ‘Introduction to open source column stores’

MySQL webinar: 'Introduction to column stores'Join me Wednesday, September 18 at 10 a.m. PDT for an hour-long webinar where I will introduce the basic concepts behind column store technology. The webinar’s title is: “Introduction to open source column stores.”

What will be discussed?

This webinar will talk about Infobright, LucidDB, MonetDB, Hadoop (Impala) and other column stores

  • I will compare features between major column stores (both open and closed source).
  • Some benchmarks will be used to demonstrate the basic performance characteristics of the open source column stores.
  • There will be a question and answer session to ask me anything you like about column stores (you can also ask in the comments section below if you prefer to submit them in advance).

Who should attend?

  • Anyone interested in analytics or OLAP
  • Those new to column store technology who want to find out how they will benefit from using a column store
  • Database administrators or users that have to query large amounts of data quickly, especially in aggregate

T-Shirts!

A randomly selected participant in the webinar will be chosen to receive a free Percona t-shirt.  You must be present at the end of the webinar during the question and answer section to be eligible for the shirt.

You can also win a Percona t-shirt by answering one or more of the following questions (the first to answer correctly wins, one winner only, sorry. ):

  • What is one advantage of columnar storage?
  • What is one big drawback of columnar storage?
  • Describe (or link to) an application or project that uses a columnar storage

Register here.

The post MySQL webinar: ‘Introduction to open source column stores’ appeared first on MySQL Performance Blog.

May
22
2013
--

MySQL and the SSB – Part 2 – MyISAM vs InnoDB low concurrency

This blog post is part two in what is now a continuing series on the Star Schema Benchmark.

In my previous blog post I compared MySQL 5.5.30 to MySQL 5.6.10, both with default settings using only the InnoDB storage engine.  In my testing I discovered that innodb_old_blocks_time had an effect on performance of the benchmark.  There was some discussion in the comments and I promised to follow up with more SSB tests at a later date.

I also promised more low concurrency SSB tests when Peter blogged about the importance of performance at low concurrency.

The SSB
The SSB tests a database’s ability to optimize queries for a star schema. A star schema presents some unique challenge to the database optimizer. The SSB benchmark consists of four sets of queries. Each set is known as a “flight”. I have labeled each query as Q{FLIGHT_NUMBER}.{QUERY_NUMBER}. In general, each flight examines different time periods or different regions. The flights represent the type of investigations and drill-downs that are common in OLAP analysis.

Each query in each flight (Q1.1 for example) is tested with a cold buffer pool. Then the query is tested again without restarting the database. The first test is described as the cold test, and the second as the hot test. The database software is restarted after the hot test. All OS caches are dropped at this time as well.

These set of queries were tested on the SSB at SCALE FACTOR: 20. This means there is approximately 12GB of data in the largest table.

You can find the individual SSB query definitions in my previous blog post.

Test environment
These tests were done on a relatively fast machine with a Xeon E5-2680 (8 cores, 16 threads) with fast IO (OCZ R4 1.6TB) and 128GB memory. For the hot test, the data fits in the buffer pool and has been loaded by the cold test already. The buffer pool and adaptive hash index are cold for the cold test. All tests were done with no concurrency. The hardware for this test was provided by Adotomi. I will be blogging about raw performance of the OCZ card in another post.

Also, while it is labeled on the graphs, it is important to note that in all cases, lower times are better.

SSB Flight #1
Here you will see the start of an interesting trend. MyISAM is faster when the data is not cached (the cold run) but is slower in the hot (cached) run. I did some investigation during the testing and found that InnoDB does more IO than MyISAM when the database is cold, but uses less CPU time when the database is hot. I am only speculating (and I can investigate further), but I believe the adaptive hash index is improving performance of InnoDB significantly during the hot run, as hash indexes are faster than a b-tree index. Also accessing pages from the buffer pool should be faster than getting them from the OS cache, which is another advantage of InnoDB.

 

 

 


image009
 

 

 

 

 

 


image001
 

 

 

SSB Flight #2
Flight #2 is similar to Flight #1. MyISAM is faster than InnoDB when the database is cold, but the opposite is true when the database is hot.

 

 


image012
 

 

 

 

 

 


image003
 

 

 

 

SSB Flight #3
Here in some cases MyISAM is substantially faster than InnoDB both cold and hot.

 

 


image014
 

 

 

 

 

 


image005
 

 

 

 

SSB Flight #4
There is one query in this flight, Q4.3, which is faster using MyISAM than InnoDB. Like the queries in Flight #3 that are faster using MyISAM, Q4.3 examines very little data. It seems that InnoDB performs better when a larger number of rows must be joined together (Q4.1, Q4.2) but worse when small amounts of data are examined.

 

 


image016
 

 

 

 

 

 


image007
 

 

 

Conclusion

In some cases MyISAM is faster than InnoDB, but usually only when the buffer pool is cold. Please don’t take away that you should be using MyISAM for everything!. MyISAM may be good for raw performance, but there are limitations which MyISAM imposes that are difficult to work with.  MyISAM does not maintain checksum consistency during regular operations and is not ACID compliant. MyISAM and InnoDB may perform differently under concurrency, which this benchmark does not cover. I will make a follow-up post about concurrency in another blog post in this series. Regardless, when the working set fits in memory, InnoDB almost always performs better, at least for this workload.

Notes

MySQL version used: 5.6.11, custom compiled to remove performance_schema

For the InnoDB tests, a 64GB buffer pool was used. O_DIRECT was used so, there was no caching of data at the filesystem level. The InnoDB indexes were built using ALTER TABLE fast index creation (merge sort).

For the MyISAM tests I used a 10GB key buffer. I used ALTER TABLE DISABLE KEYS and built the keys with sort via ALTER TABLE ENABLE KEYS.
my.cnf

[mysqld]
datadir=/mnt/mysql56/data
basedir=/usr/local/mysql
socket=/var/lib/mysql/mysql.sock
user=justin
innodb_buffer_pool_size=64G
innodb_log_file_size=4G
innodb_file_per_table
innodb_stats_on_metadata=off
innodb_file_format=barracuda
innodb_log_buffer_size=32M
innodb_buffer_pool_instances=16
metadata_locks_hash_instances=32
table_open_cache_instances=8
sort_buffer_size=128k
read_rnd_buffer_size=8M
join_buffer_size=8M
default_tmp_storage_engine=myisam
tmpdir=/dev/shm
innodb_undo_logs=32
innodb_old_blocks_time=0
table_open_cache=2048
table_definition_cache=16384
innodb_flush_method=O_DIRECT
key_buffer_size=10G
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
innodb_stats_persistent
innodb_stats_auto_update=off
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

The post MySQL and the SSB – Part 2 – MyISAM vs InnoDB low concurrency appeared first on MySQL Performance Blog.

Mar
11
2013
--

MySQL 5.6 vs MySQL 5.5 and the Star Schema Benchmark

MySQL 5.6 vs MySQL 5.5 and the Star Schema Benchmark

MySQL 5.6 vs MySQL 5.5 & the Star Schema Benchmark

So far most of the benchmarks posted about MySQL 5.6 use the sysbench OLTP workload.  I wanted to test a set of queries which, unlike sysbench, utilize joins.  I also wanted an easily reproducible set of data which is more rich than the simple sysbench table.  The Star Schema Benchmark (SSB) seems ideal for this.

I wasn’t going to focus on the performance of individual queries in this post, but instead intended to focus only on the overall response time for answering all of the queries in the benchmark. I got some strange results, however, which showed MySQL 5.6.10 to be much slower than MySQL 5.5.30 even with only a single connection. I felt these results warranted deeper investigation, so I did some research and detailed my findings here.

Just a few notes:
I tested two scenarios: a buffer pool much smaller than the data set (default size of 128MB, which is 1/8th of the data) and I also testing a 4G buffer pool, which is larger than the data. Very little tuning was done. The goal was to see how MySQL 5.6 performs out-of-the-box as compared to 5.5.30 with default settings. The non-default settings were tried to dig deeper into performance differences and are documented in the post.

This blog post is not a definitive conclusion about innodb_old_blocks_pct or innodb_old_blocks_time. It does highlight how a data set much larger than the buffer pool may perform worse with innodb_old_blocks_time=1000, but as I said this needs further investigation. One particular point of investigation which needs to be followed up on, including testing innodb_old_blocks_time=1000 on MySQL 5.5.30 and testing multiple buffer pools on MySQL 5.5.30. Finally, MySQL 5.6.10 has many additional tuning options which must be investigated (MRR, BKA, ICP, etc) before coming to further conclusions. These will be the topic of further blog posts.

Benchmark Details:

The SSB employs a data generator which produces data for a star schema.  Star schema are commonly used for analytics because it is extremely easy to construct queries against.  It is also very easy to define an OLAP cube over a star schema, so they are popular for use with tools like Mondrian and also for data mining.  I wrote an earlier blog post which describes the differences between major schema types.

  • I used the SSB data set at scale factor 1.  Scale factor 1 results in 587MB of raw data, mostly in one  table (lineorder).
  • Each of the 13 queries were executed serially in a single connection
  • I modified the queries to use ANSI JOIN syntax.  No other changes to the queries were made.

Test Environment

  • The MySQL versions used at the time of this post are 5.5.30 and 5.6.10, each of which are GA when this was written.
    • I compiled both servers from source (cmake -gui .; make; make install)
    • Only changes from defaults was that both servers are compiled without the PERFORMANCE_SCHEMA, and paths are unique for basedir and datadir
  • I tested three configurations:
    • Config 1: Default config for MySQL 5.5 and MySQL 5.6, no tuning at all
    • Config 2: MySQL 5.6 with all default settings except innodb_old_blocks_time=0
    • Config 3: MySQL 5.5 and 5.6 with a 4G buffer pool instead of the default 128M

Rationale:

  • Since O_DIRECT is not used by default, the file system cache will give better read performance after first run (but not as good as warm buffer pool)
  • Thus, the results marked COLD are the results after the server reboot, when the FS cache is cold
  • The remaining results are runs without a server restart.  For the default size BP, this means the FS cache is warm.  For the 4G BP, the BP is completely warm.
    • The idea here is to test the situation when the buffer pool is smaller than data and the IO is slow (when the  FS cache is cold, IO to slow IO subsystem happens)
    • Repeated runs test a buffer pool which is smaller than the data but underlying IO is fast (a warm FS cache reduces IO cost significantly)
    • And finally, testing with a 4G buffer pool shows how the system performs when the data fits completely into the buffer pool (no IO on repeat runs)

Test Server:

    • Intel core i970-3.20GHz.  12 logical cores (six physical cores).
    • 12GB memory
    • 4 disk 7200RPM RAID 10 array with 512MB write-back cache

 Star Schema Benchmark – Scale Factor 1 – Mysql 5.5 vs 5.6
response times are in seconds (lower is better)

Version Buffer Cold Run1 Run2 Run3
5.5.30 128M 361.49 189.29 189.34 189.40
5.6.10 128M 362.31 324.25 320.74 318.84
5.6.10 (innodb_old_blocks_time=0) 128M 349.24 178.80 178.55 179.07
5.5.30 4G 200.87 20.53 20.36 20.35
5.6.10 4G 195.33 14.41 14.45 14.61

I started by running the benchmark against MySQL 5.5.30.  It took 361.49 seconds to complete all 13 queries.  I then repeated the run three more times.  The speed is very consistent, just a few tenths of a second off per run.  I then rebooted the machine and fired up 5.6.10.   I ran the test, and to my surprise MySQL 5.6.10 did not get much faster during the repeat runs, compared to the initial cold run.  I stopped the MySQL 5.6 server, rebooted and verified again.  Same issue.  This was very different from MySQL 5.5.30, which performs significantly better on the repeat warm runs.

Just to be sure it wasn’t a disk problem, I pointed the MySQL 5.6.10 at the MySQL 5.5.30 data directory.  Tthe speed was essentially the same.   I did some further investigation and I determined that there was a lower buffer pool hit ratio during the MySQL 5.6 runs and MySQL 5.6.10 was doing more IO as a consequence.  To confirm that this was indeed the problem I decided to compare performance with a buffer pool much larger than the data size, so I configured the server with a 4GB buffer pool.  I tested both versions, and as you can see above, MySQL 5.6 outperformed MySQL 5.5.30 with the big buffer pool.

Why is the MySQL 5.6.10 with default settings test significantly slower than MySQL 5.5.30 in repeat runs?

I thought about the differences in the defaults between MySQL 5.5 and MySQL 5.6 and innodb_old_blocks_time immediately came to mind.  The InnoDB plugin introduced innodb_old_blocks_time to help control the behavior of the new split LRU mechanism which was implemented in the plugin.  In the original InnoDB, the LRU was implemented as a classic LRU which is subject to “pollution” by full table scans.  In the classic LRU, a full table scan pushes out important hot pages from the buffer pool often for an infrequent scan, like a backup or report.  In an OLTP system this can have very negative performance consequences.

The plugin attempts to fix this problem by splitting the LRU into hot and cold sections.  When a page is first read into the buffer pool it is first placed onto the head of the cold section of the LRU, where it begins to age of naturally.  If the page is touched again while on the cold portion, it is moved to the head of the hot portion.

This sounds good in theory, but in practice it is problematic.  What usually happens is that the full table scans access the table by primary key.  This forces the storage engine to touch the same page numerous times in rapid succession.  This invariably moves the page onto the hot area, defeating the split.  In order to prevent this from happening, another variable innodb_old_blocks_time was introduced.

Innodb_old_blocks_time controls how long a page must be on the cold portion of the LRU before it is eligible to be moved to the hot portion.  In MySQL 5.5 and earlier, innodb_old_blocks_time defaults to a value of 0(zero), which means that pages move rapidly from the cold portion to the hot portion because they must stay on the cold LRU for zero milliseconds before being able to move to the hot list.  In MySQL 5.6 the default value of innodb_old_blocks_time is changed to 1000.   The location at which a page is initially placed into the LRU is defined by innodb_old_blocks_pct.  The default value on both versions is 38, which happens to be 3/8 of the buffer pool.

For this workload with a small buffer pool (the buffer pool is smaller than the working set) having innodb_old_blocks_time=1000 appears to cause a major performance regression.  The new setting  changes which pages end up staying in the buffer pool, and which are aged out.

Digging into why innodb_old_blocks_time change the performance?

Each “flight” of queries represents a set of drill-down queries to find an anomaly.  I am going to focus on the first query, which uses only one join. Since it is practical for a query with only one join, I’ve tested performance of the query with the join in both directions.
Explain for query Q1.1:

mysql> select straight_join sum(lo_extendedprice*lo_discount) as revenue
from lineorder join  dim_date on lo_orderdatekey = d_datekey
where d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;
+--------------+
| revenue      |
+--------------+
| 446268068091 |
+--------------+
1 row in set (33.94 sec)
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: lineorder
         type: ALL
possible_keys: LO_OrderDateKey
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5996539
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: dim_date
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: ssb.lineorder.LO_OrderDateKey
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

After running the query, see how many pages were read from disk versus how many page requests their were: 

mysql> select * from information_schema.global_status where variable_name like '%innodb%read%';
+---------------------------------------+----------------+
| VARIABLE_NAME                         | VARIABLE_VALUE |
+---------------------------------------+----------------+
| INNODB_BUFFER_POOL_READ_AHEAD_RND     | 0              |
| INNODB_BUFFER_POOL_READ_AHEAD         | 38392          |
| INNODB_BUFFER_POOL_READ_AHEAD_EVICTED | 0              |
| INNODB_BUFFER_POOL_READ_REQUESTS      | 6731100        |
| INNODB_BUFFER_POOL_READS              | 570            |
| INNODB_DATA_PENDING_READS             | 0              |
| INNODB_DATA_READ                      | 640536576      |
| INNODB_DATA_READS                     | 38972          |
| INNODB_PAGES_READ                     | 38961          |
| INNODB_ROWS_READ                      | 6611119        |
+---------------------------------------+----------------+
10 rows in set (0.00 sec)
Here are the contents of the buffer pool in pages afterwards:
mysql> select sq.*, pages / (@@innodb_buffer_pool_size / 16384) * 100 pct_buffer_pool
from (
select table_name, index_name, count(*) pages, sum(is_old='YES') old,
count(*) - sum(is_old='YES') hot, sum(number_records) records
from information_schema.innodb_buffer_page_lru
where  table_name like '%ssb%' group by 1,2
) sq
order by pct_buffer_pool desc;
+-------------------+------------------+-------+------+------+---------+-----------------+
| table_name        | index_name       | pages | old  | hot  | records | pct_buffer_pool |
+-------------------+------------------+-------+------+------+---------+-----------------+
| `ssb`.`lineorder` | GEN_CLUST_INDEX  |  6909 | 2559 | 4350 | 1083172 |         84.3384 |
| `ssb`.`lineorder` | LO_PartKey       |    17 |    0 |   17 |    9979 |          0.2075 |
| `ssb`.`lineorder` | LO_CommitDateKey |    17 |    0 |   17 |   10776 |          0.2075 |
| `ssb`.`lineorder` | LO_OrderDateKey  |    17 |    0 |   17 |   10376 |          0.2075 |
| `ssb`.`dim_date`  | PRIMARY          |    17 |    0 |   17 |    2481 |          0.2075 |
| `ssb`.`lineorder` | LO_CustKey       |    16 |    0 |   16 |    8616 |          0.1953 |
| `ssb`.`lineorder` | LO_OrderKey      |    16 |    0 |   16 |   10943 |          0.1953 |
| `ssb`.`lineorder` | LO_SuppKey       |    15 |    0 |   15 |   11466 |          0.1831 |
+-------------------+------------------+-------+------+------+---------+-----------------+
8 rows in set (0.12 sec)
And the Innodb stats:
mysql> select * from information_schema.innodb_buffer_pool_stats\G
*************************** 1. row ***************************
                         POOL_ID: 0
                       POOL_SIZE: 8191
                    FREE_BUFFERS: 1024
                  DATABASE_PAGES: 7162
              OLD_DATABASE_PAGES: 2623
         MODIFIED_DATABASE_PAGES: 0
              PENDING_DECOMPRESS: 0
                   PENDING_READS: 0
               PENDING_FLUSH_LRU: 0
              PENDING_FLUSH_LIST: 0
                PAGES_MADE_YOUNG: 3
            PAGES_NOT_MADE_YOUNG: 4824154
           PAGES_MADE_YOUNG_RATE: 0
       PAGES_MADE_NOT_YOUNG_RATE: 0
               NUMBER_PAGES_READ: 38960
            NUMBER_PAGES_CREATED: 0
            NUMBER_PAGES_WRITTEN: 1
                 PAGES_READ_RATE: 0
               PAGES_CREATE_RATE: 0
              PAGES_WRITTEN_RATE: 0
                NUMBER_PAGES_GET: 6731253
                        HIT_RATE: 0
    YOUNG_MAKE_PER_THOUSAND_GETS: 0
NOT_YOUNG_MAKE_PER_THOUSAND_GETS: 0
         NUMBER_PAGES_READ_AHEAD: 38457
       NUMBER_READ_AHEAD_EVICTED: 0
                 READ_AHEAD_RATE: 0
         READ_AHEAD_EVICTED_RATE: 0
                    LRU_IO_TOTAL: 431
                  LRU_IO_CURRENT: 0
                UNCOMPRESS_TOTAL: 0
              UNCOMPRESS_CURRENT: 0
1 row in set (0.00 sec)

Now compare the difference with innodb_old_blocks_time=0;

mysql> set global innodb_old_blocks_time=0;
Query OK, 0 rows affected (0.00 sec)
mysql> select straight_join sum(lo_extendedprice*lo_discount) as revenue
from lineorder join  dim_date on lo_orderdatekey = d_datekey
where d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25\G
*************************** 1. row ***************************
revenue: 446268068091
1 row in set (7.81 sec)
mysql> select * from information_schema.global_status where variable_name like '%innodb%read%';
+---------------------------------------+----------------+
| VARIABLE_NAME                         | VARIABLE_VALUE |
+---------------------------------------+----------------+
| INNODB_BUFFER_POOL_READ_AHEAD_RND     | 0              |
| INNODB_BUFFER_POOL_READ_AHEAD         | 38461          |
| INNODB_BUFFER_POOL_READ_AHEAD_EVICTED | 0              |
| INNODB_BUFFER_POOL_READ_REQUESTS      | 6731687        |
| INNODB_BUFFER_POOL_READS              | 550            |
| INNODB_DATA_PENDING_READS             | 0              |
| INNODB_DATA_READ                      | 641339392      |
| INNODB_DATA_READS                     | 39021          |
| INNODB_PAGES_READ                     | 39010          |
| INNODB_ROWS_READ                      | 6611119        |
+---------------------------------------+----------------+
10 rows in set (0.00 sec)
mysql> select sq.*, pages / (@@innodb_buffer_pool_size / 16384) * 100 pct_buffer_pool
from (
select table_name, index_name, count(*) pages, sum(is_old='YES') old,
count(*) - sum(is_old='YES') hot, sum(number_records) records
from information_schema.innodb_buffer_page_lru
where  table_name like '%ssb%' group by 1,2
) sq
order by pct_buffer_pool desc;
+-------------------+-----------------+-------+------+------+---------+-----------------+
| table_name        | index_name      | pages | old  | hot  | records | pct_buffer_pool |
+-------------------+-----------------+-------+------+------+---------+-----------------+
| `ssb`.`lineorder` | GEN_CLUST_INDEX |  7085 | 2547 | 4538 | 1104291 |         86.4868 |
| `ssb`.`dim_date`  | PRIMARY         |    17 |   17 |    0 |    2481 |          0.2075 |
+-------------------+-----------------+-------+------+------+---------+-----------------+
2 rows in set (0.11 sec)
So there is more of lineorder in the buffer pool and the other secondary indexes have been pushed out of the buffer pool.
mysql> select * from information_schema.innodb_buffer_pool_stats\G
*************************** 1. row ***************************
                         POOL_ID: 0
                       POOL_SIZE: 8192
                    FREE_BUFFERS: 1024
                  DATABASE_PAGES: 7163
              OLD_DATABASE_PAGES: 2624
         MODIFIED_DATABASE_PAGES: 0
              PENDING_DECOMPRESS: 0
                   PENDING_READS: 0
               PENDING_FLUSH_LRU: 0
              PENDING_FLUSH_LIST: 0
                PAGES_MADE_YOUNG: 29501
            PAGES_NOT_MADE_YOUNG: 0
           PAGES_MADE_YOUNG_RATE: 951.6144640495468
       PAGES_MADE_NOT_YOUNG_RATE: 0
               NUMBER_PAGES_READ: 39009
            NUMBER_PAGES_CREATED: 0
            NUMBER_PAGES_WRITTEN: 1
                 PAGES_READ_RATE: 1249.8306506241734
               PAGES_CREATE_RATE: 0
              PAGES_WRITTEN_RATE: 0.032257023966968806
                NUMBER_PAGES_GET: 6731790
                        HIT_RATE: 995
    YOUNG_MAKE_PER_THOUSAND_GETS: 4
NOT_YOUNG_MAKE_PER_THOUSAND_GETS: 0
         NUMBER_PAGES_READ_AHEAD: 38459
       NUMBER_READ_AHEAD_EVICTED: 0
                 READ_AHEAD_RATE: 1240.5728847456533
         READ_AHEAD_EVICTED_RATE: 0
                    LRU_IO_TOTAL: 531
                  LRU_IO_CURRENT: 0
                UNCOMPRESS_TOTAL: 0
              UNCOMPRESS_CURRENT: 0
1 row in set (0.01 sec)

Here is the difference between innodb_buffer_pool_stats side by side:

INNODB_OLD_BLOCKS_TIME=0                     INNODB_OLD_BLOCKS_TIME=1000
*************************** 1. row ************************************************ 1. row *******
                         POOL_ID: 0                    *                         POOL_ID: 0
                       POOL_SIZE: 8192                 *                       POOL_SIZE: 8191
                    FREE_BUFFERS: 1024                 *                    FREE_BUFFERS: 1024
                  DATABASE_PAGES: 7163                 *                  DATABASE_PAGES: 7162
              OLD_DATABASE_PAGES: 2624                 *              OLD_DATABASE_PAGES: 2623
         MODIFIED_DATABASE_PAGES: 0                    *         MODIFIED_DATABASE_PAGES: 0
              PENDING_DECOMPRESS: 0                    *              PENDING_DECOMPRESS: 0
                   PENDING_READS: 0                    *                   PENDING_READS: 0
               PENDING_FLUSH_LRU: 0                    *               PENDING_FLUSH_LRU: 0
              PENDING_FLUSH_LIST: 0                    *              PENDING_FLUSH_LIST: 0
                PAGES_MADE_YOUNG: 29501                *                PAGES_MADE_YOUNG: 3
            PAGES_NOT_MADE_YOUNG: 0                    *            PAGES_NOT_MADE_YOUNG: 4824154
           PAGES_MADE_YOUNG_RATE: 951.6144640495468    *           PAGES_MADE_YOUNG_RATE: 0
       PAGES_MADE_NOT_YOUNG_RATE: 0                    *       PAGES_MADE_NOT_YOUNG_RATE: 0
               NUMBER_PAGES_READ: 39009                *               NUMBER_PAGES_READ: 38960
            NUMBER_PAGES_CREATED: 0                    *            NUMBER_PAGES_CREATED: 0
            NUMBER_PAGES_WRITTEN: 1                    *            NUMBER_PAGES_WRITTEN: 1
                 PAGES_READ_RATE: 1249.8306506241734   *                 PAGES_READ_RATE: 0
               PAGES_CREATE_RATE: 0                    *               PAGES_CREATE_RATE: 0
              PAGES_WRITTEN_RATE: 0.032257023966968806 *              PAGES_WRITTEN_RATE: 0
                NUMBER_PAGES_GET: 6731790              *                NUMBER_PAGES_GET: 6731253
                        HIT_RATE: 995                  *                        HIT_RATE: 0
    YOUNG_MAKE_PER_THOUSAND_GETS: 4                    *    YOUNG_MAKE_PER_THOUSAND_GETS: 0
NOT_YOUNG_MAKE_PER_THOUSAND_GETS: 0                    *NOT_YOUNG_MAKE_PER_THOUSAND_GETS: 0
         NUMBER_PAGES_READ_AHEAD: 38459                *         NUMBER_PAGES_READ_AHEAD: 38457
       NUMBER_READ_AHEAD_EVICTED: 0                    *       NUMBER_READ_AHEAD_EVICTED: 0
                 READ_AHEAD_RATE: 1240.5728847456533   *                 READ_AHEAD_RATE: 0
         READ_AHEAD_EVICTED_RATE: 0                    *         READ_AHEAD_EVICTED_RATE: 0
                    LRU_IO_TOTAL: 531                  *                    LRU_IO_TOTAL: 431
                  LRU_IO_CURRENT: 0                    *                  LRU_IO_CURRENT: 0
                UNCOMPRESS_TOTAL: 0                    *                UNCOMPRESS_TOTAL: 0
              UNCOMPRESS_CURRENT: 0                                   UNCOMPRESS_CURRENT: 0

As promised, here are the results from joining the tables in the other direction

mysql> select straight_join sum(lo_extendedprice*lo_discount) as revenue
from dim_date join lineorder on lo_orderdatekey = d_datekey
where d_year = 1993 and lo_discount between 1 and 3
and lo_quantity < 25\G
*************************** 1. row ***************************
revenue: 446268068091
1 row in set (22.54 sec)
mysql> explain select straight_join sum(lo_extendedprice*lo_discount) as revenue
    -> from dim_date join lineorder on lo_orderdatekey = d_datekey
    -> where d_year = 1993 and lo_discount between 1 and 3
    -> and lo_quantity < 25\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dim_date
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2704
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: lineorder
         type: ref
possible_keys: LO_OrderDateKey
          key: LO_OrderDateKey
      key_len: 4
          ref: ssb.dim_date.D_DateKey
         rows: 2837
        Extra: Using where
2 rows in set (0.00 sec)
mysql> select * from information_schema.global_status where variable_name like '%innodb%read%';
+---------------------------------------+----------------+
| VARIABLE_NAME                         | VARIABLE_VALUE |
+---------------------------------------+----------------+
| INNODB_BUFFER_POOL_READ_AHEAD_RND     | 0              |
| INNODB_BUFFER_POOL_READ_AHEAD         | 0              |
| INNODB_BUFFER_POOL_READ_AHEAD_EVICTED | 0              |
| INNODB_BUFFER_POOL_READ_REQUESTS      | 3776369        |
| INNODB_BUFFER_POOL_READS              | 191571         |
| INNODB_DATA_PENDING_READS             | 0              |
| INNODB_DATA_READ                      | 3140882432     |
| INNODB_DATA_READS                     | 191581         |
| INNODB_PAGES_READ                     | 191570         |
| INNODB_ROWS_READ                      | 910844         |
+---------------------------------------+----------------+
10 rows in set (0.01 sec)
mysql> select sq.*, pages / ((@@innodb_buffer_pool_size / 16384)) * 100 pct_buffer_pool from (select table_name, index_name, count(*) pages, sum(is_old='YES') old, count(*) - sum(is_old='YES') hot, sum(number_records) records from information_schema.innodb_buffer_page_lru where  table_name like '%ssb%' group by 1,2) sq order by pct_buffer_pool desc;
+-------------------+------------------+-------+------+------+---------+-----------------+
| table_name        | index_name       | pages | old  | hot  | records | pct_buffer_pool |
+-------------------+------------------+-------+------+------+---------+-----------------+
| `ssb`.`lineorder` | GEN_CLUST_INDEX  |  6001 | 2095 | 3906 |  964974 |         73.2544 |
| `ssb`.`lineorder` | LO_OrderDateKey  |    31 |   28 |    3 |   18223 |          0.3784 |
| `ssb`.`dim_date`  | PRIMARY          |    17 |   11 |    6 |    2414 |          0.2075 |
| `ssb`.`lineorder` | LO_OrderKey      |    17 |   17 |    0 |   11320 |          0.2075 |
| `ssb`.`lineorder` | LO_PartKey       |    17 |   17 |    0 |   10095 |          0.2075 |
| `ssb`.`lineorder` | LO_CustKey       |    17 |   17 |    0 |    9874 |          0.2075 |
| `ssb`.`lineorder` | LO_CommitDateKey |    16 |   16 |    0 |   10775 |          0.1953 |
| `ssb`.`lineorder` | LO_SuppKey       |    16 |   16 |    0 |   11879 |          0.1953 |
+-------------------+------------------+-------+------+------+---------+-----------------+
8 rows in set (0.11 sec)
mysql> select * from information_schema.innodb_buffer_pool_stats\G
*************************** 1. row ***************************
                         POOL_ID: 0
                       POOL_SIZE: 8192
                    FREE_BUFFERS: 1024
                  DATABASE_PAGES: 6175
              OLD_DATABASE_PAGES: 2259
         MODIFIED_DATABASE_PAGES: 0
              PENDING_DECOMPRESS: 0
                   PENDING_READS: 0
               PENDING_FLUSH_LRU: 0
              PENDING_FLUSH_LIST: 0
                PAGES_MADE_YOUNG: 62
            PAGES_NOT_MADE_YOUNG: 2054952
           PAGES_MADE_YOUNG_RATE: 1.0508296469551364
       PAGES_MADE_NOT_YOUNG_RATE: 34829.104591447605
               NUMBER_PAGES_READ: 191834
            NUMBER_PAGES_CREATED: 0
            NUMBER_PAGES_WRITTEN: 1
                 PAGES_READ_RATE: 3246.91106930391
               PAGES_CREATE_RATE: 0
              PAGES_WRITTEN_RATE: 0.01694886527346994
                NUMBER_PAGES_GET: 3777151
                        HIT_RATE: 950
    YOUNG_MAKE_PER_THOUSAND_GETS: 0
NOT_YOUNG_MAKE_PER_THOUSAND_GETS: 544
         NUMBER_PAGES_READ_AHEAD: 0
       NUMBER_READ_AHEAD_EVICTED: 0
                 READ_AHEAD_RATE: 0
         READ_AHEAD_EVICTED_RATE: 0
                    LRU_IO_TOTAL: 186940
                  LRU_IO_CURRENT: 0
                UNCOMPRESS_TOTAL: 0
              UNCOMPRESS_CURRENT: 0
1 row in set (0.00 sec)

And with innodb_old_blocks_time=0:

mysql> select straight_join sum(lo_extendedprice*lo_discount) as revenue
from dim_date join lineorder on lo_orderdatekey = d_datekey
where d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25\G
*************************** 1. row ***************************
revenue: 446268068091
1 row in set (12.36 sec)
mysql> select * from information_schema.global_status where variable_name like '%innodb%read%';
+---------------------------------------+----------------+
| VARIABLE_NAME                         | VARIABLE_VALUE |
+---------------------------------------+----------------+
| INNODB_BUFFER_POOL_READ_AHEAD_RND     | 0              |
| INNODB_BUFFER_POOL_READ_AHEAD         | 0              |
| INNODB_BUFFER_POOL_READ_AHEAD_EVICTED | 0              |
| INNODB_BUFFER_POOL_READ_REQUESTS      | 3811806        |
| INNODB_BUFFER_POOL_READS              | 186407         |
| INNODB_DATA_PENDING_READS             | 0              |
| INNODB_DATA_READ                      | 3056275456     |
| INNODB_DATA_READS                     | 186417         |
| INNODB_PAGES_READ                     | 186406         |
| INNODB_ROWS_READ                      | 910844         |
+---------------------------------------+----------------+
10 rows in set (0.00 sec)
mysql> select sq.*, pages / ((@@innodb_buffer_pool_size / 16384)) * 100 pct_buffer_pool
from (
select table_name, index_name, count(*) pages,
sum(is_old='YES') old, count(*) - sum(is_old='YES') hot,
sum(number_records) records
from information_schema.innodb_buffer_page_lru
where  table_name like '%ssb%'
group by 1,2
) sq
order by pct_buffer_pool desc;
+-------------------+-----------------+-------+------+------+---------+-----------------+
| table_name        | index_name      | pages | old  | hot  | records | pct_buffer_pool |
+-------------------+-----------------+-------+------+------+---------+-----------------+
| `ssb`.`lineorder` | GEN_CLUST_INDEX |  6980 | 2563 | 4417 | 1119893 |         85.2051 |
| `ssb`.`lineorder` | LO_OrderDateKey |    47 |   17 |   30 |   30637 |          0.5737 |
| `ssb`.`dim_date`  | PRIMARY         |    12 |    0 |   12 |    1841 |          0.1465 |
+-------------------+-----------------+-------+------+------+---------+-----------------+
3 rows in set (0.12 sec)
mysql> select * from information_schema.innodb_buffer_pool_stats\G
*************************** 1. row ***************************
                         POOL_ID: 0
                       POOL_SIZE: 8192
                    FREE_BUFFERS: 1024
                  DATABASE_PAGES: 7047
              OLD_DATABASE_PAGES: 2581
         MODIFIED_DATABASE_PAGES: 0
              PENDING_DECOMPRESS: 0
                   PENDING_READS: 0
               PENDING_FLUSH_LRU: 0
              PENDING_FLUSH_LIST: 0
                PAGES_MADE_YOUNG: 194023
            PAGES_NOT_MADE_YOUNG: 0
           PAGES_MADE_YOUNG_RATE: 4850.4537386565335
       PAGES_MADE_NOT_YOUNG_RATE: 0
               NUMBER_PAGES_READ: 186422
            NUMBER_PAGES_CREATED: 0
            NUMBER_PAGES_WRITTEN: 1
                 PAGES_READ_RATE: 4653.858653533662
               PAGES_CREATE_RATE: 0
              PAGES_WRITTEN_RATE: 0.02499937501562461
                NUMBER_PAGES_GET: 3811961
                        HIT_RATE: 952
    YOUNG_MAKE_PER_THOUSAND_GETS: 50
NOT_YOUNG_MAKE_PER_THOUSAND_GETS: 0
         NUMBER_PAGES_READ_AHEAD: 0
       NUMBER_READ_AHEAD_EVICTED: 0
                 READ_AHEAD_RATE: 0
         READ_AHEAD_EVICTED_RATE: 0
                    LRU_IO_TOTAL: 186024
                  LRU_IO_CURRENT: 0
                UNCOMPRESS_TOTAL: 0
              UNCOMPRESS_CURRENT: 0
1 row in set (0.00 sec)

Finally, I collected SHOW PROFILES information for the faster join direction (fact -> dimension)

select p1.seq, p1.state state, p1.duration, p2.duration, p1.cpu_user + p1.cpu_system p1_cpu, p2.cpu_user + p2.cpu_system p2_cpu,
p1.context_voluntary + p1.context_involuntary p1_cs, p2.context_voluntary + p2.context_involuntary p2_cs,
p1.block_ops_in + p1.block_ops_out p1_block_ops, p2.block_ops_in + p2.block_ops_out p2_block_ops,
p1.page_faults_major + p1.page_faults_minor p1_pf, p2.page_faults_major + p2.page_faults_minor p2_pf
from p1 join p2 using(seq)
where p1.state = p2.state
order by p1.duration desc;
+-----+----------------------+-----------+----------+-----------+-----------+-------+-------+--------------+--------------+-------+-------+
| seq | state                | duration  | duration | p1_cpu    | p2_cpu    | p1_cs | p2_cs | p1_block_ops | p2_block_ops | p1_pf | p2_pf |
+-----+----------------------+-----------+----------+-----------+-----------+-------+-------+--------------+--------------+-------+-------+
|  12 | Sending data         | 33.764396 | 7.523023 | 40.173893 | 13.027019 |  4979 | 21399 |            0 |            0 |    90 |    90 |
|   5 | Opening tables       |  0.270664 | 0.295955 |  0.025996 |  0.024996 |    34 |    35 |         2056 |         1488 |    48 |    48 |
|   2 | starting             |  0.000230 | 0.000192 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |    29 |    29 |
|   9 | statistics           |  0.000130 | 0.000097 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |    26 |    26 |
|   6 | init                 |  0.000105 | 0.000138 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |    26 |    26 |
|  10 | preparing            |  0.000068 | 0.000064 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |    14 |    14 |
|  16 | freeing items        |  0.000049 | 0.000117 |  0.000000 |  0.001000 |     0 |     0 |            0 |            0 |     3 |     3 |
|   8 | optimizing           |  0.000048 | 0.000048 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |     8 |     8 |
|   7 | System lock          |  0.000031 | 0.000031 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |     1 |     1 |
|  13 | end                  |  0.000027 | 0.000026 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |     1 |     1 |
|   4 | checking permissions |  0.000015 | 0.000014 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |     0 |     0 |
|  15 | closing tables       |  0.000015 | 0.000016 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |     0 |     0 |
|   3 | checking permissions |  0.000014 | 0.000014 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |     0 |     0 |
|  11 | executing            |  0.000013 | 0.000013 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |     0 |     0 |
|  14 | query end            |  0.000011 | 0.000012 |  0.000000 |  0.000000 |     0 |     0 |            0 |            0 |     0 |     0 |
+-----+----------------------+-----------+----------+-----------+-----------+-------+-------+--------------+--------------+-------+-------+
15 rows in set (0.00 sec)

 Here are my modified versions of the queries (just to use ANSI JOIN syntax):

-- Q1.1
select sum(lo_extendedprice*lo_discount) as
revenue
from lineorder join  dim_date on lo_orderdatekey = d_datekey
where
d_year = 1993
and lo_discount between 1 and 3
and lo_quantity < 25;
-- Q1.2
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder
join  dim_date on lo_orderdatekey = d_datekey
where  d_yearmonth = 199401 and lo_discount
between 4 and 6 and lo_quantity between 26 and 35;
-- Q1.3
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder
join  dim_date on lo_orderdatekey = d_datekey
where  d_weeknuminyear = 6
and d_year = 1994
and lo_discount between 5 and 7 and lo_quantity between 26 and 35;
-- Q2.1
select sum(lo_revenue), d_year, p_brand
from lineorder
join dim_date
 on lo_orderdatekey = d_datekey
join part
on lo_partkey = p_partkey join supplier
on lo_suppkey = s_suppkey
where  p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand
order by d_year, p_brand;
-- Q2.2
select sum(lo_revenue), d_year, p_brand
from lineorder
join dim_date
on lo_orderdatekey = d_datekey
join part
on lo_partkey = p_partkey
join supplier
on lo_suppkey = s_suppkey
where  p_brand between 'MFGR#2221' and 'MFGR#2228'
and s_region = 'ASIA'
group by d_year, p_brand
order by d_year, p_brand;
-- Q2.3
select sum(lo_revenue), d_year, p_brand
from lineorder
join dim_date
on lo_orderdatekey = d_datekey
join part
on lo_partkey = p_partkey
join supplier
on lo_suppkey = s_suppkey
where  p_brand= 'MFGR#2239'
and s_region = 'EUROPE'
group by d_year, p_brand
order by d_year, p_brand;
-- Q3.1
select c_nation, s_nation, d_year, sum(lo_revenue) as revenue
from customer
join lineorder
on lo_custkey = c_customerkey
join supplier
on lo_suppkey = s_suppkey
join dim_date   on lo_orderdatekey = d_datekey
where c_region = 'ASIA'
and s_region = 'ASIA'
and d_year <= 1992 and d_year >= 1997
group by c_nation, s_nation, d_year
order by d_year asc, revenue desc;
-- Q3.2
select c_city, s_city, d_year, sum(lo_revenue) as revenue
from customer
join lineorder
on lo_custkey = c_customerkey
join supplier
on lo_suppkey = s_suppkey
join dim_date
on lo_orderdatekey = d_datekey
where c_nation = 'UNITED STATES'
and s_nation = 'UNITED STATES'
and d_year >= 1992
and d_year <= 1997
group by c_city, s_city, d_year
order by d_year asc, revenue desc;
-- Q3.3
select c_city, s_city, d_year, sum(lo_revenue) as revenue
from customer
join lineorder
on lo_custkey = c_customerkey
join supplier   on lo_suppkey = s_suppkey
join dim_date   on lo_orderdatekey = d_datekey
where (c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and d_year >= 1992
and d_year <= 1997
group by c_city, s_city, d_year
order by d_year asc, revenue desc;
-- Q3.4
select c_city, s_city, d_year, sum(lo_revenue)
as revenue
from customer
join lineorder
  on lo_custkey = c_customerkey
join supplier
  on lo_suppkey = s_suppkey
join dim_date
  on lo_orderdatekey = d_datekey
where
(c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;
-- Q4.1
select d_year, c_nation,
sum(lo_revenue - lo_supplycost) as profit
from lineorder
join dim_date
  on lo_orderdatekey = d_datekey
join customer
  on lo_custkey = c_customerkey
join supplier
  on lo_suppkey = s_suppkey
join part
  on lo_partkey = p_partkey
where
c_region = 'AMERICA'
and s_region = 'AMERICA'
and (p_mfgr = 'MFGR#1'
or p_mfgr = 'MFGR#2')
group by d_year, c_nation
order by d_year, c_nation;
-- Q4.2
select d_year, s_nation, p_category,
sum(lo_revenue - lo_supplycost) as profit
from lineorder
join dim_date
  on lo_orderdatekey = d_datekey
join customer
  on lo_custkey = c_customerkey
join supplier
  on lo_suppkey = s_suppkey
join part
  on lo_partkey = p_partkey
where
c_region = 'AMERICA'
and s_region = 'AMERICA'
and (d_year = 1997 or d_year = 1998)
and (p_mfgr = 'MFGR#1'
or p_mfgr = 'MFGR#2')
group by d_year, s_nation, p_category
order by d_year, s_nation, p_category;
-- Q4.3
select d_year, s_city, p_brand,
sum(lo_revenue - lo_supplycost) as profit
from lineorder
join dim_date
  on lo_orderdatekey = d_datekey
join customer
  on lo_custkey = c_customerkey
join supplier
  on lo_suppkey = s_suppkey
join part
  on lo_partkey = p_partkey
where
s_nation = 'UNITED STATES'
and (d_year = 1997 or d_year = 1998)
and p_category = 'MFGR#14'
group by d_year, s_city, p_brand
order by d_year, s_city, p_brand;

And the schema:

DROP TABLE IF EXISTS customer;
CREATE TABLE IF NOT EXISTS customer
(
    C_CustomerKey int primary key,
    C_Name varchar(25),
    C_Address varchar(25),
    C_City varchar(10),
    C_Nation varchar(15),
    C_Region varchar(12),
    C_Phone varchar(15),
    C_MktSegment varchar(10),
    KEY(C_Name),
    KEY(C_City),
    KEY(C_Region),
    KEY(C_Phone),
    KEY(C_MktSegment)
);
DROP TABLE IF EXISTS part;
CREATE TABLE IF NOT EXISTS part
(
    P_PartKey int primary key,
    P_Name varchar(25),
    P_MFGR varchar(10),
    P_Category varchar(10),
    P_Brand varchar(15),
    P_Colour varchar(15),
    P_Type varchar(25),
    P_Size tinyint,
    P_Container char(10),
    key(P_Name),
    key(P_MFGR),
    key(P_Category),
    key(P_Brand)
);
DROP TABLE IF EXISTS supplier;
CREATE TABLE supplier
(
    S_SuppKey int primary key,
    S_Name char(25),
    S_Address varchar(25),
    S_City char(10),
    S_Nation char(15),
    S_Region char(12),
    S_Phone char(15),
    key(S_City),
    key(S_Name),
    key(S_Phone)
);
DROP TABLE IF EXISTS dim_date;
CREATE TABLE IF NOT EXISTS dim_date
(
    D_DateKey int primary key,
    D_Date char(18),
    D_DayOfWeek char(9),
    D_Month char(9),
    D_Year smallint,
    D_YearMonthNum int,
    D_YearMonth char(7),
    D_DayNumInWeek tinyint,
    D_DayNumInMonth tinyint,
    D_DayNumInYear smallint,
    D_MonthNumInYear tinyint,
    D_WeekNumInYear tinyint,
    D_SellingSeason char(12),
    D_LastDayInWeekFl tinyint,
    D_LastDayInMonthFl tinyint,
    D_HolidayFl tinyint,
    D_WeekDayFl tinyint
);
DROP TABLE IF EXISTS lineorder;
CREATE TABLE IF NOT EXISTS lineorder
(
    LO_OrderKey bigint not null,
    LO_LineNumber tinyint not null,
    LO_CustKey int not null,
    LO_PartKey int not null,
    LO_SuppKey int not null,
    LO_OrderDateKey int not null,
    LO_OrderPriority varchar(15),
    LO_ShipPriority char(1),
    LO_Quantity tinyint,
    LO_ExtendedPrice decimal,
    LO_OrdTotalPrice decimal,
    LO_Discount decimal,
    LO_Revenue decimal,
    LO_SupplyCost decimal,
    LO_Tax tinyint,
    LO_CommitDateKey int not null,
    LO_ShipMode varchar(10),
    KEY(LO_OrderKey, LO_LineNumber),
    KEY(LO_CustKey),
    KEY(LO_SuppKey),
    KEY(LO_PartKey),
    KEY(LO_OrderDateKey),
    KEY(LO_CommitDateKey)
);

The post MySQL 5.6 vs MySQL 5.5 and the Star Schema Benchmark appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com