Mar
10
2015

Advanced JSON for MySQL

What is JSON

JSON is an text based, human readable format for transmitting data between systems, for serializing objects and for storing document store data for documents that have different attributes/schema for each document. Popular document store databases use JSON (and the related BSON) for storing and transmitting data.

Problems with JSON in MySQL

It is difficult to inter-operate between MySQL and MongoDB (or other document databases) because JSON has traditionally been very difficult to work with. Up until recently, JSON is just a TEXT document. I said up until recently, so what has changed? The biggest thing is that there are new JSON UDF by Sveta Smirnova, which are part of the MySQL 5.7 Labs releases. Currently the JSON UDF are up to version 0.0.4. While these new UDF are a welcome edition to the MySQL database, they don’t solve the really tough JSON problems we face.

Searching

The JSON UDF provide a number of functions that make working with JSON easier, including the ability to extract portions of a document, or search a document for a particular key. That being said, you can’t use JSON_EXTRACT() or JSON_SEARCH in the WHERE clause, because it will initiate a dreaded full-table-scan (what MongoDB would call a full collection scan). This is a big problem and common wisdom is that JSON can’t be indexed for efficient WHERE clauses, especially sub-documents like arrays or objects within the JSON.

Actually, however, I’ve come up with a technique to effectively index JSON data in MySQL (to any depth). The key lies in transforming the JSON from a format that is not easily indexed into one that is easily indexed. Now, when you think index you think B-TREE or HASH indexes (or bitmap indexes) but MySQL also supports FULLTEXT indexes.

A fulltext index is an inverted index where words (tokens) point to documents. While text indexes are great, they aren’t normally usable for JSON. The reason is, MySQL splits words on whitespace and non-alphanumeric characters. A JSON document doesn’t end up being usable when the name of the field (the key) can’t be associated with the value. But what if we transform the JSON? You can “flatten” the JSON down into key/value pairs and use a text index to associate the key/value pairs with the document. I created a UDF called RAPID_FLATTEN_JSON using the C++ Rapid JSON library. The UDF flattens JSON documents down into key/value pairs for the specific purpose of indexing.

Here is an example JSON document:

{
	"id": "0001",
	"type": "donut",
	"name": "Cake",
	"ppu": 0.55,
	"batters":
		{
			"batter":
				[
					{ "id": "1001", "type": "Regular" },
					{ "id": "1002", "type": "Chocolate" },
					{ "id": "1003", "type": "Blueberry" },
					{ "id": "1004", "type": "Devil's Food" }
				]
		},
	"topping":
		[
			{ "id": "5001", "type": "None" },
			{ "id": "5002", "type": "Glazed" },
			{ "id": "5005", "type": "Sugar" },
			{ "id": "5007", "type": "Powdered Sugar" },
			{ "id": "5006", "type": "Chocolate with Sprinkles" },
			{ "id": "5003", "type": "Chocolate" },
			{ "id": "5004", "type": "Maple" }
		]
}

Flattened:

mysql> select RAPID_FLATTEN_JSON(load_file('/tmp/doc.json'))G
*************************** 1. row ***************************
RAPID_FLATTEN_JSON(load_file('/tmp/doc.json')): id=0001
type=donut
name=Cake
ppu=0.55
id=1001
type=Regular
id=1002
type=Chocolate
id=1003
type=Blueberry
id=1004
type=Devil's Food
type=Devil's
type=Food
id=5001
type=None
id=5002
type=Glazed
id=5005
type=Sugar
id=5007
type=Powdered Sugar
type=Powdered
type=Sugar
id=5006
type=Chocolate with Sprinkles
type=Chocolate
type=with
type=Sprinkles
id=5003
type=Chocolate
id=5004
type=Maple
1 row in set (0.00 sec)

Obviously this is useful, because our keys are now attached to our values in an easily searchable way. All you need to do is store the flattened version of the JSON in another field (or another table), and index it with a FULLTEXT index to make it searchable. But wait, there is one more big problem: MySQL will split words on the equal sign. We don’t want this as it removes the locality of the keyword and the value. To fix this problem you’ll have to undertake the (actually quite easy) step of adding a new collation to MySQL (I called mine ft_kvpair_ci). I added equal (=) to the list of lower case characters as described in the manual. You just have to change two text files, no need to recompile the server or anything, and as I said, it is pretty easy. Let me know if you get stuck on this step and I can show you the 5.6.22 files I modified.

By the way, I used a UDF, because MySQL FULLTEXT indexes don’t support pluggable parsers for InnoDB until 5.7. This will be much cleaner in 5.7 with a parser plugin and there will be no need to maintain an extra column.

Using the solution:
Given a table full of complex json:

create table json2(id int auto_increment primary key, doc mediumtext);

Add a column for the index data and FULLTEXT index it:

alter table json2 add flat mediumtext character set latin1 collate ft_kvpair_ci, FULLTEXT(flat);

Then populate the index. Note that you can create a trigger to keep the second column in sync, I let that up to an exercise of the reader, or you can use Flexviews to maintain a copy in a second table automatically.

mysql> update json2 set flat=RAPID_FLATTEN_JSON(doc);
Query OK, 18801 rows affected (26.34 sec)
Rows matched: 18801  Changed: 18801  Warnings: 0

Using the index:

mysql> select count(*) from json2 where match(flat) against ('last_name=Vembu');
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)

The documents I searched for that example are very complex and highly nested. Check out the full matching documents for the query here here

If you want to only index a subportion of the document, use the MySQL UDF JSON_EXTRACT to extract the portion you want to index, and only flatten that.

Aggregating

JSON documents may contain sub-documents as mentioned a moment ago. JSON_EXTRACT can extract a portion of a document, but it is still a text document. There is no function that can extract ALL of a particular key (like invoice_price) and aggregate the results. So, if you have a document called orders which contains a varying number of items and their prices, it is very difficult (if not impossible) to use the JSON UDF to aggregate a “total sales” figure from all the order documents.

To solve this problem, I created another UDF called RAPID_EXTRACT_ALL(json, ‘key’). This UDF will extract all the values for the given key. For example, if there are 10 line items with invoice_id: 30, it will extract the value (30 in this case) for each item. This UDF returns each item separated by newline. I created a few stored routines called jsum, jmin, jmax, jcount, and javg. They can process the output of rapid_extract_all and aggregate it. If you want to only RAPID_EXTRACT_ALL from a portion of a document, extract that portion with the MySQL UDF JSON_EXTRACT first, then process that with RAPID_EXTRACT_ALL.

For example:

mysql> select json_extract_all(doc,'id') ids, jsum(json_extract_all(doc,'id')) from json2 limit 1G
*************************** 1. row ***************************
ids: 888
889
2312
5869
8702
jsum(json_extract_all(doc,'id')): 18660.00000
1 row in set (0.01 sec)

Aggregating all of the id values in the entire collection:

mysql> select sum( jsum(json_extract_all(doc,'id')) ) from json2 ;
+-----------------------------------------+
| sum( jsum(json_extract_all(doc,'id')) ) |
+-----------------------------------------+
|                         296615411.00000 |
+-----------------------------------------+
1 row in set (2.90 sec)

Of course you could extract other fields and sort and group on them.

Where to get the tools:
You can find the UDF in the swanhart-tools github repo. I think you will find these tools very useful in working with JSON documents in MySQL.

(This post was originally posted on my personal blog: swanhart.livejournal.com, but is reposed here for wider distribution)

The post Advanced JSON for MySQL appeared first on MySQL Performance Blog.

Written by Justin Swanhart in: aggregation,document store,flatten,fulltext,index,JSON,Justin Swanhart,mongodb,MySQL,Primary,udf,Zend Developer |

Feb
17
2015

‘Indexing’ JSON documents for efficient MySQL queries over JSON data

MySQL meets NoSQL with JSON UDF

I recently got back from FOSDEM, in Brussels, Belgium. While I was there I got to see a great talk by Sveta Smirnova, about her MySQL 5.7 Labs release JSON UDF functions. It is important to note that while the UDF come in a 5.7 release it is absolutely possible to compile and use the UDF with earlier versions of MySQL because the UDF interface has not changed for a long time. However, the UDF should still be considered alpha/preview level of quality and should not be used in production yet! For this example I am using Percona Server 5.6 with the UDF.

That being said, the proof-of-concept that I’m about to present here uses only one JSON function (JSON_EXTRACT) and it has worked well enough in my testing to present my idea here. The JSON functions will probably be GA sometime soon anyway, and this is a useful test of the JSON_EXTRACT function.

The UDF let you parse, search and manipulate JSON data inside of MySQL, bringing MySQL closer to the capabilities of a document store.

Since I am using Percona Server 5.6, I needed to compile and install the UDF. Here are the steps I took to compile the plugin:

$ cd mysql-json-udfs-0.3.3-labs-json-udfs-src
$ cmake -DMYSQL_DIR=/usr/local/mysql .
$ sudo make install
$ sudo cp *.so /usr/local/mysql/lib/plugin

JSON UDF are great, but what’s the problem

The JSON functions work very well for manipulating individual JSON objects, but like all other functions, using JSON_EXTRACT in the WHERE clause will result in a full table scan. This means the functions are virtually useless for searching through large volumes of JSON data. If you want to use MySQL as a document store, this is going to limit the usefulness in the extreme as the ability to extract key/value pairs from JSON documents is powerful, but without indexing it can’t scale well.

What can be done to index JSON in MySQL for efficient access?

The JSON UDF provides a JSON_EXTRACT function which can pull data out of a JSON document. There are two ways we can use this function to “index” the JSON data.

Add extra columns to the table (or use a separate table, or tables) containing the JSON and populate the columns using JSON_EXTRACT in a trigger. The downside is that this slows down inserts and modifications of the documents significantly.
Use Flexviews materialized views to maintain an index table separately and asynchronously. The upside is that insertion/modification speed is not affected, but there is slight delay before index is populated. This is similar to eventual consistency in a document store.

Writing triggers is an exercise I’ll leave up to the user. The rest of this post will discuss using Flexviews materialized views to create a JSON index.

What is Flexviews?

Flexviews can create ‘incrementally refreshable’ materialized views. This means that the views are able to be refreshed efficiently using changes captured by FlexCDC, the change data capture tool that ships with Flexviews. Since the view can be refreshed fast, it is possible to refresh it frequently and have a low latency index, but not one perfectly in sync with the base table at all times.

The materialized view is a real table that is indexed to provide fast access. Flexviews includes a SQL_API, or a set of stored procedures for defining and maintaining materialized views.

See this set of slides for an overview of Flexviews: http://www.slideshare.net/MySQLGeek/flexviews-materialized-views-for-my-sql

Demo/POC using materialized view as an index

The first step to creating an incrementally refreshable materialized view with Flexviews, is to create a materialized view change log on all of the tables used in the view. The CREATE_MVLOG($schema, $table) function creates the log and FlexCDC will immediately being to collect changes into it.

mysql> call flexviews.create_mvlog('ssb_json','json');
Query OK, 1 row affected (0.01 sec)

Next, the materialized view name, and refresh type must be registered with the CREATE($schema, $mvname, $refreshtype) function:

mysql> call flexviews.create('ssb_json','json_idx','INCREMENTAL');
Query OK, 0 rows affected (0.00 sec)
-- save the just generated identifier for the view.  You can use GET_ID($schema,$mvname) later.
mysql> set @mvid := last_insert_id();
Query OK, 0 rows affected (0.00 sec)

Now one or more tables have to be added to the view using the ADD_TABLE($mvid, $schema, $table, $alias,$joinclause) function. This example will use only one table, but Flexviews supports joins too.

mysql> call flexviews.add_table(@mvid, 'ssb_json','json','json',null);
Query OK, 1 row affected (0.00 sec)

Expressions must be added to the view next. Since aggregation is not used in this example, the expressions should be ‘COLUMN’ type expressions. The function ADD_EXPR($mvid, $expression_type, $expression, $alias) is used to add expressions. Note that JSON_EXTRACT returns a TEXT column, so I’ve CAST the function to integer so that it can be indexed. Flexviews does not currently have a way to define prefix indexes.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', "cast(json_extract(doc,'D_DateKey') as date)", 'D_DateKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'C_CustomerKey') as unsigned)", 'C_CustomerKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'S_SuppKey') as unsigned)", 'S_SuppKey');
Query OK, 1 row affected (0.01 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'P_PartKey') as unsigned)", 'P_PartKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_OrderKey') as unsigned)", 'LO_OrderKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_LineNumber') as unsigned)", 'LO_LineNumber');
Query OK, 1 row affected (0.00 sec)

I’ve also projected out the ‘id’ column from the table, which is the primary key. This ties the index entries to the original row, so that the original document can be retrieved.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', 'id', 'id');
Query OK, 1 row affected (0.00 sec)

Since we want to use the materialized view as an index, we need to index the columns we’ve added to it.

mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_LineNumber", 'LO_LineNumber_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_OrderKey", 'LO_OrderKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"P_PartKey", 'P_PartKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"S_SuppKey", 'S_SuppKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"D_DateKey", 'D_DateKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"C_CustomerKey", 'C_CustomerKey_Idx');
Query OK, 1 row affected (0.00 sec)

Finally, the view has to be created. There are 6 million rows in my table, the JSON functions are UDF so they are not as fast as built in functions, and I indexed a lot of things (six different indexes are being populated at once) so it takes some time to build the index:

mysql> call flexviews.enable(@mvid);
Query OK, 2 rows affected (35 min 53.17 sec)

After the materialized view is built, you can see it in the schema. Note there is also a delta table, which I will explain a bit later.

mysql> show tables;
+--------------------+
| Tables_in_ssb_json |
+--------------------+
| json               |
| json_idx           |
| json_idx_delta     |
+--------------------+
3 rows in set (0.00 sec)

Here is the table definition of json_idx, our materialized view. You can see it is indexed:

CREATE TABLE `json_idx` (
  `mview$pk` bigint(20) NOT NULL AUTO_INCREMENT,
  `D_DateKey` date DEFAULT NULL,
  `C_CustomerKey` bigint(21) unsigned DEFAULT NULL,
  `S_SuppKey` bigint(21) unsigned DEFAULT NULL,
  `P_PartKey` bigint(21) unsigned DEFAULT NULL,
  `LO_OrderKey` bigint(21) unsigned DEFAULT NULL,
  `LO_LineNumber` bigint(21) unsigned DEFAULT NULL,
  `id` bigint(20) NOT NULL DEFAULT '0',
  `mview$hash` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`mview$pk`),
  KEY `LO_LineNumber_Idx` (`LO_LineNumber`),
  KEY `LO_OrderKey_Idx` (`LO_OrderKey`),
  KEY `P_PartKey_Idx` (`P_PartKey`),
  KEY `S_SuppKey_Idx` (`S_SuppKey`),
  KEY `D_DateKey_Idx` (`D_DateKey`),
  KEY `C_CustomerKey_Idx` (`C_CustomerKey`),
  KEY `mview$hash_key` (`mview$hash`)
) ENGINE=InnoDB AUTO_INCREMENT=6029221 DEFAULT CHARSET=latin1;

Here are some sample contents. You can see the integer values extracted out of the JSON:

mysql> select * from json_idx limit 10;
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
|        1 | 1996-08-08 |          6748 |         1 |    178778 |       35620 |             2 |  1 | 3151656687 |
|        2 | 1994-05-20 |          5272 |         1 |     52362 |      102790 |             4 |  2 | 2181615425 |
|        3 | 1995-05-04 |         22870 |         1 |    136407 |      146757 |             3 |  3 |  544130577 |
|        4 | 1996-06-16 |         12217 |         1 |    129103 |      151200 |             1 |  4 | 2969697839 |
|        5 | 1992-07-20 |         21757 |         1 |     35243 |      151745 |             1 |  5 | 1438921571 |
|        6 | 1997-08-16 |         18760 |         1 |    150766 |      159232 |             6 |  6 | 3941775529 |
|        7 | 1994-03-04 |           757 |         1 |     15750 |      188902 |             3 |  7 | 2142628088 |
|        8 | 1993-11-04 |         17830 |         1 |    192023 |      253828 |             5 |  8 | 3480147565 |
|        9 | 1993-07-12 |         16933 |         1 |     59997 |      269062 |             5 |  9 | 3572286145 |
|       10 | 1998-06-16 |         26660 |         1 |     30023 |      298272 |             3 | 10 | 1971966244 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
10 rows in set (0.00 sec)

Now, there needs to be an easy way to use this index in a select statement. Since a JOIN is needed between the materialized view and the base table, a regular VIEW makes sense to access the data. We’ll call this the index view:

mysql> create view json_idx_v as select * from json natural join json_idx;
Query OK, 0 rows affected (0.00 sec)

And just for completeness, here is the contents of a row from our new index view:

mysql> select * from json_idx_v limit 1G
*************************** 1. row ***************************
           id: 1
          doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU     5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU     0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
     mview$pk: 1
    D_DateKey: 1996-08-08
C_CustomerKey: 6748
    S_SuppKey: 1
    P_PartKey: 178778
  LO_OrderKey: 35620
LO_LineNumber: 2
   mview$hash: 3151656687
1 row in set (0.00 sec)

Using the UDF to find a document

The UDF does a full table scan, parsing all six million documents (TWICE!) as it goes along. Unsurprisingly, this is slow:

mysql> select * from json where json_extract(doc,'LO_OrderKey') = 35620 and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (54.49 sec)
mysql> explain select * from json
                where json_extract(doc,'LO_OrderKey') = 35620
                  and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5233236
        Extra: Using where
1 row in set (0.00 sec)

Using the index view to find a document

mysql> select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (0.00 sec)
mysql> explain select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json_idx
         type: ref
possible_keys: LO_LineNumber_Idx,LO_OrderKey_Idx
          key: LO_OrderKey_Idx
      key_len: 9
          ref: const
         rows: 4
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: ssb_json.json_idx.id
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Keeping the index in sync

Flexviews materialized views need to be refreshed when the underlying table changes. Flexviews includes a REFRESH($mvid, $mode, $transaction_id) function.

I am going to remove one document from the table:

mysql> delete from json where id = 10000;
Query OK, 1 row affected (0.01 sec)

Note there is now one row in the materialized view change log. dml_type is -1 because it is a delete:

mysql> select * from flexviews.mvlog_f1673fac9814a93508a1c917566ecd4dG
*************************** 1. row ***************************
    dml_type: -1
      uow_id: 113
fv$server_id: 33
      fv$gsn: 1083
          id: 10000
         doc: {"LO_OrderKey":"3359521","LO_LineNumber":"2","LO_CustKey":"10306","LO_PartKey":"77997","LO_SuppKey":"4","LO_OrderDateKey":"19951010","LO_OrderPriority":"2-HIGH","LO_ShipPriority":"0","LO_Quantity":"43","LO_ExtendedPrice":"8492457","LO_OrdTotalPrice":"27032802","LO_Discount":"2","LO_Revenue":"8322607","LO_SupplyCost":"118499","LO_Tax":"4","LO_CommitDateKey":"19951228","LO_ShipMode":"FOB","C_CustomerKey":"10306","C_Name":"Customer#000010306","C_Address":"4UR9tz8","C_City":"ROMANIA  5","C_Nation":"ROMANIA","C_Region":"EUROPE","C_Phone":"29-361-986-3513","C_MktSegment":"BUILDING","S_SuppKey":"4","S_Name":"Supplier#000000004","S_Address":"qGTQJXogS83a7MB","S_City":"MOROCCO  4","S_Nation":"MOROCCO","S_Region":"AFRICA","S_Phone":"25-128-190-5944","D_DateKey":"19951010","D_Date":"Octorber 10, 1995","D_DayOfWeek":"Wednesday","D_Month":"Octorber","D_Year":"1995","D_YearMonthNum":"199510","D_YearMonth":"Oct1995","D_DayNumInWeek":"4","D_DayNumInMonth":"10","D_DayNumInYear":"283","D_MonthNumInYear":"10","D_WeekNumInYear":"41","D_SellingSeason":"Fall","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"77997","P_Name":"burnished olive","P_MFGR":"MFGR#2","P_Category":"MFGR#24","P_Brand":"MFGR#2426","P_Colour":"orchid","P_Type":"MEDIUM PLATED TIN","P_Size":"16","P_Container":"WRAP PKG"}
1 row in set (0.01 sec)

Now we can verify the materialized view is out of date:

mysql> select * from json_idx where id = 10000;
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id    | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
|    10000 | 1995-10-10 |         10306 |         4 |     77997 |     3359521 |             2 | 10000 | 2937185172 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
1 row in set (2.60 sec)

To bring the index up to date we must refresh it. Usually you will use the ‘BOTH’ mode to ‘COMPUTE’ and ‘APPLY’ the changes at the same time, but I am going to use COMPUTE mode to show you what ends up in the delta table:

mysql> select * from json_idx_deltaG
*************************** 1. row ***************************
     dml_type: -1
       uow_id: 113
       fv$gsn: 1083
    D_DateKey: 1995-10-10
C_CustomerKey: 10306
    S_SuppKey: 4
    P_PartKey: 77997
  LO_OrderKey: 3359521
LO_LineNumber: 2
           id: 10000
   mview$hash: 2937185172
1 row in set (0.00 sec)

Delta tables are similar to materialized view change log tables, except they contain insertions and deletions to the view contents. In this case, you can see dml_type is -1 and id = 10000, so the row from the view corresponding to the row we deleted, will be deleted when the change is applied.

Finally the change can be applied:

mysql> call flexviews.refresh(flexviews.get_id('ssb_json','json_idx'), 'APPLY',NULL);
Query OK, 2 rows affected (0.47 sec)
mysql> select * from json_idx where id = 10000; -- note, we didn't index id in the MV
Empty set (2.61 sec)

Finally, it makes sense to try to keep the index in sync as quickly as possible using a MySQL event:

DELIMITER ;;
CREATE EVENT IF NOT EXISTS flexviews.refresh_json_idx
ON SCHEDULE EVERY 1 SECOND
DO
BEGIN
  DECLARE v_got_lock tinyint default 0;
  SELECT GET_LOCK('JSON_IDX_LOCK', 0) INTO v_got_lock;
  IF v_got_lock = 1 THEN
    CALL flexviews.refresh(flexviews.get_id('ssb_json','json_idx'),'BOTH',NULL);
    SELECT RELEASE_LOCK('JSON_IDX_LOCK') INTO @discard;
  END IF;
END;;
DELIMITER ;

So there you have it. A way to index and quickly search through JSON documents and keep the index in sync automatically.

The post ‘Indexing’ JSON documents for efficient MySQL queries over JSON data appeared first on MySQL Performance Blog.

Written by Justin Swanhart in: flexviews,index,JSON,JSON_EXTRACT,Justin Swanhart,MySQL,NOSQL,percona,Primary,Sveta Smirnova,udf,Zend Developer |

May
31
2013

The small improvements of MySQL 5.6: Duplicate Index Detection

Here at the MySQL Performance Blog, we’ve been discussing the several new features that MySQL 5.6 brought: GTID-based replication, InnoDB Fulltext, Memcached integration, a more complete performance schema, online DDL and several other InnoDB and query optimizer improvements. However, I plan to focus on a series of posts on the small but handy improvements – changes and bug corrections – in MySQL 5.6 that can make our lives easier and have passed almost unnoticed by most (not all) DBAs.

Duplicate Index Detection

I commented about this on my last webinar, but did not have time to analyze it in-depth. If you try to do something like this in MySQL 5.5, you will succeed without errors or warnings:

mysql> ALTER TABLE test ADD INDEX (col2);
Query OK, 0 rows affected (0.22 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> ALTER TABLE test ADD INDEX (col2);
Query OK, 0 rows affected (0.25 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> SHOW CREATE TABLE test\G
*************************** 1. row ***************************
       Table: test
Create Table: CREATE TABLE `test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `col2` int(11) DEFAULT NULL,
  `col3` varchar(200) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `col2` (`col2`),
  KEY `col2_2` (`col2`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

In previous versions of MySQL, you can create two indexes with the same columns (in the same order) and the server will not complain.

If we execute the same sentences in MySQL 5.6, the second ALTER will also succeed -and the index will be created-, but we will get a warning (note severity, to be exact):

mysql> ALTER TABLE test ADD INDEX (col2);
Query OK, 0 rows affected (0.36 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> ALTER TABLE test ADD INDEX (col2);
Query OK, 0 rows affected, 1 warning (0.56 sec)
Records: 0  Duplicates: 0  Warnings: 1
mysql> SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Note
   Code: 1831
Message: Duplicate index 'col2_2' defined on the table 'test.test'. This is deprecated and will be disallowed in a future release.
1 row in set (0.00 sec)

As the message points correctly, this is a human mistake, as it is a waste of resources that could potentially impact our performance, and should be avoided. By the way, a good practice to avoid this is always naming your keys with a consistent pattern. This new behavior was introduced in 5.6.7 with the closing of this bug (although this was initially accepted as a bug as early as 2005!).

The report explains more in detail what the “will be disallowed in a future release” means. In MySQL 5.7 the checks will be stricter: in the default SQL mode, a duplicate index will throw a warning instead of a note. In strict mode, it will throw an error and the second ALTER will fail, preventing the creation of the duplicate index.

Does it mean that tools like pt-duplicate-key-checker will not be necessary for MySQL 5.6? Let’s have a look at the code implementing this feature. The warning will only be thrown if the index has not been created automatically, it is not a foreign key, and it has the exact column definition in the same order. In other words, it checks for duplicate keys, but not redundant ones. What is the difference? Let’s see an example. If we execute:

mysql> ALTER TABLE test ADD INDEX (col2);
Query OK, 0 rows affected (0.34 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> ALTER TABLE test ADD INDEX (col2, col3);
Query OK, 0 rows affected (0.39 sec)
Records: 0  Duplicates: 0  Warnings: 0

We get no warnings and no errors in 5.6, as the indexes are different. But as you may know, we can use the second index not only for filtering on both columns, but also for filtering by just the first one. Why can’t MySQL enforce this kind of constraints? For many reasons: the first one because it would break 99% of all applications out there that use MySQL, for which we at Percona tend to find redundant indexes. And second, because in some cases, we may need to have what at first seems redundant indexes but effectively they are not -for example, if one of the two indexes was unique or a foreign key.

This is the output of our tool when run on the same table, correctly identifying the redundancy:

$ pt-duplicate-key-checker --tables test.test
# ########################################################################
# test.test
# ########################################################################
# col2 is a left-prefix of col2_2
# Key definitions:
#   KEY `col2` (`col2`),
#   KEY `col2_2` (`col2`,`col3`)
# Column types:
#         `col2` int(11) default null
#         `col3` varchar(200) default null
# To remove this duplicate index, execute:
ALTER TABLE `test`.`test` DROP INDEX `col2`;
# ########################################################################
# Summary of indexes
# ########################################################################
# Size Duplicate Indexes   5
# Total Duplicate Indexes  1
# Total Indexes            3

Additionally, pt-duplicate-key-checker will detect subtle redundancies that are engine-dependent, like redundant suffixes for secondary keys in InnoDB. As some of this redundancies could be necessary, depending on the query optimizer and the MySQL version, we always recommend to check manually the optimizations proposed by Percona Toolkit. The MySQL server, of course, cannot risk to block directly all cases.

A set of MySQL utilities were introduced by Oracle recently, which includes mysqlindexcheck, similar to pt-duplicate-key-checker, but it does not detect all cases. For example:

mysql> alter table test add index redundant (col2, id);
Query OK, 0 rows affected (1.57 sec)
Records: 0  Duplicates: 0  Warnings: 0
$ mysqlindexcheck --server=user:pass@localhost test.test
# Source on localhost: ... connected.
$ pt-duplicate-key-checker --tables test.test
# ########################################################################
# test.test
# ########################################################################
# Key redundant ends with a prefix of the clustered index
# Key definitions:
#   KEY `redundant` (`col2`,`id`)
#   PRIMARY KEY (`id`),
# Column types:
#         `col2` int(11) default null
#         `id` int(11) not null auto_increment
# To shorten this duplicate clustered index, execute:
ALTER TABLE `test`.`test` DROP INDEX `redundant`, ADD INDEX `redundant` (`col2`);
# ########################################################################
# Summary of indexes
# ########################################################################
# Size Duplicate Indexes   9
# Total Duplicate Indexes  1
# Total Indexes            2

By the way, if you want to get more familiar with this and other particularities of the latest MySQL GA release, have a look at our upcoming sessions for the “Moving to 5.6″ training course in America (Austin, San Jose) and Europe (Manchester, Utrecht).

The post The small improvements of MySQL 5.6: Duplicate Index Detection appeared first on MySQL Performance Blog.

Written by Jaime Crespo in: 5.6,5.7,duplicate,Duplicate Index Detection,index,Insight for DBAs,MySQL,MySQL 5.6,mysqlindexcheck,On Ground Training,Percona Toolkit,pt-duplicate-key-checker,training,Zend Developer |

« Newer Posts

Personal blog of Yzmir Ramirez

The Glass is twice as large as it needs to be

Advanced JSON for MySQL

What is JSON

Problems with JSON in MySQL

Searching

Aggregating

‘Indexing’ JSON documents for efficient MySQL queries over JSON data

MySQL meets NoSQL with JSON UDF

JSON UDF are great, but what’s the problem

What can be done to index JSON in MySQL for efficient access?

What is Flexviews?

Demo/POC using materialized view as an index

Using the UDF to find a document

Using the index view to find a document

Keeping the index in sync

The small improvements of MySQL 5.6: Duplicate Index Detection

Archives

Contributors

Personal blog of Yzmir Ramirez

The Glass is twice as large as it needs to be

Advanced JSON for MySQL

What is JSON

Problems with JSON in MySQL

Searching

Aggregating

‘Indexing’ JSON documents for efficient MySQL queries over JSON data

MySQL meets NoSQL with JSON UDF

JSON UDF are great, but what’s the problem

What can be done to index JSON in MySQL for efficient access?

What is Flexviews?

Demo/POC using materialized view as an index

Using the UDF to find a document

Using the index view to find a document

Keeping the index in sync

The small improvements of MySQL 5.6: Duplicate Index Detection

Tag Cloud

Archives

Contributors