Feb
17
2015
--

‘Indexing’ JSON documents for efficient MySQL queries over JSON data

MySQL meets NoSQL with JSON UDF

I recently got back from FOSDEM, in Brussels, Belgium. While I was there I got to see a great talk by Sveta Smirnova, about her MySQL 5.7 Labs release JSON UDF functions. It is important to note that while the UDF come in a 5.7 release it is absolutely possible to compile and use the UDF with earlier versions of MySQL because the UDF interface has not changed for a long time. However, the UDF should still be considered alpha/preview level of quality and should not be used in production yet! For this example I am using Percona Server 5.6 with the UDF.

That being said, the proof-of-concept that I’m about to present here uses only one JSON function (JSON_EXTRACT) and it has worked well enough in my testing to present my idea here. The JSON functions will probably be GA sometime soon anyway, and this is a useful test of the JSON_EXTRACT function.

The UDF let you parse, search and manipulate JSON data inside of MySQL, bringing MySQL closer to the capabilities of a document store.

Since I am using Percona Server 5.6, I needed to compile and install the UDF. Here are the steps I took to compile the plugin:

  1. $ cd mysql-json-udfs-0.3.3-labs-json-udfs-src
  2. $ cmake -DMYSQL_DIR=/usr/local/mysql .
  3. $ sudo make install
  4. $ sudo cp *.so /usr/local/mysql/lib/plugin

JSON UDF are great, but what’s the problem

The JSON functions work very well for manipulating individual JSON objects, but like all other functions, using JSON_EXTRACT in the WHERE clause will result in a full table scan. This means the functions are virtually useless for searching through large volumes of JSON data.  If you want to use MySQL as a document store, this is going to limit the usefulness in the extreme as the ability to extract key/value pairs from JSON documents is powerful, but without indexing it can’t scale well.

What can be done to index JSON in MySQL for efficient access?

The JSON UDF provides a JSON_EXTRACT function which can pull data out of a JSON document. There are two ways we can use this function to “index” the JSON data.

  1. Add extra columns to the table (or use a separate table, or tables) containing the JSON and populate the columns using JSON_EXTRACT in a trigger. The downside is that this slows down inserts and modifications of the documents significantly.
  2. Use Flexviews materialized views to maintain an index table separately and asynchronously. The upside is that insertion/modification speed is not affected, but there is slight delay before index is populated. This is similar to eventual consistency in a document store.

Writing triggers is an exercise I’ll leave up to the user. The rest of this post will discuss using Flexviews materialized views to create a JSON index.

What is Flexviews?

Flexviews can create ‘incrementally refreshable’ materialized views. This means that the views are able to be refreshed efficiently using changes captured by FlexCDC, the change data capture tool that ships with Flexviews. Since the view can be refreshed fast, it is possible to refresh it frequently and have a low latency index, but not one perfectly in sync with the base table at all times.

The materialized view is a real table that is indexed to provide fast access. Flexviews includes a SQL_API, or a set of stored procedures for defining and maintaining materialized views.

See this set of slides for an overview of Flexviews: http://www.slideshare.net/MySQLGeek/flexviews-materialized-views-for-my-sql

Demo/POC using materialized view as an index

The first step to creating an incrementally refreshable materialized view with Flexviews, is to create a materialized view change log on all of the tables used in the view. The CREATE_MVLOG($schema, $table) function creates the log and FlexCDC will immediately being to collect changes into it.

mysql> call flexviews.create_mvlog('ssb_json','json');
Query OK, 1 row affected (0.01 sec)

Next, the materialized view name, and refresh type must be registered with the CREATE($schema, $mvname, $refreshtype) function:

mysql> call flexviews.create('ssb_json','json_idx','INCREMENTAL');
Query OK, 0 rows affected (0.00 sec)
-- save the just generated identifier for the view.  You can use GET_ID($schema,$mvname) later.
mysql> set @mvid := last_insert_id();
Query OK, 0 rows affected (0.00 sec)

Now one or more tables have to be added to the view using the ADD_TABLE($mvid, $schema, $table, $alias,$joinclause) function. This example will use only one table, but Flexviews supports joins too.

mysql> call flexviews.add_table(@mvid, 'ssb_json','json','json',null);
Query OK, 1 row affected (0.00 sec)

Expressions must be added to the view next. Since aggregation is not used in this example, the expressions should be ‘COLUMN’ type expressions. The function ADD_EXPR($mvid, $expression_type, $expression, $alias) is used to add expressions. Note that JSON_EXTRACT returns a TEXT column, so I’ve CAST the function to integer so that it can be indexed. Flexviews does not currently have a way to define prefix indexes.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', "cast(json_extract(doc,'D_DateKey') as date)", 'D_DateKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'C_CustomerKey') as unsigned)", 'C_CustomerKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'S_SuppKey') as unsigned)", 'S_SuppKey');
Query OK, 1 row affected (0.01 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'P_PartKey') as unsigned)", 'P_PartKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_OrderKey') as unsigned)", 'LO_OrderKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_LineNumber') as unsigned)", 'LO_LineNumber');
Query OK, 1 row affected (0.00 sec)

I’ve also projected out the ‘id’ column from the table, which is the primary key. This ties the index entries to the original row, so that the original document can be retrieved.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', 'id', 'id');
Query OK, 1 row affected (0.00 sec)

Since we want to use the materialized view as an index, we need to index the columns we’ve added to it.

mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_LineNumber", 'LO_LineNumber_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_OrderKey", 'LO_OrderKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"P_PartKey", 'P_PartKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"S_SuppKey", 'S_SuppKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"D_DateKey", 'D_DateKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"C_CustomerKey", 'C_CustomerKey_Idx');
Query OK, 1 row affected (0.00 sec)

Finally, the view has to be created. There are 6 million rows in my table, the JSON functions are UDF so they are not as fast as built in functions, and I indexed a lot of things (six different indexes are being populated at once) so it takes some time to build the index:

mysql> call flexviews.enable(@mvid);
Query OK, 2 rows affected (35 min 53.17 sec)

After the materialized view is built, you can see it in the schema. Note there is also a delta table, which I will explain a bit later.

mysql> show tables;
+--------------------+
| Tables_in_ssb_json |
+--------------------+
| json               |
| json_idx           |
| json_idx_delta     |
+--------------------+
3 rows in set (0.00 sec)

Here is the table definition of json_idx, our materialized view. You can see it is indexed:

CREATE TABLE `json_idx` (
  `mview$pk` bigint(20) NOT NULL AUTO_INCREMENT,
  `D_DateKey` date DEFAULT NULL,
  `C_CustomerKey` bigint(21) unsigned DEFAULT NULL,
  `S_SuppKey` bigint(21) unsigned DEFAULT NULL,
  `P_PartKey` bigint(21) unsigned DEFAULT NULL,
  `LO_OrderKey` bigint(21) unsigned DEFAULT NULL,
  `LO_LineNumber` bigint(21) unsigned DEFAULT NULL,
  `id` bigint(20) NOT NULL DEFAULT '0',
  `mview$hash` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`mview$pk`),
  KEY `LO_LineNumber_Idx` (`LO_LineNumber`),
  KEY `LO_OrderKey_Idx` (`LO_OrderKey`),
  KEY `P_PartKey_Idx` (`P_PartKey`),
  KEY `S_SuppKey_Idx` (`S_SuppKey`),
  KEY `D_DateKey_Idx` (`D_DateKey`),
  KEY `C_CustomerKey_Idx` (`C_CustomerKey`),
  KEY `mview$hash_key` (`mview$hash`)
) ENGINE=InnoDB AUTO_INCREMENT=6029221 DEFAULT CHARSET=latin1;

Here are some sample contents. You can see the integer values extracted out of the JSON:

mysql> select * from json_idx limit 10;
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
|        1 | 1996-08-08 |          6748 |         1 |    178778 |       35620 |             2 |  1 | 3151656687 |
|        2 | 1994-05-20 |          5272 |         1 |     52362 |      102790 |             4 |  2 | 2181615425 |
|        3 | 1995-05-04 |         22870 |         1 |    136407 |      146757 |             3 |  3 |  544130577 |
|        4 | 1996-06-16 |         12217 |         1 |    129103 |      151200 |             1 |  4 | 2969697839 |
|        5 | 1992-07-20 |         21757 |         1 |     35243 |      151745 |             1 |  5 | 1438921571 |
|        6 | 1997-08-16 |         18760 |         1 |    150766 |      159232 |             6 |  6 | 3941775529 |
|        7 | 1994-03-04 |           757 |         1 |     15750 |      188902 |             3 |  7 | 2142628088 |
|        8 | 1993-11-04 |         17830 |         1 |    192023 |      253828 |             5 |  8 | 3480147565 |
|        9 | 1993-07-12 |         16933 |         1 |     59997 |      269062 |             5 |  9 | 3572286145 |
|       10 | 1998-06-16 |         26660 |         1 |     30023 |      298272 |             3 | 10 | 1971966244 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
10 rows in set (0.00 sec)

Now, there needs to be an easy way to use this index in a select statement. Since a JOIN is needed between the materialized view and the base table, a regular VIEW makes sense to access the data. We’ll call this the index view:

mysql> create view json_idx_v as select * from json natural join json_idx;
Query OK, 0 rows affected (0.00 sec)

And just for completeness, here is the contents of a row from our new index view:

mysql> select * from json_idx_v limit 1G
*************************** 1. row ***************************
           id: 1
          doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU     5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU     0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
     mview$pk: 1
    D_DateKey: 1996-08-08
C_CustomerKey: 6748
    S_SuppKey: 1
    P_PartKey: 178778
  LO_OrderKey: 35620
LO_LineNumber: 2
   mview$hash: 3151656687
1 row in set (0.00 sec)

Using the UDF to find a document

The UDF does a full table scan, parsing all six million documents (TWICE!) as it goes along. Unsurprisingly, this is slow:

mysql> select * from json where json_extract(doc,'LO_OrderKey') = 35620 and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (54.49 sec)
mysql> explain select * from json
                where json_extract(doc,'LO_OrderKey') = 35620
                  and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5233236
        Extra: Using where
1 row in set (0.00 sec)

Using the index view to find a document

mysql> select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (0.00 sec)
mysql> explain select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json_idx
         type: ref
possible_keys: LO_LineNumber_Idx,LO_OrderKey_Idx
          key: LO_OrderKey_Idx
      key_len: 9
          ref: const
         rows: 4
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: ssb_json.json_idx.id
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Keeping the index in sync

Flexviews materialized views need to be refreshed when the underlying table changes. Flexviews includes a REFRESH($mvid, $mode, $transaction_id) function.

I am going to remove one document from the table:

mysql> delete from json where id = 10000;
Query OK, 1 row affected (0.01 sec)

Note there is now one row in the materialized view change log. dml_type is -1 because it is a delete:

mysql> select * from flexviews.mvlog_f1673fac9814a93508a1c917566ecd4dG
*************************** 1. row ***************************
    dml_type: -1
      uow_id: 113
fv$server_id: 33
      fv$gsn: 1083
          id: 10000
         doc: {"LO_OrderKey":"3359521","LO_LineNumber":"2","LO_CustKey":"10306","LO_PartKey":"77997","LO_SuppKey":"4","LO_OrderDateKey":"19951010","LO_OrderPriority":"2-HIGH","LO_ShipPriority":"0","LO_Quantity":"43","LO_ExtendedPrice":"8492457","LO_OrdTotalPrice":"27032802","LO_Discount":"2","LO_Revenue":"8322607","LO_SupplyCost":"118499","LO_Tax":"4","LO_CommitDateKey":"19951228","LO_ShipMode":"FOB","C_CustomerKey":"10306","C_Name":"Customer#000010306","C_Address":"4UR9tz8","C_City":"ROMANIA  5","C_Nation":"ROMANIA","C_Region":"EUROPE","C_Phone":"29-361-986-3513","C_MktSegment":"BUILDING","S_SuppKey":"4","S_Name":"Supplier#000000004","S_Address":"qGTQJXogS83a7MB","S_City":"MOROCCO  4","S_Nation":"MOROCCO","S_Region":"AFRICA","S_Phone":"25-128-190-5944","D_DateKey":"19951010","D_Date":"Octorber 10, 1995","D_DayOfWeek":"Wednesday","D_Month":"Octorber","D_Year":"1995","D_YearMonthNum":"199510","D_YearMonth":"Oct1995","D_DayNumInWeek":"4","D_DayNumInMonth":"10","D_DayNumInYear":"283","D_MonthNumInYear":"10","D_WeekNumInYear":"41","D_SellingSeason":"Fall","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"77997","P_Name":"burnished olive","P_MFGR":"MFGR#2","P_Category":"MFGR#24","P_Brand":"MFGR#2426","P_Colour":"orchid","P_Type":"MEDIUM PLATED TIN","P_Size":"16","P_Container":"WRAP PKG"}
1 row in set (0.01 sec)

Now we can verify the materialized view is out of date:

mysql> select * from json_idx where id = 10000;
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id    | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
|    10000 | 1995-10-10 |         10306 |         4 |     77997 |     3359521 |             2 | 10000 | 2937185172 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
1 row in set (2.60 sec)

To bring the index up to date we must refresh it. Usually you will use the ‘BOTH’ mode to ‘COMPUTE’ and ‘APPLY’ the changes at the same time, but I am going to use COMPUTE mode to show you what ends up in the delta table:

mysql> select * from json_idx_deltaG
*************************** 1. row ***************************
     dml_type: -1
       uow_id: 113
       fv$gsn: 1083
    D_DateKey: 1995-10-10
C_CustomerKey: 10306
    S_SuppKey: 4
    P_PartKey: 77997
  LO_OrderKey: 3359521
LO_LineNumber: 2
           id: 10000
   mview$hash: 2937185172
1 row in set (0.00 sec)

Delta tables are similar to materialized view change log tables, except they contain insertions and deletions to the view contents. In this case, you can see dml_type is -1 and id = 10000, so the row from the view corresponding to the row we deleted, will be deleted when the change is applied.

Finally the change can be applied:

mysql> call flexviews.refresh(flexviews.get_id('ssb_json','json_idx'), 'APPLY',NULL);
Query OK, 2 rows affected (0.47 sec)
mysql> select * from json_idx where id = 10000; -- note, we didn't index id in the MV
Empty set (2.61 sec)

Finally, it makes sense to try to keep the index in sync as quickly as possible using a MySQL event:

DELIMITER ;;
CREATE EVENT IF NOT EXISTS flexviews.refresh_json_idx
ON SCHEDULE EVERY 1 SECOND
DO
BEGIN
  DECLARE v_got_lock tinyint default 0;
  SELECT GET_LOCK('JSON_IDX_LOCK', 0) INTO v_got_lock;
  IF v_got_lock = 1 THEN
    CALL flexviews.refresh(flexviews.get_id('ssb_json','json_idx'),'BOTH',NULL);
    SELECT RELEASE_LOCK('JSON_IDX_LOCK') INTO @discard;
  END IF;
END;;
DELIMITER ;

So there you have it. A way to index and quickly search through JSON documents and keep the index in sync automatically.

The post ‘Indexing’ JSON documents for efficient MySQL queries over JSON data appeared first on MySQL Performance Blog.

Mar
23
2011
--

Using Flexviews – part one, introduction to materialized views

If you know me, then you probably have heard of Flexviews. If not, then it might not be familiar to you. I’m giving a talk on it at the MySQL 2011 CE, and I figured I should blog about it before then. For those unfamiliar, Flexviews enables you to create and maintain incrementally refreshable materialized views.

You might be asking yourself “what is an incrementally refreshable materialized view?”. If so, then keep reading. This is the first in a multi-part series describing Flexviews.

edit:
You can find part 2 of the series here: http://www.mysqlperformanceblog.com/2011/03/25/using-flexviews-part-two-change-data-capture/


The output of any SQL query is logically an in-memory table
The output of a SQL query, and what we normally think of as a result set is really a virtual table. It has columns and rows, just like a database table, but it is temporary in nature, usually existing only in memory for a short time. This concept extends from relational algebra, upon which SQL is built. All SQL can be broken down into relational algebra, which is convenient because that means there are lots of transformations that can be done to it, all without changing the meaning or the output. The output of all relational algebraic statements is a table, so conceptually so too is the output of a SQL statement.

MySQL even includes a SQL statement that makes this perfectly clear: CREATE TABLE .. AS SELECT (CTAS). The results of the SELECT portion of the statement are stored in a table. Storing the results of a SQL statement into a table (even a temporary table) is called materializing the results.

Views are similar to regular SQL statements.
A view is a SQL statement which acts like a table. When you execute a query on a view, the result set is ephemeral, generated at run-time for consumption and then immediately discarded.

Views are not generally considered to be a performance optimization because:

  • The contents of the result set are computed each time the view is accessed.
  • If multiple statements access the same view repeatedly or concurrently, then this computation is likely to be expensive
  • If the view accesses large quantities of data, then the computation is likely expensive.
  • Views containing grouping, aggregation, sorting, distinct or other conditions must be fully computed and stored in a temporary table before they can be accessed, which is very expensive.

What is a materialized view (MV)?
A materialized view is similar to a regular view, in that it represents the result set of a query, but the contents are stored (materialized!) as a real table.  So a MV is similar to a table created with the CTAS command described above. This similarity is fairly superficial though. While both store results in a table, the MV represents the results of a SQL statement at a specific point in time, essentially a consistent snapshot of the query result. It is not possible to create multiple different tables via CTAS and have them all be transactionally consistent with one another, unless you stop all database write activity.

Materialized views can be used to enhance performance by acting as a cache. Further, the cost of a cache miss is lower because incremental refresh is faster than recomputing the contents from scratch.

  • The contents of the result set are updated periodically, not each time the view is accessed.
  • If multiple statements access the same view repeatedly or concurrently it is not likely to be very expensive.
  • If the view is large, accessing the table will be considerably cheaper
  • You can add indexes to the MV.
  • Since the data is already joined together and pre-aggregated, CPU and memory usage may be reduced compared to computing the results.
  • You can often generate multiple different blocks of content from the summarized data in one or more views

Materialized views must be refreshed.
Refreshing a MV brings it up to date to reflect the changes in the database since the view was either first created or last refreshed, whichever is more recent. More importantly a MV can can be refreshed to a specific point in time, not just “now”. This means that you can maintain multiple MVs and keep them synced to the same point in time.   There are two different methods by which a MV can be refreshed.

The first is the complete refresh method, which rebuilds the entire contents of the view from scratch. This is the less desirable method:

  • During a complete refresh, the view contents are completely recalculated, which could be very expensive
  • On some databases, the contents of the view are not available during complete refresh (not true of Flexviews)
  • During refresh the view may occupy twice as much space (similar to ALTER TABLE)
  • Supports all SQL syntax (like outer join) but can’t be refreshed to a specific point in time.
  • Performs no better than CTAS, but gives a convenient method of creating and refreshing the materialized results

The second is refresh method is incremental refresh. This method updates the view. It usually examines only the rows which have changed since the view was last refreshed.

Incremental refresh has obvious benefits which include:

  • Refreshing large views is orders of magnitude faster than complete refresh.
  • When updating the view, only a subset of the database rows must be examined
  • The rows examined are related to the rows which have changed since the last refresh
  • The view can be refreshed forward to to a specific transactional point in time
  • Multiple views can be rolled forward to the exact same consistent point in time, with respect to committed transactions
  • The processing can be done on a slave dedicated to such work.

And some drawbacks:

  • Not all SQL syntax supported (no outer join), no non-deterministic functions, etc.
  • There is overhead for change-data capture.
  • Some extra storage is used for the changes and deltas.
  • Creating an MV is not as simple as I’d like.

Incremental refresh capabilities imply that Flexviews has some way to capture the changes that happen in the database and then apply those changes to the views.

These capabilities break down into three main categories which will be blogged about in subsequent posts:

  • Change Data Capture – How Flexviews figures out what changed in the database
  • Delta computation – How it uses those changes to compute the differences between the old results and the new results
  • Delta application – How Flexviews uses the deltas to actually make the changes
  • Finally, the last blog post will describe how to use Flexviews for online schema change, or as an ELT* tool and it will cover how to create the materialized views which Flexviews manages.

*ELT – extract, load and transform, a variant of the data integration Extract-Transform-Load process (ETL).

You can find part 2 of the series here: http://www.mysqlperformanceblog.com/2011/03/25/using-flexviews-part-two-change-data-capture/

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com