Jul
28
2021
--

MongoDB: Modifying Documents Using Aggregation Pipelines and Update Expressions

MongoDB Modifying Documents

MongoDB Modifying DocumentsUpdating documents on MongoDB prior to version 4.2 was quite limited. It was not possible to set values to a conditional expression, combine fields, or update a field based on the value of another field on the server-side. Tracing a parallel to the SQL update statements, for example, it wasn’t possible to do something like the following:

Update t1 set t1.f1 = t1.f2 where…

It wasn’t possible to use a conditional expression either, something easily achieved with SQL standards:

UPDATE t1 SET t1.f1 = CASE WHEN f2 = 1 THEN 1 WHEN f2 = 2 THEN 5 END WHERE…

If something similar to both examples above was required and the deployment was 3.4+, probably the usage of $addFields would be an alternative way to accomplish it. However, it would not touch the current document because the $out output destination could only be a different collection.

With older versions, the only way around was creating a cursor with aggregation pipelines and iterating it on the client side. Inside the loop, it was possible to update using the proper $set values. It was a difficult and tedious task, which would result in a full javascript code.

With MongoDB 4.2 and onwards, it is possible to use an aggregation pipeline to update MongoDB documents conditionally, allowing the update/creation of a field based on another field. This article presents some very common/basic operations which are easily achieved with SQL databases.

Field Expressions in Update Commands (v4.2+)

Updating a field with the value of some other field:

This is similar to the classic example of an SQL command: update t1 set t1.f1 = t1.f2 + t1.f3

replset:PRIMARY> db.getSiblingDB("dbtest").colltest2.update({_id:3},[{$set:{result:{$add: [ "$f2", "$f3" ] } }} ]);
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

replset:PRIMARY> db.getSiblingDB("dbtest").colltest2.find({_id:3});
{ "_id" : 3, "f1" : 30, "f2" : 300, "f3" : 3000, "result" : 3300 }

The key point is the “$” on the front of the field names being referenced (“f2” and “f3” in this example). These are the simplest type of field path expression, as they’re called in the MongoDB documentation. You’ve probably seen them in the aggregation pipeline before, but it was only in v4.2 that you could also use them in a normal update command.

Applying “CASE” conditions:

It is quite suitable now to determine conditions for a field value while updating a collection:

replset:PRIMARY> db.getSiblingDB("dbtest").colltest3.find({_id:3});
{ "_id" : 3, "grade" : 8 }


replset:PRIMARY> db.getSiblingDB("dbtest").colltest3.update(
  { _id: 3}, 
  [
    { $set: {result : { 
      $switch: {branches: [
        { case: { $gte: [ "$grade", 7 ] }, then: "PASSED" }, 
        { case: { $lte: [ "$grade", 5 ] }, then: "NOPE" }, 
        { case: { $eq: [ "$grade", 6 ] }, then: "UNDER ANALYSIS" } 
      ] } 
    } } } 
  ] 
)
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })


replset:PRIMARY> db.getSiblingDB("dbtest").colltest3.find({ _id: 3});
{ "_id" : 3, "grade" : 8, "result" : "PASSED" }

Adding new fields for a specific filtered doc:

Let’s say that you want to stamp a document with the updated date = NOW and add a simple comment field:

replset:PRIMARY> db.getSiblingDB("dbtest").colltest.find({_id:3})
{ "_id" : 3, "description" : "Field 3", "rating" : 2, "updt_date" : ISODate("2021-05-06T22:00:00Z") }

replset:PRIMARY> db.getSiblingDB("dbtest").colltest.update( 
  { _id: 3 }, 
  [ 
    { $set: { "comment": "Comment3", mod_date: "$$NOW"} } 
  ] 
)
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

replset:PRIMARY> db.getSiblingDB("dbtest").colltest.find({_id:3})
{ "_id" : 3, "description" : "Field 3", "rating" : 2, "updt_date" : ISODate("2021-05-06T22:00:00Z"), "comment" : "Comment3", "mod_date" : ISODate("2021-07-05T18:48:44.710Z") }

Reaching several Docs with the same expression:

It is possible now to either use the command updateMany() and reach multiple docs with the same pipeline. 

replset:PRIMARY> db.getSiblingDB("dbtest").colltest3.find({});
{ "_id" : 1, "grade" : 8}
{ "_id" : 2, "grade" : 5}
{ "_id" : 3, "grade" : 8}

replset:PRIMARY> db.getSiblingDB("dbtest").colltest3.updateMany({}, 
  [
    { $set: {result : { $switch: {branches: [{ case: { $gte: [ "$grade", 7 ] }, then: "PASSED" }, { case: { $lte: [ "$grade", 5 ] }, then: "NOPE" }, { case: { $eq: [ "$grade", 6 ] }, then: "UNDER ANALYSIS" } ] } } } } 
  ] 
)
{ "acknowledged" : true, "matchedCount" : 3, "modifiedCount" : 2 }

replset:PRIMARY> db.getSiblingDB("dbtest").colltest3.find({});
{ "_id" : 1, "grade" : 8, "result" : "PASSED" }
{ "_id" : 2, "grade" : 5, "result" : "NOPE" }
{ "_id" : 3, "grade" : 8, "result" : "PASSED" }

Or use the command option { multi: true } if you want to stick to using the original db.collection.update() command. Note that the default is false, which means that only the first occurrence will be updated.

replset:PRIMARY> db.getSiblingDB("dbtest").colltest4.update({}, [{ $set: {result : { $switch: {branches: [{ case: { $gte: [ "$grade", 7 ] }, then: "PASSED" }, { case: { $lte: [ "$grade", 5 ] }, then: "NOPE" }, { case: { $eq: [ "$grade", 6 ] }, then: "UNDER ANALYSIS" } ] } } } } ] )
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

replset:PRIMARY> db.getSiblingDB("dbtest").colltest4.find({});
{ "_id" : 1, "grade" : 8, "result" : "PASSED" }
{ "_id" : 2, "grade" : 5 }
{ "_id" : 3, "grade" : 8 }

When specifying {multi:true} the expected outcome is finally achieved:

replset:PRIMARY> db.getSiblingDB("dbtest").colltest4.update({}, [{ $set: {result : { $switch: {branches: [{ case: { $gte: [ "$grade", 7 ] }, then: "PASSED" }, { case: { $lte: [ "$grade", 5 ] }, then: "NOPE" }, { case: { $eq: [ "$grade", 6 ] }, then: "UNDER ANALYSIS" } ] } } } } ],{multi:true} )
WriteResult({ "nMatched" : 3, "nUpserted" : 0, "nModified" : 2 })

replset:PRIMARY> db.getSiblingDB("dbtest").colltest4.find({});
{ "_id" : 1, "grade" : 8, "result" : "PASSED" }
{ "_id" : 2, "grade" : 5, "result" : "NOPE" }
{ "_id" : 3, "grade" : 8, "result" : "PASSED" }

Update by $merge Stage in the Aggregation Pipeline

Prior to version 4.2, addressing the result of an aggregate pipeline to a new collection was achieved by using $out. Starting on version 4.2 it is possible to use $merge which is way more flexible considering that while using $out, the entire collection will be replaced, and with merge, it is possible to replace a single document and a few or more things. You may want to refer to the comparison table described here:

https://docs.mongodb.com/manual/reference/operator/aggregation/merge/#std-label-merge-out-comparison

With MongoDB 4.4 and onwards, it is allowed to update a collection directly on the aggregate pipeline through the $merge stage. The magic happens after determining the output collection with the same name as the one being aggregated. The example below illustrates how to flag the max grade of the student Rafa in math class:

  • Original document
replset:PRIMARY> db.getSiblingDB("dbtest").students2.find({"name": "Rafa","class":"math"})
{ "_id" : ObjectId("6100081e21f08fe0d19bda41"), "name" : "Rafa", "grades" : [ 4, 5, 6, 9 ], "class" : "math" }

  • The aggregation pipeline
replset:PRIMARY> db.getSiblingDB("dbtest").students2.aggregate( [{ $match : { "name": "Rafa","class":"math" } }, {$project:{maxGrade:{$max:"$grades"}}}, {$merge : { into: { db: "dbtest", coll: "students2" }, on: "_id",  whenMatched: "merge", whenNotMatched:"discard"} } ]);

  • Checking the result
replset:PRIMARY> db.getSiblingDB("dbtest").students2.find({"name": "Rafa","class":"math"})
{ "_id" : ObjectId("6100081e21f08fe0d19bda41"), "name" : "Rafa", "grades" : [ 4, 5, 6, 9 ], "class" : "math", "maxGrade" : 9 }

Note that the maxGrade field was merged into the doc, flagging that the max grade achieved by that student in math was 9.

Watch out: behind the scenes, the merge will trigger an update against the same collection. If that update changes the physical location of the document, the update might revisit the same document multiple times or even get into an infinite loop (Halloween Problem)

The other cool thing is using the $merge stage to work exactly how a SQL command INSERT AS SELECT works (and this is possible with MongoDB 4.2 and onwards). The example below demonstrates how to fill the collection colltest_reporting with the result of an aggregation hit against colltest5.

replset:PRIMARY> db.getSiblingDB("dbtest").colltest5.aggregate( [{ $match : { class: "A" } }, { $group: { _id: "$class",maxGrade: { $max: "$grade" } }},  {$merge : { into: { db: "dbtest", coll: "colltest_reporting" }, on: "_id",  whenMatched: "replace", whenNotMatched: "insert" } } ] );
replset:PRIMARY> db.getSiblingDB("dbtest").colltest_reporting.find()
{ "_id" : "A", "maxGrade" : 8 }

Conclusion

There are plenty of new possibilities which will make a developer’s life easier (especially the life of those developers who are coming from SQL databases) considering that the aggregation framework provides several operators and various different stages to play. Although, it is important to highlight that the complexity of a pipeline may incur performance degradation (that may be a topic for another blog post). For more information on updates with aggregation pipelines, please refer to the official documentation.

Mar
10
2015
--

Advanced JSON for MySQL

What is JSON

JSON is an text based, human readable format for transmitting data between systems, for serializing objects and for storing document store data for documents that have different attributes/schema for each document. Popular document store databases use JSON (and the related BSON) for storing and transmitting data.

Problems with JSON in MySQL

It is difficult to inter-operate between MySQL and MongoDB (or other document databases) because JSON has traditionally been very difficult to work with. Up until recently, JSON is just a TEXT document. I said up until recently, so what has changed? The biggest thing is that there are new JSON UDF by Sveta Smirnova, which are part of the MySQL 5.7 Labs releases. Currently the JSON UDF are up to version 0.0.4. While these new UDF are a welcome edition to the MySQL database, they don’t solve the really tough JSON problems we face.

Searching

The JSON UDF provide a number of functions that make working with JSON easier, including the ability to extract portions of a document, or search a document for a particular key. That being said, you can’t use JSON_EXTRACT() or JSON_SEARCH in the WHERE clause, because it will initiate a dreaded full-table-scan (what MongoDB would call a full collection scan). This is a big problem and common wisdom is that JSON can’t be indexed for efficient WHERE clauses, especially sub-documents like arrays or objects within the JSON.

Actually, however, I’ve come up with a technique to effectively index JSON data in MySQL (to any depth). The key lies in transforming the JSON from a format that is not easily indexed into one that is easily indexed. Now, when you think index you think B-TREE or HASH indexes (or bitmap indexes) but MySQL also supports FULLTEXT indexes.

A fulltext index is an inverted index where words (tokens) point to documents. While text indexes are great, they aren’t normally usable for JSON. The reason is, MySQL splits words on whitespace and non-alphanumeric characters. A JSON document doesn’t end up being usable when the name of the field (the key) can’t be associated with the value. But what if we transform the JSON? You can “flatten” the JSON down into key/value pairs and use a text index to associate the key/value pairs with the document. I created a UDF called RAPID_FLATTEN_JSON using the C++ Rapid JSON library. The UDF flattens JSON documents down into key/value pairs for the specific purpose of indexing.

Here is an example JSON document:

{
	"id": "0001",
	"type": "donut",
	"name": "Cake",
	"ppu": 0.55,
	"batters":
		{
			"batter":
				[
					{ "id": "1001", "type": "Regular" },
					{ "id": "1002", "type": "Chocolate" },
					{ "id": "1003", "type": "Blueberry" },
					{ "id": "1004", "type": "Devil's Food" }
				]
		},
	"topping":
		[
			{ "id": "5001", "type": "None" },
			{ "id": "5002", "type": "Glazed" },
			{ "id": "5005", "type": "Sugar" },
			{ "id": "5007", "type": "Powdered Sugar" },
			{ "id": "5006", "type": "Chocolate with Sprinkles" },
			{ "id": "5003", "type": "Chocolate" },
			{ "id": "5004", "type": "Maple" }
		]
}

Flattened:

mysql> select RAPID_FLATTEN_JSON(load_file('/tmp/doc.json'))G
*************************** 1. row ***************************
RAPID_FLATTEN_JSON(load_file('/tmp/doc.json')): id=0001
type=donut
name=Cake
ppu=0.55
id=1001
type=Regular
id=1002
type=Chocolate
id=1003
type=Blueberry
id=1004
type=Devil's Food
type=Devil's
type=Food
id=5001
type=None
id=5002
type=Glazed
id=5005
type=Sugar
id=5007
type=Powdered Sugar
type=Powdered
type=Sugar
id=5006
type=Chocolate with Sprinkles
type=Chocolate
type=with
type=Sprinkles
id=5003
type=Chocolate
id=5004
type=Maple
1 row in set (0.00 sec)

Obviously this is useful, because our keys are now attached to our values in an easily searchable way. All you need to do is store the flattened version of the JSON in another field (or another table), and index it with a FULLTEXT index to make it searchable. But wait, there is one more big problem: MySQL will split words on the equal sign. We don’t want this as it removes the locality of the keyword and the value. To fix this problem you’ll have to undertake the (actually quite easy) step of adding a new collation to MySQL (I called mine ft_kvpair_ci). I added equal (=) to the list of lower case characters as described in the manual. You just have to change two text files, no need to recompile the server or anything, and as I said, it is pretty easy. Let me know if you get stuck on this step and I can show you the 5.6.22 files I modified.

By the way, I used a UDF, because MySQL FULLTEXT indexes don’t support pluggable parsers for InnoDB until 5.7. This will be much cleaner in 5.7 with a parser plugin and there will be no need to maintain an extra column.

Using the solution:
Given a table full of complex json:

create table json2(id int auto_increment primary key, doc mediumtext);

Add a column for the index data and FULLTEXT index it:

alter table json2 add flat mediumtext character set latin1 collate ft_kvpair_ci, FULLTEXT(flat);

Then populate the index. Note that you can create a trigger to keep the second column in sync, I let that up to an exercise of the reader, or you can use Flexviews to maintain a copy in a second table automatically.

mysql> update json2 set flat=RAPID_FLATTEN_JSON(doc);
Query OK, 18801 rows affected (26.34 sec)
Rows matched: 18801  Changed: 18801  Warnings: 0

Using the index:

mysql> select count(*) from json2 where match(flat) against ('last_name=Vembu');
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)

The documents I searched for that example are very complex and highly nested. Check out the full matching documents for the query here here

If you want to only index a subportion of the document, use the MySQL UDF JSON_EXTRACT to extract the portion you want to index, and only flatten that.

Aggregating

JSON documents may contain sub-documents as mentioned a moment ago. JSON_EXTRACT can extract a portion of a document, but it is still a text document. There is no function that can extract ALL of a particular key (like invoice_price) and aggregate the results. So, if you have a document called orders which contains a varying number of items and their prices, it is very difficult (if not impossible) to use the JSON UDF to aggregate a “total sales” figure from all the order documents.

To solve this problem, I created another UDF called RAPID_EXTRACT_ALL(json, ‘key’). This UDF will extract all the values for the given key. For example, if there are 10 line items with invoice_id: 30, it will extract the value (30 in this case) for each item. This UDF returns each item separated by newline. I created a few stored routines called jsum, jmin, jmax, jcount, and javg. They can process the output of rapid_extract_all and aggregate it. If you want to only RAPID_EXTRACT_ALL from a portion of a document, extract that portion with the MySQL UDF JSON_EXTRACT first, then process that with RAPID_EXTRACT_ALL.

For example:

mysql> select json_extract_all(doc,'id') ids, jsum(json_extract_all(doc,'id')) from json2 limit 1G
*************************** 1. row ***************************
ids: 888
889
2312
5869
8702
jsum(json_extract_all(doc,'id')): 18660.00000
1 row in set (0.01 sec)

Aggregating all of the id values in the entire collection:

mysql> select sum( jsum(json_extract_all(doc,'id')) ) from json2 ;
+-----------------------------------------+
| sum( jsum(json_extract_all(doc,'id')) ) |
+-----------------------------------------+
|                         296615411.00000 |
+-----------------------------------------+
1 row in set (2.90 sec)

Of course you could extract other fields and sort and group on them.

Where to get the tools:
You can find the UDF in the swanhart-tools github repo. I think you will find these tools very useful in working with JSON documents in MySQL.

(This post was originally posted on my personal blog: swanhart.livejournal.com, but is reposed here for wider distribution)

The post Advanced JSON for MySQL appeared first on MySQL Performance Blog.

May
01
2014
--

Parallel Query for MySQL with Shard-Query

While Shard-Query can work over multiple nodes, this blog post focuses on using Shard-Query with a single node.  Shard-Query can add parallelism to queries which use partitioned tables.  Very large tables can often be partitioned fairly easily. Shard-Query can leverage partitioning to add paralellism, because each partition can be queried independently. Because MySQL 5.6 supports the partition hint, Shard-Query can add parallelism to any partitioning method (even subpartioning) on 5.6 but it is limited to RANGE/LIST partitioning methods on early versions.

The output from Shard-Query is from the commandline client, but you can use MySQL proxy to communicate with Shard-Query too.

In the examples I am going to use the schema from the Star Schema Benchmark.  I generated data for scale factor 10, which means about 6GB of data in the largest table. I am going to show a few different queries, and explain how Shard-Query executes them in parallel.

Here is the DDL for the lineorder table, which I will use for the demo queries:

CREATE TABLE IF NOT EXISTS lineorder
(
 LO_OrderKey bigint not null,
 LO_LineNumber tinyint not null,
 LO_CustKey int not null,
 LO_PartKey int not null,
 LO_SuppKey int not null,
 LO_OrderDateKey int not null,
 LO_OrderPriority varchar(15),
 LO_ShipPriority char(1),
 LO_Quantity tinyint,
 LO_ExtendedPrice decimal,
 LO_OrdTotalPrice decimal,
 LO_Discount decimal,
 LO_Revenue decimal,
 LO_SupplyCost decimal,
 LO_Tax tinyint,
 LO_CommitDateKey int not null,
 LO_ShipMode varchar(10),
 primary key(LO_OrderDateKey,LO_PartKey,LO_SuppKey,LO_Custkey,LO_OrderKey,LO_LineNumber)
) PARTITION BY HASH(LO_OrderDateKey) PARTITIONS 8;

Notice that the lineorder table is partitioned by HASH(LO_OrderDateKey) into 8 partitions.  I used 8 partitions and my test box has 4 cores. It does not hurt to have more partitions than cores. A number of partitions that is two or three times the number of cores generally works best because it keeps each partition small, and smaller partitions are faster to scan. If you have a very large table, a larger number of partitions may be acceptable. Shard-Query will submit a query to Gearman for each partition, and the number of Gearman workers controls the parallelism.

The SQL for the first demo is:

SELECT COUNT(DISTINCT LO_OrderDateKey) FROM lineorder;

Here is the explain from regular MySQL:

mysql> explain select count(distinct LO_OrderDateKey) from lineorder\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: lineorder
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 25
          ref: NULL
         rows: 58922188
        Extra: Using index
1 row in set (0.00 sec)

 

So it is basically a full table scan. It takes a long time:

mysql> select count(distinct LO_OrderDateKey) from lineorder;
+---------------------------------+
| count(distinct LO_OrderDateKey) |
+---------------------------------+
|                            2406 |
+---------------------------------+
1 row in set (4 min 48.63 sec)

 

Shard-Query executes this query differently from MySQL. It sends a query to each partition, in parallel like the following queries:

Array
(
    [0] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p0)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [1] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p1)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [2] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p2)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [3] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p3)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [4] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p4)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [5] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p5)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [6] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p6)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    [7] => SELECT LO_OrderDateKey AS expr_2839651562
FROM lineorder  PARTITION(p7)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
)

You will notice that there is one query for each partition.  Those queries will be sent to Gearman and executed in parallel by as many Gearman workers as possible (in this case 4.)  The output of the queries go into a coordinator table, and then another query does a final aggregation.  That query looks like this:

SELECT COUNT(distinct expr_2839651562) AS `count`
FROM `aggregation_tmp_73522490`

The Shard-Query time:

select count(distinct LO_OrderDateKey) from lineorder;
Array
(
    [count ] => 2406
)
1 rows returned
Exec time: 0.10923719406128

That isn’t a typo, it really is sub-second compared to minutes in regular MySQL.

This is because Shard-Query uses GROUP BY to answer this query and a  loose index scan of the PRIMARY KEY is possible:

mysql> explain partitions SELECT LO_OrderDateKey AS expr_2839651562
    -> FROM lineorder  PARTITION(p7)  AS `lineorder`   WHERE 1=1  AND 1=1  GROUP BY LO_OrderDateKey
    -> \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: lineorder
   partitions: p7
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 80108
        Extra: Using index for group-by
1 row in set (0.00 sec)

Next another simple query will be tested, first on regular MySQL:

mysql> select count(*) from lineorder;
+----------+
| count(*) |
+----------+
| 59986052 |
+----------+
1 row in set (4 min 8.70 sec)

Again, the EXPLAIN shows a full table scan:

mysql> explain select count(*) from lineorder\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: lineorder
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 25
          ref: NULL
         rows: 58922188
        Extra: Using index
1 row in set (0.00 sec)

Now, Shard-Query can’t do anything special to speed up this query, except to execute it in parallel, similar to the first query:

[0] => SELECT COUNT(*) AS expr_3190753946
FROM lineorder PARTITION(p0) AS `lineorder` WHERE 1=1 AND 1=1
[1] => SELECT COUNT(*) AS expr_3190753946
FROM lineorder PARTITION(p1) AS `lineorder` WHERE 1=1 AND 1=1
[2] => SELECT COUNT(*) AS expr_3190753946
FROM lineorder PARTITION(p2) AS `lineorder` WHERE 1=1 AND 1=1
[3] => SELECT COUNT(*) AS expr_3190753946
FROM lineorder PARTITION(p3) AS `lineorder` WHERE 1=1 AND 1=1
...

The aggregation SQL is similar, but this time the aggregate function is changed to SUM to combine the COUNT from each partition:

SELECT SUM(expr_3190753946) AS ` count `
FROM `aggregation_tmp_51969525`

And the query is quite a bit faster at 140.24 second compared with MySQL’s 248.7 second result:

Array
(
[count ] => 59986052
)
1 rows returned
Exec time: 140.24419403076

Finally, I want to look at a more complex query that uses joins and aggregation.

mysql> explain select d_year, c_nation,  sum(lo_revenue - lo_supplycost) as profit  from lineorder
join dim_date  on lo_orderdatekey = d_datekey
join customer  on lo_custkey = c_customerkey
join supplier  on lo_suppkey = s_suppkey
join part  on lo_partkey = p_partkey
where  c_region = 'AMERICA'  and s_region = 'AMERICA'
and (p_mfgr = 'MFGR#1'  or p_mfgr = 'MFGR#2')
group by d_year, c_nation  order by d_year, c_nation;
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+------+---------------------------------+
| id | select_type | table     | type   | possible_keys | key     | key_len | ref                      | rows | Extra                           |
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+------+---------------------------------+
|  1 | SIMPLE      | dim_date  | ALL    | PRIMARY       | NULL    | NULL    | NULL                     |    5 | Using temporary; Using filesort |
|  1 | SIMPLE      | lineorder | ref    | PRIMARY       | PRIMARY | 4       | ssb.dim_date.D_DateKey   |   89 | NULL                            |
|  1 | SIMPLE      | supplier  | eq_ref | PRIMARY       | PRIMARY | 4       | ssb.lineorder.LO_SuppKey |    1 | Using where                     |
|  1 | SIMPLE      | customer  | eq_ref | PRIMARY       | PRIMARY | 4       | ssb.lineorder.LO_CustKey |    1 | Using where                     |
|  1 | SIMPLE      | part      | eq_ref | PRIMARY       | PRIMARY | 4       | ssb.lineorder.LO_PartKey |    1 | Using where                     |
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+------+---------------------------------+
5 rows in set (0.01 sec)

Here is the query on regular MySQL:

mysql> select d_year, c_nation,  sum(lo_revenue - lo_supplycost) as profit  from lineorder  join dim_date  on lo_orderdatekey = d_datekey  join customer  on lo_custkey = c_customerkey  join supplier  on lo_suppkey = s_suppkey  join part  on lo_partkey = p_partkey  where  c_region = 'AMERICA'  and s_region = 'AMERICA'  and (p_mfgr = 'MFGR#1'  or p_mfgr = 'MFGR#2')  group by d_year, c_nation  order by d_year, c_nation;
+--------+---------------+--------------+
| d_year | c_nation      | profit       |
+--------+---------------+--------------+
|   1992 | ARGENTINA     | 102741829748 |
...
|   1998 | UNITED STATES |  61345891337 |
+--------+---------------+--------------+
35 rows in set (11 min 56.79 sec)

Again, Shard-Query splits up the query to run over each partition (I won’t bore you with the details) and it executes the query faster than MySQL, in 343.3 second compared to ~720:

Array
(
    [d_year] => 1998
    [c_nation] => UNITED STATES
    [profit] => 61345891337
)
35 rows returned
Exec time: 343.29854893684

I hope you see how using Shard-Query can speed up queries without using sharding, on just a single server. All you really need to do is add partitioning.

You can get Shard-Query from GitHub at http://github.com/greenlion/swanhart-tools

Please note: Configure and install Shard-Query as normal, but simply use one node and set the column option (the shard column) to “nocolumn” or false, because you are not required to use a shard column if you are not sharding.

The post Parallel Query for MySQL with Shard-Query appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com