Jun
13
2018
--

Zone Based Sharding in MongoDB

MongoDB shard zones

MongoDB shard zonesIn this blog post, we will discuss about how to use zone based sharding to deploy a sharded MongoDB cluster in a customized manner so that the queries and data will be redirected per geographical groupings. This feature of MongoDB is a part of its Data Center Awareness, that allows queries to be routed to particular MongoDB deployments considering physical locations or configurations of mongod instances.

Before moving on, let’s have an overview of this feature. You might already have some questions about zone based sharding. Was it recently introduced? If zone-based sharding is something we should use, then what about tag-aware sharding?

MongoDB supported tag-aware sharding from even the initial versions of MongoDB. This means tagging a range of shard keys values, associating that range with a shard, and redirecting operations to that specific tagged shard. This tag-aware sharding, since version 3.4, is referred to as ZONES. So, the only change is its name, and this is the reason sh.addShardTag(shard, tag) method is being used.

How it works

  1. With the help of a shard key, MongoDB allows you to create zones of sharded data – also known as shard zones.
  2. Each zone can be associated with one or more shards.
  3. Similarly, a shard can associate with any number of non-conflicting zones.
  4. MongoDB migrates chunks to the zone range in the selected shards.
  5. MongoDB routes read and write to a particular zone range that resides in particular shards.

Useful for what kind of deployments/applications?

  1. In cases where data needs to be routed to a particular shard due to some hardware configuration restrictions.
  2. Zones can be useful if there is the need to isolate specific data to a particular shard. For example, in the case of GDPR compliance that requires businesses to protect data and privacy for an individual within the EU.
  3. If an application is being used geographically and you want a query to route to the nearest shards for both reads and writes.

Let’s consider a Scenario

Consider the scenario of a school where students are experts in Biology, but most students are experts in Maths. So we have more data for the maths students compare to Biology students. In this example, deployment requires that Maths students data should route to the shard with the better configuration for a large amount of data. Both read and write will be served by specific shards.  All the Biology students will be served by another shard. To implement this, we will add a tag to deploy the zones to the shards.

For this scenario we have an environment with:

DB: “school”

Collection: “students”

Fields: “sId”, “subject”, “marks” and so on..

Indexed Fields: “subject” and “sId”

We enable sharding:

sh.enableSharding("school")

And create a shardkey: “subject” and “sId” 

sh.shardCollection("school.students", {subject: 1, sId: 1});

We have two shards in our test environment

shards:

{  "_id" : "shard0000",  "host" : "127.0.0.1:27001",  "state" : 1 }
{  "_id" : "shard0001",  "host" : "127.0.0.1:27002",  "state" : 1 }

Zone Deployment

1) Disable balancer

To prevent migration of the chunks across the cluster, disable the balancer for the “students” collection:

mongos> sh.disableBalancing("school.students")
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

Before proceeding further make sure the balancer is not running. It is not a mandatory process, but it is always a good practice to make sure no migration of chunks takes place while configuring zones

mongos> sh.isBalancerRunning()
false

2) Add shard to the zone

A zone can be associated with a particular shard in the form of a tag, using the sh.addShardTag(), so a tag will be added to each shard. Here we are considering two zones so the tags “MATHS” and “BIOLOGY” need to be added.

mongos> sh.addShardTag( "shard0000" , "MATHS");
{ "ok" : 1 }
mongos> sh.addShardTag( "shard0001" , "BIOLOGY");
{ "ok" : 1 }

We can see zones are assigned in the form of tags as required against each shard.

mongos> sh.status()
 shards:
        {  "_id" : "shard0000",  "host" : "127.0.0.1:27001",  "state" : 1,  "tags" : [ "MATHS" ] }
        {  "_id" : "shard0001",  "host" : "127.0.0.1:27002",  "state" : 1,  "tags" : [ "BIOLOGY" ] }

3) Define ranges for each zone

Each zone covers one or more ranges of shard key values. Note: each range a zone covers is always inclusive of its lower boundary and exclusive of its upper boundary.

mongos> sh.addTagRange(
	"school.students",
	{ "subject" : "maths", "sId" : MinKey},
	{ "subject" : "maths", "sId" : MaxKey},
	"MATHS"
)
{ "ok" : 1 }
mongos> sh.addTagRange(
	"school.students",
	{ "subject" : "biology", "sId" : MinKey},
	{ "subject" : "biology", "sId" : MaxKey},
"BIOLOGY"
)
{ "ok" : 1 }

4) Enable balancer

Now enable the balancer so the chunks will migrate across the shards as per the requirement and all the read and write queries will be routed to the particular shards.

mongos> sh.enableBalancing("school.students")
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
mongos> sh.isBalancerRunning()
true

Let’s check how documents get routed as per the tags:

We have inserted 6 documents, 4 documents with “subject”:”maths” and 2 documents with “subject”:”biology”

mongos> db.students.find({"subject":"maths"}).count()
4
mongos> db.students.find({"subject":"biology"}).count()
2

Checking the shard distribution for the students collection:

mongos> db.students.getShardDistribution()
Shard shard0000 at 127.0.0.1:27003
data : 236B docs : 4 chunks : 4
estimated data per chunk : 59B
estimated docs per chunk : 1
Shard shard0001 at 127.0.0.1:27004
data : 122B docs : 2 chunks : 1
estimated data per chunk : 122B
estimated docs per chunk : 2

So in this test case, all the queries for the students collection have routed as per the tag used, with four documents inserted into shard0000 and two documents inserted to shard0001.

Any queries related to MATHS will route to shard0000 and queries related to BIOLOGY will route to shard0001, hence the load will be distributed as per the configuration of the shard, keeping the database performance optimized.

Sharding MongoDB using zones is a great feature provided by MongoDB. With the help of zones, data can be isolated to the specific shards. Or if we have any kind of hardware or configuration restrictions to the shards, it is a possible solution for routing the operations.

The post Zone Based Sharding in MongoDB appeared first on Percona Database Performance Blog.

May
23
2018
--

Deploy a MongoDB Replica Set with Transport Encryption (Part 2)

document-replication

In this article series, we will talk about the basic high availability architecture of a MongoDB: the MongoDB replica set.

  • Part 1 : We introduced basic replica set concepts, how it works and what its main features
  • Part 2 (this post): We’ll provide a step-by-step guide to configure a three-node replica set
  • Part 3: We’ll talk about how to configure transport encryption between the nodes

In part 1 we introduced and described the main features of a MongoDB replica set. In this post, we are going to present a step-by-step guide to deploy a basic and fully operational 3-nodes replica set. We’ll use just regular members, all with priority=1, no arbiter, no hidden or delayed nodes.

The environment

Our example environment is 3 virtual hosts with Ubuntu 16.04 LTS, although the configuration is the same with CentOS or other Linux distributions.

We have installed Percona Server for MongoDB on each node. Hostnames and IPs are:

  • psmdb1 : 192.168.56.101
  • psmdb2 : 192.168.56.102
  • psmdb3 : 192.168.56.103

It is not the goal of this post to provide installation details, but in case you need them you can follow this guide: https://www.percona.com/doc/percona-server-for-mongodb/LATEST/install/index.html MongoDB installation from the repository is very easy.

Connectivity

Once we have all the nodes with MongoDB installed, we just need to be sure that each one is accessible by all the others on port 27017, the default port.

Since our members are on the same network we can simply try to test the connectivity between each pair of nodes, connecting the mongo client from one node to each of the others.

psmdb1> mongo --host 192.168.56.102 --port 27017
psmdb1> mongo --host 192.168.56.103 --port 27017
psmdb2> mongo --host 192.168.56.101 --port 27017
psmdb2> mongo --host 192.168.56.103 --port 27017
psmdb3> mongo --host 192.168.56.101 --port 27017
psmdb3> mongo --host 192.168.56.102 --port 27017

If the mongo client is not able to connect, we need to check the network configuration, or to configure or disable the firewall.

Hostnames

Configuring the hostnames into our hosts is not mandatory for the replica set. In fact you can configure the replica set using just the IPs and it’s fine. But we need to define the hostnames because they will be very useful when we discuss how to configure internal encryption in Part 3.

We need to ensure that each member is accessible by way of resolvable DNS or hostnames.

Set up each node in the /etc/hosts file

root@psmdb1:~# cat /etc/hosts
127.0.0.1       localhost
192.168.56.101  psmdb1
192.168.56.102  psmdb2
192.168.56.103  psmdb3

Choose a name for the replica set

We are now close to finalizing the configuration.

Now we have to choose a name for the replica set. We need to choose one and put t on each member’s configuration file. Let’s say we decide to use rs-test.

Put the replica set name into /etc/mongod.conf (the MongoDB configuration file) on each host. Enter the following:

replication:
     replSetName: "rs-test"

Restart the server:

sudo service mongod restart

Remember to do this on all the nodes.

That’s all we need to do to configure the replication at its most basic. There are obviously other configuration parameters we could set, but maybe we’ll talk about them in another post when discussing more advanced features. For this basic deployment we can assume that all the default values are good enough.

Initiate replication

Now we need to connect to one of the nodes. It doesn’t matter which, just choose one of them and launch mongo shell to connect to the local mongod instance.

Then issue the rs.initiate() command to let the replica set know what all the members are.

mongo> rs.initiate( {
      ... _id: “rs-test”,
      ... members: [
      ... { _id: 0, host: “psmdb1:27017” },
      ... { _id: 1, host: “psmdb2:27017” },
      ... { _id: 2, host: “psmdb3:27017” }
      ... ] })

After issuing the command, MongoDB initiates the replication process using the default configuration. A PRIMARY node is elected and all the documents will be created by now will be asynchronously replicated on the SECONDARY nodes.

We don’t need to do any more. The replica set is now working.

We can verify that the replication is working by taking a look at the mongo shell prompt. Once the replica set is up and running the prompt should be like this on the PRIMARY node:

rs-test:PRIMARY>

and like this on the SECONDARY nodes:

rs-test:SECONDARY>

MongoDB lets you know the replica role of the node that you are connected to.

A couple of useful commands

There are several commands to investigate and to do some administrative tasks on the replica set. Here are a couple of them.

To investigate the replica set configuration you can issue rs.conf() on any node

rs-test:PRIMARY> rs.conf()
{
 "_id" : "rs-test",
 "version" : 68835,
 "protocolVersion" : NumberLong(1),
 "members" : [
 {
 "_id" : 0,
 "host" : "psmdb1:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {
},
 "slaveDelay" : NumberLong(0),
 "votes" : 1
 },
 {
 "_id" : 1,
 "host" : "psmdb2:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {
},
 "slaveDelay" : NumberLong(0),
 "votes" : 1
 },
 {
 "_id" : 2,
 "host" : "psmdb3:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {
},
 "slaveDelay" : NumberLong(0),
 "votes" : 1
 }
 ],
 "settings" : {
 "chainingAllowed" : true,
 "heartbeatIntervalMillis" : 2000,
 "heartbeatTimeoutSecs" : 10,
 "electionTimeoutMillis" : 10000,
 "catchUpTimeoutMillis" : 60000,
 "getLastErrorModes" : {
},
 "getLastErrorDefaults" : {
 "w" : 1,
 "wtimeout" : 0
 },
 "replicaSetId" : ObjectId("5aa2600d377adb63d28e7f0f")
 }
}

We can see information about the configured nodes, whether arbiter or hidden, the priority, and other details regarding the heartbeat process.

To investigate the replica set status you can issue rs.status() on any node

rs-test:SECONDARY> rs.status()
{
 "set" : "rs-test",
 "date" : ISODate("2018-05-14T10:16:05.228Z"),
 "myState" : 2,
 "term" : NumberLong(47),
 "syncingTo" : "psmdb3:27017",
 "heartbeatIntervalMillis" : NumberLong(2000),
 "optimes" : {
 "lastCommittedOpTime" : {
 "ts" : Timestamp(1526292954, 1),
 "t" : NumberLong(47)
 },
 "appliedOpTime" : {
 "ts" : Timestamp(1526292964, 1),
 "t" : NumberLong(47)
 },
 "durableOpTime" : {
 "ts" : Timestamp(1526292964, 1),
 "t" : NumberLong(47)
 }
 },
 "members" : [
 {
 "_id" : 0,
 "name" : "psmdb1:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 392,
 "optime" : {
 "ts" : Timestamp(1526292964, 1),
 "t" : NumberLong(47)
 },
 "optimeDate" : ISODate("2018-05-14T10:16:04Z"),
 "syncingTo" : "psmdb3:27017",
 "configVersion" : 68835,
 "self" : true
 },
 {
 "_id" : 1,
 "name" : "psmdb2:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "uptime" : 379,
 "optime" : {
 "ts" : Timestamp(1526292964, 1),
 "t" : NumberLong(47)
 },
 "optimeDurable" : {
 "ts" : Timestamp(1526292964, 1),
 "t" : NumberLong(47)
 },
 "optimeDate" : ISODate("2018-05-14T10:16:04Z"),
 "optimeDurableDate" : ISODate("2018-05-14T10:16:04Z"),
 "lastHeartbeat" : ISODate("2018-05-14T10:16:04.832Z"),
 "lastHeartbeatRecv" : ISODate("2018-05-14T10:16:03.318Z"),
 "pingMs" : NumberLong(0),
 "electionTime" : Timestamp(1526292592, 1),
 "electionDate" : ISODate("2018-05-14T10:09:52Z"),
 "configVersion" : 68835
 },
 {
 "_id" : 2,
 "name" : "psmdb3:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 378,
 "optime" : {
 "ts" : Timestamp(1526292964, 1),
 "t" : NumberLong(47)
 },
 "optimeDurable" : {
 "ts" : Timestamp(1526292964, 1),
 "t" : NumberLong(47)
 },
 "optimeDate" : ISODate("2018-05-14T10:16:04Z"),
 "optimeDurableDate" : ISODate("2018-05-14T10:16:04Z"),
 "lastHeartbeat" : ISODate("2018-05-14T10:16:04.832Z"),
 "lastHeartbeatRecv" : ISODate("2018-05-14T10:16:04.822Z"),
 "pingMs" : NumberLong(0),
 "syncingTo" : "psmdb2:27017",
 "configVersion" : 68835
 }
 ],
 "ok" : 1
}

Here we can see for example if nodes are reachable and are running, but in particular we can see the role they have at this moment: which is the PRIMARY and which are SECONDARY

Test replication

Finally, let’s try to test that the replication process is really working as expected.

Connect to the PRIMARY node and create a sample document:

rs-test:PRIMARY> use test
switched to db test
rs-test:PRIMARY> db.foo.insert( {name:"Bruce", surname:"Dickinson"} )
WriteResult({ "nInserted" : 1 })
rs-test:PRIMARY> db.foo.find().pretty()
{
    "_id" : ObjectId("5ae05ac27e6680071caf94b7")
    "name" : "Bruce"
    "surname" : "Dickinson"
}

Then connect to a SECONDARY node and look for the same document.

Remember that you can’t connect to the SECONDARY node to read the data. By default reads and writes are allowed only on the PRIMARY. So, if you want to read data on a SECONDARY node, you first need to issue the rs.slaveOK() command. If you don’t do this you will receive an error.

rs-test:SECONDARY> rs.slaveOK()
rs-test:SECONDARY> show collections
local
foo
rs-test:SECONDARY> db.foo.find().pretty()
{
     "_id" : ObjectId("5ae05ac27e6680071caf94b7")
     "name" : "Bruce"
     "surname" : "Dickinson"
}

As we can see, the SECONDARY node has replicated the creation of the collection foo and the inserted document.

This simple test demonstrates that the replication process is working as expected.

There are more sophisticated features to investigate the replica set, and for troubleshooting, but discussing them it’s not in the scope of this post.

In Part 3, we’ll show how to encrypt the internal replication process we have deployed so far.

Read the first post of this series: Deploy a MongoDB Replica Set with Transport Encryption

The post Deploy a MongoDB Replica Set with Transport Encryption (Part 2) appeared first on Percona Database Performance Blog.

Sep
16
2016
--

How X Plugin Works Under the Hood

X Plugin

X PluginIn this blog post, we’ll look at what MySQL does under the hood to transform NoSQL requests to SQL (and then store them in InnoDB transactional engine) when using the X Plugin.

X Plugin allows MySQL to function as a document store. We don’t need to define any schema or use SQL language while still being a fully ACID database. Sounds like magic – but we know the only thing that magic does is make planes fly! ?

Alexander already wrote a blog post exploring how the X Plugin works, with some examples. In this post, I am going to show some more query examples and how they are transformed.

I have enabled the slow query log to see what it is actually being executed when I run NoSQL queries.

Creating our first collection

We start the MySQL shell and create our first collection:

$ mysqlsh -u root --py
Creating an X Session to root@localhost:33060
No default schema selected.
[...]
Currently in Python mode. Use sql to switch to SQL mode and execute queries.
mysql-py> db.createCollection("people")

What is a collection in SQL terms? A table. Let’s check what MySQL does by reading the slow query log:

CREATE TABLE `people` (
  `doc` json DEFAULT NULL,
  `_id` varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$._id'))) STORED NOT NULL,
  PRIMARY KEY (`_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

As we correctly guessed, it creates a table with two columns. One is called “doc” and it stores a JSON document. A second column named “_id” and is created as a virtual column from data extracted from that JSON document. _id is used as a primary key, and if we don’t specify a value, MySQL will choose a random UUID every time we write a document.

So, the basics are clear.

  • It stores everything inside a JSON column.
  • Indexes are created on virtual columns that are generated by extracting data from that JSON. Every time we add a new index, a virtual column will be generated. That means that under the hood, an alter table will run adding the column and the corresponding index.

Let’s run a getCollections that would be similar to “SHOW TABLES” in the SQL world:

mysql-py> db.getCollections()
[
]

This is what MySQL actually runs:

SELECT C.table_name AS name, IF(ANY_VALUE(T.table_type)='VIEW', 'VIEW', IF(COUNT(*) = COUNT(CASE WHEN (column_name = 'doc' AND data_type = 'json') THEN 1 ELSE NULL END) + COUNT(CASE WHEN (column_name = '_id' AND generation_expression = 'json_unquote(json_extract(`doc`,''$._id''))') THEN 1 ELSE NULL END) + COUNT(CASE WHEN (column_name != '_id' AND generation_expression RLIKE '^(json_unquote[[.(.]])?json_extract[[.(.]]`doc`,''[[.$.]]([[...]][^[:space:][...]]+)+''[[.).]]{1,2}$') THEN 1 ELSE NULL END), 'COLLECTION', 'TABLE')) AS type FROM information_schema.columns AS C LEFT JOIN information_schema.tables AS T USING (table_name)WHERE C.table_schema = 'test' GROUP BY C.table_name ORDER BY C.table_name;

This time, the query is a bit more complex. It runs a query on information_schema.tables joining it, with information_schema.columns searching for tables that have “doc” and “_id” columns.

Inserting and reading documents

I am going to start adding data to our collection. Let’s add our first document:

mysql-py> db.people.add(
      ...  {
      ...     "Name": "Miguel Angel",
      ...     "Country": "Spain",
      ...     "Age": 33
      ...   }
      ... )

In the background, MySQL inserts a JSON object and auto-assign a primary key value.

INSERT INTO `test`.`people` (doc) VALUES (JSON_OBJECT('Age',33,'Country','Spain','Name','Miguel Angel','_id','a45c69cd2074e611f11f62bf9ac407d7'));

Ok, this is supposed to be schemaless. So let’s add someone else using different fields:

mysql-py> db.people.add(
      ...  {
      ...     "Name": "Thrall",
      ...     "Race": "Orc",
      ...     "Faction": "Horde"
      ...   }
      ... )

Same as before, MySQL just writes another JSON object (with different fields):

INSERT INTO `test`.`people` (doc) VALUES (JSON_OBJECT('Faction','Horde','Name','Thrall','Race','Orc','_id','7092776c2174e611f11f62bf9ac407d7'));

Now we are going to read the data we have just inserted. First, we are going to find all documents stored in the collection:

mysql-py> db.people.find()

MySQL translates to a simple:

SELECT doc FROM `test`.`people`;

And this is how filters are transformed:

mysql-py> db.people.find("Name = 'Thrall'")

It uses a SELECT with the WHERE clause on data extracted from the JSON object.

SELECT doc FROM `test`.`people` WHERE (JSON_EXTRACT(doc,'$.Name') = 'Thrall');

Updating documents

Thrall decided that he doesn’t want to belong to the Horde anymore. He wants to join the Alliance. We need to update the document:

mysql-py> db.people.modify("Name = 'Thrall'").set("Faction", "Alliance")

MySQL runs an UPDATE, again using a WHERE clause on the data extracted from the JSON. Then, it updates the “Faction”:

UPDATE `test`.`people` SET doc=JSON_SET(doc,'$.Faction','Alliance') WHERE (JSON_EXTRACT(doc,'$.Name') = 'Thrall');

Now I want to remove my own document:

mysql-py> db.people.remove("Name = 'Miguel Angel'");

As you can already imagine, it runs a DELETE, searching for my name on the data extracted from the JSON object:

DELETE FROM `test`.`people` WHERE (JSON_EXTRACT(doc,'$.Name') = 'Miguel Angel');

Summary

The magic that makes our MySQL work like a document-store NoSQL database is:

  • Create a simple InnoDB table with a JSON column.
  • Auto-generate the primary key with UUID values and represent it as a virtual column.
  • All searches are done by extracting data JSON_EXTRACT, and passing that info to the WHERE clause.

I would define the solution as something really clever, simple and clean. Congrats to Oracle! ?

Aug
01
2016
--

Introduction into storage engine troubleshooting: Q & A

storage engine troubleshooting

 storage engine troubleshootingIn this blog, I will provide answers to the Q & A for the “Introduction into storage engine troubleshooting” webinar.

First, I want to thank everybody for attending the July 14 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: At which isolation level do 

pt-online-schema-change

 and 

pt-archive

  copy data from a table?

A: Both tools do not change the server’s default transaction isolation level. Use either

REPEATABLE READ

 or set it in my

.cnf

.

Q: Can I create an index to optimize a query which has group by A and order by B, both from different tables and A column is from the first table in the two table join?

A: Do you mean a query like

SELECT ... FROM a, b GROUP BY a.A ORDER BY b.B

 ? Yes, this is possible:

mysql> explain select A, B, count(*) from a join b on(a.A=b.id) WHERE b.B < 4 GROUP BY a.A, b.B ORDER BY b.B ASC;
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref       | rows | Extra                                                     |
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
|  1 | SIMPLE      | b     | range | PRIMARY,B     | B    | 5       | NULL      |   15 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | a     | ref   | A             | A    | 5       | test.b.id |    1 | Using index                                               |
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
2 rows in set (0.00 sec)

Q: Where can I find recommendations on what kind of engine to use for different application types or use cases?

A: Storage engines are always being actively developed, therefore I suggest that you don’t search for generic recommendations. These can be outdated just a few weeks after they are written. Study engines instead. For example, just a few years ago MyISAM was the only engine (among those officially supported) that could work with FULLTEXT indexes and SPATIAL columns. Now InnoDB supports both: FULLTEXT indexes since version 5.6 and GIS features in 5.7. Today I can recommend InnoDB as a general-purpose engine for all installations, and TokuDB for write-heavy workloads when you cannot use high-speed disks.

Alternative storage engines can help to realize specific business needs. For example, CONNECT brings data to your server from many sources, SphinxSE talks to the Sphinx daemon, etc.

Other alternative storage engines increase the speed of certain workloads. Memory, for example, can be a good fit for temporary tables.

Q: Can you please explain how we find the full text of the query when we query the view ‘statements_with_full_table_Scans’?

A: Do you mean view in sys schema? Sys schema views take information from

summary_*

 and digests it in Performance Schema, therefore it does not contain full queries (only digests). Full text of the query can be found in the

events_statements_*

  tables in the Performance Schema. Note that even the 

events_statements_history_long

  table can be rewritten very quickly, and you may want to save data from it periodically.

Q: Hi is TokuDB for the new document protocol?

A: As Alex Rubin showed in his detailed blog post, the new document protocol just converts NoSQL queries into SQL, and is thus not limited to any storage engine. To use documents and collections, a storage engine must support generated columns (which TokuDB currently does not). So support of X Protocol for TokuDB is limited to relational tables access.

Q: Please comment on “read committed” versus “repeatable read.”
Q: Repeatable read holds the cursor on the result set for the client versus read committed where the cursor is updated after a transaction.

A:

READ COMMITTED

 and

REPEATABLE READ

 are transaction isolation levels, whose details are explained here.
I would not correlate locks set on table rows in different transaction isolation modes with the result set. A transaction with isolation level

REPEATABLE READ

  instead creates a snapshot of rows that are accessed by the transaction. Let’s consider a table:

mysql> create table ti(id int not null primary key, f1 int) engine=innodb;
Query OK, 0 rows affected (0.56 sec)
mysql> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
Query OK, 9 rows affected (0.03 sec)
Records: 9  Duplicates: 0  Warnings: 0

Then start the transaction and select a few rows from this table:

mysql1> begin;
Query OK, 0 rows affected (0.00 sec)
mysql1> select * from ti where id < 5;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
+----+------+
4 rows in set (0.04 sec)

Now let’s update another set of rows in another transaction:

mysql2> update ti set f1 = id*2 where id > 5;
Query OK, 4 rows affected (0.06 sec)
Rows matched: 4  Changed: 4  Warnings: 0
mysql2> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
|  6 |   12 |
|  7 |   14 |
|  8 |   16 |
|  9 |   18 |
+----+------+
9 rows in set (0.00 sec)

You see that the first four rows – which we accessed in the first transaction – were not modified, and last four were modified. If InnoDB only saved the cursor (as someone answered above) we would expect to see the same result if we ran 

SELECT * ...

  query in our old transaction, but it actually shows whole table content before modification:

mysql1> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
|  6 |    6 |
|  7 |    7 |
|  8 |    8 |
|  9 |    9 |
+----+------+
9 rows in set (0.00 sec)

So “snapshot”  is a better word than “cursor” for the result set. In the case of

READ COMMITTED

, the first transaction would see modified rows:

mysql1> drop table ti;
Query OK, 0 rows affected (0.11 sec)
mysql1> create table ti(id int not null primary key, f1 int) engine=innodb;
Query OK, 0 rows affected (0.38 sec)
mysql1> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
Query OK, 9 rows affected (0.04 sec)
Records: 9  Duplicates: 0  Warnings: 0
mysql1> set transaction isolation level read committed;
Query OK, 0 rows affected (0.00 sec)
mysql1> begin;
Query OK, 0 rows affected (0.00 sec)
mysql1> select * from ti where id < 5;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
+----+------+
4 rows in set (0.00 sec)

Let’s update all rows in the table this time:

mysql2> update ti set f1 = id*2;
Query OK, 9 rows affected (0.04 sec)
Rows matched: 9  Changed: 9  Warnings: 0

Now the first transaction sees both the modified rows with id >= 5 (not in the initial result set), but also the modified rows with id < 5 (which existed in the initial result set):

mysql1> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    2 |
|  2 |    4 |
|  3 |    6 |
|  4 |    8 |
|  5 |   10 |
|  6 |   12 |
|  7 |   14 |
|  8 |   16 |
|  9 |   18 |
+----+------+
9 rows in set (0.00 sec)

Jun
08
2016
--

Using MySQL 5.7 Document Store with Internet of Things (IoT)

MySQL 5.7 Document Store

MySQL 5.7 Document StoreIn this blog post, I’ll discuss how to use MySQL 5.7 Document Store to track data from Internet of Things (IoT) devices.

Using JSON in MySQL 5.7

In my previous blog post, I’ve looked into MySQL 5.7.12 Document Store. This is a brand new feature in MySQL 5.7, and many people are asking when do I need or want to use the JSON or Document Store interface?

Storing data in JSON may be quite useful in some cases, for example:

  • You already have a JSON (i.e., from external feeds) and need to store it anyway. Using the JSON datatype will be more convenient and more efficient.
  • For the Internet of Things, specifically, when storing events from sensors: some sensors may send only temperature data, some may send temperature, humidity and light (but light information is only recorded during the day), etc. Storing it in JSON format may be more convenient in that you don’t have to declare all possible fields in advance, and do not have to run “alter table” if a new sensor starts sending new types of data.

Internet of Things

In this blog post, I will show an example of storing an event stream from Particle Photon. Last time I created a device to measure light and temperature and stored the results in MySQL. Particle.io provides the ability to use its own MQTT server and publish events with:

Spark.publish("temperature", String(temperature));
Spark.publish("humidity", String(humidity));
Spark.publish("light", String(light));

Then, I wanted to “subscribe” to my events and insert those into MySQL (for further analysis). As we have three different metrics for the same device, we have two basic options:

  1. Use a field per metric and create something like this: device_id int, temperature double, humidity double, light double
  2. Use a record per metric and have something like this: device_id int, event_name varchar(255), event_data text (please see this Internet of Things, Messaging and MySQL blog post for more details)

The first option above is not flexible. If my device starts measuring the soil temperature, I will have to “alter table add column”.

Option two is better in this regard, but I may significantly increase the table size as I have to store the name as a string for each measurement. In addition, some devices may send more complex metrics (i.e., latitude and longitude).

In this case, using JSON for storing metrics can be a better option. In this case, I’ve also decided to try Document Store as well.

First, we will need to enable X Plugin and setup the NodeJS / connector. Here are the steps required:

  1. Enable X Plugin in MySQL 5.7.12+, which uses a different port (33060 by default)
  2. Download and install NodeJS (>4.2) and mysql-connector-nodejs-1.0.2.tar.gz (follow the Getting Started with Connector/Node.JS guide).
    # node --version
    v4.4.4
    # wget https://dev.mysql.com/get/Downloads/Connector-Nodejs/mysql-connector-nodejs-1.0.2.tar.gz
    # npm install mysql-connector-nodejs-1.0.2.tar.gz

    Please note: on older systems you will probably need to upgrade the nodejs version (follow the Installing Node.js via package manager guide).

Storing Events from Sensors

Particle.io provides you with an API that allows you to subscribe to all public events (“events” are what sensors send). The API is for NodeJS, which is really convenient as we can use NodeJS for MySQL 5.7.12 Document Store as well.

To use the Particle API, install the particle-api-js module:

$ npm install particle-api-js

I’ve created the following NodeJS code to subscribe to all public events, and then add the data (in JSON format) to a document store:

var mysqlx = require('mysqlx');
var Particle = require('particle-api-js');
var particle = new Particle();
var token = '<place your token here>'
var mySession =
mysqlx.getSession({
    host: 'localhost',
    port: 33060,
    dbUser: 'root',
    dbPassword: '<place your pass here>'
});
process.on('SIGINT', function() {
    console.log("Caught interrupt signal. Exiting...");
    process.exit()
});
particle.getEventStream({ auth: token}).then(function(stream) {
                stream.on('event', function(data) {
                                console.log(data);
                                mySession.then(session => {
                                                session.getSchema("iot").getCollection("event_stream")
                                                .add(  data  )
                                                .execute(function (row) {
                                                        // can log something here
                                                }).catch(err => {
                                                        console.log(err);
                                                })
                                                .then( function (notices) {
                                                        console.log("Wrote to MySQL: " + JSON.stringify(notices))
                                                });
                                }).catch(function (err) {
                                              console.log(err);
                                              process.exit();
                                });
                });
}).catch(function (err) {
                                              console.log(err.stack);
                                              process.exit();
});

How it works:

  • particle.getEventStream({ auth: token}) gives me the stream of events. From there I can subscribe to specific event names, or to all public events using the generic name “events”: stream.on(‘event’, function(data).
  • function(data) is a callback function fired when a new event is ready. The event has JSON type “data.” From there I can simply insert it to a document store: .add( data ).execute() will insert the JSON data into the event_stream document store.

One of the reasons I use document store here is I do not have to know what is inside the event data. I do not have to parse it, I simply throw it to MySQL and analyze it later. If the format of data will change in the future, my application will not break.

Inside the data stream

Here is the example of running the above code:

{ data: 'Humid: 49.40 Temp: 25.00 *C Dew: 13.66 *C HeatI: 25.88 *C',
  ttl: '60',
  published_at: '2016-05-20T19:30:51.433Z',
  coreid: '2b0034000947343337373738',
  name: 'log' }
Wrote to MySQL: {"_state":{"rows_affected":1,"doc_ids":["a3058c16-15db-0dab-f349-99c91a00"]}}
{ data: 'null',
  ttl: '60',
  published_at: '2016-05-20T19:30:51.418Z',
  coreid: '50ff72...',
  name: 'registerdev' }
Wrote to MySQL: {"_state":{"rows_affected":1,"doc_ids":["eff0de02-726e-34bd-c443-6ecbccdd"]}}
{ data: '24.900000',
  ttl: '60',
  published_at: '2016-05-20T19:30:51.480Z',
  coreid: '2d0024...',
  name: 'Humid 2' }
{ data: '[{"currentTemp":19.25},{"currentTemp":19.19},{"currentTemp":100.00}]',
  ttl: '60',
  published_at: '2016-05-20T19:30:52.896Z',
  coreid: '2d002c...',
  name: 'getTempData' }
Wrote to MySQL: {"_state":{"rows_affected":1,"doc_ids":["5f1de278-05e0-6193-6e30-0ebd78f7"]}}
{ data: '{"pump":0,"salt":0}',
  ttl: '60',
  published_at: '2016-05-20T19:30:51.491Z',
  coreid: '55ff6...',
  name: 'status' }
Wrote to MySQL: {"_state":{"rows_affected":1,"doc_ids":["d6fcf85f-4cba-fd59-a5ec-2bd78d4e"]}}

(Please note: although the stream is public, I’ve tried to anonymize the results a little.)

As we can see the “data” is JSON and has that structure. I could have implemented it as a MySQL table structure (adding published_at, name, TTL and coreid as separate fields). However, I would have to depend on those specific fields and change my application if those fields changed. We also see examples of how the device sends the data back: it can be just a number, a string or another JSON.

Analyzing the results

Now I can go to MySQL and use SQL (which I’ve used for >15 years) to find out what I’ve collected. First, I want to know how many device names I have:

mysql -A iot
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 3289
Server version: 5.7.12 MySQL Community Server (GPL)
Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql> select count(distinct json_unquote(doc->'$.name')) from event_stream;
+---------------------------------------------+
| count(distinct json_unquote(doc->'$.name')) |
+---------------------------------------------+
|                                        1887 |
+---------------------------------------------+
1 row in set (5.47 sec)

That is slow! As described in my previous post, I can create a virtual column and index for doc->’$.name’ to make it faster:

mysql> alter table event_stream add column name varchar(255)
    -> generated always as (json_unquote(doc->'$.name')) virtual;
Query OK, 0 rows affected (0.17 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table event_stream add key (name);
Query OK, 0 rows affected (3.47 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> show create table event_stream
*************************** 1. row ***************************
       Table: event_stream
Create Table: CREATE TABLE `event_stream` (
  `doc` json DEFAULT NULL,
  `_id` varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$._id'))) STORED NOT NULL,
  `name` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$.name'))) VIRTUAL,
  UNIQUE KEY `_id` (`_id`),
  KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)
mysql> select count(distinct name) from event_stream;
+----------------------+
| count(distinct name) |
+----------------------+
|                 1820 |
+----------------------+
1 row in set (0.67 sec)

How many beers left?

Eric Joyce has published a Keg Inventory Counter that uses a Particle Proton device to measure the amount of beer in a keg by 12oz pours. I want to see what was the average and the lowest amount of beer per day:

mysql> select date(json_unquote(doc->'$.published_at')) as day,
    ->        avg(json_unquote(doc->'$.data')) as avg_beer_left,
    ->    min(json_unquote(doc->'$.data')) as min_beer_left
    -> from event_stream
    -> where name = 'Beers_left'
    -> group by date(json_unquote(doc->'$.published_at'));
+------------+--------------------+---------------+
| day        | avg_beer_left      | min_beer_left |
+------------+--------------------+---------------+
| 2016-05-13 |  53.21008358996988 | 53.2          |
| 2016-05-18 |  52.89973045822105 | 52.8          |
| 2016-05-19 | 52.669233854792694 | 52.6          |
| 2016-05-20 |  52.60644257702987 | 52.6          |
+------------+--------------------+---------------+
4 rows in set (0.44 sec)

Conclusion

UDocument Store can be very beneficial if an application is working with a JSON field and does not know or does not care about its structure. In this post, I’ve used the “save to MySQL and analyze later” approach here. We can then add virtual fields and add indexes if needed.

May
24
2016
--

Looking inside the MySQL 5.7 document store

MySQL-5.7-Document-Store

In this blog, we’ll look at the MySQL 5.7 document store feature, and how it is implemented.

Document Store

MySQL 5.7.12 is a major new release, as it contains quite a number of new features:

  1. Document store and “MongoDB” like NoSQL interface to JSON storage
  2. Protocol X / X Plugin, which can be used for asynchronous queries (I will write about it as well)
  3. New MySQL shell

Peter already wrote the document store overview; in this post, I will look deeper into the document store implementation. In my next post, I will demonstrate how to use document store for Internet of Things (IoT) and event logging.

Older MySQL 5.7 versions already have a JSON data type, and an ability to create virtual columns that can be indexed. The new document store feature is based on the JSON datatype.

So what is the document store anyway? It is an add-on to a normal MySQL table with a JSON field. Let’s take a deep dive into it and see how it works.

First of all: one can interface with the document store’s collections using the X Plugin (default port: 33060). To do that:

  1. Enable X Plugin and install MySQL shell.
  2. Login to a shell:
    mysqlsh --uri root@localhost
  3. Run commands (JavaScript mode, can be switched to SQL or Python):
    mysqlsh --uri root@localhost
    Creating an X Session to root@localhost:33060
    Enter password:
    No default schema selected.
    Welcome to MySQL Shell 1.0.3 Development Preview
    Copyright (c) 2016, Oracle and/or its affiliates. All rights reserved.
    Oracle is a registered trademark of Oracle Corporation and/or its
    affiliates. Other names may be trademarks of their respective
    owners.
    Type 'help', 'h' or '?' for help.
    Currently in JavaScript mode. Use sql to switch to SQL mode and execute queries.
    mysql-js> db = session.getSchema('world_x')                                                                                                                                                                 <Schema:world_x>
    mysql-js> db.getCollections()
    {
        "CountryInfo": <Collection:CountryInfo>
    }

Now, how is the document store’s collection different from a normal table? To find out, I’ve connected to a normal MySQL shell:

mysql world_x
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 2396
Server version: 5.7.12 MySQL Community Server (GPL)
Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql> show create table CountryInfo
*************************** 1. row ***************************
       Table: CountryInfo
Create Table: CREATE TABLE `CountryInfo` (
  `doc` json DEFAULT NULL,
  `_id` varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$._id'))) STORED NOT NULL,
  PRIMARY KEY (`_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> show tables;
+-------------------+
| Tables_in_world_x |
+-------------------+
| City              |
| Country           |
| CountryInfo       |
| CountryLanguage   |
+-------------------+
4 rows in set (0.00 sec)

So the document store is actually an InnoDB table with one field: doc json + Primary key, which is a generated column.

As we can also see, there are four tables in the world_x database, but db.getCollections() only shows one. So how does MySQL distinguish between a “normal” table and a “document store” table? To find out, we can enable the general query log and see which query is being executed:

$ mysql -e 'set global general_log=1'
$ tail /var/log/general.log
2016-05-17T20:53:12.772114Z  186 Query  SELECT table_name, COUNT(table_name) c FROM information_schema.columns WHERE ((column_name = 'doc' and data_type = 'json') OR (column_name = '_id' and generation_expression = 'json_unquote(json_extract(`doc`,''$._id''))')) AND table_schema = 'world_x' GROUP BY table_name HAVING c = 2
2016-05-17T20:53:12.773834Z  186 Query  SHOW FULL TABLES FROM `world_x`

As you can see, every table that has a specific structure (doc JSON or specific generation_expression) is considered to be a JSON store. Now, how does MySQL translate the .find or .add constructs to actual MySQL queries? Let’s run a sample query:

mysql-js> db.getCollection("CountryInfo").find('Name= "United States"').limit(1)
[
    {
        "GNP": 8510700,
        "IndepYear": 1776,
        "Name": "United States",
        "_id": "USA",
        "demographics": {
            "LifeExpectancy": 77.0999984741211,
            "Population": 278357000
        },
        "geography": {
            "Continent": "North America",
            "Region": "North America",
            "SurfaceArea": 9363520
        },
        "government": {
            "GovernmentForm": "Federal Republic",
            "HeadOfState": "George W. Bush",
            "HeadOfState_title": "President"
        }
    }
]
1 document in set (0.02 sec)

and now look at the slow query log again:

2016-05-17T21:02:21.213899Z  186 Query  SELECT doc FROM `world_x`.`CountryInfo` WHERE (JSON_EXTRACT(doc,'$.Name') = 'United States') LIMIT 1

We can verify that MySQL translates all document store commands to SQL. That also means that it is 100% transparent to the existing MySQL storage level and will work with other storage engines. Let’s verify that, just for fun:

mysql> alter table CountryInfo engine=MyISAM;
Query OK, 239 rows affected (0.06 sec)
Records: 239  Duplicates: 0  Warnings: 0
mysql-js> db.getCollection("CountryInfo").find('Name= "United States"').limit(1)
[
    {
        "GNP": 8510700,
        "IndepYear": 1776,
        "Name": "United States",
        "_id": "USA",
        "demographics": {
            "LifeExpectancy": 77.0999984741211,
            "Population": 278357000
        },
        "geography": {
            "Continent": "North America",
            "Region": "North America",
            "SurfaceArea": 9363520
        },
        "government": {
            "GovernmentForm": "Federal Republic",
            "HeadOfState": "George W. Bush",
            "HeadOfState_title": "President"
        }
    }
]
1 document in set (0.00 sec)
2016-05-17T21:09:21.074726Z 2399 Query  alter table CountryInfo engine=MyISAM
2016-05-17T21:09:41.037575Z 2399 Quit
2016-05-17T21:09:43.014209Z  186 Query  SELECT doc FROM `world_x`.`CountryInfo` WHERE (JSON_EXTRACT(doc,'$.Name') = 'United States') LIMIT 1

Worked fine!

Now, how about the performance? We can simply take the SQL query and run

explain

:

mysql> explain SELECT doc FROM `world_x`.`CountryInfo` WHERE (JSON_EXTRACT(doc,'$.Name') = 'United States') LIMIT 1
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: CountryInfo
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

Hmm, it looks like it is not using an index. That’s because there is no index on Name. Can we add one? Sure, we can add a virtual column and then index it:

mysql> alter table CountryInfo add column Name varchar(255)
    -> GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$.Name'))) VIRTUAL;
Query OK, 0 rows affected (0.12 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table CountryInfo add key (Name);
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain SELECT doc FROM `world_x`.`CountryInfo` WHERE (JSON_EXTRACT(doc,'$.Name') = 'United States') LIMIT 1
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: CountryInfo
   partitions: NULL
         type: ref
possible_keys: name
          key: name
      key_len: 768
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)

That is really cool! We have added an index, and now the original query starts using it. Note that we do not have to reference the new field, the MySQL optimizer is smart enough to translate the

(JSON_EXTRACT(doc,'$.Name') = 'United States'

 to an index scan on the virtual column.

But please note: JSON attributes are case-sensitive. If you will use

(doc,'$.name')

 instead of

(doc,'$.Name')

 it will not generate an error, but will simply break the search and all queries looking for “Name” will return 0 rows.

Finally, if you looked closely at the output of

db.getCollection("CountryInfo").find('Name= "United States"').limit(1)

 , you noticed that the database has outdated info:

"government": {
            "GovernmentForm": "Federal Republic",
            "HeadOfState": "George W. Bush",
            "HeadOfState_title": "President"
        }

Let’s change “George W. Bush” to “Barack Obama” using the .modify clause:

mysql-js> db.CountryInfo.modify("Name = 'United States'").set("government.HeadOfState", "Barack Obama" );
Query OK, 1 item affected (0.02 sec)
mysql-js> db.CountryInfo.find('Name= "United States"')
[
    {
        "GNP": 8510700,
        "IndepYear": 1776,
        "Name": "United States",
        "_id": "USA",
        "demographics": {
            "LifeExpectancy": 77.0999984741211,
            "Population": 278357000
        },
        "geography": {
            "Continent": "North America",
            "Region": "North America",
            "SurfaceArea": 9363520
        },
        "government": {
            "GovernmentForm": "Federal Republic",
            "HeadOfState": "Barack Obama",
            "HeadOfState_title": "President"
        }
    }
]
1 document in set (0.00 sec)

Conclusion

Document store is an interesting concept and a good add-on on top of the existing MySQL JSON feature. Using the new .find/.add/.modify methods instead of the original SQL statements can be convenient in some cases.

Some might ask, “why do you want to use document store and store information in JSON inside the database if it is relational anyway?” Storing data in JSON can be quite useful in some cases, for example:

  • You already have a JSON (i.e., from external feeds) and need to store it anyway. Using the JSON datatype will be more convenient and more efficient.
  • You have a flexible schema, typical for the Internet of Things for example, where some sensors might only send temperature data, some might send temperature/humidity/light (but light information is only recorded during the day), etc. Storing it in the JSON format can be more convenient so that you do not have to declare all possible fields in advance, and do not have to run “alter table” if a new sensor starts sending new types of data.

In the next two blog posts, I will show how to use document store for Internet of Things / event streaming, and how to use X Protocol for asynchronous queries in MySQL.

Mar
10
2015
--

Advanced JSON for MySQL

What is JSON

JSON is an text based, human readable format for transmitting data between systems, for serializing objects and for storing document store data for documents that have different attributes/schema for each document. Popular document store databases use JSON (and the related BSON) for storing and transmitting data.

Problems with JSON in MySQL

It is difficult to inter-operate between MySQL and MongoDB (or other document databases) because JSON has traditionally been very difficult to work with. Up until recently, JSON is just a TEXT document. I said up until recently, so what has changed? The biggest thing is that there are new JSON UDF by Sveta Smirnova, which are part of the MySQL 5.7 Labs releases. Currently the JSON UDF are up to version 0.0.4. While these new UDF are a welcome edition to the MySQL database, they don’t solve the really tough JSON problems we face.

Searching

The JSON UDF provide a number of functions that make working with JSON easier, including the ability to extract portions of a document, or search a document for a particular key. That being said, you can’t use JSON_EXTRACT() or JSON_SEARCH in the WHERE clause, because it will initiate a dreaded full-table-scan (what MongoDB would call a full collection scan). This is a big problem and common wisdom is that JSON can’t be indexed for efficient WHERE clauses, especially sub-documents like arrays or objects within the JSON.

Actually, however, I’ve come up with a technique to effectively index JSON data in MySQL (to any depth). The key lies in transforming the JSON from a format that is not easily indexed into one that is easily indexed. Now, when you think index you think B-TREE or HASH indexes (or bitmap indexes) but MySQL also supports FULLTEXT indexes.

A fulltext index is an inverted index where words (tokens) point to documents. While text indexes are great, they aren’t normally usable for JSON. The reason is, MySQL splits words on whitespace and non-alphanumeric characters. A JSON document doesn’t end up being usable when the name of the field (the key) can’t be associated with the value. But what if we transform the JSON? You can “flatten” the JSON down into key/value pairs and use a text index to associate the key/value pairs with the document. I created a UDF called RAPID_FLATTEN_JSON using the C++ Rapid JSON library. The UDF flattens JSON documents down into key/value pairs for the specific purpose of indexing.

Here is an example JSON document:

{
	"id": "0001",
	"type": "donut",
	"name": "Cake",
	"ppu": 0.55,
	"batters":
		{
			"batter":
				[
					{ "id": "1001", "type": "Regular" },
					{ "id": "1002", "type": "Chocolate" },
					{ "id": "1003", "type": "Blueberry" },
					{ "id": "1004", "type": "Devil's Food" }
				]
		},
	"topping":
		[
			{ "id": "5001", "type": "None" },
			{ "id": "5002", "type": "Glazed" },
			{ "id": "5005", "type": "Sugar" },
			{ "id": "5007", "type": "Powdered Sugar" },
			{ "id": "5006", "type": "Chocolate with Sprinkles" },
			{ "id": "5003", "type": "Chocolate" },
			{ "id": "5004", "type": "Maple" }
		]
}

Flattened:

mysql> select RAPID_FLATTEN_JSON(load_file('/tmp/doc.json'))G
*************************** 1. row ***************************
RAPID_FLATTEN_JSON(load_file('/tmp/doc.json')): id=0001
type=donut
name=Cake
ppu=0.55
id=1001
type=Regular
id=1002
type=Chocolate
id=1003
type=Blueberry
id=1004
type=Devil's Food
type=Devil's
type=Food
id=5001
type=None
id=5002
type=Glazed
id=5005
type=Sugar
id=5007
type=Powdered Sugar
type=Powdered
type=Sugar
id=5006
type=Chocolate with Sprinkles
type=Chocolate
type=with
type=Sprinkles
id=5003
type=Chocolate
id=5004
type=Maple
1 row in set (0.00 sec)

Obviously this is useful, because our keys are now attached to our values in an easily searchable way. All you need to do is store the flattened version of the JSON in another field (or another table), and index it with a FULLTEXT index to make it searchable. But wait, there is one more big problem: MySQL will split words on the equal sign. We don’t want this as it removes the locality of the keyword and the value. To fix this problem you’ll have to undertake the (actually quite easy) step of adding a new collation to MySQL (I called mine ft_kvpair_ci). I added equal (=) to the list of lower case characters as described in the manual. You just have to change two text files, no need to recompile the server or anything, and as I said, it is pretty easy. Let me know if you get stuck on this step and I can show you the 5.6.22 files I modified.

By the way, I used a UDF, because MySQL FULLTEXT indexes don’t support pluggable parsers for InnoDB until 5.7. This will be much cleaner in 5.7 with a parser plugin and there will be no need to maintain an extra column.

Using the solution:
Given a table full of complex json:

create table json2(id int auto_increment primary key, doc mediumtext);

Add a column for the index data and FULLTEXT index it:

alter table json2 add flat mediumtext character set latin1 collate ft_kvpair_ci, FULLTEXT(flat);

Then populate the index. Note that you can create a trigger to keep the second column in sync, I let that up to an exercise of the reader, or you can use Flexviews to maintain a copy in a second table automatically.

mysql> update json2 set flat=RAPID_FLATTEN_JSON(doc);
Query OK, 18801 rows affected (26.34 sec)
Rows matched: 18801  Changed: 18801  Warnings: 0

Using the index:

mysql> select count(*) from json2 where match(flat) against ('last_name=Vembu');
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)

The documents I searched for that example are very complex and highly nested. Check out the full matching documents for the query here here

If you want to only index a subportion of the document, use the MySQL UDF JSON_EXTRACT to extract the portion you want to index, and only flatten that.

Aggregating

JSON documents may contain sub-documents as mentioned a moment ago. JSON_EXTRACT can extract a portion of a document, but it is still a text document. There is no function that can extract ALL of a particular key (like invoice_price) and aggregate the results. So, if you have a document called orders which contains a varying number of items and their prices, it is very difficult (if not impossible) to use the JSON UDF to aggregate a “total sales” figure from all the order documents.

To solve this problem, I created another UDF called RAPID_EXTRACT_ALL(json, ‘key’). This UDF will extract all the values for the given key. For example, if there are 10 line items with invoice_id: 30, it will extract the value (30 in this case) for each item. This UDF returns each item separated by newline. I created a few stored routines called jsum, jmin, jmax, jcount, and javg. They can process the output of rapid_extract_all and aggregate it. If you want to only RAPID_EXTRACT_ALL from a portion of a document, extract that portion with the MySQL UDF JSON_EXTRACT first, then process that with RAPID_EXTRACT_ALL.

For example:

mysql> select json_extract_all(doc,'id') ids, jsum(json_extract_all(doc,'id')) from json2 limit 1G
*************************** 1. row ***************************
ids: 888
889
2312
5869
8702
jsum(json_extract_all(doc,'id')): 18660.00000
1 row in set (0.01 sec)

Aggregating all of the id values in the entire collection:

mysql> select sum( jsum(json_extract_all(doc,'id')) ) from json2 ;
+-----------------------------------------+
| sum( jsum(json_extract_all(doc,'id')) ) |
+-----------------------------------------+
|                         296615411.00000 |
+-----------------------------------------+
1 row in set (2.90 sec)

Of course you could extract other fields and sort and group on them.

Where to get the tools:
You can find the UDF in the swanhart-tools github repo. I think you will find these tools very useful in working with JSON documents in MySQL.

(This post was originally posted on my personal blog: swanhart.livejournal.com, but is reposed here for wider distribution)

The post Advanced JSON for MySQL appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com