How To Best Use Percona Server Column Compression With Dictionary

Database Compression

column compressionVery often, database performance is affected by the inability to cache all the required data in memory. Disk IO, even when using the fastest devices, takes much more time than a memory access. With MySQL/InnoDB, the main memory cache is the InnoDB buffer pool. There are many strategies we can try to fit as much data as possible in the buffer pool, and one of them is data compression.

With regular MySQL, to compress InnoDB data you can either use “Barracuda page compression” or “transparent page compression with punch holes”. The use of the ZFS filesystem is another possibility, but it is external to MySQL and doesn’t help with caching. All these solutions are transparent, but often they also have performance and management implications. If you are using Percona Server for MySQL, you have yet another option, “column compression with dictionary“. This feature is certainly not receiving the attention it merits. I think it is really cool—let me show you why.

We all know what compression means, who has not zipped a file before attaching it to an email? Compression removes redundancy from a file. What about the dictionary? A compression dictionary is a way to seed the compressor with expected patterns, in order to improve the compression ratio. Because you can specify a dictionary, the scope of usefulness of column compression with the Percona Server for MySQL feature is greatly increased. In the following sections, we’ll review the impacts of a good dictionary, and devise a way to create a good one without any guessing.

A simple use case

A compression algorithm needs a minimal amount of data in order to achieve a reasonable compression ratio. Typically, if the object is below a few hundred bytes, there is rarely enough data to have repetitive patterns and when the compression header is added, the compressed data can end up larger than the original.

mysql> select length('Hi!'), length(compress('Hi!'));
| length('Hi!') | length(compress('Hi!')) |
|             3 |                      15 |
1 row in set (0.02 sec)

Compressing a string of three bytes results in a binary object of 15 bytes. That’s counter productive.

In order to illustrate the potential of the dictionary, I used this dataset:

It is a set of 100k Jeopardy questions written in JSON. To load the data in MySQL, I created the following table:

mysql> show create table TestColCompression\G
*************************** 1. row ***************************
Table: TestColCompression
Create Table: CREATE TABLE `TestColCompression` (
`question` text NOT NULL,
1 row in set (0.00 sec)

Then, I did some formatting to create insert statements:

zcat JEOPARDY_QUESTIONS1.json.gz | perl -p -e 's/\[\{/\{/g' | perl -p -e 's/\}, \{/\}\n\{/g' | perl -p -e "s/'/''/g" | \
  (while read line; do echo "insert into testColComp (questionJson) values ('$line');"; done )

And I executed the inserts. About 20% of the rows had some formatting issues but nevertheless, I ended up with close to 80k rows:

mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78110
Avg_row_length: 316
Data_length: 24690688
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:16:41
Update_time: 2018-10-26 15:40:34
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
1 row in set (0.00 sec)

The average row length is 316 bytes for a total data size of 23.55MB. The question JSON objects are large enough to matter, but barely large enough for compression. Here are the first five rows:

mysql> select question from TestColCompression limit 5\G
*************************** 1. row ***************************
question: {"category": "HISTORY", "air_date": "2004-12-31", "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"}
*************************** 2. row ***************************
question: {"category": "ESPN's TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"}
*************************** 3. row ***************************
question: {"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"}
*************************** 4. row ***************************
question: {"category": "OLD FOLKS IN THEIR 30s", "air_date": "2009-05-08", "question": "'The district of conservative rep. Patrick McHenry in this state includes Mooresville, a home of NASCAR'", "value": "$800", "answer": "North Carolina", "round": "Jeopardy!", "show_number": "5690"}
*************************** 5. row ***************************
question: {"category": "MOVIES & TV", "air_date": "2009-05-08", "question": "'Tim Robbins played a public TV newsman in "Anchorman: The Legend of" him'", "value": "$800", "answer": "Ron Burgundy", "round": "Jeopardy!", "show_number": "5690"}

Let’s begin by a straight column compression without specifying a dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED;
Query OK, 79976 rows affected (4.25 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
| Table | Op | Msg_type | Msg_text |
| colcomp.TestColCompression | analyze | status | OK |
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78995
Avg_row_length: 259
Data_length: 20496384
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:47:56
Update_time: 2018-10-26 15:47:56
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
1 row in set (0.00 sec)

As expected the data didn’t compress much. The compression ration is 0.82 or if expressed as a percentage, 18%. Since the JSON headers are always the same, and are present in all questions, we should minimally use them for the dictionary. Trying a minimal dictionary made of the headers gives:

mysql> SET @dictionary_data = 'category' 'air_date' 'question' 'value' 'answer' 'round' 'show_number' ;
Query OK, 0 rows affected (0.01 sec)
mysql> CREATE COMPRESSION_DICTIONARY simple_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY simple_dictionary;
Query OK, 79976 rows affected (4.72 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
| Table | Op | Msg_type | Msg_text |
| colcomp.TestColCompression | analyze | status | OK |
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78786
Avg_row_length: 246
Data_length: 19447808
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:58:17
Update_time: 2018-10-26 17:58:17
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
1 row in set (0.00 sec)

There is a little progress, we now have a compression ratio of 0.79. Obviously, we could do more but without a tool, we’ll have to guess. A compressor like zlib builds a dictionary as part of its compression effort, could we use that? Yes, but only if we can generate it correctly and access the result. That’s not readily available with the common compressors I know. Fortunately, someone else had the same issue and wrote a compressor able to save its dictionary. Please let me introduce femtozip.

Femtozip to the rescue

The tool, by itself, has no magic algorithm. It is based on zlib, from what I can understand from the code. Anyway, we won’t compress anything with it, we’ll use it to generate a good dictionary. In order to create a dictionary, the tool looks at a set of files and try to see patterns between them. The use of a single big file defeat the purpose. So, I generated one file per question with:

mkdir questions
cd questions
l=1; mysql -u blog -pblog colcomp -e 'select question from TestColCompression' | (while read line; do echo $line > ${l}; let l=l+1; done)

Then, I used the following command to generate a 1024 bytes dictionary using all the files starting by “1”:

../femtozip/cpp/fzip/src/fzip --model ../questions_1s.mod --build --dictonly --maxdict 1024 1*
Building dictionary...

In about 10s the job was done. I tried with all the 80k files and… I had to kill the process after thirty minutes. Anyway, there are 11111 files starting with “1”, a very decent sample. Our generated dictionary looks like:

cat ../questions_1s.mod
", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "'e", "round": "Jeopardy!", "show_number": "r", "round": "{"cate gory": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "'Jeopardy!", "show_number": "2'", "value": "$1000", "answer": "7", "question": "'The ", "question": "'A'", "value": "$600", "answer": "9", "questi on": "'In ", "question": "'This 3", "question": "'2", "question": "'e'", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"'", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s'", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy !", "show_number": "3", "round": "Jeopardy!", "show_number": "5'", "value": "$200", "answer": "'", "value": "$800", "answer": "'", "value": "$400", "answer": "

With some formatting, I was able to create a dictionary with the above data:

mysql> SET @dictionary_data = '", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "''e", "round": "Jeopardy!", "show_number": "r", "round": "{"category": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "''Jeopardy!", "show_number": "2''", "value": "$1000", "answer": "7", "question": "''The ", "question": "''A''", "value": "$600", "answer": "9", "question": "''In ", "question": "''This 3", "question": "''2", "question": "''e''", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"''", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s''", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "5''", "value": "$200", "answer": "''", "value": "$800", "answer": "''", "value": "$400", "answer": "' ;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE COMPRESSION_DICTIONARY femtozip_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
And then, I altered the table to use the new dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY femtozip_dictionary;
Query OK, 79976 rows affected (4.05 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
| Table | Op | Msg_type | Msg_text |
| colcomp.TestColCompression | analyze | status | OK |
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 79861
Avg_row_length: 190
Data_length: 15220736
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:56:09
Update_time: 2018-10-26 17:56:09
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
1 row in set (0.00 sec)

That’s interesting, we are now achieving a ratio of 0.61, a significant improvement. I pushed my luck and tried with a 2048 bytes dictionary. It further reduced the ratio to 0.57 but that was about the best I got. Larger dictionaries didn’t lower the ratio below 0.57. Zlib supports up to 32KB for the dictionary.

So, to recap:

  • column compression without dictionary, ratio of 0.82
  • column compression with simple dictionary, ratio of 0.79
  • column compression with a 1k dictionary from femtozip, ratio of 0.61
  • column compression with a 2k dictionary from femtozip, ratio of 0.57

The above example stores a JSON document in a text column. MySQL 5.7 includes a JSON datatype which behaves a bit differently regarding the dictionary. Delimiting characters like ‘{}’ are removed in the on disk representation of a JSON column. If you have TBs of data in similar tables, you should really consider column compression and a systematic way of determining the dictionary with femtozip. In addition to improve the compression, it is likely to be the less performance impacting solution. Would it be interesting to generate a dictionary from existing data with a command like this one?

CREATE COMPRESSION_DICTIONARY_FROM_DATA A_good_dictionary (2048, select questions from TestColCompression limit 10000);

where the dictionary creation process would implicitly includes steps similar to the ones I did with femtozip.


InnoDB Cluster in a Nutshell Part 3: MySQL Shell

MySQL InnoDB Cluster MySQL Shell

MySQL InnoDB Cluster MySQL ShellWelcome to the third part of this series. I’m glad you’re still reading, as hopefully this means you find this subject interesting at least. Previously we presented the first two components of MySQL InnoDB Cluster: Group Replication and MySQL Router and now we will discuss the last component, MySQL Shell.

MySQL Shell

This is the last component in the cluster and I love it. Oracle have created this tool to centralize cluster management, providing a friendly, command-line based user interface.

The tool can be defined as an advanced MySQL shell, which is much more powerful than the well known MySQL client. With the capacity to work with both relational and document (JSON) data, the tool provides an extended capability to interact with the database from a single place.

MySQL Shell is also able to understand different languages:

  • JavaScript (default) which includes several built-in functions to administer the cluster—create, destroy, restart, etc.—in a very easy way.
  • Python it provides an easy way to write Python code to interact with the database. This is particularly useful for developers who don’t need to have SQL skills or run applications to test code.
  • SQL to work in classic mode to query database as we used to do with the old MySQL client.

A very interesting feature provided with MySQL Shell is the ability to establish different connections to different servers/clusters from within the same shell. There is no need to exit to connect to a different server, just issuing the command \connect will make this happen. As DBA, I find this pretty useful when handling multiple clusters/servers.

Some of the features present in this tool:

  • Capacity to use both Classic and X protocols.
  • Online switch mode to change languages (JavaScript, Python and SQL)
  • Auto-completion of commands using tab, a super expected feature in MySQL client.
  • Colored formatting output that also supports different formats like Table, Tab-separated and Json formats.
  • Batch mode that processes batches of commands allowing also an interactive mode to print output according each line is processed.

Some sample commands

Samples of new tool and execution modes:

#switch modes
#connect to instance
\connect user@host:[port]
#create a cluster (better to handle through variables)
var cluster=dba.createCluster('percona')
#add instances to cluster
#check cluster status
#using another variable
var cluster2=dba.getCluster(‘percona’)
#get cluster structure
#rejoin instance to cluster - needs to be executed locally to the instance
#rejoin instance to cluster - needs to be executed locally to the instance
#recover from lost quorum
#recover from lost quorum
#destroy cluster

Personally, I think this tool is a very good replacement for the classic MySQL client. Sadly, mysql-server installations do not include MySQL shell by default, but it is worth getting used to. I recommend you try it.


We finally reached the end of this series. I hope you have enjoyed this short introduction to what seems to be Oracle’s bid to have a built-in High Availability solution based on InnoDB. It may become a good competitor to Galera-based solutions. Still, there is a long way to go, as the tool was only just released as GA (April 2018). There are a bunch of things that need to be addressed before it becomes consistent enough to be production-ready. In my personal opinion, it is not—yet. Nevertheless, I think it is a great tool that will eventually be a serious player in the HA field as it’s an excellent, flexible and easy to deploy solution.

The post InnoDB Cluster in a Nutshell Part 3: MySQL Shell appeared first on Percona Database Performance Blog.


This Week in Data with Colin Charles 33: Reporting from FOSSASIA 2018 and Azure Announces Database Services for MySQL and PostgreSQL

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Writing to you on the ground from FOSSASIA 2018, where I gave a track introduction yesterday since we have a pretty awesome database track most Saturday, and generally, all MySQL focused on Sunday. There’s even a list of talks by Oracle MySQL’ers (yes, there’s more than just Oracle folk, but for that, you got to get the schedule).

The Percona Live Community Dinner happens again this year during Percona Live 2018, at Pedro’s on 24 April 2018. It starts at 7 pm, and I highly recommend you purchase the $30 ticket. It usually sells out, so don’t wait till it’s too late.

Some big news from a MySQL in the cloud perspective: Announcing general availability of Azure database services for MySQL and PostgreSQL. There’s also a quick guide: Create an Azure Database for MySQL server by using the Azure portal. Note that next comes MariaDB Server too.


Link List

Upcoming appearances

The post This Week in Data with Colin Charles 33: Reporting from FOSSASIA 2018 and Azure Announces Database Services for MySQL and PostgreSQL appeared first on Percona Database Performance Blog.


InnoDB Page Compression: the Good, the Bad and the Ugly

InnoDB Page CompressionIn this blog post, we’ll look at some of the facets of InnoDB page compression.

Somebody recently asked me about the best way to handle JSON data compression in MySQL. I took a quick look at InnoDB page compression and wanted to share my findings.

There is also some great material on this topic that was prepared and presented by Yura Sorokin at Percona Live Europe 2017: Yura also implemented Compressed Columns in Percona Server.

First, the good part.

InnoDB page compression is actually really easy to use and provides a decent compression ratio. To use it, I just ran

CREATE TABLE commententry (...) COMPRESSION="zlib";

 – and that’s all. By the way, for my experiment I used the subset of Reddit comments stored in JSON (described here: Big Dataset: All Reddit Comments – Analyzing with ClickHouse).

This method got me a compressed table of 3.9GB. Compare this to 8.4GB for an uncompressed table and it’s about a 2.15x compression ratio.

Now, the bad part.

As InnoDB page compression uses “hole punching,” the standard Linux utils do not always properly support files created this way. In fact, to see the size “3.9GB” I had to use

du --block-size=1 tablespace_name.ibd

 , as the standard

ls -l tablespace_name.ibd

 shows the wrong size (8.4GB). There is a similar limitation on copying files. The standard way

cp old_file new_file

 may not always work, and to be sure I had to use

cp --sparse=always old_file new_file


Speaking about copying, here’s the ugly part.

The actual time to copy the sparse file was really bad.

On a fairly fast device (a Samsung SM863), copying the sparse file mentioned above in its compressed size of 3.9GB took 52 minutes! That’s shocking, so let me repeat it again: 52 minutes to copy a 3.9GB file on an enterprise SATA SSD.

By comparison, copying regular 8.4GB file takes 9 seconds! Compare 9 sec and 52 mins.

To be fair, the NMVe device (Intel® SSD DC D3600) handles sparse files much better. It took only 12 seconds to copy the same sparse file on this device.

Having considered all this, it is hard to recommend that you use InnoDB page compression for serious production. Well, unless you power your database servers with NVMe storage.

For JSON data, the Compressed Columns in Percona Server for MySQL should work quite well using Dictionary to store JSON keys – give it a try!


Percona Live Europe Featured Talks: NoSQL Best Practices for PostgreSQL with Dmitry Dolgov

Colin Charles

Percona Live EuropeWelcome to another post our series of interview blogs for the upcoming Percona Live Europe 2017 in Dublin. This series highlights a number of talks that will be at the conference and gives a short preview of what attendees can expect to learn from the presenter.

This blog post is with Dmitry Dolgov, a software engineer at Zalando SE. His talk is titled NoSQL Best Practices for PostgreSQL. One of PostgreSQL’s most attractive features is the Jsonb data type. It allows efficient work with semi-structured data without sacrificing strong consistency and ability to use all the power of proven relational technology.  In our conversation, we discussed how to use this NoSQL feature in PostgreSQL:

Percona: How did you get into databases? What do you love about it?  

Dmitry: I grew extremely interested in databases not so long ago, mostly due to the influence of Oleg Bartunov, who is a longtime contributor to PostgreSQL. Initially, I just implemented one patch for the Jsonb data type that was eventually included in the core. After that I couldn’t stop. So I still try to help the PostgreSQL community as much as I can.

What I love is just that: PostgreSQL has an awesome community. And I mean it, there are a lot of people that are excited about databases and possess valuable expertise in this area. My most vivid memory so far about the community was someone asking a question in the hackers mailing list that got answered within minutes – even before I started to type my own reply.

Percona: How can NoSQL Jsonb data type get used effectively with PostgreSQL?

Dmitry: The trick is that you don’t need to do anything supernatural for that. Jsonb is already effective enough right out of the box. But as always there are some limitations, implementation details and tricks (which I’ll show in my talk).

Percona: What do you want attendees to take away from your session? Why should they attend?

Dmitry: The biggest idea behind this talk is to show that we live in interesting times. It’s not that easy to stick with only one data model/data storage. And to mitigate this issue, most modern databases are trying to provide more that one approach. We have to evaluate them each carefully.

Or you can attend if you expect a holy war of PostgreSQL vs. MongoDB vs. MySQL vs. whatever else. But you won’t see anything like that, because we’re all grown up people. ?

Percona: What are you most looking forward to at Percona Live Europe 2017?

Dmitry: I look forward to meeting a lot of interesting people to collaborate with, and to share my own experiences.

Want to find out more about Dmitry and PostgreSQL and the Jsonb data type? Register for Percona Live Europe 2017, and see his talk NoSQL Best Practices for PostgreSQL. Register now to get the best price! Use discount code SeeMeSpeakPLE17 to get 10% off your registration.

Percona Live Open Source Database Conference Europe 2017 in Dublin is the premier European open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, MariaDB, MongoDB, time series database, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference Europe will be September 25-27, 2017 at the Radisson Blu Royal Hotel, Dublin.


IMDb Data in a Graph Database

Graph Database 1

Graph Database 1In this first of its kind, Percona welcomes Dehowe Feng, Software Developer from Bitnine as a guest blogger. In his blog post, Dehowe discusses how viewing imported data from IMDb into a graph database (AgensGraph) lets you quickly see how data nodes relate to each other. This blog echoes a talk given by Bitnine at the Percona Live Open Source Database Conference 2017.

Graphs help illustrate the relationships between entities through nodes, drawing connections between people and objects. Relationships in IMDb are inherently visual. Seeing how things are connected grants us a better understanding of the context underneath. By importing IMDb data as graph data, you simplify the schema can obtain key insights.

In this post, we will examine how importing IMDb into a graph database (in this case, AgensGraph) allows us to look at data relationships in a much more visual way, providing more intuitive insights into the nature of related data.

For install instructions to the importing scripts, go here.

Internet Movie Database (IMDb) owned by is one of the largest movie databases. It contains 4.1 million titles and 7.7 million personalities (

Relational Schema for IMDb

Graph Database 2

Relational Schema of IMDb Info

Picture courtesy of user ofthelit on StackOverflow,

Because IMDb’s file format is not easy to read and parse, rather than implementing the file directly we use an additional step to load it into relational tables. For this project, we used IMDbpy to load relational data into AgensGraph in relational form. The above figure is the relational schema which IMDbpy created. This schema is somewhat complicated, but essentially there are four basic entries: Production, Person, Company and Keyword. Because there are many N-to-N relationships between these entities, the relational schema has more tables than the number of entities. This makes the schema harder to understand. For example, a person can be related to many movies and a movie can have many characters.

Concise Graph Modeling

From there, we developed our own graph schema using Production, Person, Company and Keyword as our nodes (or end data points).

Productions lie at the “center” of the graph, with everything leading to them. Keywords describing Productions, Persons and Companies are credited for their contributions to Productions. Productions are linked to other productions as well.

Graph Database 3

Simplified Graph Database Schema

With the data in graph form, one can easily see the connections between all the nodes. The data can be visualized as a network and querying the data with Cypher allows users to explore the connections between entities.

Compared to the relational schema of IMDb, the graph schema is much simpler to understand. By merging related information for the main entities into nodes, we can access all relevant information to that node through that node, rather than having to match IDs across tables to get the information that we need. If we want to examine how a node relates to another node, we can query its edges to see the connections it forms. Being able to visually “draw a connection” from one node to another helps to illustrate how they are connected.

Furthermore, the labels of the edges describe how the nodes are connected. Edge labels in the IMDb Graph describe what kind of connection is formed, and pertinent information may be stored in attributes in the edges. For example, for the connections ACTOR_IN and ACTRESS_IN, we store role data, such as character name and character id.

Data Migration

To make vertexes’ and edges’ properties we use “views”, which join related tables. The data is migrated into a graph format by querying the relational data using selects and joins into a single table with the necessary information for creating each node.

For example, here is the SQL query used to create the jsonb_keyword view:

CREATE VIEW jsonb_keyword AS
SELECT row_to_json(row(keyword)) AS data
FROM keyword;

We use a view to make importing queries simpler. Once this view is created, its content can be migrated into the graph. After the graph is created, the graph_path is set, and the VLABEL is created, we can use the convenient LOAD keyword to load the JSON values from the relational table into the graph:

LOAD FROM jsonb_keyword AS keywords
CREATE (a:Keyword = data(keywords) );

Note that here LOAD is used to load data in from a relational table, but LOAD can also be used to load data from external sources as well.

Creating edges is a similar process. We load edges from the tables that store id tuples of the between the entities after creating their ELABELs:

LOAD FROM movie_keyword AS rel_key_movie
MATCH (a:Keyword), (b:Production)
WHERE = (rel_key_movie).keyword_id AND = (rel_key_movie).movie_id
CREATE (a)-[:KEYWORD_OF]->(b);

As you can see, AgensGraph is not restricted to the CSV format when importing data. We can import relational data into its graph portion using the LOAD feature and SQL statements to refine our data sets.

How is information stored?

Most of the pertinent information is held in the nodes (vertexes). Nodes are labeled either as Productions, Persons, Companies or Keywords, and their relative information is stored as JSONs. Since IMDB information is constantly updated, many fields for certain entities are left incomplete. Since JSON is semi-structured, if an entity does not have a certain piece of information the field will not exist at all – rather than having a field and marking it as NULL.

We also use nested JSON arrays to store data that may have multiple fields, such as quotes that persons might have said or alternate titles to productions. This makes it possible to store “duplicate” fields in each node.

How can this information be used?

In the graph IMDb database, querying between entities is very easy to learn. Using the Cypher Query Language, a user can find things such as all actors that acted in a certain production, all productions that a person has worked on or all other companies that have worked with a certain company on any production. Graph database strength is the simplicity of visualizing the data. There are many ways you can query a graph database to find what you need!

Find the name of all actors that acted in Night at the Museum:

MATCH (a:Person)-[:ACTOR_IN]->(b:Production)
WHERE title = 'Night at the Museum'


name | title
Asprinio, Stephen | Night at the Museum
Blais, Richard | Night at the Museum
Bougere, Teagle F. | Night at the Museum
Bourdain, Anthony | Night at the Museum
Cherry, Jake | Night at the Museum
Cheng, Paul Chih-Ping | Night at the Museum
(56 rows)

Find all productions that Ben Stiller worked on:

MATCH (a:Person)-[b]->(c:Production)
WHERE = 'Stiller, Ben'


name | label | title
Stiller, Ben | actor_in | The Heartbreak Kid: The Egg Toss
Stiller, Ben | producer_of | The Hardy Men
Stiller, Ben | actor_in | The Heartbreak Kid: Ben & Jerry
Stiller, Ben | producer_of | The Polka King
Stiller, Ben | actor_in | The Heartbreak Kid
Stiller, Ben | actor_in | The Watch
Stiller, Ben | actor_in | The History of 'Walter Mitty'
Stiller, Ben | producer_of | The Making of 'The Pick of Destiny'
Stiller, Ben | actor_in | The Making of 'The Pick of Destiny'
(901 rows)

Find all actresses that worked with Sarah Jessica Parker:

MATCH (a:Person)-[b:ACTRESS_IN]->(c:Production)<-[d:ACTRESS_IN]-(e:Person)
WHERE = 'Parker, Sarah Jessica'


Aaron, Caroline
Aaron, Kelly
Abascal, Nati
Abbott, Diane
Abdul, Paula
(3524 rows)


The most powerful aspects of a graph database are flexibility and visualization capabilities.

In the future, we plan to implement a one-step importing script. Currently, the importing script is two-phased: the first step is to load into relational tables and the second step is to load into the graph. Additionally, AgensGraph has worked with Gephi to release a data import plugin. The Gephi Connector allows for graph visualization and analysis. For more information, please visit and


Services Monitoring with Probabilistic Fault Detection

Services Monitoring

In this blog post, we’ll discuss services monitoring using probabilistic fault detection.

Let’s admit it, the task of monitoring services is one of the most difficult. It is time-consuming, error-prone and difficult to automate. The usual monitoring approach has been pretty straightforward in the last few years: setup a service like Nagios, or pay money to get a cloud-based monitoring tool. Then choose the metrics you are interested in and set the thresholds. This is a manual process that works when you have a small number of services and servers, and you know exactly how they behave and what you should monitor. These days, we have hundred of servers with thousands of services sending us millions of metrics. That is the first problem: the manual approach to configuration doesn’t work.

That is not the only problem. We know that no two servers perform the same because no two servers have exactly the same workload. The thresholds that you setup for one server might not be the correct one for all of the other thousand. There are some approaches to the problem that will make it even worse (like taking averages and setting the thresholds based on those, for example, hoping it will work). Let me tell you a secret: it won’t work. Here we have a second problem: instances of the same type can demonstrate very different behaviors.

The last problem is that new shiny services you company may want to use are announced every week. It is impossible, because of time constraints, to know all of those services well enough to create a perfect monitoring template. In other words: sometimes we are asked to monitor software we don’t completely understand.

In summary, you have thousands of services, some of them you don’t even know how they work, that are sending you million of metrics that mean nothing to you. Now, set the thresholds and enable the pager alert. The nightmare has started. Is there a different approach?

Machine Learning

We have to stop thinking that monitoring is a bunch of config files with thresholds that we copy from one server to another. There are no magic templates that will work. We need to use a different technique that removes us from the process. That template is “machine learning.” As stated in Wikipedia, it is a subfield of computer science that gives computers the ability to learn without being explicitly programmed. In it’s most basic form, it can be used to solve classification problems. For example, open pet photos and identify if it is a cat or a dog. This is a classification problem that both humans and computers can solve, but we are much much slower. The computer has to take the time to learn the patterns, but at some point it will do the classification in no time.

I hope you are starting to see a pattern here. Why do we need to care about monitoring and its configuration if we have computers that can learn patterns and classify things for us?

There are two main ways of doing probabilistic fault detection: Novelty Detection and Outlier Detection.

Novelty Detection

Novelty Detection is easy to visualize and understand. It takes a series of inputs and tries to find anomalies, something that hasn’t been seen before. For example, our credit card company has a function that takes “category, expense, date, hour, country” as arguments and returns an integer so that they can classify and identify all the purchases. Your monthly use of the credit card looks like this:


That is the normal model that defines your use of the credit card. Now, it can be used to detect anomalies.

  • [0] – OK
  • [4] – OK
  • [4] – OK
  • [1] – Anomaly! Operation canceled.

Easy and straightforward. It is simple and very useful in a lot of areas to generate alerts when something anomalous happens. One of the machine learning models that can do this is One-Class Support Vector Machines, but since this is not the kind of fault detection we are looking for I won’t go into details. If you are interested, follow these links:

Outlier Detection

Let’s say we have this data:

[0, 3, 5, 1, -2, 19, 2, 10, -9, 2, 1, 8, 3, 21, -1, 3]

Now we know how to find anomalies, but how do we find outliers? Looking at the numbers above, it seems 21, 19 and -9 could be outliers. But a more exact definition is needed (not just intuition). The most simple and usual way of doing it is the following:

We divide our data into three pieces. One cut will be done at 25%, the second cut at 75%. The number that it is at 25% is called the First Quartile, and the value of the second cut is called the Third Quartile. The IQR or Interquartile Range is the subtraction of the Third Quartile with the First Quartile. Now, an outlier is any number that falls in one of these two categories:

  • If the value is below: (First Quartile) – (1.5 × IQR)
  • If the value is above: (Third Quartile + (1.5 × IQR)

Using Python:

inputs = [0, 3, 5, 1, -2, 19, 2, 10, -9, 2, 1, 8, 3, 21, -1, 3]
Q1 = np.percentile(inputs,25)
Q3 = np.percentile(inputs,75)
step = (Q3-Q1)*1.5
outliers = [x for x in inputs if x < Q1-step or x > Q3+step]
[19, -9, 21]

This looks more like what we need. If we are monitoring a metric, and outliers are detected, then something is happening there that requires investigation. Some of the most used outlier detection models in scikit-learn are:

  • Elliptic Envelope: a robust co-variance estimate that assumes that our data is Gaussian distributed. It will define the shape of the data we have, creating a frontier that delimits the contour. As you probably guessed, it will be elliptical in shape. Don’t worry about the assumption of Gaussian distribution, data can be standardized. More about this later on.


  • Isolation Forest: this is the well-known “forest of random trees,” but applied to outlier detection. This is more suitable when we have many different input metrics. In the example I use below, I just use a single metric, so this model would not work that well.

Therefore, Elliptic Envelope looks like the best option for our proof-of-concept.

For visual reference, this is how the three models look like when they try to shape two data inputs:

Services Monitoring



I haven’t explained the model in detail, but a high level explanation should be enough to understand the problem and the possible solution. Let’s start building a proof-of-concept.

For this test, I got data from our Prometheus setup, where all the time-series monitoring data from our customers is stored. In this particular example, I got numbers from the “Threads Running” metric. Those will be used to train our Elliptical Envelope. It is important to take the following into account:

    • We need to collect enough data so that it captures the correct shape of our baseline performance. For example, usually nighttime hours have less of a workload than during the day (same with weekend days, in some cases).
    • As explained before, it assumes a Gaussian distribution, which means that the data needs to be scaled. I am going to standardize the data so that it has 0 mean and 1 variance. The same standardization needs to be applied to the data we test after the training process, when the monitoring is already in place. That standardization also needs to be applied to each metric individually. This is the formula:
Services Monitoring

With ? as the mean and ? as the standard deviation.

This is the summary of what our proof-of-concept will do:

  • Read Prometheus JSON dump.
  • Separate some data for training, standardizing it first.
  • Separate some data for testing, standardizing it first as well.
  • Make predictions on test data.
  • For those rows identified as outliers, get the original non-standardize data to see the number of threads running.

So, let’s start:

Import the Libraries

import pandas as pd
import numpy as np
import json
from datetime import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.covariance import EllipticEnvelope

Load the Data

All the information is in a JSON output from Prometheus that has the “threads_running” of a particular server. It has one second granularity for the last four weeks. I also converted “timestamps” to a normal “datetime” object so that it is easier to read:

with open('query.json') as data_file:
    data = json.load(data_file)
data = pd.DataFrame(data["data"]["result"][0]["values"])
data[0] = data[0].astype(int)
data[0] = data[0].apply(lambda x: datetime.fromtimestamp(x))

The data looks like this:

DateTime Threads Running
2017-01-19 20:32:44 1
2017-01-19 20:32:45 1
2017-01-19 20:32:46 2


Create the Training and Testing Dataset

First, separate some of the data for use as training:

train_data = data[(data[0] >= "2017-01-22") & (data[0] <= "2017-01-28" )]

Ignore the date column, and just store the metrics:

train_data = train_data.iloc[:,[1]]

Standardize it:

escaler = StandardScaler()
train_data = escaler.fit_transform(train_data)

Now the data looks like this:

Standardized Threads Running

To create the test dataset we need to follow the exact same procedure, only select a different timeframe:

test_original_data = data[(data[0] >= "2017-02-2") & (data[0] <= "2017-02-17" )]
test_data = test_original_data.iloc[:,[1]]
test_data = escaler.transform(test_data)

Train the Model

Let’s create our model with the training data! I am using two parameters here:

  • assume_centered: to specify that our data is already Gaussian distributed.
  • contamination: to specify the ratio of outliers our training data has.
clf = EllipticEnvelope(assume_centered=True,contamination=0)

Search for Outliers

Now that we’ve trained the model and we have our test data, we can ask the model if it finds any outliers. It will return




 for each row. “1” means that the value of threads running is normal and within the boundaries, while “-1” means that the value is an outlier:

predictions = clf.predict(test_data)
outliers = np.where(predictions==-1)

The array “outliers” stores the row numbers where -1 was predicted.

At this point we have three important variables:

  • test_data: standardized testing data.
  • test_original_data: the original test data without modification.
  • outliers: the row numbers where an outlier was detected (-1).

Investigate the Outliers

Since we have the row number where an outlier was detected, now we can just query


 and search for those rows. In this example, I show some random ones:

for indice in outliers[0]:
    if np.random.randn() > 2.5:
        print("{} - {} threads running".format(test_original_data.iloc[indice][0], test_original_data.iloc[indice][1]))
2017-02-03 11:26:03 - 41 threads running
2017-02-03 11:26:40 - 43 threads running
2017-02-03 11:27:50 - 48 threads running
2017-02-03 11:32:07 - 78 threads running
2017-02-03 11:33:25 - 90 threads running
2017-02-12 10:06:58 - 36 threads running
2017-02-12 10:12:11 - 60 threads running
2017-02-12 10:12:30 - 64 threads running

And there we have it! Dates and hours when something really out of the ordinary happened. No need to create a config file for each service, guess thresholds, adjust them … nothing. Just let the model learn, and you get alerts when something unexpected happens. Push all the metrics from your services to these models, and let them do the hard work.


Most companies have similar situations. Companies add new services on hundred of servers, and monitoring is an essential part of the infrastructure. The old method of monolithic config files with some thresholds doesn’t scale, because it needs a lot of manual work with a trial/error approach. The types of techniques explained in this blog post can help us deploy monitoring on hundred of servers, not really caring about the different nuances of each service or workload. It is even possible to start monitoring a service without even knowing anything about it — just let the probabilistic model take care of it.

It is important to clarify that, in my opinion, these fault detection models are not going to be a substitute for software like Nagios. In those areas where a binary test is needed (service is up/down for example), Nagios and other similar services do a good job. Actually, a Nagios check can use the procedure explained here. When there are many metrics to analyze, probabilistic methods can save us from a nightmare.


Elasticsearch Ransomware: Open Source Database Security Part 2

Elasticsearch Ransomware

Elasticsearch RansomwareIn this blog post, we’ll look at a new Elasticsearch ransomware outbreak and what you can do to prevent it happening to you.

Mere weeks after reports of MongoDB servers getting hacked and infected with ransomware, Elasticsearch clusters are experiencing the same difficulties. David Murphy’s blog discussed the situation and the solution for MongoDB servers. In this blog post, we look at how you can prevent ransomware attacks on your Elasticsearch clusters.

First off, what is Elasticsearch? Elasticsearch is an open source distributed index based on Apache Lucene. It provides a full-text search with an HTTP API, using schemaless JSON documents. By its nature, it is also distributed and redundant. Companies use Elasticsearch with logging via the ELK stack and data-gathering software, to assist with data analytics and visualizations. It is also used to back search functionality in a number of popular apps and web services.

In this new scenario, the ransomware completed wiped away the cluster data, and replaced it with the following warning index:


As with the MongoDB situation, this isn’t a flaw in the Elasticsearch software. This vulnerability stems from not correctly using the security settings provided by Elasticsearch. As the PCWorld article sums up:

According to experts, there is no reason to expose Elasticsearch clusters to the internet. In response to these recent attacks, search technologies and distributed systems architect Itamar Syn-Hershko has published a blog post with recommendations for securing Elasticsearch deployments.

The blog post they reference has excellent advice and examples of how to protect your Elasticsearch installations from exploitation. To summarize its advice (from the post itself):

Whatever you do, never expose your cluster nodes to the web directly.

So how do you prevent hackers from getting into your Elasticsearch cluster? Using the advice from Syn-Hershko’s blog, here are some bullet points for shoring up your Elasticsearch security:

  • HTTP-enabled nodes should only listen to private IPs. You can configure what IPs Elasticsearch listens to: localhost, private IPs, public IPs or several of these options.


     control the IP types (manual). Never set Elasticsearch to listen to a public IP or a publicly accessible DNS name.

  • Use proxies to communicate with clients. You should pass any application queries to Elasticsearch through some sort of software that can filter requests, perform audit-logging and password-protect the data. Your client-side javascript shouldn’t talk to Elastic directly, and should only communicate with your server-side software. That software can translate all client-side requests to Elasticsearch DSL, execute the query, and then send the response in a format the clients expect.
  • Don’t use default ports. Once again for clarity: DON’T USE DEFAULT PORTS. You can easily change Elasticsearch’s default ports by modifying the .YML file. The relevant parameters are




  • Disable HTTP if you don’t need it. Only Elasticsearch client nodes should enable HTTP, and your private network applications should be the only ones with access to them. You can completely disable the HTTP module by setting




  • Secure publicly available client nodes. You should protect your Elasticsearch client and any UI it communicates with (such as Kibana and Kopf) behind a VPN. If you choose to allow some nodes access to the public network, use HTTPS and don’t transmit data and credentials as plain-text. You can use plugins like Elastic’s Shield or SearchGuard to secure your cluster.
  • Disable scripting (pre-5.x). Malicious scripts can hack clusters via the Search API. Earlier versions of Elasticscript allowed unsecured scripts to access the software. If you are using an older version (pre-5.x), upgrade to a newer version or disable dynamic scripting completely.

Go to Syn-Hershko’s blog for more details.

This should get you started on correctly protecting yourself against Elasticsearch ransomware (and other security threats). If you want to have someone review your security, please contact us.


Webinar Thursday December 29: JSON in MySQL 5.7


JSON in MySQL 5.7Please join Percona’s Consultant David Ducos on Thursday, December 29, 2016 at 10 am PST/ 1:00 pm EST (UTC-8) as he presents JSON in MySQL 5.7.

Since it was implemented in MySQL 5.7, we can use JSON as a data type. In this webinar, we will review some of the useful functions that have been added to work with JSON.

We will examine and analyze how JSON works internally, and take into account some of the costs related to employing this new technology. 

At the end of the webinar, you will know the answers to the following questions: 

  • What is JSON?
  • Why don’t we keep using VARCHAR?
  • How does it work? 
  • What are the costs?
  • What limitations should we take into account?
  • What are the benefits of using MySQL JSON support?

Register for the webinar here.

JSON in MySQL 5.7David Ducos, Percona Consultant

David studied Computer Science at the National University of La Plata, and has worked as a Database Consultant since 2008. He worked for three years in a worldwide platform of free classifieds, until starting work for Percona in November 2014 as part of the Consulting team.


Using MySQL 5.7 Document Store with Internet of Things (IoT)

MySQL 5.7 Document Store

MySQL 5.7 Document StoreIn this blog post, I’ll discuss how to use MySQL 5.7 Document Store to track data from Internet of Things (IoT) devices.

Using JSON in MySQL 5.7

In my previous blog post, I’ve looked into MySQL 5.7.12 Document Store. This is a brand new feature in MySQL 5.7, and many people are asking when do I need or want to use the JSON or Document Store interface?

Storing data in JSON may be quite useful in some cases, for example:

  • You already have a JSON (i.e., from external feeds) and need to store it anyway. Using the JSON datatype will be more convenient and more efficient.
  • For the Internet of Things, specifically, when storing events from sensors: some sensors may send only temperature data, some may send temperature, humidity and light (but light information is only recorded during the day), etc. Storing it in JSON format may be more convenient in that you don’t have to declare all possible fields in advance, and do not have to run “alter table” if a new sensor starts sending new types of data.

Internet of Things

In this blog post, I will show an example of storing an event stream from Particle Photon. Last time I created a device to measure light and temperature and stored the results in MySQL. provides the ability to use its own MQTT server and publish events with:

Spark.publish("temperature", String(temperature));
Spark.publish("humidity", String(humidity));
Spark.publish("light", String(light));

Then, I wanted to “subscribe” to my events and insert those into MySQL (for further analysis). As we have three different metrics for the same device, we have two basic options:

  1. Use a field per metric and create something like this: device_id int, temperature double, humidity double, light double
  2. Use a record per metric and have something like this: device_id int, event_name varchar(255), event_data text (please see this Internet of Things, Messaging and MySQL blog post for more details)

The first option above is not flexible. If my device starts measuring the soil temperature, I will have to “alter table add column”.

Option two is better in this regard, but I may significantly increase the table size as I have to store the name as a string for each measurement. In addition, some devices may send more complex metrics (i.e., latitude and longitude).

In this case, using JSON for storing metrics can be a better option. In this case, I’ve also decided to try Document Store as well.

First, we will need to enable X Plugin and setup the NodeJS / connector. Here are the steps required:

  1. Enable X Plugin in MySQL 5.7.12+, which uses a different port (33060 by default)
  2. Download and install NodeJS (>4.2) and mysql-connector-nodejs-1.0.2.tar.gz (follow the Getting Started with Connector/Node.JS guide).
    # node --version
    # wget
    # npm install mysql-connector-nodejs-1.0.2.tar.gz

    Please note: on older systems you will probably need to upgrade the nodejs version (follow the Installing Node.js via package manager guide).

Storing Events from Sensors provides you with an API that allows you to subscribe to all public events (“events” are what sensors send). The API is for NodeJS, which is really convenient as we can use NodeJS for MySQL 5.7.12 Document Store as well.

To use the Particle API, install the particle-api-js module:

$ npm install particle-api-js

I’ve created the following NodeJS code to subscribe to all public events, and then add the data (in JSON format) to a document store:

var mysqlx = require('mysqlx');
var Particle = require('particle-api-js');
var particle = new Particle();
var token = '<place your token here>'
var mySession =
    host: 'localhost',
    port: 33060,
    dbUser: 'root',
    dbPassword: '<place your pass here>'
process.on('SIGINT', function() {
    console.log("Caught interrupt signal. Exiting...");
particle.getEventStream({ auth: token}).then(function(stream) {
                stream.on('event', function(data) {
                                mySession.then(session => {
                                                .add(  data  )
                                                .execute(function (row) {
                                                        // can log something here
                                                }).catch(err => {
                                                .then( function (notices) {
                                                        console.log("Wrote to MySQL: " + JSON.stringify(notices))
                                }).catch(function (err) {
}).catch(function (err) {

How it works:

  • particle.getEventStream({ auth: token}) gives me the stream of events. From there I can subscribe to specific event names, or to all public events using the generic name “events”: stream.on(‘event’, function(data).
  • function(data) is a callback function fired when a new event is ready. The event has JSON type “data.” From there I can simply insert it to a document store: .add( data ).execute() will insert the JSON data into the event_stream document store.

One of the reasons I use document store here is I do not have to know what is inside the event data. I do not have to parse it, I simply throw it to MySQL and analyze it later. If the format of data will change in the future, my application will not break.

Inside the data stream

Here is the example of running the above code:

{ data: 'Humid: 49.40 Temp: 25.00 *C Dew: 13.66 *C HeatI: 25.88 *C',
  ttl: '60',
  published_at: '2016-05-20T19:30:51.433Z',
  coreid: '2b0034000947343337373738',
  name: 'log' }
Wrote to MySQL: {"_state":{"rows_affected":1,"doc_ids":["a3058c16-15db-0dab-f349-99c91a00"]}}
{ data: 'null',
  ttl: '60',
  published_at: '2016-05-20T19:30:51.418Z',
  coreid: '50ff72...',
  name: 'registerdev' }
Wrote to MySQL: {"_state":{"rows_affected":1,"doc_ids":["eff0de02-726e-34bd-c443-6ecbccdd"]}}
{ data: '24.900000',
  ttl: '60',
  published_at: '2016-05-20T19:30:51.480Z',
  coreid: '2d0024...',
  name: 'Humid 2' }
{ data: '[{"currentTemp":19.25},{"currentTemp":19.19},{"currentTemp":100.00}]',
  ttl: '60',
  published_at: '2016-05-20T19:30:52.896Z',
  coreid: '2d002c...',
  name: 'getTempData' }
Wrote to MySQL: {"_state":{"rows_affected":1,"doc_ids":["5f1de278-05e0-6193-6e30-0ebd78f7"]}}
{ data: '{"pump":0,"salt":0}',
  ttl: '60',
  published_at: '2016-05-20T19:30:51.491Z',
  coreid: '55ff6...',
  name: 'status' }
Wrote to MySQL: {"_state":{"rows_affected":1,"doc_ids":["d6fcf85f-4cba-fd59-a5ec-2bd78d4e"]}}

(Please note: although the stream is public, I’ve tried to anonymize the results a little.)

As we can see the “data” is JSON and has that structure. I could have implemented it as a MySQL table structure (adding published_at, name, TTL and coreid as separate fields). However, I would have to depend on those specific fields and change my application if those fields changed. We also see examples of how the device sends the data back: it can be just a number, a string or another JSON.

Analyzing the results

Now I can go to MySQL and use SQL (which I’ve used for >15 years) to find out what I’ve collected. First, I want to know how many device names I have:

mysql -A iot
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 3289
Server version: 5.7.12 MySQL Community Server (GPL)
Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql> select count(distinct json_unquote(doc->'$.name')) from event_stream;
| count(distinct json_unquote(doc->'$.name')) |
|                                        1887 |
1 row in set (5.47 sec)

That is slow! As described in my previous post, I can create a virtual column and index for doc->’$.name’ to make it faster:

mysql> alter table event_stream add column name varchar(255)
    -> generated always as (json_unquote(doc->'$.name')) virtual;
Query OK, 0 rows affected (0.17 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table event_stream add key (name);
Query OK, 0 rows affected (3.47 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> show create table event_stream
*************************** 1. row ***************************
       Table: event_stream
Create Table: CREATE TABLE `event_stream` (
  `doc` json DEFAULT NULL,
  `_id` varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$._id'))) STORED NOT NULL,
  `name` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$.name'))) VIRTUAL,
  UNIQUE KEY `_id` (`_id`),
  KEY `name` (`name`)
1 row in set (0.00 sec)
mysql> select count(distinct name) from event_stream;
| count(distinct name) |
|                 1820 |
1 row in set (0.67 sec)

How many beers left?

Eric Joyce has published a Keg Inventory Counter that uses a Particle Proton device to measure the amount of beer in a keg by 12oz pours. I want to see what was the average and the lowest amount of beer per day:

mysql> select date(json_unquote(doc->'$.published_at')) as day,
    ->        avg(json_unquote(doc->'$.data')) as avg_beer_left,
    ->    min(json_unquote(doc->'$.data')) as min_beer_left
    -> from event_stream
    -> where name = 'Beers_left'
    -> group by date(json_unquote(doc->'$.published_at'));
| day        | avg_beer_left      | min_beer_left |
| 2016-05-13 |  53.21008358996988 | 53.2          |
| 2016-05-18 |  52.89973045822105 | 52.8          |
| 2016-05-19 | 52.669233854792694 | 52.6          |
| 2016-05-20 |  52.60644257702987 | 52.6          |
4 rows in set (0.44 sec)


UDocument Store can be very beneficial if an application is working with a JSON field and does not know or does not care about its structure. In this post, I’ve used the “save to MySQL and analyze later” approach here. We can then add virtual fields and add indexes if needed.

Powered by WordPress | Theme: Aeros 2.0 by