Oct
05
2021
--

Using New JSON Syntax in PostgreSQL 14 – Update

JSON Syntax PostgreSQL 14

As previously mentioned here on the blog, PostgreSQL 14 was just released. Need a primer? Umair Shahid recently put together a great overview, PostgreSQL 14 – Performance, Security, Usability, and Observability.

I recently did a deep-dive into using JSON from within PostgreSQL in a two blog post series titled Storing and using JSON within PostgreSQL, While all the information from those two blogs still is relevant with the release of PostgreSQL 14 (PG14), we need to provide a quick update to show off some improvements.

Keeping this brief:  Prior to PG14, pulling fields out of JSON looked something like the following:

movie_json_test=# select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_jsonb  where (jsonb_column->>'title')::varchar = 'Avengers: Endgame (2019)';
          title           | rating |  imdb_id  
--------------------------+--------+-----------
 Avengers: Endgame (2019) |        | tt4154796
(1 row)

While this worked, it is not terribly intuitive for developers who may be used to interacting with JSON in code.   Starting in versions beyond 14 you can use a much more natural and user-friendly syntax to access data.

movie_json_test=# select jsonb_column['title'] as title,   jsonb_column['imdb_rating'] as rating,   jsonb_column['imdb_id'] as imdb_id  from movies_jsonb  where jsonb_column['title'] = '"Avengers: Endgame (2019)"';
           title            | rating |   imdb_id   
----------------------------+--------+-------------
 "Avengers: Endgame (2019)" | null   | "tt4154796"
(1 row)

Not only can you access top-level attributes, but you can also access nested arrays and lists as well:

movie_json_test=# select jsonb_column['title'] as title,   jsonb_column['imdb_rating'] as rating,   jsonb_column['imdb_id'] as imdb_id, jsonb_column['cast'][0]  from movies_jsonb  where jsonb_column['title'] = '"Avengers: Endgame (2019)"';
           title            | rating |   imdb_id   |                                            jsonb_column                                             
----------------------------+--------+-------------+-----------------------------------------------------------------------------------------------------
 "Avengers: Endgame (2019)" | null   | "tt4154796" | {"id": "nm0000375", "name": "Robert Downey Jr.", "character": "Tony Stark /              Iron Man"}
(1 row)

Overall this syntax still will make use of all the same indexes, JSON functions, etc., that were previously available.  While this is a relatively minor change, it actually should make things much more natural for developers and hopefully drive more adoption and usage

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Sep
08
2021
--

Storing and Using JSON Within PostgreSQL Part Two

Storing and Using JSON Within PostgreSQL 2

We explored the basic functionality of JSON within PostgreSQL in Part One of this series. Now we will look a little deeper into some things we may want to use regularly.  Most of the functionality we talked about in part one works well when you have a simple JSON format.  However, in real life, our documents may get a bit messy.  Let’s look at part of the JSON we are using for our tests:

Example JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

You can see here that we have some nested arrays and a bit of multi-dimensional flair.  If we wanted to get all the characters or actors in this movie, we would have a challenge using the basic functions.  Thankfully, PostgreSQL has a deep set of functions for interacting with JSON.

First, let’s look at how to get all the movies starring Robert Downey Jr. The easiest way is to use one of the  following:

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "cast": [{ "name" : "Robert Downey Jr." }]}'

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.cast.name == "Robert Downey Jr."'

However, what if we also need to pull out the character from the movie?  For our needs of getting a full list of actors and characters who were in this particular movie, we can use the jsonb_to_rescordset (similar to MySQL’s json_table function we covered in the MySQL part of this series).

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where imdb_id = 'tt2395427' limit 15;
  imdb_id  |             title              | imdb_rating |    id     |         name         |    character     
-----------+--------------------------------+-------------+-----------+----------------------+------------------
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000375 | Robert Downey Jr.    | Tony Stark
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1165110 | Chris Hemsworth      | Thor
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0749263 | Mark Ruffalo         | Bruce Banner
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0262635 | Chris Evans          | Steve Rogers
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0424060 | Scarlett Johansson   | Natasha Romanoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0719637 | Jeremy Renner        | Clint Barton
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000652 | James Spader         | Ultron
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000168 | Samuel L. Jackson    | Nick Fury
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000332 | Don Cheadle          | James Rhodes
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1093951 | Aaron Taylor-Johnson | Pietro Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0647634 | Elizabeth Olsen      | Wanda Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0079273 | Paul Bettany         | Jarvis
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1130627 | Cobie Smulders       | Maria Hill
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1107001 | Anthony Mackie       | Sam Wilson
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm2017943 | Hayley Atwell        | Peggy Carter
(15 rows)

This works fine – until it doesn’t.  If I do a similar search for all movies starring Robert Downey Jr., I get:

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where name like 'Robert Downey%' limit 10; 
ERROR:  cannot call jsonb_to_recordset on a non-array

When we look at the results, you can see that the function expects an array, and several of our movies have no cast (or a NULL value in the json).

movie_json_test=> select imdb_id, jsonb_typeof((jsonb_column->>'cast')::jsonb) from movies_json_generated limit 10;
  imdb_id  | jsonb_typeof 
-----------+--------------
 tt7620156 | 
 tt0109524 | array
 tt0534466 | array
 tt0111091 | 
 tt4167726 | array
 tt0638383 | array
 tt6346314 | array
 tt5877808 | array
 tt4098782 | 
 tt0365100 | array
(10 rows)

You can work around this “Null” issue in a couple of different ways.  The easiest is by converting to text, then back into JSON.  For example:

select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset((jsonb_column->>'cast'::text)::jsonb) as t(id text,name varchar(100),character text) where name like 'Robert Downey%';

Remember in part one how I harped on types and ensuring you cast to different data types when you needed to?  This is another example.  In this case, first taking the null as text, then taking the empty string, and then converting to a JSON object with a null inside.

While this is the easiest way, let’s show some other interesting ways to work around this to highlight some of the other functions, indexes, etc., we learned in part one.  Because we know that we have NULL values for some cast entries, we could check and filter out the values where the type is not empty or null.  For example, here is a simple check if the text version of our jsonb_column->’cast’ is not equal to null.

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
  imdb_id  |                   title                   | imdb_rating |    id     |       name        |               character                
-----------+-------------------------------------------+-------------+-----------+-------------------+----------------------------------------
 tt0266220 | The 65th Annual Academy Awards (1993)     |             | nm0000375 | Robert Downey Jr. | Himself - Presenter
 tt1515091 | Sherlock Holmes: A Game of Shadows (2011) |        7.50 | nm0000375 | Robert Downey Jr. | Sherlock Holmes
 tt1231583 | Due Date (2010)                           |        6.60 | nm0000375 | Robert Downey Jr. | Peter Highman
 tt0343663 | Eros (2004)                               |        6.00 | nm0000375 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")
 tt4420868 | The EE British Academy Film Awards (2015) |        7.40 | nm0000375 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough
(5 rows)
 
movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..7.30 rows=5 width=332) (actual time=0.586..84.666 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..545198.71 rows=373482 width=332) (actual time=0.585..84.664 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..74611.38 rows=373482 width=272) (actual time=0.023..30.257 rows=3786 loops=1)
               Filter: ((jsonb_column -> 'cast'::text) <> 'null'::jsonb)
               Rows Removed by Filter: 258
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.014..0.014 rows=0 loops=3786)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 24
 Planning Time: 0.064 ms
 Execution Time: 84.692 ms
(10 rows)

This is not terribly fast, but it does work. This is basically working around the functionality built into the JSON functions, however.  To speed our query up, we can, of course, index this column; however, some of the data in our movie cast list is just too large:

movie_json_test=> create index gen_func_index_cast on movies_json_generated (((jsonb_column->>'cast')::text));
ERROR:  index row requires 10704 bytes, maximum size is 8191

Of course, we can work around this as well.  But sometimes, it’s best to pivot.  Let’s look at another alternative to simply searching for a NULL value.  As we learned in the first post, we can use generated columns or expression indexes to do some nice things.  Here we could create either an expression index that checks for a null or add a new column that uses a case statement to flag null casts:

movie_json_test=> alter table movies_json_generated add column cast_is_null boolean generated always as  ((case when ((jsonb_column->>'cast')::text) is null then true else false end)) stored; 
ALTER TABLE


movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is false limit 5;
  imdb_id  |                          title                          | imdb_rating |    id     |       name        |                   character                    
-----------+---------------------------------------------------------+-------------+-----------+-------------------+------------------------------------------------
 tt3473134 | Off Camera with Sam Jones (TV Series 2014– )            |             | nm0000375 | Robert Downey Jr. | Himself                  2 episodes, 2014-2019
 tt0092851 | Dear America: Letters Home from Vietnam (TV Movie 1987) |        7.90 | nm0000375 | Robert Downey Jr. | (voice)
 tt0426841 | The 1994 Billboard Music Awards (1994)                  |             | nm0000375 | Robert Downey Jr. | Himself
 tt1228705 | Iron Man 2 (2010)                                       |        7.00 | nm0000375 | Robert Downey Jr. | Tony Stark
 tt0821642 | The Soloist (2009)                                      |        6.70 | nm0000375 | Robert Downey Jr. | Steve Lopez
(5 rows)


movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is not true limit 5;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..8.24 rows=5 width=332) (actual time=0.912..30.550 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..309227.39 rows=187680 width=332) (actual time=0.912..30.548 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..72750.59 rows=187680 width=272) (actual time=0.007..1.069 rows=1789 loops=1)
               Filter: (cast_is_null IS NOT TRUE)
               Rows Removed by Filter: 106
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.016..0.016 rows=0 loops=1789)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 23
 Planning Time: 0.068 ms
 Execution Time: 30.572 ms
(10 rows)

You can see there are several options here for dealing with the nulls, some way easier (and cleaner) than others.  I want to highlight some of the challenges this brings up with using unstructured data within a structured system.

Evolution Upsets the Balance (or Breaks Stuff)

All of the above solutions work for the existing data, but the wonderful thing about JSON is that you can evolve what you store over time.  Let’s use the above example.  Let’s say that for years, every movie that is fed into your system has a full cast listing of characters and actors.  Then, one day, the feed you get your data from allows movies without a cast listing.  Your application will still work, your load scripts will still work.  But every once in a while, your users will get a weird error, or you will see small flashes in your logs (if you are logging these).  But 99.9% of queries are fine.  It is these transient issues that drive people bonkers.

Looking at this problem slightly differently, what if you start adding data or changing the order of certain items?  Back to our original JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

What if we wanted to add another piece of data to the cast records? Let’s say we want to add a thumbs up, or thumbs down vote to each character, so something like this:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100,
      "character": "Bruce Banner"
    },

Now we run into another challenge.  If we inject the new fields into the middle of the JSON, instead of the end, our call to jsonb_to_recordset(jsonb_column->’cast’) as t(id text,name varchar(100),character text) will cause use issues in the application.  We had to define the definition of the recordest; the first field returned is id, then actor name, and then character.  If we changed this record, the character returned in the SQL would be 10000, not Bruce Banner.  Now to make additions, you would need to start adding at the end like:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100
    },

MySQL mitigates this with allowing you to defined specific paths in their equivalent function json_table(json_column, ‘$.cast[*]’ columns( V_name varchar(200) path ‘$.name’, V_character varchar(200) path ‘$.character’) This allows you to define exactly which fields you want, not just the first X ones. That said, there are other limitations in the json_table method MYSQL uses.  Here this is a trade-off.  You need to be aware again that the order and structure of your JSON matter greatly if you are going to use built-in database functions to query this data.

To JSON or Not?  Database Design

While PostgreSQL’s JSON features are pretty in-depth (there are dozens of other functions I did not show), there are similar challenges I see when using these compared to a more classic database design. JSON functionality is NOT a replacement for actually designing your database and thinking about schema and design.  It is a great tool to offer additional flexibility.  If you treat PostgreSQL or any database simply as a dumping ground for data, bad things happen.  It would help to think about how JSON columns can augment your current database designs, not replace them.

Let me give you a simple example of some of the dangers of simply forgoing database design in favor of just dumping in JSON.  While I think we have highlighted some of the complexity and setup challenges with different methods of storing JSON in PostgreSQL, I wanted to take this a step further and compare the performance of different database designs and their impact on performance and scalability.  For this, I built three separate designs and schemas for housing the same data:

A.)  Minimal Database Design -> single table, single column.  “The dump.”  With only a GIN index.

B.)  Single Table -> Generated columns and Expression indexes where needed.

C.)  Old School Normalized Tables, with JSON stored for easy use

The idea here is to illustrate a few of the trade-offs around performance when picking a specific design(1):

A.)  Minimal Database Design No Gin IDX A.)  Minimal Database Design with Gin IDX B.)  Single Table W Generated Column B.)  Single Table W Expression Indexes c.) Normalized
Simple Query for Movie Title (random 10  titles) 800ms 0.3ms 0.2ms 0.2ms 0.2ms
Select top 10 movies 841ms 831ms 0.9ms 0.9ms 0.3ms
Select all movies for an actor  1009.0ms 228ms 5588.0ms(2) 5588.0ms(2) 0.8ms
Count movies for a given actor 5071ms 5075ms 5544ms NA 1ms
  1. Each test was repeated 100 times, and the average results were listed.  Min/Max is available as well. 
  2. The nested json for our “cast info” was not easily indexed, there are some things we could do to mitigate this, but it is a bit more complicated.

Interestingly, there are a few cases where we have problems with using only generated columns/expression indexes, generally because of nested data within the JSON.  We could do some tricks, like adding a GIN index and query the JSON to find the element or creating some funky nested generated column with an expression index.  I left these out for now.  Also, I could go deeper into the tests, testing inserts, updates, and, of course, assembling JSON from a normalized table, but I didn’t.  My goal here is merely to make you think about your design.  Choosing one way to implement and use JSON may be ok under certain use cases, but you may find yourself querying or aggregating the data, and things break.

Recap & What’s Next

A few quick takeaways:

Recapping part 1:

  • Using JSONB is probably going to be your best option in most use cases.
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits.

What we learned in part 2:

  • The more straightforward the JSON, the easier it will be to pull out and interact with the data you need.
  • Nested JSON data can be pulled out in a few different ways. jsonb_to_rescordset is the easiest way I found to pull out the data I need.  However, using this function or others is very susceptible to the data structure in your document (data types matter!).
  • JSON data whose format changes ( elements added or removed ) may make using some functions difficult, if not impossible, to use.
  • JSON within a well-built, designed application and database can be wonderful and offer many great benefits.  JSON just dumped into databases won’t scale.  Database design still matters.

Now that we have covered the basics and a few nuances of JSON with PostgreSQL, next up, it is time to look at the same functionality and data within MongoDB before finally comparing the three databases. Stay tuned!

Sep
03
2021
--

Storing and Using JSON Within PostgreSQL Part One

Storing JSON PostgreSQL

Continuing our series on storing JSON directly within your database, we are now looking at the JSON functionality built into PostgreSQL.  You can see the first two posts in this series exploring MySQL’s JSON functionality with Storing JSON in Your Databases: Tips and Tricks For MySQL Part One and Storing JSON in Your Databases: Tips and Tricks For MySQL Part Two.  I used the exact same table structures and datasets as I did within my MySQL tests.  You can get the instructions to follow along at home here on GitHub.

PostgreSQL has two JSON datatypes available to you.  First JSON, and second JSONB.  The docs highlight the differences pretty well:

JSON JSON B Compare

Basically, JSONB is stored decomposed, making it more efficient for the database to interact with, while the JSON type is stored as an exact text copy of what was input.  This means formatting, white space, etc., will be lost with JSONB.  That said, the suggestion is to generally use JSONB.  The trade-off being slightly slower inserts vs. faster overall read performance (on top of the storage differences mentioned above).

In terms of performance, how much of a difference is there?  That is tricky, as it will depend heavily on the size and complexity of the JSON document being stored.  The larger, more complex the document, the larger the difference you could see.  But to illustrate an example, we will use our movie JSON to select the same data from both JSON and JSONB.

I created the following tables and loaded them with movie-related JSON:

create table movies_json (
	ai_myid serial primary key, 
	imdb_id varchar(255),
	json_column json not null
);

create unique index movies_json_imdb_idx on movies_json(imdb_id);
CREATE INDEX gin_index ON movies_json USING gin (jsonb_column);


create table movies_jsonb (
	ai_myid serial primary key, 
	imdb_id varchar(255),
	jsonb_column jsonb not null
);

create unique index movies_jsonb_imdb_idx on movies_jsonb(imdb_id);
CREATE INDEX movies_jsonb_gin_index ON movies_jsonb USING gin (json_column);

Querying the JSON column results in slower retrieval:

movie_json_test=# explain (verbose true, analyze true) select * from movies_json where json_column->>'title' = 'Avengers: Endgame (2019)';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..55775.85 rows=1880 width=1059) (actual time=694.047..2516.879 rows=1 loops=1)
   Output: ai_myid, imdb_id, json_column
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_json  (cost=0.00..54587.85 rows=783 width=1059) (actual time=1905.511..2512.010 rows=0 loops=3)
         Output: ai_myid, imdb_id, json_column
         Filter: ((movies_json.json_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=2511.276..2511.277 rows=0 loops=1
         Worker 1:  actual time=2511.322..2511.322 rows=0 loops=1
 Planning Time: 0.166 ms
 Execution Time: 2516.897 ms
(12 rows)

While the JSONB column is faster (3x):

movie_json_test=# explain (verbose true, analyze true) select * from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54116.60 rows=1873 width=1025) (actual time=723.324..726.914 rows=1 loops=1)
   Output: ai_myid, imdb_id, jsonb_column
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52929.30 rows=780 width=1025) (actual time=548.982..721.730 rows=0 loops=3)
         Output: ai_myid, imdb_id, jsonb_column
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=720.995..720.995 rows=0 loops=1
         Worker 1:  actual time=202.751..720.994 rows=1 loops=1
 Planning Time: 0.038 ms
 Execution Time: 726.933 ms
(12 rows)

To ensure this is not a single data point I wrote a small script to run this test (with this title and other random titles).  Over 100 runs we had pretty consistent results:

JSON JSONB
Average Time: 2.5492

Min Time: 2.5297428970225155

Max Time: 2.56536191503983

Average Time: 0.747

Min Time: 0.7297536049736664

Max Time: 0.7827945239841938

As stated, insert/updates will lose some of that performance, so keep that in mind.  It is, however, easier to interact and update things in a JSONB datatype.  For instance, some functions are only available for JSONB data types, like the JSONB_SET function which allows you to in-place update your JSON.

movie_json_test=# explain (verbose true, analyze true) update movies_jsonb set jsonb_column= jsonb_set(jsonb_column, '{imdb_rating}', '9') where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Update on public.movies_jsonb  (cost=0.00..56211.80 rows=1873 width=52) (actual time=1928.691..1928.692 rows=0 loops=1)
   ->  Seq Scan on public.movies_jsonb  (cost=0.00..56211.80 rows=1873 width=52) (actual time=577.386..1928.246 rows=1 loops=1)
         Output: ai_myid, imdb_id, jsonb_set(jsonb_column, '{imdb_rating}'::text[], '9'::jsonb, true), ctid
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 375358
 Planning Time: 0.040 ms
 Execution Time: 1928.718 ms
(7 rows)

I have seen references in several places to the insert performance difference between JSON and JSONB, but I have not seen any concrete numbers.  I decided to run a few tests:

  • Inserting 10K records into a JSON column (via insert into select from):  72.851 ms
  • Inserting 10K records into a JSONB column (via insert into select from):  754.045 ms

If you are doing a heavy insert workload, the difference is significant enough to take note and plan for it.  

Another distinct advantage for JSONB is the ability to use GIN indexes over the JSON document.  Gin indexes are designed and optimized to work for text searches.  This focus lends itself well to JSON.  That said, the syntax may be a bit more complicated to make use of GIN indexes.

Here is how you create a GIN index. 

movie_json_test=# CREATE INDEX movies_jsonb_gin_index ON movies_jsonb USING gin (jsonb_column);
CREATE INDEX

You can see the original query is still not using this index:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54125.60 rows=1877 width=64) (actual time=716.059..719.346 rows=1 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52937.90 rows=782 width=64) (actual time=544.197..714.577 rows=0 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=202.768..713.907 rows=1 loops=1
         Worker 1:  actual time=713.890..713.891 rows=0 loops=1
 Planning Time: 0.041 ms
 Execution Time: 719.365 ms
(12 rows)

Why not?  The ->> returns the value as a text, so the conversion from JSON to Text causes some issues.  There are a couple of ways to work around this.  The first is to use @@ which returns the first JSON item:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.title == "Avengers: Endgame (2019)"';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=72.29..221.83 rows=38 width=64) (actual time=0.179..0.180 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: (movies_jsonb.jsonb_column @@ '($."title" == "Avengers: Endgame (2019)")'::jsonpath)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movies_jsonb_gin_index  (cost=0.00..72.28 rows=38 width=0) (actual time=0.068..0.068 rows=2 loops=1)
         Index Cond: (movies_jsonb.jsonb_column @@ '($."title" == "Avengers: Endgame (2019)")'::jsonpath)
 Planning Time: 0.145 ms
 Execution Time: 0.199 ms
(8 rows)

This works.  You can also use the @> which checks if the JSON value entries exist.

movie_json_test=> explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "title" : "Avengers: Endgame (2019)"}';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=44.29..193.94 rows=38 width=64) (actual time=0.199..0.200 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: (movies_jsonb.jsonb_column @> '{"title": "Avengers: Endgame (2019)"}'::jsonb)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movies_jsonb_gin_index  (cost=0.00..44.28 rows=38 width=0) (actual time=0.091..0.091 rows=1 loops=1)
         Index Cond: (movies_jsonb.jsonb_column @> '{"title": "Avengers: Endgame (2019)"}'::jsonb)
 Planning Time: 0.147 ms
 Execution Time: 0.214 ms
(8 rows)

Not only did this use the index it is now over 3x faster overall.  The PostgreSQL documentation offers a fairly detailed overview of the different options to interact with JSON, especially when it comes to functions and operators (note the table in the PG 12 Docs giving you an overview of the operators is a little more readable ) available to you. It’s important to note that not all operators and functions will work with GIN indexes.  For example, using ‘like_regex’:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.title like_regex "^Avengers*"';
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..53546.87 rows=38 width=64) (actual time=218.550..795.063 rows=9 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52543.07 rows=16 width=64) (actual time=251.866..790.098 rows=3 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: (movies_jsonb.jsonb_column @@ '($."title" like_regex "^Avengers*")'::jsonpath)
         Rows Removed by Filter: 125117
         Worker 0:  actual time=311.403..789.402 rows=3 loops=1
         Worker 1:  actual time=225.825..789.408 rows=1 loops=1
 Planning Time: 0.204 ms
 Execution Time: 795.087 ms
(12 rows)

In addition to the GIN indexes, you can create Hash or Btree indexes on your JSON columns, however, they are only useful if you are comparing or searching the entire JSON document.  Instead, you want to use an expression index (often referred to as functional indexes in other databases).  

Generated Columns and Expression Indexes

While the GIN indexes work for some use cases, you often will find it easier and more convenient to use actual columns or some of the normal functions you are familiar with.  The simplest use of expression indexes is to simply pull out and index a field you will use often.  Let’s say we will often want to search for a title.  We can create an index on the extracted JSON from our previous example.

movie_json_test=# create index movie_jsonb_title_index on movies_jsonb (((jsonb_column ->> 'title')::text));
CREATE INDEX

This creates a btree index on the jsonb_column->> title path, allowing us to search for a title.

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=66.97..6421.19 rows=1877 width=64) (actual time=0.138..0.140 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movie_jsonb_title_index  (cost=0.00..66.50 rows=1877 width=0) (actual time=0.014..0.014 rows=1 loops=1)
         Index Cond: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
 Planning Time: 0.160 ms
 Execution Time: 0.159 ms
(8 rows)

Originally, when we searched for movies with the ‘Avengers Endgame (2019)’ title there was no index (unless you modified the SQL to make use of the GIN index).  Now with this new index, you can see our original query is no longer doing a sequential scan but is instead using the btree index.  If we want the index to also be used for ‘ like string%’ statements as well to help with partial matching we need something a bit extra added to the index.  Note: this is not something you can do by default using only the GIN index (however you can look into the pg_trgm extension potentially for this).  Here we will create the same index with the operator class “text_pattern_ops” which is designed to work with Like and Regex.

Before:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' like 'Avengers%';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54125.60 rows=1877 width=64) (actual time=371.463..723.743 rows=8 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52937.90 rows=782 width=64) (actual time=288.053..718.957 rows=3 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) ~~ 'Avengers%'::text)
         Rows Removed by Filter: 125117
         Worker 0:  actual time=204.176..718.288 rows=2 loops=1
         Worker 1:  actual time=288.637..718.299 rows=3 loops=1
 Planning Time: 0.130 ms
 Execution Time: 723.762 ms
(12 rows)

After:

movie_json_test=# create index movie_jsonb_title_index on movies_jsonb (((jsonb_column ->> 'title')::text) text_pattern_ops);
CREATE INDEX
movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' like 'Avengers%';
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=71.66..6425.88 rows=1877 width=64) (actual time=0.195..0.498 rows=8 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) ~~ 'Avengers%'::text)
   Heap Blocks: exact=8
   ->  Bitmap Index Scan on movie_jsonb_title_index  (cost=0.00..71.19 rows=1877 width=0) (actual time=0.015..0.015 rows=8 loops=1)
         Index Cond: (((movies_jsonb.jsonb_column ->> 'title'::text) ~>=~ 'Avengers'::text) AND ((movies_jsonb.jsonb_column ->> 'title'::text) ~<~ 'Avengert'::text))
 Planning Time: 0.168 ms
 Execution Time: 0.519 ms

Now you can also perform a similar operation, but store the results as a queryable generated column instead:

movie_json_test=> \d movies_jsonb
                                         Table "public.movies_jsonb"
    Column    |          Type          | Collation | Nullable |                    Default                    
--------------+------------------------+-----------+----------+-----------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_jsonb_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | 
 jsonb_column | jsonb                  |           | not null | 
Indexes:
    "movies_jsonb_pkey" PRIMARY KEY, btree (ai_myid)
    "movies_jsonb_imdb_idx" UNIQUE, btree (imdb_id)

movie_json_test=> alter table movies_jsonb add column title varchar(255) generated always as (((jsonb_column ->> 'title')::text)) stored;
ALTER TABLE


movie_json_test=> \d movies_jsonb
                                                  Table "public.movies_jsonb"
    Column    |          Type          | Collation | Nullable |                             Default                             
--------------+------------------------+-----------+----------+-----------------------------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_jsonb_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | 
 jsonb_column | jsonb                  |           | not null | 
 title        | character varying(255) |           |          | generated always as (((jsonb_column ->> 'title'::text))) stored
Indexes:
    "movies_jsonb_pkey" PRIMARY KEY, btree (ai_myid)
    "movies_jsonb_imdb_idx" UNIQUE, btree (imdb_id)

Now we have the option to add an index on the title column, select it without the JSON formatting, etc.  Note: generated columns can not be updated.

We can create our entire movie_json table with extra columns and expression indexes and make it look a bit more like a normal table:

create table movies_json_generated (
	ai_myid serial primary key, 
	imdb_id varchar(255) generated always as (jsonb_column ->> 'imdb_id') stored,
	title varchar(255) generated always as (jsonb_column ->> 'title') stored,
    imdb_rating decimal(5,2) generated always as ((jsonb_column  ->> 'imdb_rating')::numeric) stored,
	overview text generated always as (jsonb_column ->> 'overview') stored,
	director jsonb generated always as ((jsonb_column ->> 'director')::json) stored,
	country varchar(100) generated always as (jsonb_column ->> 'country') stored,
	jsonb_column jsonb,
	json_column json
);

create unique index gen_imdb_idx on movies_json_generated(imdb_id);
create index gen_title_idx on movies_json_generated(title);
create index gen_func_title_index on movies_json_generated (((json_column ->> 'title')::varchar));	
CREATE INDEX Gen_gin_index ON movies_json_generated USING gin (jsonb_column);

You may have noticed we had to explicitly cast the columns to a specific data type.  One of the challenges or at least the more difficult things to get used to with both MySQL and PostgreSQL.

It’s worth remembering that a JSON document does not have explicit data types, so you will often find that some functions or indexes may not work as expected because you are comparing data of two different types (and will return different results from some operations, i.e. sorting ascii vs. numeric).  For example: to add 1 to our IMDB rating:

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')))+1.1))) ;
ERROR:  operator does not exist: text + numeric
LINE 1: ...imdb_rating}',(((((jsonb_column ->>'imdb_rating')))+1))) ;
                                                              ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

Here the imdb_rating that is returned is not numeric so you can not add 1.  So logically, you want to cast the value to a numeric to allow for the addition. 

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1))) ;
ERROR:  function jsonb_set(jsonb, unknown, numeric) does not exist
LINE 1: ...pdate  movies_jsonb_generated_1 set jsonb_column= jsonb_set(...
                                                             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Again we run into a type issue.  This time the jsonb_set function can not pass in a numeric value (nor does it implicitly convert) when the function is looking for a JSONB value.  So let’s cast the result to JSONB:

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1)::jsonb)) ;
ERROR:  cannot cast type numeric to jsonb
LINE 1: ...((((jsonb_column ->>'imdb_rating')::numeric))+1.1)::jsonb)) …

Here we find we can not cast a numeric directly to JSONB.  So we have to convert the numeric to text, then to JSONB.

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1)::text::jsonb)) ;
UPDATE 100000

For some developers, this may prove a bit tricky to get the hang of.  This will also extend to things that work, but maybe just slow or may produce the wrong data.  For instance, when using compare, sort, or filter you may or may not pick up the index you are looking for and you may be comparing the ASCII value for numerics.  Below you can see us explicitly convert certain columns to compare them. You can also see that when you generate a column or create an expression index you need to do the same:

movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where (jsonb_column->>'imdb_rating')::numeric > 8 order by (jsonb_column->>'imdb_rating')::numeric desc limit 5;
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=65291.78..65292.36 rows=5 width=128) (actual time=1055.064..1058.740 rows=5 loops=1)
   ->  Gather Merge  (cost=65291.78..77449.06 rows=104198 width=128) (actual time=1055.062..1058.738 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=64291.76..64422.00 rows=52099 width=128) (actual time=1053.392..1053.393 rows=4 loops=3)
               Sort Key: (((jsonb_column ->> 'imdb_rating'::text))::numeric) DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Seq Scan on movies_json_generated  (cost=0.00..63426.41 rows=52099 width=128) (actual time=0.993..1052.275 rows=2688 loops=3)
                     Filter: (((jsonb_column ->> 'imdb_rating'::text))::numeric > '8'::numeric)
                     Rows Removed by Filter: 122432
 Planning Time: 0.061 ms
 Execution Time: 1058.769 ms
(14 rows)
 
movie_json_test=> \d movies_json_generated
                                                     Table "public.movies_json_generated"
    Column    |          Type          | Collation | Nullable |                                    Default                                    
--------------+------------------------+-----------+----------+-------------------------------------------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_json_generated_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | generated always as (((json_column ->> 'imdb_id'::text))) stored
 title        | character varying(255) |           |          | generated always as (((json_column ->> 'title'::text))) stored
 imdb_rating  | numeric(5,2)           |           |          | generated always as (((json_column ->> 'imdb_rating'::text)::numeric)) stored
 overview     | text                   |           |          | generated always as (json_column ->> 'overview'::text) stored
 director     | jsonb                  |           |          | generated always as (((json_column ->> 'director'::text)::json)) stored
 country      | character varying(100) |           |          | generated always as (((json_column ->> 'country'::text))) stored
 jsonb_column | jsonb                  |           |          | 
 json_column  | json                   |           |          | 
Indexes:
    "movies_json_generated_pkey" PRIMARY KEY, btree (ai_myid)
    "gen_func_title_index" btree (((json_column ->> 'title'::text)::character varying))
    "gen_gin_index" gin (jsonb_column)
    "gen_imdb_idx" UNIQUE, btree (imdb_id)
    "gen_title_idx" btree (title)
 
movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 5;
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=62548.12..62548.70 rows=5 width=102) (actual time=112.458..114.704 rows=5 loops=1)
   ->  Gather Merge  (cost=62548.12..63277.80 rows=6254 width=102) (actual time=112.457..114.702 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=61548.09..61555.91 rows=3127 width=102) (actual time=110.807..110.808 rows=4 loops=3)
               Sort Key: imdb_rating DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on movies_json_generated  (cost=0.00..61496.16 rows=3127 width=102) (actual time=0.128..109.939 rows=2688 loops=3)
                     Filter: (imdb_rating > '8'::numeric)
                     Rows Removed by Filter: 122432
 Planning Time: 0.146 ms
 Execution Time: 114.729 ms
(14 rows)

movie_json_test=> create index test_index_imdb_rating on movies_json_generated (imdb_rating);
CREATE INDEX
movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 5;
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..19.41 rows=5 width=102) (actual time=0.094..0.134 rows=5 loops=1)
   ->  Index Scan Backward using test_index_imdb_rating on movies_json_generated  (cost=0.42..28521.09 rows=7510 width=102) (actual time=0.093..0.132 rows=5 loops=1)
         Index Cond: (imdb_rating > '8'::numeric)
 Planning Time: 0.136 ms
 Execution Time: 0.152 ms
(5 rows)

A quick side note:  there will be a performance difference for inserts/updates on JSON/JSONB columns with one or more generated columns or expression indexes.  Because of how JSONB is stored, you can see a boost in performance when you update a column that has a lot of generated columns/expression indexes.  In my tests, I saw 3-4x or more performance improvement in doing updates on a JSONB column with several generated/expression indexes vs. a JSON column with the same setup.  That said, the more generated columns and expression indexes you have on a table, the more it can impact performance… this is a trade-off.

  • A table with a single generated column, took 7650ms to load 100K rows.
  • A table with 6 generated columns, took 8576ms to load 100K rows.

While this is a very simple setup with just a single JSONB column and the generated columns, you can see a slight overhead.  This held true with the same 100K records being updated, updating only one column in the table with six generated columns took 4.7 seconds vs. 3.6 seconds for a single generated column (a ~25% difference).  Again, not an in-depth test, but it illustrates that adding a ton of expression indexes or generated columns will not be without some cost.

Recap & What’s Next

A few quick takeaways:

  • Using JSONB is probably going to be your best option in most use cases
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits

Now you have the basics of JSON in PostgreSQL.  Next post we will review some of the more advanced options and do a deeper dive on performance and explore a normalized schema -vs- one that is heavily JSON.

Aug
20
2021
--

Storing JSON in Your Databases: Tips and Tricks For MySQL Part Two

Storing JSON MySQL

JSON Structure, The Next Generation (Performance, Scale, and Fun)

Storing JSON MySQLNow you know the basic JSON functionality available to you, as covered in part one of this series.  Let’s look deeper into performance, scale, and oddities that may occur.

You can do almost anything you could do in SQL with standard data types and objects, but you may run into specific problems and limitations.  Here is an example of that.  Let’s say I want to get the top 10 rated movies of all time.

mysql> select json_column->>'$.title' as title,   json_column->>'$.imdb_rating' as rating,   json_column->>'$.imdb_id' as imdb_id  from movies_json where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating' desc limit 10;

ERROR 1038 (HY001): Out of sort memory, consider increasing server sort buffer size

In this case, the size required to sort the data set is much larger than the available sort size.  Selecting this data from a normalized structure would not cause this (nor would generated columns).

mysql> select @@sort_buffer_size;
+--------------------+
| @@sort_buffer_size |
+--------------------+
|             262144 |
+--------------------+
1 row in set (0.00 sec)

mysql> set @@sort_buffer_size=26214400;
Query OK, 0 rows affected (0.00 sec)

mysql> select json_column->>'$.title' as title,   json_column->>'$.imdb_rating' as rating,   json_column->>'$.imdb_id' as imdb_id  from movies_json where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating' desc limit 10;
+----------------------------------------------------+--------+-----------+
| title                                              | rating | imdb_id   |
+----------------------------------------------------+--------+-----------+
| The Mountain II (2016)                             | 9.9    | tt5813916 |
| Toxic Temptation (2016)                            | 9.9    | tt4621978 |
| 1985 (2016)                                        | 9.9    | tt5932546 |
| Jag har din rygg (2015)                            | 9.8    | tt3689312 |
| My Head Hurts (2000)                               | 9.8    | tt1346290 |
| Boz: Colors and Shapes (Video 2006)                | 9.8    | tt0876256 |
| Foreclosed (Video 2010)                            | 9.8    | tt1648984 |
| Nocturne in Black (2016)                           | 9.8    | tt4536608 |
| Monrad & Rislund: Det store triumftog (Video 2004) | 9.8    | tt0425266 |
| Questione di Sguardi (2014)                        | 9.8    | tt4423586 |
+----------------------------------------------------+--------+-----------+
10 rows in set, 65535 warnings (0.78 sec)

So we solved the issue with not having enough space to sort the dataset, but did you notice the time?  Almost 1 second.  Any website you are working on nowadays can not succeed if all your database queries take 0.5-1 second to return.  Yes, you can put a cache in front of these and probably will.  A better cache solves it all, right?  Well, not really.  In this case, you can speed up things drastically with a few important and small improvements.  Before moving on, notice the warnings?  Let’s ignore these for one moment. (These are because of the null values in the rating column (but these are not impacting performance).

Let’s look at generated columns.   Generated columns allow you to create columns based on other columns or circumstances on the fly. Note, you can also use functional indexes, which I will cover later.  In the case of JSON, we can pull values out of our document and store them read-only in a column that can be indexed (we all know indexing is good, right!).  Let’s do this!  I am going to create a table with generated columns for various columns.  Then I am going to load the data via insert from our already loaded movies table.

mysql> create table movies_json_generated (
     ->         ai_myid int AUTO_INCREMENT primary key,
     ->         imdb_id varchar(255) generated always as (`json_column` ->> '$.imdb_id'),
     ->         title varchar(255) generated always as (`json_column` ->> '$.title'),
     ->         imdb_rating decimal(5,2) generated always as (`json_column` ->> '$.imdb_rating'),
     ->         overview text generated always as (`json_column` ->> '$.overview'),
     ->         director json generated always as (`json_column` ->> '$.director'),
     ->         cast json generated always as (`json_column` ->> '$.cast'),
     ->         json_column json
     -> ) engine = innodb;
 Query OK, 0 rows affected (0.04 sec)

 mysql> create unique index imdb_idx on movies_json_generated(imdb_id);
 Query OK, 0 rows affected (0.04 sec)
 Records: 0  Duplicates: 0  Warnings: 0

 mysql> insert into movies_json_generated (json_column ) select json_column from  movies_json;
 ERROR 1366 (HY000): Incorrect decimal value: 'null' for column 'imdb_rating' at row 1
 mysql> Drop Table movies_json_generated;
 Query OK, 0 rows affected (0.04 sec)

I wanted to take a moment and dive into what I have seen as a common issue. As you can see, the table was created without problem, but it failed when inserting data from our JSON objects.  The reason? I am storing the  IMDB rating in a decimal(5,2) field, but the JSON reference has ‘null’ (quoted) as text. Type conversions when working with JSON and MySQL can be a bit tricky for some people.  Remember those warnings above?  They were type conversion warnings.  Easy to ignore as they did not appear to cause an immediate issue (but they did).  Type conversions and character sets can be a bit of a problem when accessing JSON data. 

Allow a slight diverged rant:   If your application has an inconsistent or rapidly changing JSON format/schema you can make using standard database functions difficult ( if not impossible ).

I often hear people talk about structure and the lack of flexibility as a massive drawback for relational databases. JSON is flexible; people love flexibility. The issue is flexibility comes at a cost, and you will pay that cost somewhere.  If you are validating and maintaining a structure in your database, your code must fit in the structure and changes to the structure maybe slow ( database migrations).  If you use a flexible database schema, then you need to validate in your code.  This validation in the code may be simple now, but as you grow, the more iterations or changes to your data, the more sprawling the code to check and validate data is. Whether you want to admit it or not, throwing any unstructured data into the wasteland that is a document is a recipe for problems. In the case of storing just a dump of JSON in your database, sometimes that data is complex to access or manipulate without writing code and having access to that validation. 

In this case, the type conversion is pretty straightforward and you can solve this issue in a couple of different ways. Ultimately it is about ensuring the JSON value is converted correctly. I will use the json_value function to unquote the null.

mysql> create table movies_json_generated_stored (
    ->    ai_myid int AUTO_INCREMENT primary key,
    ->    imdb_id varchar(255) generated always as (`json_column` ->> '$.imdb_id')  ,
    ->    title varchar(255) generated always as (`json_column` ->> '$.title') ,
    ->    imdb_rating decimal(5,2) generated always as (json_value(json_column, '$.imdb_rating')) ,
    ->    overview text generated always as (`json_column` ->> '$.overview') ,
    ->    director json generated always as (`json_column` ->> '$.director') ,
    ->    cast json generated always as (`json_column` ->> '$.cast') ,
    ->    json_column json
    -> ) engine = innodb;

 mysql> 
 mysql> create unique index imdb_idx on movies_json_generated(imdb_id);
 Query OK, 0 rows affected (0.01 sec)
 Records: 0  Duplicates: 0  Warnings: 0

 mysql> insert into movies_json_generated (json_column ) select json_column from  movies_json;
 Query OK, 375359 rows affected (40.26 sec)
 Records: 375359  Duplicates: 0  Warnings: 0

Now let’s compare searching for a movie using first the imdb_id from the JSON document and then from the generated table using the column we indexed:

mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt2395427';
 +--------------------------------+-------------------------------+
 | json_column->>'$.title'        | json_column->>'$.imdb_rating' |
 +--------------------------------+-------------------------------+
 | Avengers: Age of Ultron (2015) | 7.5                           |
 +--------------------------------+-------------------------------+
 1 row in set (0.86 sec)

 mysql> select title, imdb_rating from movies_json_generated where imdb_id='tt2395427';
 +--------------------------------+-------------+
 | title                          | imdb_rating |
 +--------------------------------+-------------+
 | Avengers: Age of Ultron (2015) |        7.50 |
 +--------------------------------+-------------+
 1 row in set (0.01 sec)

Great!  Let’s go back to our example using IMDB rating to get the top 10 movies.  To make this faster, we will need to first create an index on the generated column.

mysql> create index idx_rating on movies_json_generated ( imdb_rating );
 Query OK, 0 rows affected (1.45 sec)
 Records: 0  Duplicates: 0  Warnings: 0

With that out of the way, let’s get the top 10 list:

mysql> select json_column->>'$.title' as title,   json_column->>'$.imdb_rating' as rating,   json_column->>'$.imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 10;
 +--------------------------------------------------+--------+-----------+
 | title                                            | rating | imdb_id   |
 +--------------------------------------------------+--------+-----------+
 | Advent (IV) (2016)                               | 10.0   | tt6129028 |
 | 311 Live: 3/11 Day 2006 (2006)                   | 10.0   | tt0872240 |
 | Light Study (2013)                               | 10.0   | tt3130306 |
 | Future Boyfriend (2016)                          | 10.0   | tt3955652 |
 | Cory in the House: All Star Edition (Video 2007) | 10.0   | tt2402070 |
 | 1985 (2016)                                      | 9.9    | tt5932546 |
 | Toxic Temptation (2016)                          | 9.9    | tt4621978 |
 | The Mountain II (2016)                           | 9.9    | tt5813916 |
 | Questione di Sguardi (2014)                      | 9.8    | tt4423586 |
 | Foreclosed (Video 2010)                          | 9.8    | tt1648984 |
 +--------------------------------------------------+--------+-----------+
 10 rows in set (0.01 sec)

A very nice drop from 0.78 seconds to 0.01 seconds!  But wait… why is the data different?  Ahhh glad you noticed!  As discussed above, pulling data out of JSON often requires some type of conversion.  By default, values coming out of JSON are considered as text, not numeric, so it’s sorting based on the ASCII Value (oops).   So you can get the same results by forcing the type conversion:

mysql> select json_column->>'$.title' as title,   json_column->>'$.imdb_rating' as rating,   json_column->>'$.imdb_id' as imdb_id  from movies_json_generated where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating'*1 desc limit 10;
+--------------------------------------------------+--------+-----------+
| title                                            | rating | imdb_id   |
+--------------------------------------------------+--------+-----------+
| 311 Live: 3/11 Day 2006 (2006)                   | 10.0   | tt0872240 |
| Advent (IV) (2016)                               | 10.0   | tt6129028 |
| Cory in the House: All Star Edition (Video 2007) | 10.0   | tt2402070 |
| Light Study (2013)                               | 10.0   | tt3130306 |
| Future Boyfriend (2016)                          | 10.0   | tt3955652 |
| Toxic Temptation (2016)                          | 9.9    | tt4621978 |
| The Mountain II (2016)                           | 9.9    | tt5813916 |
| 1985 (2016)                                      | 9.9    | tt5932546 |
| Nocturne in Black (2016)                         | 9.8    | tt4536608 |
| My Head Hurts (2000)                             | 9.8    | tt1346290 |
+--------------------------------------------------+--------+-----------+
10 rows in set, 65535 warnings (0.89 sec)

Alternative you can use:  cast(json_value(json_column,'$.imdb_rating') as float)

So not only can you significantly speed up your performance with indexing with generated columns, you can also ensure that you are getting consistent types and the expected results.  The documentation has a detailed section on ordering and group by JSON values. Generated columns also improve most of the other queries we showed above.

Functional Indexes Without Generated Columns

Generated columns work well enough, but for our queries (In MySQL 8.0.13 or later), we can create indexes on the JSON functions we call regularly and forgo the generated columns altogether.   

mysql>  create index title_idx on movies_json ((json_value(json_column,'$.title')));
Query OK, 0 rows affected (2.90 sec)
Records: 0  Duplicates: 0  Warnings: 0

This works well enough if you are trying to match an exact match; if you need to use a like or wild card the functional index won’t be used. As you explore using functional indexes with JSON, be mindful there are some character set and collation restrictions and restrictions on which functions can be used, so your mileage will vary.   I will avoid a deep dive here, but you can review the documentation here https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-functional-key-parts & here:  https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html#json-column-indirect-index.  In my testing, the more complicated or nested the JSON was, the more problems I ran into.   For this reason, I stuck with the generated columns for my examples above. 

Multi-Valued Indexes

As of 8.0.17 MySQL also supports Multi-Valued indexes, a secondary index that allows you to index an array.  This is helpful if you have an array within your JSON (this does not appear to support arrays of characters at this time, the helpful “This version of MySQL doesn’t yet support ‘CAST-ing data to array of char/binary BLOBs’”).  This can help with JSON designs that are straightforward, but as your JSON becomes more nested and complex I ran into problems.  

Performance Summary

Super unscientific, but these query run times hold over multiple iterations, look at the differences speed-wise: 

Query Access JSON Directly (seconds) Generated Column
(seconds)
Simple Search Via IMDB ID 0.75  0.01
Search for Avengers Titled Movies 0.76 0.01>
Updating a single value within the JSON searching via IMDB or title 0.80 0.01>
Find top 10 movies of all time 0.89 0.01
Characters played by Robert Downey JR in the avengers’ movies 0.74 0.01>

More Performance Needed? Normalizing Data for Query Patterns

So far we have done a lot of simple interactions and were able to speed up access to an acceptable level.  But not everything fits within the available toolbox.  When searching for movies or ratings for a specific cast member ( show me all the Avengers movies Robert Downey JR. played and the characters ), we used an index on the title generated column to reduce the JSON Documents we had to fully process to get the character he played.  See below:

mysql> select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns( 
    ->    V_name varchar(200) path '$.name',
    ->    V_character varchar(200) path '$.character')
    ->    ) t where t.V_name like 'Robert Downey Jr.%'  and title like 'Avengers%';
+--------------------------------+-------------+-------------------+------------------------------------+
| title                          | imdb_rating | V_name            | V_character                        |
+--------------------------------+-------------+-------------------+------------------------------------+
| Avengers: Age of Ultron (2015) |        7.50 | Robert Downey Jr. | Tony Stark                         |
| Avengers: Endgame (2019)       |        9.00 | Robert Downey Jr. | Tony Stark /              Iron Man |
| Avengers: Infinity War (2018)  |        NULL | Robert Downey Jr. | Tony Stark /              Iron Man |
+--------------------------------+-------------+-------------------+------------------------------------+
3 rows in set (0.00 sec)

mysql> explain select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns(     V_name varchar(200) path '$.name',    V_character varchar(200) path '$.character')    ) t where t.V_name like 'Robert Downey Jr.%'  and title like 'Avengers%';
+----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+
| id | select_type | table                 | partitions | type  | possible_keys | key       | key_len | ref  | rows | filtered | Extra                                                    |
+----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+
|  1 | SIMPLE      | movies_json_generated | NULL       | range | title_idx     | title_idx | 1023    | NULL |    8 |   100.00 | Using where                                              |
|  1 | SIMPLE      | t                     | NULL       | ALL   | NULL          | NULL      | NULL    | NULL |    2 |    50.00 | Table function: json_table; Using temporary; Using where |
+----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

But let’s assume you needed to get a list of all characters he played in his career (Will truncate the full result set).  

mysql> select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns( 
    ->    V_name varchar(200) path '$.name',
    ->    V_character varchar(200) path '$.character')
    ->    ) t where t.V_name like 'Robert Downey Jr.%';


+-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+
| title                                                                                                       | imdb_rating | V_name            | V_character                                                                                    |
+-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+
| The 65th Annual Academy Awards (1993)                                                                       |        NULL | Robert Downey Jr. | Himself - Presenter                                                                            |
| Sherlock Holmes: A Game of Shadows (2011)                                                                   |        7.50 | Robert Downey Jr. | Sherlock Holmes                                                                                |
| Due Date (2010)                                                                                             |        6.60 | Robert Downey Jr. | Peter Highman                                                                                  |
| Eros (2004)                                                                                                 |        6.00 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")                                                           |
| The EE British Academy Film Awards (2015)                                                                   |        7.40 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough                                                         |
| "Saturday Night Live" John Lithgow/Mr. Mister (TV Episode 1985)                                             |        NULL | Robert Downey Jr. | Bruce Winston /              Rudy Randolph III /              Various       (as Robert Downey) |
+-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+
213 rows in set (7.14 sec)



mysql> explain select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns(     V_name varchar(200) path '$.name',    V_character varchar(200) path '$.character')    ) t where t.V_name like 'Robert Downey Jr.%';
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+
| id | select_type | table                 | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra                                                    |
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+
|  1 | SIMPLE      | movies_json_generated | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 358174 |   100.00 | NULL                                                     |
|  1 | SIMPLE      | t                     | NULL       | ALL  | NULL          | NULL | NULL    | NULL |      2 |    50.00 | Table function: json_table; Using temporary; Using where |
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

Now our access pattern needs to start at a deeper nested element within the JSON document.  Here you are faced with a few options, but almost all of them lead to creating and maintaining a new table to get that level of data either via trigger, code to break out the data when being loaded or via batch job (or caching this evil slow query).  

I ended up creating the following “Database Schema” for it:

create table movies_normalized_meta (
        ai_myid int AUTO_INCREMENT primary key,
        imdb_id varchar(32),
        title varchar(255),
        imdb_rating decimal(5,2),
        json_column json
) engine = innodb;

create unique index imdb_id_idx  on movies_normalized_meta (imdb_id);
create index rating_idx  on movies_normalized_meta (imdb_rating);

create table movies_normalized_actors (
        ai_actor_id int auto_increment primary key,
        actor_id varchar(50),
        actor_name varchar(500)
        ) engine = innodb;
create index actor_id_idx  on movies_normalized_actors (actor_id);
create index actor_name_idx  on movies_normalized_actors (actor_name);

create table movies_normalized_cast (
        ai_actor_id int,
        ai_myid int,
        actor_character varchar(500)
        ) engine = innodb;

create index cast_id_idx  on movies_normalized_cast (ai_actor_id,ai_myid);
create index cast_id2_idx  on movies_normalized_cast (ai_myid);
create index cast_character_idx  on movies_normalized_cast (actor_character);
create unique index u_cast_idx  on movies_normalized_cast (ai_myid,ai_actor_id,actor_character);

On loading the JSON into MySQL I added an actor table that will have a row for each new unique actor as well as a cast table that has the movie, actor, and name of the character they played in the movie (Note I could optimize the structure a bit, but that’s for another day ).  This gives me a ton of flexibility in reporting and a major performance boost.  Now to get all the movies Robert Downey JR was in and the characters he played I can it via:

mysql> select title, imdb_rating, actor_character from movies_normalized_meta a, movies_normalized_cast b,  movies_normalized_actors c where a.ai_myid=b.ai_myid and b.ai_actor_id = c.ai_actor_id and actor_name='Robert Downey Jr.';


+-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+
| title                                                                                                       | imdb_rating | actor_character                                                                                |
+-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+
| The 65th Annual Academy Awards (1993)                                                                       |        NULL | Himself - Presenter                                                                            |
| Sherlock Holmes: A Game of Shadows (2011)                                                                   |        7.50 | Sherlock Holmes                                                                                |
| Due Date (2010)                                                                                             |        6.60 | Peter Highman                                                                                  |
| Eros (2004)                                                                                                 |        6.00 | Nick Penrose (segment "Equilibrium")     
 | Saturday Night Live in the '80s: Lost & Found (2005)                                                        |        7.20 | Various       (archive footage)                                                                |
| "Saturday Night Live" John Lithgow/Mr. Mister (TV Episode 1985)                                             |        NULL | Bruce Winston /              Rudy Randolph III /              Various       (as Robert Downey) |
+-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+
213 rows in set (0.01 sec)
                                           

      mysql> explain select title, imdb_rating, actor_character from movies_normalized_meta a, movies_normalized_cast b,  movies_normalized_actors c where a.ai_myid=b.ai_myid and b.ai_actor_id = c.ai_actor_id and actor_name='Robert Downey Jr.';
+----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+
| id | select_type | table | partitions | type   | possible_keys                       | key            | key_len | ref                           | rows | filtered | Extra                 |
+----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+
|  1 | SIMPLE      | c     | NULL       | ref    | PRIMARY,actor_name_idx              | actor_name_idx | 2003    | const                         |  213 |   100.00 | Using index           |
|  1 | SIMPLE      | b     | NULL       | ref    | u_cast_idx,cast_id_idx,cast_id2_idx | cast_id_idx    | 5       | movie_json_test.c.ai_actor_id |    2 |   100.00 | Using index condition |
|  1 | SIMPLE      | a     | NULL       | eq_ref | PRIMARY                             | PRIMARY        | 4       | movie_json_test.b.ai_myid     |    1 |   100.00 | NULL                  |
+----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+
3 rows in set, 1 warning (0.00 sec)

Not only is this 7 seconds faster than accessing the same data via the JSON functions, but I can also use these tables to easily do things like give me all the movies that had two or three actors tother, play six degrees, etc that would be a challenge only access the JSON directly. 

Take-Aways and Lessons Learned

First,  think before you store your data.  Understanding what you are storing, why you are storing, and how you will access it is paramount. How you will access and use your data has a profound impact on the optimal database setup, design, and usage.  If you only access top-level properties in your document, a simple design relying only on using MySQL’s built-in JSON functions may be totally ok.  But as you want to dive deeper into the data and start looking at pulling our subsets of data or correlating different documents to ensure performance and scalability you may end up reducing your flexibility and build (hold your breath) an actual database schema that is normalized and everything.  

Second, MySQL’s JSON functions get the job done for most basic use cases.  You can get what you need, but you need to understand the implementations and what is available and what is not.  As I perform these same processes with PostgreSQL and MongoDB you will see where some functions are better than others.   

Finally,  don’t fear schemas!  Seriously, structure is good.  No matter how you access your data and use it, you will assume some structure (it all depends on where you want to enforce that structure).  

Next up in this series, I will dive into JSON with PostgreSQL followed by JSON and MongoDB. Stay tuned!

Aug
19
2021
--

Storing JSON in Your Databases: Tips and Tricks For MySQL Part One

MySQL JSON Databases

MySQL JSON DatabasesDatabase architecture and design are becoming an increasingly lost art. With new technologies and the push towards faster development cycles, people continue to take shortcuts, often to the detriment of long-term performance, scalability, and security. Designing how your application stores, accesses, and processes data is so fundamentally important, it can not be overlooked. I want people to understand that early design choices can have a profound impact on their applications. To that end, I will be exploring database design principles and practices over the next several months. I am starting with every developer’s favorite data format: JSON!

It seems that almost every database over the last few years has introduced various degrees of support for storing and interacting with JSON objects directly. While these features are designed to make it easier for application developers to write code faster, the implementations of each implementation tend to vary wildly and can cause some, well, weirdness. Over the next few weeks/months, I will show you some methods, mistakes, and common ways developers store JSON. Just because you can use a database’s native JSON support does not always mean you should! I hope to show you which ones work best for which use cases.

For part one of this series, I am going to focus on MySQL. MySQL’s implementation of the JSON data type was introduced back in 5.7 (Late 2015/Early 2016 timeframe). Since then, a few minor enhancements have made the implementation a bit more liveable. The current iteration MySQL 8 offers a fully functional implementation of JSON functions and features. Let me show you some examples of how to store and interact with your JSON documents within MySQL.

Setup

For all my tests, I wanted a reasonable amount of data to test the performance implications of certain functions.  I opted to use the metadata JSON from http://movienet.site/, about 2.3GB of individual JSON files (one per movie).

I wrote a small python script to load and iterate through the JSON files and load them into MySQL.

metadata JSON

I will walk through the examples and show you how I have seen many developers use MySQL to interact with JSON and point out why some of them may be incorrect or cause you issues you may not be aware of.  I will also show you a few other features you may want to look into and explore and offer some design advice.  Let us start with the following simple table definition:

create table movies_json (
   ai_myid int AUTO_INCREMENT primary key,
   imdb_id varchar(255),
   json_column json
) engine = innodb;

create unique index imdb_idx on movies_json(imdb_id);

 

{
  "imdb_id": "tt8408760",
  "tmdb_id": null,
  "douban_id": null,
  "title": "Rubes (2019)",
  "genres": [
    "Short",
    "Comedy",
    "Horror"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "7 min",
      "description": ""
    }
  ],
  "imdb_rating": null,
  "director": [
    {
      "id": "nm3216042",
      "name": "Nathan Alan Bunker"
    }
  ],
  "writer": null,
  "cast": [
    {
      "id": "nm1899908",
      "name": "Brendan Jennings",
      "character": "Milton"
    },
    {
      "id": "nm2384265",
      "name": "Ben Begley",
      "character": "Paul"
    },
    {
      "id": "nm2287013",
      "name": "Jerry Marr",
      "character": "Professor Henson"
    },
    {
      "id": "nm7529700",
      "name": "Allene Prince",
      "character": "Margaret"
    }
  ],
  "overview": null,
  "storyline": "Two disgruntled teachers use a Rube Goldberg machine to exact revenge on the people who have wronged them.",
  "plot": null,
  "synopsis": null
}

You can see an example of the JSON format

101: Simple JSON Interactions in MySQL

Yes, a single column in a table with a key or two.  Each row would store one of the movies in the downloaded JSON files.  There is an auto_increment key and the IMDB ID that I extracted from the JSON during the load. This structure and setup is a straightforward design with minimal effort.  However, this design also means you generally rely on MySQL as merely the storage for your data.  Provided you are accessing everything by the imdb_id key, you can get and update your JSON to your application easily with a:

select json_column from movies_json where imdb_id = ‘tt4154796’;

update movies_json set json_column = ‘<new JSON>’ where imdb_id = ‘tt4154796’;

Eventually, however, you will want to search within your JSON or just return a portion of the JSON Document. For example, let’s say you only want to find the title and IMDB rating for a specified movie. You can do this with functionality is built-in:

mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt2395427';
+--------------------------------+-------------------------------+
| json_column->>'$.title'        | json_column->>'$.imdb_rating' |
+--------------------------------+-------------------------------+
| Avengers: Age of Ultron (2015) | 7.5                           |
+--------------------------------+-------------------------------+
1 row in set (0.77 sec)


mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt4154796';
+--------------------------+-------------------------------+
| json_column->>'$.title'  | json_column->>'$.imdb_rating' |
+--------------------------+-------------------------------+
| Avengers: Endgame (2019) | null                          |
+--------------------------+-------------------------------+
1 row in set (0.75 sec)

Here you can see we can interact inside the JSON column just like we would with standard data via SQL by using the special syntax “->>’$.key’”.   You can see Avengers: Endgame has a rating of null!  That is no good, and it was a much better movie than that.  Instead of updating and storing the entire JSON document again, MySQL provides a JSON_SET function to set an element within a document.

mysql>  update movies_json 
       set json_column = JSON_SET(json_column, "$.imdb_rating", 9) 
       where json_column->>'$.imdb_id'='tt4154796';
 Query OK, 1 row affected (0.93 sec)
 Rows matched: 1  Changed: 1  Warnings: 0

 mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt4154796';
 +--------------------------+-------------------------------+
 | json_column->>'$.title'  | json_column->>'$.imdb_rating' |
 +--------------------------+-------------------------------+
 | Avengers: Endgame (2019) | 9                             |
 +--------------------------+-------------------------------+
 1 row in set (0.80 sec)

We now have fixed the missing rating for Endgame!  But we may not know the IMDB ID when we are searching. Just like working with standard data types, you can use data from within your document in a where clause.  In this case, we will look for all movies that start with “Avengers”.

mysql> select json_column->>'$.title', json_column->>'$.imdb_rating', json_column->>'$.imdb_id' from movies_json 
where json_column->>'$.title' like 'Avengers%';

+------------------------------------------------------------+-------------------------------+---------------------------+
| json_column->>'$.title'                                    | json_column->>'$.imdb_rating' | json_column->>'$.imdb_id' |
+------------------------------------------------------------+-------------------------------+---------------------------+
| Avengers: Endgame (2019)                                   | 9.0                           | tt4154796                 |
| Avengers Confidential: Black Widow & Punisher (Video 2014) | 5.8                           | tt3482378                 |
| Avengers of Justice: Farce Wars (2018)                     | null                          | tt6172666                 |
| Avengers: Age of Ultron (2015)                             | 7.5                           | tt2395427                 |
| Avengers: Infinity War (2018)                              | null                          | tt4154756                 |
| Avengers Grimm: Time Wars (Video 2018)                     | null                          | tt8159584                 |
| Avengers Assemble! (TV Series 2010– )                      | null                          | tt1779952                 |
| Avengers Grimm (Video 2015)                                | 2.8                           | tt4296026                 |
+------------------------------------------------------------+-------------------------------+---------------------------+
8 rows in set (0.74 sec)

Using the “json_column->’$.title’ in the where clause got us a nice list of Avengers titled movies and TV shows.  But, you can see from this query, we got more than just the blockbuster Avengers movies.  Let’s say you want to refine this a bit more and find just Avengers movies with Robert Downey Jr. in the cast.  This is a bit more difficult, honestly, because the format of our JSON documents uses an array for cast members.

Here is what the JSON looks like:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

 

You can access arrays in a JSON document by referencing the specific index for the element you want ( i.e. [0].name ), however, if you don’t know which one contains the data you are looking for you need to search for it.  MySQL has the function json_search to help with this (there are other functions such as json_contains as well).  json_search searches a provided value and will return the location if found and null if not found:

mysql> select json_column->>'$.title', 
json_column->>'$.imdb_rating', 
json_column->>'$.imdb_id' from movies_json  
where json_column->>'$.title' like 'Avengers%' and json_search(json_column->>'$.cast', 'one','Robert Downey Jr.', NULL,'$[*].name' ) is not null;

+--------------------------------+-------------------------------+---------------------------+
| json_column->>'$.title'        | json_column->>'$.imdb_rating' | json_column->>'$.imdb_id' |
+--------------------------------+-------------------------------+---------------------------+
| Avengers: Endgame (2019)       | 9                             | tt4154796                 |
| Avengers: Age of Ultron (2015) | 7.5                           | tt2395427                 |
| Avengers: Infinity War (2018)  | null                          | tt4154756                 |
+--------------------------------+-------------------------------+---------------------------+
3 rows in set (0.79 sec)

You will notice that I used the parameter ‘one’, this finds the first value.  You can also use ‘all’ to return every value matched.  In case you are curious as to what the json_search actually returns here is the output:

mysql>  select json_column->>'$.title' as title,
        json_search(json_column->>'$.cast', 'one','Robert Downey Jr.') as search_output
    from movies_json  where json_column->>'$.title' like 'Avengers%' 
and json_search(json_column->>'$.cast', 'one','Robert Downey Jr.', NULL,'$[*].name' ) is not null;

+--------------------------------+---------------+
| title                          | search_output |
+--------------------------------+---------------+
| Avengers: Endgame (2019)       | "$[0].name"   |
| Avengers: Age of Ultron (2015) | "$[0].name"   |
| Avengers: Infinity War (2018)  | "$[0].name"   |
+--------------------------------+---------------+
3 rows in set (0.72 sec)

You can see it returns the position and the property that contains the value.  This output is useful for a variety of reasons.  One is if you need to find which index value contains that particular text.  In the example of searching for Robert Downey JR movies, we can use this index information to return the character he played in each movie.  The first way I have seen this done requires a bit of unholy wrangling but:

mysql> select json_column->>'$.title' as title, 
    json_column->>'$.imdb_rating' as Rating, 
    json_column->>'$.imdb_id' as IMDB_ID,
json_extract(json_column->>'$.cast',concat(substr(json_unquote(json_search(json_column->>'$.cast', 'one','Robert Downey Jr.')),1,
    -> locate('.',json_unquote(json_search(json_column->>'$.cast', 'one','Robert Downey Jr.')))),'character')) as Char_played
    from movies_json  where json_column->>'$.title' like 'Avengers%' and json_search(json_column->>'$.cast', 'one','Robert Downey Jr.') is not null;


+--------------------------------+--------+-----------+--------------------------------------+
| title                          | Rating | IMDB_ID   | Char_played                          |
+--------------------------------+--------+-----------+--------------------------------------+
| Avengers: Endgame (2019)       | 9      | tt4154796 | "Tony Stark /              Iron Man" |
| Avengers: Age of Ultron (2015) | 7.5    | tt2395427 | "Tony Stark"                         |
| Avengers: Infinity War (2018)  | null   | tt4154756 | "Tony Stark /              Iron Man" |
+--------------------------------+--------+-----------+--------------------------------------+
3 rows in set (0.68 sec)

Here I am finding where in the document Robert Downey Jr is listed, then extracting the index and using that with the JSON Extract function to pull out the value of the “$[0].character” instead of “$[0].name”.  While this works, it is ugly.  MySQL provides an alternative to doing this by using json_table.

mysql> select json_column->>'$.title',  json_column->>'$.imdb_rating', t.* from movies_json, json_table(json_column, '$.cast[*]' columns( 
       V_name varchar(200) path '$.name',
       V_character varchar(200) path '$.character')
       ) t where t.V_name like 'Robert Downey Jr.%' and json_column->>'$.title' like 'Avengers%';


+--------------------------------+-------------------------------+-------------------+------------------------------------+
| json_column->>'$.title'        | json_column->>'$.imdb_rating' | V_name            | V_character                        |
+--------------------------------+-------------------------------+-------------------+------------------------------------+
| Avengers: Endgame (2019)       | 9                             | Robert Downey Jr. | Tony Stark /              Iron Man |
| Avengers: Age of Ultron (2015) | 7.5                           | Robert Downey Jr. | Tony Stark                         |
| Avengers: Infinity War (2018)  | null                          | Robert Downey Jr. | Tony Stark /              Iron Man |
+--------------------------------+-------------------------------+-------------------+------------------------------------+
3 rows in set (0.74 sec)

Basically, json_table takes an array and turns it into a table object, allowing you to join and query it.  You can also use this to list all the characters an actor played in any film in his career. Stay tuned for part two of this series, when we will show you some easier and faster ways to use JSON from inside MySQL.

Dec
02
2020
--

Fylamynt raises $6.5M for its cloud workflow automation platform

Fylamynt, a new service that helps businesses automate their cloud workflows, today announced both the official launch of its platform as well as a $6.5 million seed round. The funding round was led by Google’s AI-focused Gradient Ventures fund. Mango Capital and Point72 Ventures also participated.

At first glance, the idea behind Fylamynt may sound familiar. Workflow automation has become a pretty competitive space, after all, and the service helps developers connect their various cloud tools to create repeatable workflows. We’re not talking about your standard IFTTT- or Zapier -like integrations between SaaS products, though. The focus of Fylamynt is squarely on building infrastructure workflows. While that may sound familiar, too, with tools like Ansible and Terraform automating a lot of that already, Fylamynt sits on top of those and integrates with them.

Image Credits: Fylamynt

“Some time ago, we used to do Bash and scripting — and then [ … ] came Chef and Puppet in 2006, 2007. SaltStack, as well. Then Terraform and Ansible,” Fylamynt co-founder and CEO Pradeep Padala told me. “They have all done an extremely good job of making it easier to simplify infrastructure operations so you don’t have to write low-level code. You can write a slightly higher-level language. We are not replacing that. What we are doing is connecting that code.”

So if you have a Terraform template, an Ansible playbook and maybe a Python script, you can now use Fylamynt to connect those. In the end, Fylamynt becomes the orchestration engine to run all of your infrastructure code — and then allows you to connect all of that to the likes of DataDog, Splunk, PagerDuty Slack and ServiceNow.

Image Credits: Fylamynt

The service currently connects to Terraform, Ansible, Datadog, Jira, Slack, Instance, CloudWatch, CloudFormation and your Kubernetes clusters. The company notes that some of the standard use cases for its service are automated remediation, governance and compliance, as well as cost and performance management.

The company is already working with a number of design partners, including Snowflake.

Fylamynt CEO Padala has quite a bit of experience in the infrastructure space. He co-founded ContainerX, an early container-management platform, which later sold to Cisco. Before starting ContainerX, he was at VMWare and DOCOMO Labs. His co-founders, VP of Engineering Xiaoyun Zhu and CTO David Lee, also have deep expertise in building out cloud infrastructure and operating it.

“If you look at any company — any company building a product — let’s say a SaaS product, and they want to run their operations, infrastructure operations very efficiently,” Padala said. “But there are always challenges. You need a lot of people, it takes time. So what is the bottleneck? If you ask that question and dig deeper, you’ll find that there is one bottleneck for automation: that’s code. Someone has to write code to automate. Everything revolves around that.”

Fylamynt aims to take the effort out of that by allowing developers to either write Python and JSON to automate their workflows (think “infrastructure as code” but for workflows) or to use Fylamynt’s visual no-code drag-and-drop tool. As Padala noted, this gives developers a lot of flexibility in how they want to use the service. If you never want to see the Fylamynt UI, you can go about your merry coding ways, but chances are the UI will allow you to get everything done as well.

One area the team is currently focusing on — and will use the new funding for — is building out its analytics capabilities that can help developers debug their workflows. The service already provides log and audit trails, but the plan is to expand its AI capabilities to also recommend the right workflows based on the alerts you are getting.

“The eventual goal is to help people automate any service and connect any code. That’s the holy grail. And AI is an enabler in that,” Padala said.

Gradient Ventures partner Muzzammil “MZ” Zaveri echoed this. “Fylamynt is at the intersection of applied AI and workflow automation,” he said. “We’re excited to support the Fylamynt team in this uniquely positioned product with a deep bench of integrations and a nonprescriptive builder approach. The vision of automating every part of a cloud workflow is just the beginning.”

The team, which now includes about 20 employees, plans to use the new round of funding, which closed in September, to focus on its R&D, build out its product and expand its go-to-market team. On the product side, that specifically means building more connectors.

The company offers both a free plan as well as enterprise pricing and its platform is now generally available.

Jun
18
2019
--

MongoDB gets a data lake, new security features and more

MongoDB is hosting its developer conference today and, unsurprisingly, the company has quite a few announcements to make. Some are straightforward, like the launch of MongoDB 4.2 with some important new security features, while others, like the launch of the company’s Atlas Data Lake, point the company beyond its core database product.

“Our new offerings radically expand the ways developers can use MongoDB to better work with data,” said Dev Ittycheria, the CEO and president of MongoDB. “We strive to help developers be more productive and remove infrastructure headaches — with additional features along with adjunct capabilities like full-text search and data lake. IDC predicts that by 2025 global data will reach 175 Zettabytes and 49% of it will reside in the public cloud. It’s our mission to give developers better ways to work with data wherever it resides, including in public and private clouds.”

The highlight of today’s set of announcements is probably the launch of MongoDB Atlas Data Lake. Atlas Data Lake allows users to query data, using the MongoDB Query Language, on AWS S3, no matter their format, including JSON, BSON, CSV, TSV, Parquet and Avro. To get started, users only need to point the service at their existing S3 buckets. They don’t have to manage servers or other infrastructure. Support for Data Lake on Google Cloud Storage and Azure Storage is in the works and will launch in the future.

Also new is Full-Text Search, which gives users access to advanced text search features based on the open-source Apache Lucene 8.

In addition, MongoDB is also now starting to bring together Realm, the mobile database product it acquired earlier this year, and the rest of its product lineup. Using the Realm brand, Mongo is merging its serverless platform, MongoDB Stitch, and Realm’s mobile database and synchronization platform. Realm’s synchronization protocol will now connect to MongoDB Atlas’ cloud database, while Realm Sync will allow developers to bring this data to their applications. 

“By combining Realm’s wildly popular mobile database and synchronization platform with the strengths of Stitch, we will eliminate a lot of work for developers by making it natural and easy to work with data at every layer of the stack, and to seamlessly move data between devices at the edge to the core backend,”  explained Eliot Horowitz, CTO and co-founder of MongoDB.

As for the latest release of MongoDB, the highlight of the release is a set of new security features. With this release, Mongo is implementing client-side Field Level Encryption. Traditionally, database security has always relied on server-side trust. This typically leaves the data accessible to administrators, even if they don’t have client access. If an attacker breaches the server, that’s almost automatically a catastrophic event.

With this new security model, Mongo is shifting access to the client and to the local drivers. It provides multiple encryption options; for developers to make use of this, they will use a new “encrypt” JSON scheme attribute.

This ensures that all application code can generally run unmodified, and even the admins won’t get access to the database or its logs and backups unless they get client access rights themselves. Because the logic resides in the drivers, the encryption is also handled totally separate from the actual database.

Other new features in MongoDB 4.2 include support for distributed transactions and the ability to manage MongoDB deployments from a single Kubernetes control plane.

Apr
02
2019
--

How to handle dark data compliance risk at your company

Slack and other consumer-grade productivity tools have been taking off in workplaces large and small — and data governance hasn’t caught up.

Whether it’s litigation, compliance with regulations like GDPR or concerns about data breaches, legal teams need to account for new types of employee communication. And that’s hard when work is happening across the latest messaging apps and SaaS products, which make data searchability and accessibility more complex.

Here’s a quick look at the problem, followed by our suggestions for best practices at your company.

Problems

The increasing frequency of reported data breaches and expanding jurisdiction of new privacy laws are prompting conversations about dark data and risks at companies of all sizes, even small startups. Data risk discussions necessarily include the risk of a data breach, as well as preservation of data. Just two weeks ago it was reported that Jared Kushner used WhatsApp for official communications and screenshots of those messages for preservation, which commentators say complies with record keeping laws but raises questions about potential admissibility as evidence.

Dec
07
2018
--

MySQL 8 and The FRM Drop… How To Recover Table DDL

MySQL 8 frm drop recover ddl

… or what I should keep in mind in case of disaster

MySQL 8 frm drop recover ddl

To retrieve and maintain in SQL format the definition of all tables in a database, is a best practice that we all should adopt. To have that under version control is also another best practice to keep in mind.

While doing that may seem redundant, it can become a life saver in several situations. From the need to review what has historically changed in a table, to knowing who changed what and why… to when you need to recover your data and have your beloved MySQL instance not start…

But let’s be honest, only a few do the right thing, and even fewer keep that information up to date. Given that’s the case, what can we do when we have the need to discover/recover the table structure?

From the beginning, MySQL has used some external files to describe its internal structure.

For instance, if I have a schema named windmills and a table named wmillAUTOINC1, on the file system I will see this:

-rw-r-----. 1 mysql mysql     8838 Mar 14 2018 wmillAUTOINC1.frm
-rw-r-----. 1 mysql mysql   131072 Mar 14 2018 wmillAUTOINC1.ibd

The ibd file contains the data, while the frm file contains the structure information.

Putting aside ANY discussion about if this is safe, if it’s transactional and more… when we’ve experienced some major crash and data corruption this approach has been helpful. Being able to read from the frm file was the easiest way to get the information we need.
Simple tools like DBSake made the task quite trivial, and allowed us to script table definition when needed to run long, complex tedious data recovery:

[root@master1 windmills]# /opt/tools/dbsake frmdump wmillAUTOINC1.frm
--
-- Table structure for table `wmillAUTOINC1`
-- Created with MySQL Version 5.7.20
--
CREATE TABLE `wmillAUTOINC1` (
  `id` bigint(11) NOT NULL AUTO_INCREMENT,
  `uuid` char(36) COLLATE utf8_bin NOT NULL,
  `millid` smallint(6) NOT NULL,
  `kwatts_s` int(11) NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) COLLATE utf8_bin NOT NULL,
  `active` tinyint(2) NOT NULL DEFAULT '1',
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `strrecordtype` char(3) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC;

Of course, if the frm file was also corrupt, then we could try to get the information from the ibdata dictionary. If that is corrupted too (trust me I’ve seen all of these situations) … well a last resource was hoping the customer has a recent table definition stored somewhere, but as mentioned before, we are not so diligent, are we?

Now, though, in MySQL8 we do not have FRM files, they were dropped. Even more interesting is that we do not have the same dictionary, most of the things that we knew have changed, including the dictionary location. So what can be done?

Well Oracle have moved the FRM information—and more—to what is called Serialized Dictionary Information (SDI), the SDI is written INSIDE the ibd file, and represents the redundant copy of the information contained in the data dictionary.

The SDI is updated/modified by DDL operations on tables that reside in that tablespace. This is it: if you have one file per table normally, then you will have in that file ONLY the SDI for that table, but if you have multiple tables in a tablespace, the SDI information will refer to ALL of the tables.

To extract this information from the IBD files, Oracle provides a utility called ibd2sdi. This application parses the SDI information and reports a JSON file that can be easily manipulated to extract and build the table definition.

One exception is represented by Partitioned tables. The SDI information is contained ONLY in the first partition, and if you drop it, it is moved to the next one. I will show that later.

But let’s see how it works. In the next examples I will look for the table’s name, attributes, and datatype starting from the dictionary tables.

To obtain the info I will do this:

/opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/mysql.ibd |jq  '.[]?|.[]?|.dd_object?|("------------------------------------"?,"TABLE NAME = ",.name?,"****",(.columns?|.[]?|(.name?,.column_type_utf8?)))'

The result will be something like:

"------------------------------------"
"TABLE NAME = "
"tables"
"****"
"id"
"bigint(20) unsigned"
"schema_id"
"bigint(20) unsigned"
"name"
"varchar(64)"
"type"
"enum('BASE TABLE','VIEW','SYSTEM VIEW')"
"engine"
"varchar(64)"
"mysql_version_id"
"int(10) unsigned"
"row_format"
"enum('Fixed','Dynamic','Compressed','Redundant','Compact','Paged')"
"collation_id"
"bigint(20) unsigned"
"comment"
"varchar(2048)"
<snip>
"------------------------------------"
"TABLE NAME = "
"tablespaces"
"****"
"id"
"bigint(20) unsigned"
"name"
"varchar(259)"
"options"
"mediumtext"
"se_private_data"
"mediumtext"
"comment"
"varchar(2048)"
"engine"
"varchar(64)"
"DB_TRX_ID"
""
"DB_ROLL_PTR"
""

I cut the output for brevity, but if you run the above command yourself you’ll be able to see that this retrieves the information for ALL the tables residing in the IBD.

The other thing I hope you noticed is that I am NOT parsing ibdata, but mysql.ibd. Why? Because the dictionary was moved out from ibdata and is now in mysql.ibd.

Look what happens if I try to parse ibdata:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/ibdata1 |jq '.'
[INFO] ibd2sdi: SDI is empty.

Be very careful here to not mess up your mysql.ibd file.

Now what can I do to get information about my wmillAUTOINC1 table in MySQL8?

That is quite simple:

/opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINC.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1068,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINC",
        "mysql_version_id": 80011,
        "created": 20180925095853,
        "last_altered": 20180925095853,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [
          {
            "name": "id",
            "type": 9,
            "is_nullable": false,
            "is_zerofill": false,
            "is_unsigned": false,
            "is_auto_increment": true,
            "is_virtual": false,
            "hidden": 1,
            "ordinal_position": 1,
            "char_length": 11,
            "numeric_precision": 19,
            "numeric_scale": 0,
            "numeric_scale_null": false,
            "datetime_precision": 0,
            "datetime_precision_null": 1,
            "has_no_default": false,
            "default_value_null": false,
            "srs_id_null": true,
            "srs_id": 0,
            "default_value": "AAAAAAAAAAA=",
            "default_value_utf8_null": true,
            "default_value_utf8": "",
            "default_option": "",
            "update_option": "",
            "comment": "",
            "generation_expression": "",
            "generation_expression_utf8": "",
            "options": "interval_count=0;",
            "se_private_data": "table_id=1838;",
            "column_key": 2,
            "column_type_utf8": "bigint(11)",
            "elements": [],
            "collation_id": 83,
            "is_explicit_collation": false
          },
<SNIP>
        "indexes": [
          {
            "name": "PRIMARY",
            "hidden": false,
            "is_generated": false,
            "ordinal_position": 1,
            "comment": "",
            "options": "flags=0;",
            "se_private_data": "id=2261;root=4;space_id=775;table_id=1838;trx_id=6585972;",
            "type": 1,
            "algorithm": 2,
            "is_algorithm_explicit": false,
            "is_visible": true,
            "engine": "InnoDB",
<Snip>
        ],
        "foreign_keys": [],
        "partitions": [],
        "collation_id": 83
      }
    }
  },
  {
    "type": 2,
    "id": 780,
    "object": {
      "mysqld_version_id": 80011,
      "dd_version": 80011,
      "sdi_version": 1,
      "dd_object_type": "Tablespace",
      "dd_object": {
        "name": "windmills/wmillAUTOINC",
        "comment": "",
        "options": "",
        "se_private_data": "flags=16417;id=775;server_version=80011;space_version=1;",
        "engine": "InnoDB",
        "files": [
          {
            "ordinal_position": 1,
            "filename": "./windmills/wmillAUTOINC.ibd",
            "se_private_data": "id=775;"
          }
        ]
      }
    }
  }
]

The JSON will contains:

  • A section describing the DB object at high level
  • Array of columns and related information
  • Array of indexes
  • Partition information (not here but in the next example)
  • Table space information

That is a lot more detail compared to what we had in the FRM, and it is quite relevant and interesting information as well.

Once you have extracted the SDI, any JSON parser tool script can generate the information for the SQL DDL.

I mention partitions, so let’s look at this a bit more, given they can be tricky.

As mentioned, the SDI information is present ONLY in the first partition. All other partitions hold ONLY the tablespace information. Given that, then the first thing to do is to identify which partition is the first… OR simply try to access all partitions, and when you are able to get the details, extract them.

The process is the same:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170301.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1460,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINCPART",
        "mysql_version_id": 80013,
        "created": 20181125110300,
        "last_altered": 20181125110300,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [<snip>
    	  "schema_ref": "windmills",
        "se_private_id": 18446744073709552000,
        "engine": "InnoDB",
        "last_checked_for_upgrade_version_id": 80013,
        "comment": "",
        "se_private_data": "autoinc=31080;version=2;",
        "row_format": 2,
        "partition_type": 7,
        "partition_expression": "to_days(`date`)",
        "partition_expression_utf8": "to_days(`date`)",
        "default_partitioning": 1,
        "subpartition_type": 0,
        "subpartition_expression": "",
        "subpartition_expression_utf8": "",
        "default_subpartitioning": 0,
       ],
<snip>
        "foreign_keys": [],
        "partitions": [
          {
            "name": "PT20170301",
            "parent_partition_id": 18446744073709552000,
            "number": 0,
            "se_private_id": 1847,
            "description_utf8": "736754",
            "engine": "InnoDB",
            "comment": "",
            "options": "",
            "se_private_data": "autoinc=0;version=0;",
            "values": [
              {
                "max_value": false,
                "null_value": false,
                "list_num": 0,
                "column_num": 0,
                "value_utf8": "736754"
              }
            ],

The difference, as you can see, is that the section related to partitions and sub partitions will be filled with all the details you might need to recreate the partitions.

We will have:

  • Partition type
  • Partition expression
  • Partition values
  • …more

Same for sub partitions.

Now again see what happens if I parse the second partition:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 2,
    "id": 790,
    "object": {
      "mysqld_version_id": 80011,
      "dd_version": 80011,
      "sdi_version": 1,
      "dd_object_type": "Tablespace",
      "dd_object": {
        "name": "windmills/wmillAUTOINCPART#P#PT20170401",
        "comment": "",
        "options": "",
        "se_private_data": "flags=16417;id=785;server_version=80011;space_version=1;",
        "engine": "InnoDB",
        "files": [
          {
            "ordinal_position": 1,
            "filename": "./windmills/wmillAUTOINCPART#P#PT20170401.ibd",
            "se_private_data": "id=785;"
          }
        ]
      }
    }
  }
]

I will get only the information about the tablespace, not the table.

As promised let me show you now what happens if I delete the first partition, and the second partition becomes the first:

(root@localhost) [windmills]>alter table wmillAUTOINCPART drop partition PT20170301;
Query OK, 0 rows affected (1.84 sec)
Records: 0  Duplicates: 0  Warnings: 0
[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd |jq '.'|more
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1461,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINCPART",
        "mysql_version_id": 80013,
        "created": 20181129130834,
        "last_altered": 20181129130834,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [
          {
            "name": "id",
            "type": 9,
            "is_nullable": false,
            "is_zerofill": false,
            "is_unsigned": false,
            "is_auto_increment": true,
            "is_virtual": false,
            "hidden": 1,
            "ordinal_position": 1,

As I mentioned before, each DDL updates the SDI, and here we go: I will have all the information on what’s NOW the FIRST partition. Please note the value of the attribute “created” between the first time I queried the other partition, and the one that I have now:

/opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170301.ibd
       "created": 20181125110300,
/opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd
       "created": 20181129130834,

To be clear the second created is NOW (PT20170401) from when I dropped the other partition (PT20170301).

Conclusions

In the end, this solution is definitely more powerful than the FRM files. It will allow us to parse the file and identify the table definition more easily, providing us with much more detail and information.

The problems will arise if and when the IBD file becomes corrupt.

As for the manual:  For InnoDB, an SDI record requires a single index page, which is 16KB in size by default. However, SDI data is compressed to reduce the storage footprint.

By which it means that for each table I have a page, if I associate record=table. Which means that in case of IBD corruption I should (likely) be able to read those pages. Unless I have bad (very bad) luck.

I still wonder how the dimension of an IBD affects the SDI retrieval, but given I have not tried it yet I will have to let you know.

As an aside, I am working on a script to facilitate the generation of the SQL, it’s not yet ready but you can find it here

Last note but keep this in mind! It is stated in the manual but in a hidden place and in small letters:
DDL operations take longer due to writing to storage, undo logs, and redo logs instead of .frm files.

References

https://stedolan.github.io/jq/

https://dev.mysql.com/doc/refman/8.0/en/ibd2sdi.html

https://dev.mysql.com/doc/refman/8.0/en/serialized-dictionary-information.html

https://dev.mysql.com/doc/refman/8.0/en/data-dictionary-limitations.html


Photo by chuttersnap on Unsplash

Nov
01
2018
--

How To Best Use Percona Server Column Compression With Dictionary

Database Compression

column compressionVery often, database performance is affected by the inability to cache all the required data in memory. Disk IO, even when using the fastest devices, takes much more time than a memory access. With MySQL/InnoDB, the main memory cache is the InnoDB buffer pool. There are many strategies we can try to fit as much data as possible in the buffer pool, and one of them is data compression.

With regular MySQL, to compress InnoDB data you can either use “Barracuda page compression” or “transparent page compression with punch holes”. The use of the ZFS filesystem is another possibility, but it is external to MySQL and doesn’t help with caching. All these solutions are transparent, but often they also have performance and management implications. If you are using Percona Server for MySQL, you have yet another option, “column compression with dictionary“. This feature is certainly not receiving the attention it merits. I think it is really cool—let me show you why.

We all know what compression means, who has not zipped a file before attaching it to an email? Compression removes redundancy from a file. What about the dictionary? A compression dictionary is a way to seed the compressor with expected patterns, in order to improve the compression ratio. Because you can specify a dictionary, the scope of usefulness of column compression with the Percona Server for MySQL feature is greatly increased. In the following sections, we’ll review the impacts of a good dictionary, and devise a way to create a good one without any guessing.

A simple use case

A compression algorithm needs a minimal amount of data in order to achieve a reasonable compression ratio. Typically, if the object is below a few hundred bytes, there is rarely enough data to have repetitive patterns and when the compression header is added, the compressed data can end up larger than the original.

mysql> select length('Hi!'), length(compress('Hi!'));
+---------------+-------------------------+
| length('Hi!') | length(compress('Hi!')) |
+---------------+-------------------------+
|             3 |                      15 |
+---------------+-------------------------+
1 row in set (0.02 sec)

Compressing a string of three bytes results in a binary object of 15 bytes. That’s counter productive.

In order to illustrate the potential of the dictionary, I used this dataset:

http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz

It is a set of 100k Jeopardy questions written in JSON. To load the data in MySQL, I created the following table:

mysql> show create table TestColCompression\G
*************************** 1. row ***************************
Table: TestColCompression
Create Table: CREATE TABLE `TestColCompression` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`question` text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=79977 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Then, I did some formatting to create insert statements:

zcat JEOPARDY_QUESTIONS1.json.gz | perl -p -e 's/\[\{/\{/g' | perl -p -e 's/\}, \{/\}\n\{/g' | perl -p -e "s/'/''/g" | \
  (while read line; do echo "insert into testColComp (questionJson) values ('$line');"; done )

And I executed the inserts. About 20% of the rows had some formatting issues but nevertheless, I ended up with close to 80k rows:

mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78110
Avg_row_length: 316
Data_length: 24690688
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:16:41
Update_time: 2018-10-26 15:40:34
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

The average row length is 316 bytes for a total data size of 23.55MB. The question JSON objects are large enough to matter, but barely large enough for compression. Here are the first five rows:

mysql> select question from TestColCompression limit 5\G
*************************** 1. row ***************************
question: {"category": "HISTORY", "air_date": "2004-12-31", "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"}
*************************** 2. row ***************************
question: {"category": "ESPN's TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"}
*************************** 3. row ***************************
question: {"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"}
*************************** 4. row ***************************
question: {"category": "OLD FOLKS IN THEIR 30s", "air_date": "2009-05-08", "question": "'The district of conservative rep. Patrick McHenry in this state includes Mooresville, a home of NASCAR'", "value": "$800", "answer": "North Carolina", "round": "Jeopardy!", "show_number": "5690"}
*************************** 5. row ***************************
question: {"category": "MOVIES & TV", "air_date": "2009-05-08", "question": "'Tim Robbins played a public TV newsman in "Anchorman: The Legend of" him'", "value": "$800", "answer": "Ron Burgundy", "round": "Jeopardy!", "show_number": "5690"}

Let’s begin by a straight column compression without specifying a dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED;
Query OK, 79976 rows affected (4.25 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78995
Avg_row_length: 259
Data_length: 20496384
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:47:56
Update_time: 2018-10-26 15:47:56
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

As expected the data didn’t compress much. The compression ration is 0.82 or if expressed as a percentage, 18%. Since the JSON headers are always the same, and are present in all questions, we should minimally use them for the dictionary. Trying a minimal dictionary made of the headers gives:

mysql> SET @dictionary_data = 'category' 'air_date' 'question' 'value' 'answer' 'round' 'show_number' ;
Query OK, 0 rows affected (0.01 sec)
mysql> CREATE COMPRESSION_DICTIONARY simple_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY simple_dictionary;
Query OK, 79976 rows affected (4.72 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78786
Avg_row_length: 246
Data_length: 19447808
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:58:17
Update_time: 2018-10-26 17:58:17
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

There is a little progress, we now have a compression ratio of 0.79. Obviously, we could do more but without a tool, we’ll have to guess. A compressor like zlib builds a dictionary as part of its compression effort, could we use that? Yes, but only if we can generate it correctly and access the result. That’s not readily available with the common compressors I know. Fortunately, someone else had the same issue and wrote a compressor able to save its dictionary. Please let me introduce femtozip.

Femtozip to the rescue

The tool, by itself, has no magic algorithm. It is based on zlib, from what I can understand from the code. Anyway, we won’t compress anything with it, we’ll use it to generate a good dictionary. In order to create a dictionary, the tool looks at a set of files and try to see patterns between them. The use of a single big file defeat the purpose. So, I generated one file per question with:

mkdir questions
cd questions
l=1; mysql -u blog -pblog colcomp -e 'select question from TestColCompression' | (while read line; do echo $line > ${l}; let l=l+1; done)

Then, I used the following command to generate a 1024 bytes dictionary using all the files starting by “1”:

../femtozip/cpp/fzip/src/fzip --model ../questions_1s.mod --build --dictonly --maxdict 1024 1*
Building dictionary...

In about 10s the job was done. I tried with all the 80k files and… I had to kill the process after thirty minutes. Anyway, there are 11111 files starting with “1”, a very decent sample. Our generated dictionary looks like:

cat ../questions_1s.mod
", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "'e", "round": "Jeopardy!", "show_number": "r", "round": "{"cate gory": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "'Jeopardy!", "show_number": "2'", "value": "$1000", "answer": "7", "question": "'The ", "question": "'A'", "value": "$600", "answer": "9", "questi on": "'In ", "question": "'This 3", "question": "'2", "question": "'e'", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"'", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s'", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy !", "show_number": "3", "round": "Jeopardy!", "show_number": "5'", "value": "$200", "answer": "'", "value": "$800", "answer": "'", "value": "$400", "answer": "

With some formatting, I was able to create a dictionary with the above data:

mysql> SET @dictionary_data = '", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "''e", "round": "Jeopardy!", "show_number": "r", "round": "{"category": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "''Jeopardy!", "show_number": "2''", "value": "$1000", "answer": "7", "question": "''The ", "question": "''A''", "value": "$600", "answer": "9", "question": "''In ", "question": "''This 3", "question": "''2", "question": "''e''", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"''", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s''", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "5''", "value": "$200", "answer": "''", "value": "$800", "answer": "''", "value": "$400", "answer": "' ;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE COMPRESSION_DICTIONARY femtozip_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
<\pre>
And then, I altered the table to use the new dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY femtozip_dictionary;
Query OK, 79976 rows affected (4.05 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 79861
Avg_row_length: 190
Data_length: 15220736
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:56:09
Update_time: 2018-10-26 17:56:09
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

That’s interesting, we are now achieving a ratio of 0.61, a significant improvement. I pushed my luck and tried with a 2048 bytes dictionary. It further reduced the ratio to 0.57 but that was about the best I got. Larger dictionaries didn’t lower the ratio below 0.57. Zlib supports up to 32KB for the dictionary.

So, to recap:

  • column compression without dictionary, ratio of 0.82
  • column compression with simple dictionary, ratio of 0.79
  • column compression with a 1k dictionary from femtozip, ratio of 0.61
  • column compression with a 2k dictionary from femtozip, ratio of 0.57

The above example stores a JSON document in a text column. MySQL 5.7 includes a JSON datatype which behaves a bit differently regarding the dictionary. Delimiting characters like ‘{}’ are removed in the on disk representation of a JSON column. If you have TBs of data in similar tables, you should really consider column compression and a systematic way of determining the dictionary with femtozip. In addition to improve the compression, it is likely to be the less performance impacting solution. Would it be interesting to generate a dictionary from existing data with a command like this one?

CREATE COMPRESSION_DICTIONARY_FROM_DATA A_good_dictionary (2048, select questions from TestColCompression limit 10000);

where the dictionary creation process would implicitly includes steps similar to the ones I did with femtozip.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com