Apr
03
2023
--

MongoDB: How To Convert BSON Documents to a Human-readable Format


Binary Javascript Object Notation (BSON) is a bin­ary-en­coded seri­al­iz­a­tion of JSON documents. JSON is easier to understand as it is human-readable, but compared to BSON, it supports fewer data types. BSON has been extended to add some optional non-JSON-native data types, like dates and binary data.

MongoDB stores data in BSON format both internally and over the network. It is also the format used for the output files generated by mongodump. To read the content of a BSON document, you have to convert it to a human-readable format like JSON.

Through this blog post, you will learn how to convert a BSON document to JSON. Some of the methods I will explain include using bsondump, mongoexport, Python, and Bash.

BSON to JSON with bsondump

The bsondump converts BSON files into human-readable formats, including JSON. For example, bsondump is useful for reading the output files generated by mongodump. The bsondump tool is part of the MongoDB Database Tools package.

Run

bsondump

from the system command line:

bsondump --outFile=collection.json collection.bson

It will create a JSON file (

collection.json

) from an existing BSON document (

collection.bson

), like the ones created after backing up your database.

BSON to JSON with mongoexport

mongoexport is a command-line tool that produces a JSON or CSV export of data stored in a MongoDB instance. The mongoexport tool is part of the MongoDB Database Tools package.

Run

mongoexport

 from the command line:

mongoexport --collection=employees --db=company --out=employees.json --pretty

To connect to a local MongoDB instance running on port 27017, you do not have to specify the host or port. If otherwise needed, check the Connect to a MongoDB Instance section in the documentation for more information.

The

--pretty

  option will pretty format the content of the JSON file.

BSON to JSON with Python

If you’re a Python developer, there are two ways for reading a BSON document and converting it to JSON.

  1. Using the
    bson

    module from PyMongo

from bson import decode_all
from bson.json_util import dumps

with open('./data.bson','rb') as f:
    data = decode_all(f.read())

with open("./data.json", "w") as outfile:
    outfile.write(dumps(data, indent=2))

This is what the script is doing:

    1. Import the
      decode_all

        and

      dumps

      methods from the

      bson

      module.

    2. Open the file to read the content and decode the data.
    3. Create a JSON file, and write the JSON document created from the data of the BSON file.

The script works with BSON files generated by mongodump. Before running the script, you must install PyMongo:

pip install pymongo

.

  1. Connecting to the database and querying the data with PyMongo, the Python driver for MongoDB.
from pymongo import MongoClient
from bson.json_util import dumps

uri = "mongodb://username:password@host:port/"
client = MongoClient(uri)

db = client.company
employees = db.employees

cursor = employees.find()
list_cur = list(cursor)

json_data = dumps(list_cur, indent = 2)

with open('data.json', 'w') as file:
    file.write(json_data)

This is what the script is doing:

    1. Import the method
      MongoClient

      from the

      pymongo

      library, and the

      dumps

      method from the

      bson

      module.

    2. Establish the connection to the database.
    3. Set the database (e.g.,
      company

      ) and the collection (e.g.,

      employees)

       you want to query.

    4. Retrieve the documents in the collection with the
      find()

        method and create a list with the result. If you don’t pass any parameter to this method, the result will be similar to

      SELECT *

        in MySQL.

    5. Create a JSON object by calling the
      dumps

      method. The

      indent = 2

       parameter will tell

      dumps()

      to pretty format the JSON object.

    6. Write the content of the
      json_data

        variable to the

      data.json

      file.

Before running the script, you must install PyMongo:

pip install pymongo

.

BSON to JSON with Bash

I asked the AI at phind.com to tell me how to convert a BSON file to JSON, and one of the solutions that it showed me was to create a Bash script in the directory where the BSON files are.

#!/bin/bash
declare -a bson_files
bson_files=( $(ls -d $PWD/*.bson) )

for file in "${bson_files[@]}"; 
do 
bsondump $file --outFile=$file.json
done

The script lists all the BSON files in the present directory and saves the result in an array, then loops through the array and converts every BSON file to JSON files. The script uses

bsondump

.

To run the script

  1. Add execution permission to the script:
    chmod +x bson_to_json.sh.

  2. Execute this command in the command line:
./bson_to_json.sh

Conclusion

If you want to read the content of a BSON document, you can use bsondump and mongoexport to convert a BSON document to a human-readable format like JSON. These tools are part of the MongoDB Database Tools.

If you’re a developer, you can use the MongoDB driver of your programming language of choice and query the data to analyze the content of the collections in your database. For Python, you can install PyMongo, connect to the database, query the data and use the

bson

module to save the content as a JSON document.

There are other solutions like online tools and methods provided by other programming languages, but here you learned some ways to do it.

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

 

Download Percona Distribution for MongoDB Today!

Dec
15
2022
--

JSON and Relational Databases – Part One

JSON and Relational Databases

JSON and Relational DatabasesJSON data and relational databases have traditionally been diametric opposites. One is very free form with minimal rules on the formatting and the other is very strict and formal. So many folks were surprised when relational databases started adding a JSON data type to their products.

In this multi-part blog series, we will start with an overview of using JSON in a relational database. Then we will dig deeper into the issues raised to see what, if any, progress has been made in resolving the possible conflicts and issues. It is obvious that neither RDMS nor JSON are going away but can they coexist?

Why JSON?

JSON is popular. Very popular. It is the interchange format of choice for many developers, is easy to parse, and is ridiculously flexible. For some, the ability to encapsulate a JSON document in a row is a necessity. Others have found that JSON allows some schema flexibility or ‘fudge factor’ in situations where database schema changes are expensive.

Unstructured data has a lot of appeal in environments where defining the data itself, especially at the beginning of a project, is hard to do. Or, there are too many known unknowns. This approach to never-ending incrementalism is not optimal for database operations.

The Achilles Heel of unstructured data is that it is hard to apply rigor to the data. Making a key/value pair a required item or of a certain data type or format is not part of the JSON standard. And it is easy for a team to decide to record electronic mail addresses with inconsistent keys such as e-mail, EMail, or email as there is no ‘native’ enforcement mechanism.

Why not JSON?

Structured data has a lot going for it and the relational model works well with most business models. JSON breaks normalization rules.

In the structured world of relational databases, all columns are carefully thought out, typed, and arranged. So the idea of a free-form column of unstructured does not sit well with long-time database practices.

With a traditional relational database, you know that a column of integers is going to be nothing but integers. This makes searching that column of integers very easy if it is indexed. And if you try to put in a string to that column of integers the server will refuse to accept that row of data. It is a lot less expensive to keep out bad data than having to go back later and clean it up.

But altering tables to add or qualify columns can be slow, expensive, and unsettling. The alter table process can be thought of as the server making a copy of the existing table (goodbye memory and/or disk during the process), making the change on the new version of the table, and then loading the data into the new table. The actual process has a lot more intricacies than that and of course, larger tables take more time to alter. Yup, the free-form JSON stuff looks a lot more attractive now, doesn’t it?

But you do not need a JSON data type!

Before JSON data types were introduced, it was possible to save a JSON document in a text string. It was very simple to put the entire document in a column and be done with it. Sadly this approach made it hard to search for data embedded in the document. Regular Expression searches are not exactly speedy. Changing one part of the JSON document meant the whole document had to be rewritten. And usually, you had to pull the entire document out of the table and parse it in an application. This approach works but not well.

JSON with relational

So all but the most devoted to the relational model can see the benefits of having a JSON column, or two, in a row of a table. In MySQL, you get one gigabyte of space for each JSON column – nothing to be sneezed at.

Both the PostgreSQL and the MySQL implementations of JSON data types provide some extra benefits. First, you do not have to update the entire document when you make minor changes – just the changed bits. The server has many functions to extract data efficiently on the server itself.

There is a performance hit when searching JSON data (see the part on indexing below) but the extra flexibility is handy. For example, look how MySQL stores Dual Passwords in the mysql.user tables.

{"additional_password": "$A$005$;H7\u001b\u001b\u0006<`qFRU\tNRxT Z\u0003Ya/iej8Az8LoXGTv.dtf9K3RdJuaLFtXZHBs3/DntG2"}

In this case, you have a temporary second password on a user account so you do not need a column in the table to take care of this data. This information is around only for the accounts that need it. JSON_TABLE() is a powerful command! Sadly, PostgreSQL 15 withdrew its JSON_TABLE() at the last minute but hopefully, it will be in version 16. Simply stated, JSON_TABLE() temporarily converts the unstructured JSON data into structured data so it can be processed with Structured Query Language (SQL). This allows you to use CTEs, Window Functions, or aggregate functions on that data.

mysql> select country_name, IndyYear from countryinfo,
json_table(doc,"$" columns (country_name char(20) path "$.Name",
IndyYear int path "$.IndepYear")) as stuff
where IndyYear > 1992;
+----------------+----------+
| country_name | IndyYear |
+----------------+----------+
| Czech Republic | 1993 |
| Eritrea | 1993 |
| Palau | 1994 |
| Slovakia | 1993 |
+----------------+----------+
4 rows in set, 67 warnings (0.00 sec)

Generated columns and indexes

MySQL has a different approach to indexing JSON data that provides two choices. If you need to extract values from a JSON key/value pair into its own column then you can use generated columns.

ALTER TABLE car ADD car_type VARCHAR(30)
AS (JSON_UNQUOTE(car->"$.type")) STORED;

Then this column can be indexed for quick searches.

The second MySQL option is Multi-Value indexes. These are secondary indexes defined on a column that stores an array of values. Usually, there is a one-to-one relationship between an index entry and a row of data but Mult-Value indexes have a one-to-many relationship that is intended for indexing JSON arrays

With PostgreSQL, you can simply create a GIN index on the values of a specified key in the JSON document.

CREATE INDEX books_tages_idx ON books USING gin (data->'tags');

Rigor

MySQL implemented a subset of the JSON-Shema.org proposal to ensure that your JSON data is what you want before it is inserted into the database. This way you can require certain fields, as well as check data types and data ranges. The JSON_VALID() function checks the incoming JSON document against a parametrized list of attributes to see if it complied. Used in conjunction with a constraint check, you can keep the bad or incomplete data out of your JSON column.

Conclusion

JSON will remain popular for data interchange usage for a long time and relational databases will need to adapt to its usage. The old relational arguments about abstaining from using JSON will fade as efforts like that from JSON-Schema.org provide the ability to ensure the completeness of data. Altering tables will never be completely instantaneous even though work like MySQL’s instant add column work has proven useful.

And the need to add or alter columns in a table will always be around. Being able to get some flexibility with a JSON column could provide some relief in shops where continuous operations with zero downtime is the goal and could be enough of a trade-off when using the relational model.

Oct
05
2021
--

Using New JSON Syntax in PostgreSQL 14 – Update

JSON Syntax PostgreSQL 14

As previously mentioned here on the blog, PostgreSQL 14 was just released. Need a primer? Umair Shahid recently put together a great overview, PostgreSQL 14 – Performance, Security, Usability, and Observability.

I recently did a deep-dive into using JSON from within PostgreSQL in a two blog post series titled Storing and using JSON within PostgreSQL, While all the information from those two blogs still is relevant with the release of PostgreSQL 14 (PG14), we need to provide a quick update to show off some improvements.

Keeping this brief:  Prior to PG14, pulling fields out of JSON looked something like the following:

movie_json_test=# select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_jsonb  where (jsonb_column->>'title')::varchar = 'Avengers: Endgame (2019)';
          title           | rating |  imdb_id  
--------------------------+--------+-----------
 Avengers: Endgame (2019) |        | tt4154796
(1 row)

While this worked, it is not terribly intuitive for developers who may be used to interacting with JSON in code.   Starting in versions beyond 14 you can use a much more natural and user-friendly syntax to access data.

movie_json_test=# select jsonb_column['title'] as title,   jsonb_column['imdb_rating'] as rating,   jsonb_column['imdb_id'] as imdb_id  from movies_jsonb  where jsonb_column['title'] = '"Avengers: Endgame (2019)"';
           title            | rating |   imdb_id   
----------------------------+--------+-------------
 "Avengers: Endgame (2019)" | null   | "tt4154796"
(1 row)

Not only can you access top-level attributes, but you can also access nested arrays and lists as well:

movie_json_test=# select jsonb_column['title'] as title,   jsonb_column['imdb_rating'] as rating,   jsonb_column['imdb_id'] as imdb_id, jsonb_column['cast'][0]  from movies_jsonb  where jsonb_column['title'] = '"Avengers: Endgame (2019)"';
           title            | rating |   imdb_id   |                                            jsonb_column                                             
----------------------------+--------+-------------+-----------------------------------------------------------------------------------------------------
 "Avengers: Endgame (2019)" | null   | "tt4154796" | {"id": "nm0000375", "name": "Robert Downey Jr.", "character": "Tony Stark /              Iron Man"}
(1 row)

Overall this syntax still will make use of all the same indexes, JSON functions, etc., that were previously available.  While this is a relatively minor change, it actually should make things much more natural for developers and hopefully drive more adoption and usage

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Sep
08
2021
--

Storing and Using JSON Within PostgreSQL Part Two

Storing and Using JSON Within PostgreSQL 2

We explored the basic functionality of JSON within PostgreSQL in Part One of this series. Now we will look a little deeper into some things we may want to use regularly.  Most of the functionality we talked about in part one works well when you have a simple JSON format.  However, in real life, our documents may get a bit messy.  Let’s look at part of the JSON we are using for our tests:

Example JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

You can see here that we have some nested arrays and a bit of multi-dimensional flair.  If we wanted to get all the characters or actors in this movie, we would have a challenge using the basic functions.  Thankfully, PostgreSQL has a deep set of functions for interacting with JSON.

First, let’s look at how to get all the movies starring Robert Downey Jr. The easiest way is to use one of the  following:

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "cast": [{ "name" : "Robert Downey Jr." }]}'

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.cast.name == "Robert Downey Jr."'

However, what if we also need to pull out the character from the movie?  For our needs of getting a full list of actors and characters who were in this particular movie, we can use the jsonb_to_rescordset (similar to MySQL’s json_table function we covered in the MySQL part of this series).

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where imdb_id = 'tt2395427' limit 15;
  imdb_id  |             title              | imdb_rating |    id     |         name         |    character     
-----------+--------------------------------+-------------+-----------+----------------------+------------------
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000375 | Robert Downey Jr.    | Tony Stark
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1165110 | Chris Hemsworth      | Thor
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0749263 | Mark Ruffalo         | Bruce Banner
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0262635 | Chris Evans          | Steve Rogers
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0424060 | Scarlett Johansson   | Natasha Romanoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0719637 | Jeremy Renner        | Clint Barton
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000652 | James Spader         | Ultron
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000168 | Samuel L. Jackson    | Nick Fury
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000332 | Don Cheadle          | James Rhodes
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1093951 | Aaron Taylor-Johnson | Pietro Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0647634 | Elizabeth Olsen      | Wanda Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0079273 | Paul Bettany         | Jarvis
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1130627 | Cobie Smulders       | Maria Hill
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1107001 | Anthony Mackie       | Sam Wilson
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm2017943 | Hayley Atwell        | Peggy Carter
(15 rows)

This works fine – until it doesn’t.  If I do a similar search for all movies starring Robert Downey Jr., I get:

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where name like 'Robert Downey%' limit 10; 
ERROR:  cannot call jsonb_to_recordset on a non-array

When we look at the results, you can see that the function expects an array, and several of our movies have no cast (or a NULL value in the json).

movie_json_test=> select imdb_id, jsonb_typeof((jsonb_column->>'cast')::jsonb) from movies_json_generated limit 10;
  imdb_id  | jsonb_typeof 
-----------+--------------
 tt7620156 | 
 tt0109524 | array
 tt0534466 | array
 tt0111091 | 
 tt4167726 | array
 tt0638383 | array
 tt6346314 | array
 tt5877808 | array
 tt4098782 | 
 tt0365100 | array
(10 rows)

You can work around this “Null” issue in a couple of different ways.  The easiest is by converting to text, then back into JSON.  For example:

select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset((jsonb_column->>'cast'::text)::jsonb) as t(id text,name varchar(100),character text) where name like 'Robert Downey%';

Remember in part one how I harped on types and ensuring you cast to different data types when you needed to?  This is another example.  In this case, first taking the null as text, then taking the empty string, and then converting to a JSON object with a null inside.

While this is the easiest way, let’s show some other interesting ways to work around this to highlight some of the other functions, indexes, etc., we learned in part one.  Because we know that we have NULL values for some cast entries, we could check and filter out the values where the type is not empty or null.  For example, here is a simple check if the text version of our jsonb_column->’cast’ is not equal to null.

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
  imdb_id  |                   title                   | imdb_rating |    id     |       name        |               character                
-----------+-------------------------------------------+-------------+-----------+-------------------+----------------------------------------
 tt0266220 | The 65th Annual Academy Awards (1993)     |             | nm0000375 | Robert Downey Jr. | Himself - Presenter
 tt1515091 | Sherlock Holmes: A Game of Shadows (2011) |        7.50 | nm0000375 | Robert Downey Jr. | Sherlock Holmes
 tt1231583 | Due Date (2010)                           |        6.60 | nm0000375 | Robert Downey Jr. | Peter Highman
 tt0343663 | Eros (2004)                               |        6.00 | nm0000375 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")
 tt4420868 | The EE British Academy Film Awards (2015) |        7.40 | nm0000375 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough
(5 rows)
 
movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..7.30 rows=5 width=332) (actual time=0.586..84.666 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..545198.71 rows=373482 width=332) (actual time=0.585..84.664 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..74611.38 rows=373482 width=272) (actual time=0.023..30.257 rows=3786 loops=1)
               Filter: ((jsonb_column -> 'cast'::text) <> 'null'::jsonb)
               Rows Removed by Filter: 258
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.014..0.014 rows=0 loops=3786)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 24
 Planning Time: 0.064 ms
 Execution Time: 84.692 ms
(10 rows)

This is not terribly fast, but it does work. This is basically working around the functionality built into the JSON functions, however.  To speed our query up, we can, of course, index this column; however, some of the data in our movie cast list is just too large:

movie_json_test=> create index gen_func_index_cast on movies_json_generated (((jsonb_column->>'cast')::text));
ERROR:  index row requires 10704 bytes, maximum size is 8191

Of course, we can work around this as well.  But sometimes, it’s best to pivot.  Let’s look at another alternative to simply searching for a NULL value.  As we learned in the first post, we can use generated columns or expression indexes to do some nice things.  Here we could create either an expression index that checks for a null or add a new column that uses a case statement to flag null casts:

movie_json_test=> alter table movies_json_generated add column cast_is_null boolean generated always as  ((case when ((jsonb_column->>'cast')::text) is null then true else false end)) stored; 
ALTER TABLE


movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is false limit 5;
  imdb_id  |                          title                          | imdb_rating |    id     |       name        |                   character                    
-----------+---------------------------------------------------------+-------------+-----------+-------------------+------------------------------------------------
 tt3473134 | Off Camera with Sam Jones (TV Series 2014– )            |             | nm0000375 | Robert Downey Jr. | Himself                  2 episodes, 2014-2019
 tt0092851 | Dear America: Letters Home from Vietnam (TV Movie 1987) |        7.90 | nm0000375 | Robert Downey Jr. | (voice)
 tt0426841 | The 1994 Billboard Music Awards (1994)                  |             | nm0000375 | Robert Downey Jr. | Himself
 tt1228705 | Iron Man 2 (2010)                                       |        7.00 | nm0000375 | Robert Downey Jr. | Tony Stark
 tt0821642 | The Soloist (2009)                                      |        6.70 | nm0000375 | Robert Downey Jr. | Steve Lopez
(5 rows)


movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is not true limit 5;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..8.24 rows=5 width=332) (actual time=0.912..30.550 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..309227.39 rows=187680 width=332) (actual time=0.912..30.548 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..72750.59 rows=187680 width=272) (actual time=0.007..1.069 rows=1789 loops=1)
               Filter: (cast_is_null IS NOT TRUE)
               Rows Removed by Filter: 106
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.016..0.016 rows=0 loops=1789)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 23
 Planning Time: 0.068 ms
 Execution Time: 30.572 ms
(10 rows)

You can see there are several options here for dealing with the nulls, some way easier (and cleaner) than others.  I want to highlight some of the challenges this brings up with using unstructured data within a structured system.

Evolution Upsets the Balance (or Breaks Stuff)

All of the above solutions work for the existing data, but the wonderful thing about JSON is that you can evolve what you store over time.  Let’s use the above example.  Let’s say that for years, every movie that is fed into your system has a full cast listing of characters and actors.  Then, one day, the feed you get your data from allows movies without a cast listing.  Your application will still work, your load scripts will still work.  But every once in a while, your users will get a weird error, or you will see small flashes in your logs (if you are logging these).  But 99.9% of queries are fine.  It is these transient issues that drive people bonkers.

Looking at this problem slightly differently, what if you start adding data or changing the order of certain items?  Back to our original JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

What if we wanted to add another piece of data to the cast records? Let’s say we want to add a thumbs up, or thumbs down vote to each character, so something like this:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100,
      "character": "Bruce Banner"
    },

Now we run into another challenge.  If we inject the new fields into the middle of the JSON, instead of the end, our call to jsonb_to_recordset(jsonb_column->’cast’) as t(id text,name varchar(100),character text) will cause use issues in the application.  We had to define the definition of the recordest; the first field returned is id, then actor name, and then character.  If we changed this record, the character returned in the SQL would be 10000, not Bruce Banner.  Now to make additions, you would need to start adding at the end like:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100
    },

MySQL mitigates this with allowing you to defined specific paths in their equivalent function json_table(json_column, ‘$.cast[*]’ columns( V_name varchar(200) path ‘$.name’, V_character varchar(200) path ‘$.character’) This allows you to define exactly which fields you want, not just the first X ones. That said, there are other limitations in the json_table method MYSQL uses.  Here this is a trade-off.  You need to be aware again that the order and structure of your JSON matter greatly if you are going to use built-in database functions to query this data.

To JSON or Not?  Database Design

While PostgreSQL’s JSON features are pretty in-depth (there are dozens of other functions I did not show), there are similar challenges I see when using these compared to a more classic database design. JSON functionality is NOT a replacement for actually designing your database and thinking about schema and design.  It is a great tool to offer additional flexibility.  If you treat PostgreSQL or any database simply as a dumping ground for data, bad things happen.  It would help to think about how JSON columns can augment your current database designs, not replace them.

Let me give you a simple example of some of the dangers of simply forgoing database design in favor of just dumping in JSON.  While I think we have highlighted some of the complexity and setup challenges with different methods of storing JSON in PostgreSQL, I wanted to take this a step further and compare the performance of different database designs and their impact on performance and scalability.  For this, I built three separate designs and schemas for housing the same data:

A.)  Minimal Database Design -> single table, single column.  “The dump.”  With only a GIN index.

B.)  Single Table -> Generated columns and Expression indexes where needed.

C.)  Old School Normalized Tables, with JSON stored for easy use

The idea here is to illustrate a few of the trade-offs around performance when picking a specific design(1):

A.)  Minimal Database Design No Gin IDX A.)  Minimal Database Design with Gin IDX B.)  Single Table W Generated Column B.)  Single Table W Expression Indexes c.) Normalized
Simple Query for Movie Title (random 10  titles) 800ms 0.3ms 0.2ms 0.2ms 0.2ms
Select top 10 movies 841ms 831ms 0.9ms 0.9ms 0.3ms
Select all movies for an actor  1009.0ms 228ms 5588.0ms(2) 5588.0ms(2) 0.8ms
Count movies for a given actor 5071ms 5075ms 5544ms NA 1ms
  1. Each test was repeated 100 times, and the average results were listed.  Min/Max is available as well. 
  2. The nested json for our “cast info” was not easily indexed, there are some things we could do to mitigate this, but it is a bit more complicated.

Interestingly, there are a few cases where we have problems with using only generated columns/expression indexes, generally because of nested data within the JSON.  We could do some tricks, like adding a GIN index and query the JSON to find the element or creating some funky nested generated column with an expression index.  I left these out for now.  Also, I could go deeper into the tests, testing inserts, updates, and, of course, assembling JSON from a normalized table, but I didn’t.  My goal here is merely to make you think about your design.  Choosing one way to implement and use JSON may be ok under certain use cases, but you may find yourself querying or aggregating the data, and things break.

Recap & What’s Next

A few quick takeaways:

Recapping part 1:

  • Using JSONB is probably going to be your best option in most use cases.
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits.

What we learned in part 2:

  • The more straightforward the JSON, the easier it will be to pull out and interact with the data you need.
  • Nested JSON data can be pulled out in a few different ways. jsonb_to_rescordset is the easiest way I found to pull out the data I need.  However, using this function or others is very susceptible to the data structure in your document (data types matter!).
  • JSON data whose format changes ( elements added or removed ) may make using some functions difficult, if not impossible, to use.
  • JSON within a well-built, designed application and database can be wonderful and offer many great benefits.  JSON just dumped into databases won’t scale.  Database design still matters.

Now that we have covered the basics and a few nuances of JSON with PostgreSQL, next up, it is time to look at the same functionality and data within MongoDB before finally comparing the three databases. Stay tuned!

Sep
03
2021
--

Storing and Using JSON Within PostgreSQL Part One

Storing JSON PostgreSQL

Continuing our series on storing JSON directly within your database, we are now looking at the JSON functionality built into PostgreSQL.  You can see the first two posts in this series exploring MySQL’s JSON functionality with Storing JSON in Your Databases: Tips and Tricks For MySQL Part One and Storing JSON in Your Databases: Tips and Tricks For MySQL Part Two.  I used the exact same table structures and datasets as I did within my MySQL tests.  You can get the instructions to follow along at home here on GitHub.

PostgreSQL has two JSON datatypes available to you.  First JSON, and second JSONB.  The docs highlight the differences pretty well:

JSON JSON B Compare

Basically, JSONB is stored decomposed, making it more efficient for the database to interact with, while the JSON type is stored as an exact text copy of what was input.  This means formatting, white space, etc., will be lost with JSONB.  That said, the suggestion is to generally use JSONB.  The trade-off being slightly slower inserts vs. faster overall read performance (on top of the storage differences mentioned above).

In terms of performance, how much of a difference is there?  That is tricky, as it will depend heavily on the size and complexity of the JSON document being stored.  The larger, more complex the document, the larger the difference you could see.  But to illustrate an example, we will use our movie JSON to select the same data from both JSON and JSONB.

I created the following tables and loaded them with movie-related JSON:

create table movies_json (
	ai_myid serial primary key, 
	imdb_id varchar(255),
	json_column json not null
);

create unique index movies_json_imdb_idx on movies_json(imdb_id);
CREATE INDEX gin_index ON movies_json USING gin (jsonb_column);


create table movies_jsonb (
	ai_myid serial primary key, 
	imdb_id varchar(255),
	jsonb_column jsonb not null
);

create unique index movies_jsonb_imdb_idx on movies_jsonb(imdb_id);
CREATE INDEX movies_jsonb_gin_index ON movies_jsonb USING gin (json_column);

Querying the JSON column results in slower retrieval:

movie_json_test=# explain (verbose true, analyze true) select * from movies_json where json_column->>'title' = 'Avengers: Endgame (2019)';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..55775.85 rows=1880 width=1059) (actual time=694.047..2516.879 rows=1 loops=1)
   Output: ai_myid, imdb_id, json_column
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_json  (cost=0.00..54587.85 rows=783 width=1059) (actual time=1905.511..2512.010 rows=0 loops=3)
         Output: ai_myid, imdb_id, json_column
         Filter: ((movies_json.json_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=2511.276..2511.277 rows=0 loops=1
         Worker 1:  actual time=2511.322..2511.322 rows=0 loops=1
 Planning Time: 0.166 ms
 Execution Time: 2516.897 ms
(12 rows)

While the JSONB column is faster (3x):

movie_json_test=# explain (verbose true, analyze true) select * from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54116.60 rows=1873 width=1025) (actual time=723.324..726.914 rows=1 loops=1)
   Output: ai_myid, imdb_id, jsonb_column
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52929.30 rows=780 width=1025) (actual time=548.982..721.730 rows=0 loops=3)
         Output: ai_myid, imdb_id, jsonb_column
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=720.995..720.995 rows=0 loops=1
         Worker 1:  actual time=202.751..720.994 rows=1 loops=1
 Planning Time: 0.038 ms
 Execution Time: 726.933 ms
(12 rows)

To ensure this is not a single data point I wrote a small script to run this test (with this title and other random titles).  Over 100 runs we had pretty consistent results:

JSON JSONB
Average Time: 2.5492

Min Time: 2.5297428970225155

Max Time: 2.56536191503983

Average Time: 0.747

Min Time: 0.7297536049736664

Max Time: 0.7827945239841938

As stated, insert/updates will lose some of that performance, so keep that in mind.  It is, however, easier to interact and update things in a JSONB datatype.  For instance, some functions are only available for JSONB data types, like the JSONB_SET function which allows you to in-place update your JSON.

movie_json_test=# explain (verbose true, analyze true) update movies_jsonb set jsonb_column= jsonb_set(jsonb_column, '{imdb_rating}', '9') where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Update on public.movies_jsonb  (cost=0.00..56211.80 rows=1873 width=52) (actual time=1928.691..1928.692 rows=0 loops=1)
   ->  Seq Scan on public.movies_jsonb  (cost=0.00..56211.80 rows=1873 width=52) (actual time=577.386..1928.246 rows=1 loops=1)
         Output: ai_myid, imdb_id, jsonb_set(jsonb_column, '{imdb_rating}'::text[], '9'::jsonb, true), ctid
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 375358
 Planning Time: 0.040 ms
 Execution Time: 1928.718 ms
(7 rows)

I have seen references in several places to the insert performance difference between JSON and JSONB, but I have not seen any concrete numbers.  I decided to run a few tests:

  • Inserting 10K records into a JSON column (via insert into select from):  72.851 ms
  • Inserting 10K records into a JSONB column (via insert into select from):  754.045 ms

If you are doing a heavy insert workload, the difference is significant enough to take note and plan for it.  

Another distinct advantage for JSONB is the ability to use GIN indexes over the JSON document.  Gin indexes are designed and optimized to work for text searches.  This focus lends itself well to JSON.  That said, the syntax may be a bit more complicated to make use of GIN indexes.

Here is how you create a GIN index. 

movie_json_test=# CREATE INDEX movies_jsonb_gin_index ON movies_jsonb USING gin (jsonb_column);
CREATE INDEX

You can see the original query is still not using this index:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54125.60 rows=1877 width=64) (actual time=716.059..719.346 rows=1 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52937.90 rows=782 width=64) (actual time=544.197..714.577 rows=0 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=202.768..713.907 rows=1 loops=1
         Worker 1:  actual time=713.890..713.891 rows=0 loops=1
 Planning Time: 0.041 ms
 Execution Time: 719.365 ms
(12 rows)

Why not?  The ->> returns the value as a text, so the conversion from JSON to Text causes some issues.  There are a couple of ways to work around this.  The first is to use @@ which returns the first JSON item:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.title == "Avengers: Endgame (2019)"';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=72.29..221.83 rows=38 width=64) (actual time=0.179..0.180 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: (movies_jsonb.jsonb_column @@ '($."title" == "Avengers: Endgame (2019)")'::jsonpath)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movies_jsonb_gin_index  (cost=0.00..72.28 rows=38 width=0) (actual time=0.068..0.068 rows=2 loops=1)
         Index Cond: (movies_jsonb.jsonb_column @@ '($."title" == "Avengers: Endgame (2019)")'::jsonpath)
 Planning Time: 0.145 ms
 Execution Time: 0.199 ms
(8 rows)

This works.  You can also use the @> which checks if the JSON value entries exist.

movie_json_test=> explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "title" : "Avengers: Endgame (2019)"}';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=44.29..193.94 rows=38 width=64) (actual time=0.199..0.200 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: (movies_jsonb.jsonb_column @> '{"title": "Avengers: Endgame (2019)"}'::jsonb)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movies_jsonb_gin_index  (cost=0.00..44.28 rows=38 width=0) (actual time=0.091..0.091 rows=1 loops=1)
         Index Cond: (movies_jsonb.jsonb_column @> '{"title": "Avengers: Endgame (2019)"}'::jsonb)
 Planning Time: 0.147 ms
 Execution Time: 0.214 ms
(8 rows)

Not only did this use the index it is now over 3x faster overall.  The PostgreSQL documentation offers a fairly detailed overview of the different options to interact with JSON, especially when it comes to functions and operators (note the table in the PG 12 Docs giving you an overview of the operators is a little more readable ) available to you. It’s important to note that not all operators and functions will work with GIN indexes.  For example, using ‘like_regex’:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.title like_regex "^Avengers*"';
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..53546.87 rows=38 width=64) (actual time=218.550..795.063 rows=9 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52543.07 rows=16 width=64) (actual time=251.866..790.098 rows=3 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: (movies_jsonb.jsonb_column @@ '($."title" like_regex "^Avengers*")'::jsonpath)
         Rows Removed by Filter: 125117
         Worker 0:  actual time=311.403..789.402 rows=3 loops=1
         Worker 1:  actual time=225.825..789.408 rows=1 loops=1
 Planning Time: 0.204 ms
 Execution Time: 795.087 ms
(12 rows)

In addition to the GIN indexes, you can create Hash or Btree indexes on your JSON columns, however, they are only useful if you are comparing or searching the entire JSON document.  Instead, you want to use an expression index (often referred to as functional indexes in other databases).  

Generated Columns and Expression Indexes

While the GIN indexes work for some use cases, you often will find it easier and more convenient to use actual columns or some of the normal functions you are familiar with.  The simplest use of expression indexes is to simply pull out and index a field you will use often.  Let’s say we will often want to search for a title.  We can create an index on the extracted JSON from our previous example.

movie_json_test=# create index movie_jsonb_title_index on movies_jsonb (((jsonb_column ->> 'title')::text));
CREATE INDEX

This creates a btree index on the jsonb_column->> title path, allowing us to search for a title.

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=66.97..6421.19 rows=1877 width=64) (actual time=0.138..0.140 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movie_jsonb_title_index  (cost=0.00..66.50 rows=1877 width=0) (actual time=0.014..0.014 rows=1 loops=1)
         Index Cond: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
 Planning Time: 0.160 ms
 Execution Time: 0.159 ms
(8 rows)

Originally, when we searched for movies with the ‘Avengers Endgame (2019)’ title there was no index (unless you modified the SQL to make use of the GIN index).  Now with this new index, you can see our original query is no longer doing a sequential scan but is instead using the btree index.  If we want the index to also be used for ‘ like string%’ statements as well to help with partial matching we need something a bit extra added to the index.  Note: this is not something you can do by default using only the GIN index (however you can look into the pg_trgm extension potentially for this).  Here we will create the same index with the operator class “text_pattern_ops” which is designed to work with Like and Regex.

Before:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' like 'Avengers%';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54125.60 rows=1877 width=64) (actual time=371.463..723.743 rows=8 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52937.90 rows=782 width=64) (actual time=288.053..718.957 rows=3 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) ~~ 'Avengers%'::text)
         Rows Removed by Filter: 125117
         Worker 0:  actual time=204.176..718.288 rows=2 loops=1
         Worker 1:  actual time=288.637..718.299 rows=3 loops=1
 Planning Time: 0.130 ms
 Execution Time: 723.762 ms
(12 rows)

After:

movie_json_test=# create index movie_jsonb_title_index on movies_jsonb (((jsonb_column ->> 'title')::text) text_pattern_ops);
CREATE INDEX
movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' like 'Avengers%';
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=71.66..6425.88 rows=1877 width=64) (actual time=0.195..0.498 rows=8 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) ~~ 'Avengers%'::text)
   Heap Blocks: exact=8
   ->  Bitmap Index Scan on movie_jsonb_title_index  (cost=0.00..71.19 rows=1877 width=0) (actual time=0.015..0.015 rows=8 loops=1)
         Index Cond: (((movies_jsonb.jsonb_column ->> 'title'::text) ~>=~ 'Avengers'::text) AND ((movies_jsonb.jsonb_column ->> 'title'::text) ~<~ 'Avengert'::text))
 Planning Time: 0.168 ms
 Execution Time: 0.519 ms

Now you can also perform a similar operation, but store the results as a queryable generated column instead:

movie_json_test=> \d movies_jsonb
                                         Table "public.movies_jsonb"
    Column    |          Type          | Collation | Nullable |                    Default                    
--------------+------------------------+-----------+----------+-----------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_jsonb_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | 
 jsonb_column | jsonb                  |           | not null | 
Indexes:
    "movies_jsonb_pkey" PRIMARY KEY, btree (ai_myid)
    "movies_jsonb_imdb_idx" UNIQUE, btree (imdb_id)

movie_json_test=> alter table movies_jsonb add column title varchar(255) generated always as (((jsonb_column ->> 'title')::text)) stored;
ALTER TABLE


movie_json_test=> \d movies_jsonb
                                                  Table "public.movies_jsonb"
    Column    |          Type          | Collation | Nullable |                             Default                             
--------------+------------------------+-----------+----------+-----------------------------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_jsonb_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | 
 jsonb_column | jsonb                  |           | not null | 
 title        | character varying(255) |           |          | generated always as (((jsonb_column ->> 'title'::text))) stored
Indexes:
    "movies_jsonb_pkey" PRIMARY KEY, btree (ai_myid)
    "movies_jsonb_imdb_idx" UNIQUE, btree (imdb_id)

Now we have the option to add an index on the title column, select it without the JSON formatting, etc.  Note: generated columns can not be updated.

We can create our entire movie_json table with extra columns and expression indexes and make it look a bit more like a normal table:

create table movies_json_generated (
	ai_myid serial primary key, 
	imdb_id varchar(255) generated always as (jsonb_column ->> 'imdb_id') stored,
	title varchar(255) generated always as (jsonb_column ->> 'title') stored,
    imdb_rating decimal(5,2) generated always as ((jsonb_column  ->> 'imdb_rating')::numeric) stored,
	overview text generated always as (jsonb_column ->> 'overview') stored,
	director jsonb generated always as ((jsonb_column ->> 'director')::json) stored,
	country varchar(100) generated always as (jsonb_column ->> 'country') stored,
	jsonb_column jsonb,
	json_column json
);

create unique index gen_imdb_idx on movies_json_generated(imdb_id);
create index gen_title_idx on movies_json_generated(title);
create index gen_func_title_index on movies_json_generated (((json_column ->> 'title')::varchar));	
CREATE INDEX Gen_gin_index ON movies_json_generated USING gin (jsonb_column);

You may have noticed we had to explicitly cast the columns to a specific data type.  One of the challenges or at least the more difficult things to get used to with both MySQL and PostgreSQL.

It’s worth remembering that a JSON document does not have explicit data types, so you will often find that some functions or indexes may not work as expected because you are comparing data of two different types (and will return different results from some operations, i.e. sorting ascii vs. numeric).  For example: to add 1 to our IMDB rating:

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')))+1.1))) ;
ERROR:  operator does not exist: text + numeric
LINE 1: ...imdb_rating}',(((((jsonb_column ->>'imdb_rating')))+1))) ;
                                                              ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

Here the imdb_rating that is returned is not numeric so you can not add 1.  So logically, you want to cast the value to a numeric to allow for the addition. 

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1))) ;
ERROR:  function jsonb_set(jsonb, unknown, numeric) does not exist
LINE 1: ...pdate  movies_jsonb_generated_1 set jsonb_column= jsonb_set(...
                                                             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Again we run into a type issue.  This time the jsonb_set function can not pass in a numeric value (nor does it implicitly convert) when the function is looking for a JSONB value.  So let’s cast the result to JSONB:

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1)::jsonb)) ;
ERROR:  cannot cast type numeric to jsonb
LINE 1: ...((((jsonb_column ->>'imdb_rating')::numeric))+1.1)::jsonb)) …

Here we find we can not cast a numeric directly to JSONB.  So we have to convert the numeric to text, then to JSONB.

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1)::text::jsonb)) ;
UPDATE 100000

For some developers, this may prove a bit tricky to get the hang of.  This will also extend to things that work, but maybe just slow or may produce the wrong data.  For instance, when using compare, sort, or filter you may or may not pick up the index you are looking for and you may be comparing the ASCII value for numerics.  Below you can see us explicitly convert certain columns to compare them. You can also see that when you generate a column or create an expression index you need to do the same:

movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where (jsonb_column->>'imdb_rating')::numeric > 8 order by (jsonb_column->>'imdb_rating')::numeric desc limit 5;
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=65291.78..65292.36 rows=5 width=128) (actual time=1055.064..1058.740 rows=5 loops=1)
   ->  Gather Merge  (cost=65291.78..77449.06 rows=104198 width=128) (actual time=1055.062..1058.738 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=64291.76..64422.00 rows=52099 width=128) (actual time=1053.392..1053.393 rows=4 loops=3)
               Sort Key: (((jsonb_column ->> 'imdb_rating'::text))::numeric) DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Seq Scan on movies_json_generated  (cost=0.00..63426.41 rows=52099 width=128) (actual time=0.993..1052.275 rows=2688 loops=3)
                     Filter: (((jsonb_column ->> 'imdb_rating'::text))::numeric > '8'::numeric)
                     Rows Removed by Filter: 122432
 Planning Time: 0.061 ms
 Execution Time: 1058.769 ms
(14 rows)
 
movie_json_test=> \d movies_json_generated
                                                     Table "public.movies_json_generated"
    Column    |          Type          | Collation | Nullable |                                    Default                                    
--------------+------------------------+-----------+----------+-------------------------------------------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_json_generated_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | generated always as (((json_column ->> 'imdb_id'::text))) stored
 title        | character varying(255) |           |          | generated always as (((json_column ->> 'title'::text))) stored
 imdb_rating  | numeric(5,2)           |           |          | generated always as (((json_column ->> 'imdb_rating'::text)::numeric)) stored
 overview     | text                   |           |          | generated always as (json_column ->> 'overview'::text) stored
 director     | jsonb                  |           |          | generated always as (((json_column ->> 'director'::text)::json)) stored
 country      | character varying(100) |           |          | generated always as (((json_column ->> 'country'::text))) stored
 jsonb_column | jsonb                  |           |          | 
 json_column  | json                   |           |          | 
Indexes:
    "movies_json_generated_pkey" PRIMARY KEY, btree (ai_myid)
    "gen_func_title_index" btree (((json_column ->> 'title'::text)::character varying))
    "gen_gin_index" gin (jsonb_column)
    "gen_imdb_idx" UNIQUE, btree (imdb_id)
    "gen_title_idx" btree (title)
 
movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 5;
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=62548.12..62548.70 rows=5 width=102) (actual time=112.458..114.704 rows=5 loops=1)
   ->  Gather Merge  (cost=62548.12..63277.80 rows=6254 width=102) (actual time=112.457..114.702 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=61548.09..61555.91 rows=3127 width=102) (actual time=110.807..110.808 rows=4 loops=3)
               Sort Key: imdb_rating DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on movies_json_generated  (cost=0.00..61496.16 rows=3127 width=102) (actual time=0.128..109.939 rows=2688 loops=3)
                     Filter: (imdb_rating > '8'::numeric)
                     Rows Removed by Filter: 122432
 Planning Time: 0.146 ms
 Execution Time: 114.729 ms
(14 rows)

movie_json_test=> create index test_index_imdb_rating on movies_json_generated (imdb_rating);
CREATE INDEX
movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 5;
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..19.41 rows=5 width=102) (actual time=0.094..0.134 rows=5 loops=1)
   ->  Index Scan Backward using test_index_imdb_rating on movies_json_generated  (cost=0.42..28521.09 rows=7510 width=102) (actual time=0.093..0.132 rows=5 loops=1)
         Index Cond: (imdb_rating > '8'::numeric)
 Planning Time: 0.136 ms
 Execution Time: 0.152 ms
(5 rows)

A quick side note:  there will be a performance difference for inserts/updates on JSON/JSONB columns with one or more generated columns or expression indexes.  Because of how JSONB is stored, you can see a boost in performance when you update a column that has a lot of generated columns/expression indexes.  In my tests, I saw 3-4x or more performance improvement in doing updates on a JSONB column with several generated/expression indexes vs. a JSON column with the same setup.  That said, the more generated columns and expression indexes you have on a table, the more it can impact performance… this is a trade-off.

  • A table with a single generated column, took 7650ms to load 100K rows.
  • A table with 6 generated columns, took 8576ms to load 100K rows.

While this is a very simple setup with just a single JSONB column and the generated columns, you can see a slight overhead.  This held true with the same 100K records being updated, updating only one column in the table with six generated columns took 4.7 seconds vs. 3.6 seconds for a single generated column (a ~25% difference).  Again, not an in-depth test, but it illustrates that adding a ton of expression indexes or generated columns will not be without some cost.

Recap & What’s Next

A few quick takeaways:

  • Using JSONB is probably going to be your best option in most use cases
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits

Now you have the basics of JSON in PostgreSQL.  Next post we will review some of the more advanced options and do a deeper dive on performance and explore a normalized schema -vs- one that is heavily JSON.

Aug
20
2021
--

Storing JSON in Your Databases: Tips and Tricks For MySQL Part Two

Storing JSON MySQL

JSON Structure, The Next Generation (Performance, Scale, and Fun)

Storing JSON MySQLNow you know the basic JSON functionality available to you, as covered in part one of this series.  Let’s look deeper into performance, scale, and oddities that may occur.

You can do almost anything you could do in SQL with standard data types and objects, but you may run into specific problems and limitations.  Here is an example of that.  Let’s say I want to get the top 10 rated movies of all time.

mysql> select json_column->>'$.title' as title,   json_column->>'$.imdb_rating' as rating,   json_column->>'$.imdb_id' as imdb_id  from movies_json where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating' desc limit 10;

ERROR 1038 (HY001): Out of sort memory, consider increasing server sort buffer size

In this case, the size required to sort the data set is much larger than the available sort size.  Selecting this data from a normalized structure would not cause this (nor would generated columns).

mysql> select @@sort_buffer_size;
+--------------------+
| @@sort_buffer_size |
+--------------------+
|             262144 |
+--------------------+
1 row in set (0.00 sec)

mysql> set @@sort_buffer_size=26214400;
Query OK, 0 rows affected (0.00 sec)

mysql> select json_column->>'$.title' as title,   json_column->>'$.imdb_rating' as rating,   json_column->>'$.imdb_id' as imdb_id  from movies_json where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating' desc limit 10;
+----------------------------------------------------+--------+-----------+
| title                                              | rating | imdb_id   |
+----------------------------------------------------+--------+-----------+
| The Mountain II (2016)                             | 9.9    | tt5813916 |
| Toxic Temptation (2016)                            | 9.9    | tt4621978 |
| 1985 (2016)                                        | 9.9    | tt5932546 |
| Jag har din rygg (2015)                            | 9.8    | tt3689312 |
| My Head Hurts (2000)                               | 9.8    | tt1346290 |
| Boz: Colors and Shapes (Video 2006)                | 9.8    | tt0876256 |
| Foreclosed (Video 2010)                            | 9.8    | tt1648984 |
| Nocturne in Black (2016)                           | 9.8    | tt4536608 |
| Monrad & Rislund: Det store triumftog (Video 2004) | 9.8    | tt0425266 |
| Questione di Sguardi (2014)                        | 9.8    | tt4423586 |
+----------------------------------------------------+--------+-----------+
10 rows in set, 65535 warnings (0.78 sec)

So we solved the issue with not having enough space to sort the dataset, but did you notice the time?  Almost 1 second.  Any website you are working on nowadays can not succeed if all your database queries take 0.5-1 second to return.  Yes, you can put a cache in front of these and probably will.  A better cache solves it all, right?  Well, not really.  In this case, you can speed up things drastically with a few important and small improvements.  Before moving on, notice the warnings?  Let’s ignore these for one moment. (These are because of the null values in the rating column (but these are not impacting performance).

Let’s look at generated columns.   Generated columns allow you to create columns based on other columns or circumstances on the fly. Note, you can also use functional indexes, which I will cover later.  In the case of JSON, we can pull values out of our document and store them read-only in a column that can be indexed (we all know indexing is good, right!).  Let’s do this!  I am going to create a table with generated columns for various columns.  Then I am going to load the data via insert from our already loaded movies table.

mysql> create table movies_json_generated (
     ->         ai_myid int AUTO_INCREMENT primary key,
     ->         imdb_id varchar(255) generated always as (`json_column` ->> '$.imdb_id'),
     ->         title varchar(255) generated always as (`json_column` ->> '$.title'),
     ->         imdb_rating decimal(5,2) generated always as (`json_column` ->> '$.imdb_rating'),
     ->         overview text generated always as (`json_column` ->> '$.overview'),
     ->         director json generated always as (`json_column` ->> '$.director'),
     ->         cast json generated always as (`json_column` ->> '$.cast'),
     ->         json_column json
     -> ) engine = innodb;
 Query OK, 0 rows affected (0.04 sec)

 mysql> create unique index imdb_idx on movies_json_generated(imdb_id);
 Query OK, 0 rows affected (0.04 sec)
 Records: 0  Duplicates: 0  Warnings: 0

 mysql> insert into movies_json_generated (json_column ) select json_column from  movies_json;
 ERROR 1366 (HY000): Incorrect decimal value: 'null' for column 'imdb_rating' at row 1
 mysql> Drop Table movies_json_generated;
 Query OK, 0 rows affected (0.04 sec)

I wanted to take a moment and dive into what I have seen as a common issue. As you can see, the table was created without problem, but it failed when inserting data from our JSON objects.  The reason? I am storing the  IMDB rating in a decimal(5,2) field, but the JSON reference has ‘null’ (quoted) as text. Type conversions when working with JSON and MySQL can be a bit tricky for some people.  Remember those warnings above?  They were type conversion warnings.  Easy to ignore as they did not appear to cause an immediate issue (but they did).  Type conversions and character sets can be a bit of a problem when accessing JSON data. 

Allow a slight diverged rant:   If your application has an inconsistent or rapidly changing JSON format/schema you can make using standard database functions difficult ( if not impossible ).

I often hear people talk about structure and the lack of flexibility as a massive drawback for relational databases. JSON is flexible; people love flexibility. The issue is flexibility comes at a cost, and you will pay that cost somewhere.  If you are validating and maintaining a structure in your database, your code must fit in the structure and changes to the structure maybe slow ( database migrations).  If you use a flexible database schema, then you need to validate in your code.  This validation in the code may be simple now, but as you grow, the more iterations or changes to your data, the more sprawling the code to check and validate data is. Whether you want to admit it or not, throwing any unstructured data into the wasteland that is a document is a recipe for problems. In the case of storing just a dump of JSON in your database, sometimes that data is complex to access or manipulate without writing code and having access to that validation. 

In this case, the type conversion is pretty straightforward and you can solve this issue in a couple of different ways. Ultimately it is about ensuring the JSON value is converted correctly. I will use the json_value function to unquote the null.

mysql> create table movies_json_generated_stored (
    ->    ai_myid int AUTO_INCREMENT primary key,
    ->    imdb_id varchar(255) generated always as (`json_column` ->> '$.imdb_id')  ,
    ->    title varchar(255) generated always as (`json_column` ->> '$.title') ,
    ->    imdb_rating decimal(5,2) generated always as (json_value(json_column, '$.imdb_rating')) ,
    ->    overview text generated always as (`json_column` ->> '$.overview') ,
    ->    director json generated always as (`json_column` ->> '$.director') ,
    ->    cast json generated always as (`json_column` ->> '$.cast') ,
    ->    json_column json
    -> ) engine = innodb;

 mysql> 
 mysql> create unique index imdb_idx on movies_json_generated(imdb_id);
 Query OK, 0 rows affected (0.01 sec)
 Records: 0  Duplicates: 0  Warnings: 0

 mysql> insert into movies_json_generated (json_column ) select json_column from  movies_json;
 Query OK, 375359 rows affected (40.26 sec)
 Records: 375359  Duplicates: 0  Warnings: 0

Now let’s compare searching for a movie using first the imdb_id from the JSON document and then from the generated table using the column we indexed:

mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt2395427';
 +--------------------------------+-------------------------------+
 | json_column->>'$.title'        | json_column->>'$.imdb_rating' |
 +--------------------------------+-------------------------------+
 | Avengers: Age of Ultron (2015) | 7.5                           |
 +--------------------------------+-------------------------------+
 1 row in set (0.86 sec)

 mysql> select title, imdb_rating from movies_json_generated where imdb_id='tt2395427';
 +--------------------------------+-------------+
 | title                          | imdb_rating |
 +--------------------------------+-------------+
 | Avengers: Age of Ultron (2015) |        7.50 |
 +--------------------------------+-------------+
 1 row in set (0.01 sec)

Great!  Let’s go back to our example using IMDB rating to get the top 10 movies.  To make this faster, we will need to first create an index on the generated column.

mysql> create index idx_rating on movies_json_generated ( imdb_rating );
 Query OK, 0 rows affected (1.45 sec)
 Records: 0  Duplicates: 0  Warnings: 0

With that out of the way, let’s get the top 10 list:

mysql> select json_column->>'$.title' as title,   json_column->>'$.imdb_rating' as rating,   json_column->>'$.imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 10;
 +--------------------------------------------------+--------+-----------+
 | title                                            | rating | imdb_id   |
 +--------------------------------------------------+--------+-----------+
 | Advent (IV) (2016)                               | 10.0   | tt6129028 |
 | 311 Live: 3/11 Day 2006 (2006)                   | 10.0   | tt0872240 |
 | Light Study (2013)                               | 10.0   | tt3130306 |
 | Future Boyfriend (2016)                          | 10.0   | tt3955652 |
 | Cory in the House: All Star Edition (Video 2007) | 10.0   | tt2402070 |
 | 1985 (2016)                                      | 9.9    | tt5932546 |
 | Toxic Temptation (2016)                          | 9.9    | tt4621978 |
 | The Mountain II (2016)                           | 9.9    | tt5813916 |
 | Questione di Sguardi (2014)                      | 9.8    | tt4423586 |
 | Foreclosed (Video 2010)                          | 9.8    | tt1648984 |
 +--------------------------------------------------+--------+-----------+
 10 rows in set (0.01 sec)

A very nice drop from 0.78 seconds to 0.01 seconds!  But wait… why is the data different?  Ahhh glad you noticed!  As discussed above, pulling data out of JSON often requires some type of conversion.  By default, values coming out of JSON are considered as text, not numeric, so it’s sorting based on the ASCII Value (oops).   So you can get the same results by forcing the type conversion:

mysql> select json_column->>'$.title' as title,   json_column->>'$.imdb_rating' as rating,   json_column->>'$.imdb_id' as imdb_id  from movies_json_generated where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating'*1 desc limit 10;
+--------------------------------------------------+--------+-----------+
| title                                            | rating | imdb_id   |
+--------------------------------------------------+--------+-----------+
| 311 Live: 3/11 Day 2006 (2006)                   | 10.0   | tt0872240 |
| Advent (IV) (2016)                               | 10.0   | tt6129028 |
| Cory in the House: All Star Edition (Video 2007) | 10.0   | tt2402070 |
| Light Study (2013)                               | 10.0   | tt3130306 |
| Future Boyfriend (2016)                          | 10.0   | tt3955652 |
| Toxic Temptation (2016)                          | 9.9    | tt4621978 |
| The Mountain II (2016)                           | 9.9    | tt5813916 |
| 1985 (2016)                                      | 9.9    | tt5932546 |
| Nocturne in Black (2016)                         | 9.8    | tt4536608 |
| My Head Hurts (2000)                             | 9.8    | tt1346290 |
+--------------------------------------------------+--------+-----------+
10 rows in set, 65535 warnings (0.89 sec)

Alternative you can use:  cast(json_value(json_column,'$.imdb_rating') as float)

So not only can you significantly speed up your performance with indexing with generated columns, you can also ensure that you are getting consistent types and the expected results.  The documentation has a detailed section on ordering and group by JSON values. Generated columns also improve most of the other queries we showed above.

Functional Indexes Without Generated Columns

Generated columns work well enough, but for our queries (In MySQL 8.0.13 or later), we can create indexes on the JSON functions we call regularly and forgo the generated columns altogether.   

mysql>  create index title_idx on movies_json ((json_value(json_column,'$.title')));
Query OK, 0 rows affected (2.90 sec)
Records: 0  Duplicates: 0  Warnings: 0

This works well enough if you are trying to match an exact match; if you need to use a like or wild card the functional index won’t be used. As you explore using functional indexes with JSON, be mindful there are some character set and collation restrictions and restrictions on which functions can be used, so your mileage will vary.   I will avoid a deep dive here, but you can review the documentation here https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-functional-key-parts & here:  https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html#json-column-indirect-index.  In my testing, the more complicated or nested the JSON was, the more problems I ran into.   For this reason, I stuck with the generated columns for my examples above. 

Multi-Valued Indexes

As of 8.0.17 MySQL also supports Multi-Valued indexes, a secondary index that allows you to index an array.  This is helpful if you have an array within your JSON (this does not appear to support arrays of characters at this time, the helpful “This version of MySQL doesn’t yet support ‘CAST-ing data to array of char/binary BLOBs’”).  This can help with JSON designs that are straightforward, but as your JSON becomes more nested and complex I ran into problems.  

Performance Summary

Super unscientific, but these query run times hold over multiple iterations, look at the differences speed-wise: 

Query Access JSON Directly (seconds) Generated Column
(seconds)
Simple Search Via IMDB ID 0.75  0.01
Search for Avengers Titled Movies 0.76 0.01>
Updating a single value within the JSON searching via IMDB or title 0.80 0.01>
Find top 10 movies of all time 0.89 0.01
Characters played by Robert Downey JR in the avengers’ movies 0.74 0.01>

More Performance Needed? Normalizing Data for Query Patterns

So far we have done a lot of simple interactions and were able to speed up access to an acceptable level.  But not everything fits within the available toolbox.  When searching for movies or ratings for a specific cast member ( show me all the Avengers movies Robert Downey JR. played and the characters ), we used an index on the title generated column to reduce the JSON Documents we had to fully process to get the character he played.  See below:

mysql> select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns( 
    ->    V_name varchar(200) path '$.name',
    ->    V_character varchar(200) path '$.character')
    ->    ) t where t.V_name like 'Robert Downey Jr.%'  and title like 'Avengers%';
+--------------------------------+-------------+-------------------+------------------------------------+
| title                          | imdb_rating | V_name            | V_character                        |
+--------------------------------+-------------+-------------------+------------------------------------+
| Avengers: Age of Ultron (2015) |        7.50 | Robert Downey Jr. | Tony Stark                         |
| Avengers: Endgame (2019)       |        9.00 | Robert Downey Jr. | Tony Stark /              Iron Man |
| Avengers: Infinity War (2018)  |        NULL | Robert Downey Jr. | Tony Stark /              Iron Man |
+--------------------------------+-------------+-------------------+------------------------------------+
3 rows in set (0.00 sec)

mysql> explain select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns(     V_name varchar(200) path '$.name',    V_character varchar(200) path '$.character')    ) t where t.V_name like 'Robert Downey Jr.%'  and title like 'Avengers%';
+----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+
| id | select_type | table                 | partitions | type  | possible_keys | key       | key_len | ref  | rows | filtered | Extra                                                    |
+----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+
|  1 | SIMPLE      | movies_json_generated | NULL       | range | title_idx     | title_idx | 1023    | NULL |    8 |   100.00 | Using where                                              |
|  1 | SIMPLE      | t                     | NULL       | ALL   | NULL          | NULL      | NULL    | NULL |    2 |    50.00 | Table function: json_table; Using temporary; Using where |
+----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

But let’s assume you needed to get a list of all characters he played in his career (Will truncate the full result set).  

mysql> select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns( 
    ->    V_name varchar(200) path '$.name',
    ->    V_character varchar(200) path '$.character')
    ->    ) t where t.V_name like 'Robert Downey Jr.%';


+-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+
| title                                                                                                       | imdb_rating | V_name            | V_character                                                                                    |
+-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+
| The 65th Annual Academy Awards (1993)                                                                       |        NULL | Robert Downey Jr. | Himself - Presenter                                                                            |
| Sherlock Holmes: A Game of Shadows (2011)                                                                   |        7.50 | Robert Downey Jr. | Sherlock Holmes                                                                                |
| Due Date (2010)                                                                                             |        6.60 | Robert Downey Jr. | Peter Highman                                                                                  |
| Eros (2004)                                                                                                 |        6.00 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")                                                           |
| The EE British Academy Film Awards (2015)                                                                   |        7.40 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough                                                         |
| "Saturday Night Live" John Lithgow/Mr. Mister (TV Episode 1985)                                             |        NULL | Robert Downey Jr. | Bruce Winston /              Rudy Randolph III /              Various       (as Robert Downey) |
+-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+
213 rows in set (7.14 sec)



mysql> explain select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns(     V_name varchar(200) path '$.name',    V_character varchar(200) path '$.character')    ) t where t.V_name like 'Robert Downey Jr.%';
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+
| id | select_type | table                 | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra                                                    |
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+
|  1 | SIMPLE      | movies_json_generated | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 358174 |   100.00 | NULL                                                     |
|  1 | SIMPLE      | t                     | NULL       | ALL  | NULL          | NULL | NULL    | NULL |      2 |    50.00 | Table function: json_table; Using temporary; Using where |
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

Now our access pattern needs to start at a deeper nested element within the JSON document.  Here you are faced with a few options, but almost all of them lead to creating and maintaining a new table to get that level of data either via trigger, code to break out the data when being loaded or via batch job (or caching this evil slow query).  

I ended up creating the following “Database Schema” for it:

create table movies_normalized_meta (
        ai_myid int AUTO_INCREMENT primary key,
        imdb_id varchar(32),
        title varchar(255),
        imdb_rating decimal(5,2),
        json_column json
) engine = innodb;

create unique index imdb_id_idx  on movies_normalized_meta (imdb_id);
create index rating_idx  on movies_normalized_meta (imdb_rating);

create table movies_normalized_actors (
        ai_actor_id int auto_increment primary key,
        actor_id varchar(50),
        actor_name varchar(500)
        ) engine = innodb;
create index actor_id_idx  on movies_normalized_actors (actor_id);
create index actor_name_idx  on movies_normalized_actors (actor_name);

create table movies_normalized_cast (
        ai_actor_id int,
        ai_myid int,
        actor_character varchar(500)
        ) engine = innodb;

create index cast_id_idx  on movies_normalized_cast (ai_actor_id,ai_myid);
create index cast_id2_idx  on movies_normalized_cast (ai_myid);
create index cast_character_idx  on movies_normalized_cast (actor_character);
create unique index u_cast_idx  on movies_normalized_cast (ai_myid,ai_actor_id,actor_character);

On loading the JSON into MySQL I added an actor table that will have a row for each new unique actor as well as a cast table that has the movie, actor, and name of the character they played in the movie (Note I could optimize the structure a bit, but that’s for another day ).  This gives me a ton of flexibility in reporting and a major performance boost.  Now to get all the movies Robert Downey JR was in and the characters he played I can it via:

mysql> select title, imdb_rating, actor_character from movies_normalized_meta a, movies_normalized_cast b,  movies_normalized_actors c where a.ai_myid=b.ai_myid and b.ai_actor_id = c.ai_actor_id and actor_name='Robert Downey Jr.';


+-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+
| title                                                                                                       | imdb_rating | actor_character                                                                                |
+-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+
| The 65th Annual Academy Awards (1993)                                                                       |        NULL | Himself - Presenter                                                                            |
| Sherlock Holmes: A Game of Shadows (2011)                                                                   |        7.50 | Sherlock Holmes                                                                                |
| Due Date (2010)                                                                                             |        6.60 | Peter Highman                                                                                  |
| Eros (2004)                                                                                                 |        6.00 | Nick Penrose (segment "Equilibrium")     
 | Saturday Night Live in the '80s: Lost & Found (2005)                                                        |        7.20 | Various       (archive footage)                                                                |
| "Saturday Night Live" John Lithgow/Mr. Mister (TV Episode 1985)                                             |        NULL | Bruce Winston /              Rudy Randolph III /              Various       (as Robert Downey) |
+-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+
213 rows in set (0.01 sec)
                                           

      mysql> explain select title, imdb_rating, actor_character from movies_normalized_meta a, movies_normalized_cast b,  movies_normalized_actors c where a.ai_myid=b.ai_myid and b.ai_actor_id = c.ai_actor_id and actor_name='Robert Downey Jr.';
+----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+
| id | select_type | table | partitions | type   | possible_keys                       | key            | key_len | ref                           | rows | filtered | Extra                 |
+----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+
|  1 | SIMPLE      | c     | NULL       | ref    | PRIMARY,actor_name_idx              | actor_name_idx | 2003    | const                         |  213 |   100.00 | Using index           |
|  1 | SIMPLE      | b     | NULL       | ref    | u_cast_idx,cast_id_idx,cast_id2_idx | cast_id_idx    | 5       | movie_json_test.c.ai_actor_id |    2 |   100.00 | Using index condition |
|  1 | SIMPLE      | a     | NULL       | eq_ref | PRIMARY                             | PRIMARY        | 4       | movie_json_test.b.ai_myid     |    1 |   100.00 | NULL                  |
+----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+
3 rows in set, 1 warning (0.00 sec)

Not only is this 7 seconds faster than accessing the same data via the JSON functions, but I can also use these tables to easily do things like give me all the movies that had two or three actors tother, play six degrees, etc that would be a challenge only access the JSON directly. 

Take-Aways and Lessons Learned

First,  think before you store your data.  Understanding what you are storing, why you are storing, and how you will access it is paramount. How you will access and use your data has a profound impact on the optimal database setup, design, and usage.  If you only access top-level properties in your document, a simple design relying only on using MySQL’s built-in JSON functions may be totally ok.  But as you want to dive deeper into the data and start looking at pulling our subsets of data or correlating different documents to ensure performance and scalability you may end up reducing your flexibility and build (hold your breath) an actual database schema that is normalized and everything.  

Second, MySQL’s JSON functions get the job done for most basic use cases.  You can get what you need, but you need to understand the implementations and what is available and what is not.  As I perform these same processes with PostgreSQL and MongoDB you will see where some functions are better than others.   

Finally,  don’t fear schemas!  Seriously, structure is good.  No matter how you access your data and use it, you will assume some structure (it all depends on where you want to enforce that structure).  

Next up in this series, I will dive into JSON with PostgreSQL followed by JSON and MongoDB. Stay tuned!

Aug
19
2021
--

Storing JSON in Your Databases: Tips and Tricks For MySQL Part One

MySQL JSON Databases

MySQL JSON DatabasesDatabase architecture and design are becoming an increasingly lost art. With new technologies and the push towards faster development cycles, people continue to take shortcuts, often to the detriment of long-term performance, scalability, and security. Designing how your application stores, accesses, and processes data is so fundamentally important, it can not be overlooked. I want people to understand that early design choices can have a profound impact on their applications. To that end, I will be exploring database design principles and practices over the next several months. I am starting with every developer’s favorite data format: JSON!

It seems that almost every database over the last few years has introduced various degrees of support for storing and interacting with JSON objects directly. While these features are designed to make it easier for application developers to write code faster, the implementations of each implementation tend to vary wildly and can cause some, well, weirdness. Over the next few weeks/months, I will show you some methods, mistakes, and common ways developers store JSON. Just because you can use a database’s native JSON support does not always mean you should! I hope to show you which ones work best for which use cases.

For part one of this series, I am going to focus on MySQL. MySQL’s implementation of the JSON data type was introduced back in 5.7 (Late 2015/Early 2016 timeframe). Since then, a few minor enhancements have made the implementation a bit more liveable. The current iteration MySQL 8 offers a fully functional implementation of JSON functions and features. Let me show you some examples of how to store and interact with your JSON documents within MySQL.

Setup

For all my tests, I wanted a reasonable amount of data to test the performance implications of certain functions.  I opted to use the metadata JSON from http://movienet.site/, about 2.3GB of individual JSON files (one per movie).

I wrote a small python script to load and iterate through the JSON files and load them into MySQL.

metadata JSON

I will walk through the examples and show you how I have seen many developers use MySQL to interact with JSON and point out why some of them may be incorrect or cause you issues you may not be aware of.  I will also show you a few other features you may want to look into and explore and offer some design advice.  Let us start with the following simple table definition:

create table movies_json (
   ai_myid int AUTO_INCREMENT primary key,
   imdb_id varchar(255),
   json_column json
) engine = innodb;

create unique index imdb_idx on movies_json(imdb_id);

 

{
  "imdb_id": "tt8408760",
  "tmdb_id": null,
  "douban_id": null,
  "title": "Rubes (2019)",
  "genres": [
    "Short",
    "Comedy",
    "Horror"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "7 min",
      "description": ""
    }
  ],
  "imdb_rating": null,
  "director": [
    {
      "id": "nm3216042",
      "name": "Nathan Alan Bunker"
    }
  ],
  "writer": null,
  "cast": [
    {
      "id": "nm1899908",
      "name": "Brendan Jennings",
      "character": "Milton"
    },
    {
      "id": "nm2384265",
      "name": "Ben Begley",
      "character": "Paul"
    },
    {
      "id": "nm2287013",
      "name": "Jerry Marr",
      "character": "Professor Henson"
    },
    {
      "id": "nm7529700",
      "name": "Allene Prince",
      "character": "Margaret"
    }
  ],
  "overview": null,
  "storyline": "Two disgruntled teachers use a Rube Goldberg machine to exact revenge on the people who have wronged them.",
  "plot": null,
  "synopsis": null
}

You can see an example of the JSON format

101: Simple JSON Interactions in MySQL

Yes, a single column in a table with a key or two.  Each row would store one of the movies in the downloaded JSON files.  There is an auto_increment key and the IMDB ID that I extracted from the JSON during the load. This structure and setup is a straightforward design with minimal effort.  However, this design also means you generally rely on MySQL as merely the storage for your data.  Provided you are accessing everything by the imdb_id key, you can get and update your JSON to your application easily with a:

select json_column from movies_json where imdb_id = ‘tt4154796’;

update movies_json set json_column = ‘<new JSON>’ where imdb_id = ‘tt4154796’;

Eventually, however, you will want to search within your JSON or just return a portion of the JSON Document. For example, let’s say you only want to find the title and IMDB rating for a specified movie. You can do this with functionality is built-in:

mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt2395427';
+--------------------------------+-------------------------------+
| json_column->>'$.title'        | json_column->>'$.imdb_rating' |
+--------------------------------+-------------------------------+
| Avengers: Age of Ultron (2015) | 7.5                           |
+--------------------------------+-------------------------------+
1 row in set (0.77 sec)


mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt4154796';
+--------------------------+-------------------------------+
| json_column->>'$.title'  | json_column->>'$.imdb_rating' |
+--------------------------+-------------------------------+
| Avengers: Endgame (2019) | null                          |
+--------------------------+-------------------------------+
1 row in set (0.75 sec)

Here you can see we can interact inside the JSON column just like we would with standard data via SQL by using the special syntax “->>’$.key’”.   You can see Avengers: Endgame has a rating of null!  That is no good, and it was a much better movie than that.  Instead of updating and storing the entire JSON document again, MySQL provides a JSON_SET function to set an element within a document.

mysql>  update movies_json 
       set json_column = JSON_SET(json_column, "$.imdb_rating", 9) 
       where json_column->>'$.imdb_id'='tt4154796';
 Query OK, 1 row affected (0.93 sec)
 Rows matched: 1  Changed: 1  Warnings: 0

 mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt4154796';
 +--------------------------+-------------------------------+
 | json_column->>'$.title'  | json_column->>'$.imdb_rating' |
 +--------------------------+-------------------------------+
 | Avengers: Endgame (2019) | 9                             |
 +--------------------------+-------------------------------+
 1 row in set (0.80 sec)

We now have fixed the missing rating for Endgame!  But we may not know the IMDB ID when we are searching. Just like working with standard data types, you can use data from within your document in a where clause.  In this case, we will look for all movies that start with “Avengers”.

mysql> select json_column->>'$.title', json_column->>'$.imdb_rating', json_column->>'$.imdb_id' from movies_json 
where json_column->>'$.title' like 'Avengers%';

+------------------------------------------------------------+-------------------------------+---------------------------+
| json_column->>'$.title'                                    | json_column->>'$.imdb_rating' | json_column->>'$.imdb_id' |
+------------------------------------------------------------+-------------------------------+---------------------------+
| Avengers: Endgame (2019)                                   | 9.0                           | tt4154796                 |
| Avengers Confidential: Black Widow & Punisher (Video 2014) | 5.8                           | tt3482378                 |
| Avengers of Justice: Farce Wars (2018)                     | null                          | tt6172666                 |
| Avengers: Age of Ultron (2015)                             | 7.5                           | tt2395427                 |
| Avengers: Infinity War (2018)                              | null                          | tt4154756                 |
| Avengers Grimm: Time Wars (Video 2018)                     | null                          | tt8159584                 |
| Avengers Assemble! (TV Series 2010– )                      | null                          | tt1779952                 |
| Avengers Grimm (Video 2015)                                | 2.8                           | tt4296026                 |
+------------------------------------------------------------+-------------------------------+---------------------------+
8 rows in set (0.74 sec)

Using the “json_column->’$.title’ in the where clause got us a nice list of Avengers titled movies and TV shows.  But, you can see from this query, we got more than just the blockbuster Avengers movies.  Let’s say you want to refine this a bit more and find just Avengers movies with Robert Downey Jr. in the cast.  This is a bit more difficult, honestly, because the format of our JSON documents uses an array for cast members.

Here is what the JSON looks like:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

 

You can access arrays in a JSON document by referencing the specific index for the element you want ( i.e. [0].name ), however, if you don’t know which one contains the data you are looking for you need to search for it.  MySQL has the function json_search to help with this (there are other functions such as json_contains as well).  json_search searches a provided value and will return the location if found and null if not found:

mysql> select json_column->>'$.title', 
json_column->>'$.imdb_rating', 
json_column->>'$.imdb_id' from movies_json  
where json_column->>'$.title' like 'Avengers%' and json_search(json_column->>'$.cast', 'one','Robert Downey Jr.', NULL,'$[*].name' ) is not null;

+--------------------------------+-------------------------------+---------------------------+
| json_column->>'$.title'        | json_column->>'$.imdb_rating' | json_column->>'$.imdb_id' |
+--------------------------------+-------------------------------+---------------------------+
| Avengers: Endgame (2019)       | 9                             | tt4154796                 |
| Avengers: Age of Ultron (2015) | 7.5                           | tt2395427                 |
| Avengers: Infinity War (2018)  | null                          | tt4154756                 |
+--------------------------------+-------------------------------+---------------------------+
3 rows in set (0.79 sec)

You will notice that I used the parameter ‘one’, this finds the first value.  You can also use ‘all’ to return every value matched.  In case you are curious as to what the json_search actually returns here is the output:

mysql>  select json_column->>'$.title' as title,
        json_search(json_column->>'$.cast', 'one','Robert Downey Jr.') as search_output
    from movies_json  where json_column->>'$.title' like 'Avengers%' 
and json_search(json_column->>'$.cast', 'one','Robert Downey Jr.', NULL,'$[*].name' ) is not null;

+--------------------------------+---------------+
| title                          | search_output |
+--------------------------------+---------------+
| Avengers: Endgame (2019)       | "$[0].name"   |
| Avengers: Age of Ultron (2015) | "$[0].name"   |
| Avengers: Infinity War (2018)  | "$[0].name"   |
+--------------------------------+---------------+
3 rows in set (0.72 sec)

You can see it returns the position and the property that contains the value.  This output is useful for a variety of reasons.  One is if you need to find which index value contains that particular text.  In the example of searching for Robert Downey JR movies, we can use this index information to return the character he played in each movie.  The first way I have seen this done requires a bit of unholy wrangling but:

mysql> select json_column->>'$.title' as title, 
    json_column->>'$.imdb_rating' as Rating, 
    json_column->>'$.imdb_id' as IMDB_ID,
json_extract(json_column->>'$.cast',concat(substr(json_unquote(json_search(json_column->>'$.cast', 'one','Robert Downey Jr.')),1,
    -> locate('.',json_unquote(json_search(json_column->>'$.cast', 'one','Robert Downey Jr.')))),'character')) as Char_played
    from movies_json  where json_column->>'$.title' like 'Avengers%' and json_search(json_column->>'$.cast', 'one','Robert Downey Jr.') is not null;


+--------------------------------+--------+-----------+--------------------------------------+
| title                          | Rating | IMDB_ID   | Char_played                          |
+--------------------------------+--------+-----------+--------------------------------------+
| Avengers: Endgame (2019)       | 9      | tt4154796 | "Tony Stark /              Iron Man" |
| Avengers: Age of Ultron (2015) | 7.5    | tt2395427 | "Tony Stark"                         |
| Avengers: Infinity War (2018)  | null   | tt4154756 | "Tony Stark /              Iron Man" |
+--------------------------------+--------+-----------+--------------------------------------+
3 rows in set (0.68 sec)

Here I am finding where in the document Robert Downey Jr is listed, then extracting the index and using that with the JSON Extract function to pull out the value of the “$[0].character” instead of “$[0].name”.  While this works, it is ugly.  MySQL provides an alternative to doing this by using json_table.

mysql> select json_column->>'$.title',  json_column->>'$.imdb_rating', t.* from movies_json, json_table(json_column, '$.cast[*]' columns( 
       V_name varchar(200) path '$.name',
       V_character varchar(200) path '$.character')
       ) t where t.V_name like 'Robert Downey Jr.%' and json_column->>'$.title' like 'Avengers%';


+--------------------------------+-------------------------------+-------------------+------------------------------------+
| json_column->>'$.title'        | json_column->>'$.imdb_rating' | V_name            | V_character                        |
+--------------------------------+-------------------------------+-------------------+------------------------------------+
| Avengers: Endgame (2019)       | 9                             | Robert Downey Jr. | Tony Stark /              Iron Man |
| Avengers: Age of Ultron (2015) | 7.5                           | Robert Downey Jr. | Tony Stark                         |
| Avengers: Infinity War (2018)  | null                          | Robert Downey Jr. | Tony Stark /              Iron Man |
+--------------------------------+-------------------------------+-------------------+------------------------------------+
3 rows in set (0.74 sec)

Basically, json_table takes an array and turns it into a table object, allowing you to join and query it.  You can also use this to list all the characters an actor played in any film in his career. Stay tuned for part two of this series, when we will show you some easier and faster ways to use JSON from inside MySQL.

Dec
02
2020
--

Fylamynt raises $6.5M for its cloud workflow automation platform

Fylamynt, a new service that helps businesses automate their cloud workflows, today announced both the official launch of its platform as well as a $6.5 million seed round. The funding round was led by Google’s AI-focused Gradient Ventures fund. Mango Capital and Point72 Ventures also participated.

At first glance, the idea behind Fylamynt may sound familiar. Workflow automation has become a pretty competitive space, after all, and the service helps developers connect their various cloud tools to create repeatable workflows. We’re not talking about your standard IFTTT- or Zapier -like integrations between SaaS products, though. The focus of Fylamynt is squarely on building infrastructure workflows. While that may sound familiar, too, with tools like Ansible and Terraform automating a lot of that already, Fylamynt sits on top of those and integrates with them.

Image Credits: Fylamynt

“Some time ago, we used to do Bash and scripting — and then [ … ] came Chef and Puppet in 2006, 2007. SaltStack, as well. Then Terraform and Ansible,” Fylamynt co-founder and CEO Pradeep Padala told me. “They have all done an extremely good job of making it easier to simplify infrastructure operations so you don’t have to write low-level code. You can write a slightly higher-level language. We are not replacing that. What we are doing is connecting that code.”

So if you have a Terraform template, an Ansible playbook and maybe a Python script, you can now use Fylamynt to connect those. In the end, Fylamynt becomes the orchestration engine to run all of your infrastructure code — and then allows you to connect all of that to the likes of DataDog, Splunk, PagerDuty Slack and ServiceNow.

Image Credits: Fylamynt

The service currently connects to Terraform, Ansible, Datadog, Jira, Slack, Instance, CloudWatch, CloudFormation and your Kubernetes clusters. The company notes that some of the standard use cases for its service are automated remediation, governance and compliance, as well as cost and performance management.

The company is already working with a number of design partners, including Snowflake.

Fylamynt CEO Padala has quite a bit of experience in the infrastructure space. He co-founded ContainerX, an early container-management platform, which later sold to Cisco. Before starting ContainerX, he was at VMWare and DOCOMO Labs. His co-founders, VP of Engineering Xiaoyun Zhu and CTO David Lee, also have deep expertise in building out cloud infrastructure and operating it.

“If you look at any company — any company building a product — let’s say a SaaS product, and they want to run their operations, infrastructure operations very efficiently,” Padala said. “But there are always challenges. You need a lot of people, it takes time. So what is the bottleneck? If you ask that question and dig deeper, you’ll find that there is one bottleneck for automation: that’s code. Someone has to write code to automate. Everything revolves around that.”

Fylamynt aims to take the effort out of that by allowing developers to either write Python and JSON to automate their workflows (think “infrastructure as code” but for workflows) or to use Fylamynt’s visual no-code drag-and-drop tool. As Padala noted, this gives developers a lot of flexibility in how they want to use the service. If you never want to see the Fylamynt UI, you can go about your merry coding ways, but chances are the UI will allow you to get everything done as well.

One area the team is currently focusing on — and will use the new funding for — is building out its analytics capabilities that can help developers debug their workflows. The service already provides log and audit trails, but the plan is to expand its AI capabilities to also recommend the right workflows based on the alerts you are getting.

“The eventual goal is to help people automate any service and connect any code. That’s the holy grail. And AI is an enabler in that,” Padala said.

Gradient Ventures partner Muzzammil “MZ” Zaveri echoed this. “Fylamynt is at the intersection of applied AI and workflow automation,” he said. “We’re excited to support the Fylamynt team in this uniquely positioned product with a deep bench of integrations and a nonprescriptive builder approach. The vision of automating every part of a cloud workflow is just the beginning.”

The team, which now includes about 20 employees, plans to use the new round of funding, which closed in September, to focus on its R&D, build out its product and expand its go-to-market team. On the product side, that specifically means building more connectors.

The company offers both a free plan as well as enterprise pricing and its platform is now generally available.

Jun
18
2019
--

MongoDB gets a data lake, new security features and more

MongoDB is hosting its developer conference today and, unsurprisingly, the company has quite a few announcements to make. Some are straightforward, like the launch of MongoDB 4.2 with some important new security features, while others, like the launch of the company’s Atlas Data Lake, point the company beyond its core database product.

“Our new offerings radically expand the ways developers can use MongoDB to better work with data,” said Dev Ittycheria, the CEO and president of MongoDB. “We strive to help developers be more productive and remove infrastructure headaches — with additional features along with adjunct capabilities like full-text search and data lake. IDC predicts that by 2025 global data will reach 175 Zettabytes and 49% of it will reside in the public cloud. It’s our mission to give developers better ways to work with data wherever it resides, including in public and private clouds.”

The highlight of today’s set of announcements is probably the launch of MongoDB Atlas Data Lake. Atlas Data Lake allows users to query data, using the MongoDB Query Language, on AWS S3, no matter their format, including JSON, BSON, CSV, TSV, Parquet and Avro. To get started, users only need to point the service at their existing S3 buckets. They don’t have to manage servers or other infrastructure. Support for Data Lake on Google Cloud Storage and Azure Storage is in the works and will launch in the future.

Also new is Full-Text Search, which gives users access to advanced text search features based on the open-source Apache Lucene 8.

In addition, MongoDB is also now starting to bring together Realm, the mobile database product it acquired earlier this year, and the rest of its product lineup. Using the Realm brand, Mongo is merging its serverless platform, MongoDB Stitch, and Realm’s mobile database and synchronization platform. Realm’s synchronization protocol will now connect to MongoDB Atlas’ cloud database, while Realm Sync will allow developers to bring this data to their applications. 

“By combining Realm’s wildly popular mobile database and synchronization platform with the strengths of Stitch, we will eliminate a lot of work for developers by making it natural and easy to work with data at every layer of the stack, and to seamlessly move data between devices at the edge to the core backend,”  explained Eliot Horowitz, CTO and co-founder of MongoDB.

As for the latest release of MongoDB, the highlight of the release is a set of new security features. With this release, Mongo is implementing client-side Field Level Encryption. Traditionally, database security has always relied on server-side trust. This typically leaves the data accessible to administrators, even if they don’t have client access. If an attacker breaches the server, that’s almost automatically a catastrophic event.

With this new security model, Mongo is shifting access to the client and to the local drivers. It provides multiple encryption options; for developers to make use of this, they will use a new “encrypt” JSON scheme attribute.

This ensures that all application code can generally run unmodified, and even the admins won’t get access to the database or its logs and backups unless they get client access rights themselves. Because the logic resides in the drivers, the encryption is also handled totally separate from the actual database.

Other new features in MongoDB 4.2 include support for distributed transactions and the ability to manage MongoDB deployments from a single Kubernetes control plane.

Apr
02
2019
--

How to handle dark data compliance risk at your company

Slack and other consumer-grade productivity tools have been taking off in workplaces large and small — and data governance hasn’t caught up.

Whether it’s litigation, compliance with regulations like GDPR or concerns about data breaches, legal teams need to account for new types of employee communication. And that’s hard when work is happening across the latest messaging apps and SaaS products, which make data searchability and accessibility more complex.

Here’s a quick look at the problem, followed by our suggestions for best practices at your company.

Problems

The increasing frequency of reported data breaches and expanding jurisdiction of new privacy laws are prompting conversations about dark data and risks at companies of all sizes, even small startups. Data risk discussions necessarily include the risk of a data breach, as well as preservation of data. Just two weeks ago it was reported that Jared Kushner used WhatsApp for official communications and screenshots of those messages for preservation, which commentators say complies with record keeping laws but raises questions about potential admissibility as evidence.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com