Oct
07
2021
--

Using the Range and the New Multirange Data Type in PostgreSQL 14

Multirange Data Type in PostgreSQL 14

Don’t miss the great overview of PostgreSQL 14 that Umair Shahid recently put together, PostgreSQL 14 – Performance, Security, Usability, and Observability!

Range data types have been in Postgresql for some time now, however, as of PostgreSQL 14, we have seen the introduction of multi-range data types.  Before we cover that, let’s cover the basics of what the heck kind of voodoo black magic a range data type is, and why would you ever want to use it before diving into the multi-range variety.

Range data types store a beginning and end value.  The stored data could be an integer or a DateTime (timestamp). Note: if you are interested, you can also create your own custom range types as well.  Why would you use this?  I can think of a few interesting use cases, but the most generic one is keeping track of state changes and durations.  For example:

  • This machine was running from X time to Y time.
  • This room was occupied between X and Y.
  • This sale/price is active within these time frames only.

Range data has been stored for ages without a specific “range” data type.  Let’s show you how we used to do it and how range and multi-range types help us.

For our example, let us assume you want to track the days and times the HOSS wears his Cleveland Browns hat -vs- his elephant hat (practical, right?).

Matt Yonkovit Hats

Using Classic SQL to Interact with Ranges

Here is how you would classically do something like this:

create table matts_hats_1 ( 
id serial primary key, 
matts_hat varchar(100),
start_ts timestamp, 
end_ts timestamp
);

insert into matts_hats_1 (matts_hat,start_ts,end_ts) values ('Browns Hat', '2021-10-01 6:00', '2021-10-01 10:00');
insert into matts_hats_1 (matts_hat,start_ts,end_ts) values ('Elephant Hat', '2021-10-01 10:00', '2021-10-01 12:00');
insert into matts_hats_1 (matts_hat,start_ts,end_ts) values ('Browns Hat', '2021-10-01 14:00', '2021-10-01 20:00');
insert into matts_hats_1 (matts_hat,start_ts,end_ts) values ('Elephant Hat', '2021-10-01 22:00', '2021-10-01 23:00');

You can insert which hat I am wearing and when I start and stop wearing that hat. To see what hat I am wearing at any one time, I would use:

yonk=# select id, matts_hat from matts_hats_1 where start_ts <= '2021-10-01 7:35'::timestamp and end_ts >= '2021-10-01 7:35'::timestamp;
 id | matts_hat  
----+------------
  1 | Browns Hat

To see what hat I am wearing from 7 am to 11 am, I would need to do a couple of adjustments.  First, I may have started wearing the hat before 7 am and potentially wore the hat after 11 am.  In order to find these, I will need to look at start times before 7 am with end times after 7 am and start_times before 11 am with an end time after 11 am.  That looks like this:

yonk=# select id, matts_hat from matts_hats_1 where (start_ts <= '2021-10-01 07:00'::timestamp and end_ts >= '2021-10-01 07:00'::timestamp) or (start_ts <= '2021-10-01 11:00'::timestamp and end_ts >= '2021-10-01 11:00'::timestamp);
 id |  matts_hat   
----+--------------
  1 | Browns Hat
  2 | Elephant Hat
(2 rows)

The issue with this approach is that it quickly gets pretty complicated and it negates proper indexes.  See as we try and add an index:

yonk=# create index matts_hats_1_idx on matts_hats_1 (start_ts, end_ts);
 CREATE INDEX
 
 yonk=# explain select id, matts_hat from matts_hats_1 where (start_ts <= '2021-10-01 07:00'::timestamp and end_ts >= '2021-10-01 07:00'::timestamp) or (start_ts <= '2021-10-01 11:00'::timestamp and end_ts >= '2021-10-01 11:00'::timestamp);
                                                                                                                                          QUERY PLAN                                                                      
                                                                   
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------
  Seq Scan on matts_hats_1  (cost=0.00..1.08 rows=1 width=222)
    Filter: (((start_ts <= '2021-10-01 07:00:00'::timestamp without time zone) AND (end_ts >= '2021-10-01 07:00:00'::timestamp without time zone)) OR ((start_ts <= '2021-10-01 11:00:00'::timestamp without time zone) AN
 D (end_ts >= '2021-10-01 11:00:00'::timestamp without time zone)))
 (2 rows)

Introduction to Ranges in PostgreSQL

This is where the “range” data types come into play.  Let’s take a look at this data and query using a range.

create table matts_hats_2 ( 
id serial primary key, 
matts_hat varchar(100),
hat_ts tsrange
);

insert into matts_hats_2 (matts_hat,hat_ts) values ('Browns Hat', '[2021-10-01 6:00, 2021-10-01 10:00]');
insert into matts_hats_2 (matts_hat,hat_ts) values ('Elephant Hat', '[2021-10-01 10:00, 2021-10-01 12:00]');
insert into matts_hats_2 (matts_hat,hat_ts) values ('Browns Hat', '[2021-10-01 14:00, 2021-10-01 20:00]');
insert into matts_hats_2 (matts_hat,hat_ts) values ('Elephant Hat', '[2021-10-01 22:00, 2021-10-01 23:00]');

Out of the box, you can create ranges with integers, numerics, timestamps, or dates, and if you need other data (like a float) you can add these custom ones (check the docs for more info).  You can also put constraints on the ranges to prevent data from overlapping or enforce certain rules, but I won’t cover that in this blog.

Now if we wanted to find what hat I was wearing at 7:35 am like above, I would find that with the following:

yonk=# select id, matts_hat from matts_hats_2 where hat_ts @> '2021-10-01 7:35'::timestamp;
 id | matts_hat  
----+------------
  1 | Browns Hat

Notice the different operator when checking for the range?  Instead of “=” I am using the “@>” (the containment operator).  PostgreSQL has a set of operators that are used when interacting with ranges. In this case, the @> is checking if the range to the left ( hat_ts ) contains the value ‘2021-10-01 7:35’.  The most common operators which I use in this blog are:

@> , <@ See if an element or range is part of or contains the other value
&&  Do the ranges overlap
+ Creates a union between two ranges
Removes one range from another

There are other operators, so check out the docs for the complete list in the docs.

Now to get the hats I was wearing between 7 am and 11 am with a tsrange data type, I would issue the following command:

yonk=# select id, matts_hat from matts_hats_2 where hat_ts && '[2021-10-01 7:00, 2021-10-01 11:00]';
  id |  matts_hat   
 ----+--------------
   1 | Browns Hat
   2 | Elephant Hat

Notice how that looks much cleaner.  In this case, we are checking if the range 7 am-11 am overlaps with the hat_ts field.  Indexing will also work with range data types via gist.

CREATE INDEX matts_hats_2_idx ON matts_hats_2 USING GIST (hat_ts);
  
yonk=# explain select id, matts_hat from matts_hats_2 where hat_ts && '["2021-10-01 7:00","2021-10-01 11:00"]';
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Bitmap Heap Scan on matts_hats_2  (cost=4.17..14.51 rows=3 width=15)
   Recheck Cond: (hat_ts && '["2021-10-01 07:00:00","2021-10-01 11:00:00"]'::tsrange)
   ->  Bitmap Index Scan on matts_hats_2_idx  (cost=0.00..4.17 rows=3 width=0)
         Index Cond: (hat_ts && '["2021-10-01 07:00:00","2021-10-01 11:00:00"]'::tsrange)

I can modify the ranges as well, adding or removing time from what is already stored.  Let’s go ahead and remove 30 minutes from my hat-wearing time.

yonk=# update matts_hats_2 set hat_ts = hat_ts - '[2021-10-01 14:00,2021-10-01 14:30]' where id = 3;
UPDATE 1

yonk=# select * from matts_hats_2 where id = 3;
 id | matts_hat  |                    hat_ts                     
----+------------+-----------------------------------------------
  3 | Browns Hat | ("2021-10-01 14:30:00","2021-10-01 20:00:00"]
(1 row)

This method works as long as you are shrinking or extending the range.  However, removing or adding ranges gets tricky (at least before PostgreSQL 14) when removing the middle of a range or adding a non-contiguous hat-wearing time.

yonk=# update matts_hats_2 set hat_ts = hat_ts - '[2021-10-01 15:00,2021-10-01 15:30]' where id = 3;
ERROR:  result of range difference would not be contiguous

Introduction to Multi-Ranges in PostgreSQL 14

Before PostgreSQL 14, you would need to add new records to facilitate having multiple non-contiguous blocks.  However, in PG14 you have the option to use a “Multi Range” type.  Let’s see how this works:

create table matts_hats_3 ( 
id serial primary key, 
matts_hat varchar(100),
hat_date date,
hat_ts tsmultirange
);

insert into matts_hats_3 (matts_hat,hat_date,hat_ts) values ('Browns Hat',   '2021-10-01', '{[2021-10-01 6:00, 2021-10-01 10:00],[2021-10-01 14:00, 2021-10-01 20:00]}');
insert into matts_hats_3 (matts_hat,hat_date,hat_ts) values ('Elephant Hat', '2021-10-01', '{[2021-10-01 10:00, 2021-10-01 12:00],[2021-10-01 22:00, 2021-10-01 23:00]}');

You can see I consolidated all my time wearing each hat into one record for each.  I can run the same queries I used for the range examples here with the same results.  Now, however, I can add or remove additional non-contiguous ranges into that record:

yonk=# select * from matts_hats_3 where id = 1;
 id | matts_hat  |  hat_date  |                                            hat_ts                                             
----+------------+------------+-----------------------------------------------------------------------------------------------
  1 | Browns Hat | 2021-10-01 | {["2021-10-01 06:00:00","2021-10-01 10:00:00"],["2021-10-01 14:00:00","2021-10-01 20:00:00"]}
(1 row)

yonk=# update matts_hats_3 set hat_ts = hat_ts - '{[2021-10-01 15:00,2021-10-01 15:30]}' where id = 1;
UPDATE 1
yonk=# select * from matts_hats_3 where id = 1;
 id | matts_hat  |  hat_date  |                                                                   hat_ts                                                                    
----+------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------
  1 | Browns Hat | 2021-10-01 | {["2021-10-01 06:00:00","2021-10-01 10:00:00"],["2021-10-01 14:00:00","2021-10-01 15:00:00"),("2021-10-01 15:30:00","2021-10-01 20:00:00"]}
(1 row)4

Removing a block of time from the middle of the range now creates a third range.  This functionality can be an incredibly efficient and powerful way of storing and seeing when something is active or not.  This can also be useful for auditing and determining what conditions were active during a certain period.

Word of caution:  Just because you can do something does not mean you should.  I have not been able to find any hard limit on the number of distinct ranges you can store in a single field.  In fact, I was able to add over 100K distinct ranges into one field without PostgreSQL complaining.  That said, the performance of modifying or using that many different ranges is potentially very limiting; this was especially telling when adding or splitting ranges.

More 101 on Using Ranges in PostgreSQL

You may have noticed the ranges are bracketed with “[]“; these have a special meaning in PostgreSQL.  They specify whether the range should include the lower/upper bounds value when evaluating the data.   Here is a quick primer:

Range Description
( Exclusive lower range value, think of this like a > ( will not include the specified lower value)
) Exclusive upper range value, think of this like a < ( will not include the specified upper value)
[ Inclusive lower range value, think of this like a >= ( will include the specified lower value)
] Inclusive upper range value, think of this like a <= ( will include the specified upper value)
{} These are reserved for multi-range data types

You can visualize this if you load data into a test table:

create table test_range (
   int_range int4range,
   int_mrange int4multirange,
   ts_range tsrange,
   ts_mrange tsmultirange,
   my_desc varchar(100)
 );

insert into test_range values ('[1,10]','{[1,4],[6,10]}','[2021-10-01 10:30,2021-10-01 22:30]','{[2021-10-01 10:30,2021-10-01 12:30],[2021-10-01 16:30,2021-10-01 22:30]}','Inclusive upper/ Inclusive lower');
insert into test_range values ('(1,10]','{(1,4],(6,10]}','(2021-10-01 10:30,2021-10-01 22:30]','{(2021-10-01 10:30,2021-10-01 12:30],(2021-10-01 16:30,2021-10-01 22:30]}','Inclusive upper/ Exclusive lower');
insert into test_range values ('(1,10)','{(1,4),(6,10)}','(2021-10-01 10:30,2021-10-01 22:30)','{(2021-10-01 10:30,2021-10-01 12:30),(2021-10-01 16:30,2021-10-01 22:30)}','Exclusive upper/ Exclusive lower');
insert into test_range values ('[1,10)','{[1,4),[6,10)}','[2021-10-01 10:30,2021-10-01 22:30)','{[2021-10-01 10:30,2021-10-01 12:30),[2021-10-01 16:30,2021-10-01 22:30)}','Exclusive upper/ Inclusive lower');

yonk=#  select * from test_range;
 int_range |   int_mrange   |                   ts_range                    |                                           ts_mrange                                           |             my_desc              
-----------+----------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------+----------------------------------
 [1,11)    | {[1,5),[6,11)} | ["2021-10-01 10:30:00","2021-10-01 22:30:00"] | {["2021-10-01 10:30:00","2021-10-01 12:30:00"],["2021-10-01 16:30:00","2021-10-01 22:30:00"]} | Inclusive upper/ Inclusive lower
 [2,11)    | {[2,5),[7,11)} | ("2021-10-01 10:30:00","2021-10-01 22:30:00"] | {("2021-10-01 10:30:00","2021-10-01 12:30:00"],("2021-10-01 16:30:00","2021-10-01 22:30:00"]} | Inclusive upper/ Exclusive lower
 [2,10)    | {[2,4),[7,10)} | ("2021-10-01 10:30:00","2021-10-01 22:30:00") | {("2021-10-01 10:30:00","2021-10-01 12:30:00"),("2021-10-01 16:30:00","2021-10-01 22:30:00")} | Exclusive upper/ Exclusive lower
 [1,10)    | {[1,4),[6,10)} | ["2021-10-01 10:30:00","2021-10-01 22:30:00") | {["2021-10-01 10:30:00","2021-10-01 12:30:00"),["2021-10-01 16:30:00","2021-10-01 22:30:00")} | Exclusive upper/ Inclusive lower
(4 rows)

Final Thoughts

The range data type was already a unique and powerful feature to store multiple non-contiguous ranges. With this addition, developers will have yet another tool in their toolbox to get even more out of PostgreSQL.

As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Oct
05
2021
--

Using New JSON Syntax in PostgreSQL 14 – Update

JSON Syntax PostgreSQL 14

As previously mentioned here on the blog, PostgreSQL 14 was just released. Need a primer? Umair Shahid recently put together a great overview, PostgreSQL 14 – Performance, Security, Usability, and Observability.

I recently did a deep-dive into using JSON from within PostgreSQL in a two blog post series titled Storing and using JSON within PostgreSQL, While all the information from those two blogs still is relevant with the release of PostgreSQL 14 (PG14), we need to provide a quick update to show off some improvements.

Keeping this brief:  Prior to PG14, pulling fields out of JSON looked something like the following:

movie_json_test=# select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_jsonb  where (jsonb_column->>'title')::varchar = 'Avengers: Endgame (2019)';
          title           | rating |  imdb_id  
--------------------------+--------+-----------
 Avengers: Endgame (2019) |        | tt4154796
(1 row)

While this worked, it is not terribly intuitive for developers who may be used to interacting with JSON in code.   Starting in versions beyond 14 you can use a much more natural and user-friendly syntax to access data.

movie_json_test=# select jsonb_column['title'] as title,   jsonb_column['imdb_rating'] as rating,   jsonb_column['imdb_id'] as imdb_id  from movies_jsonb  where jsonb_column['title'] = '"Avengers: Endgame (2019)"';
           title            | rating |   imdb_id   
----------------------------+--------+-------------
 "Avengers: Endgame (2019)" | null   | "tt4154796"
(1 row)

Not only can you access top-level attributes, but you can also access nested arrays and lists as well:

movie_json_test=# select jsonb_column['title'] as title,   jsonb_column['imdb_rating'] as rating,   jsonb_column['imdb_id'] as imdb_id, jsonb_column['cast'][0]  from movies_jsonb  where jsonb_column['title'] = '"Avengers: Endgame (2019)"';
           title            | rating |   imdb_id   |                                            jsonb_column                                             
----------------------------+--------+-------------+-----------------------------------------------------------------------------------------------------
 "Avengers: Endgame (2019)" | null   | "tt4154796" | {"id": "nm0000375", "name": "Robert Downey Jr.", "character": "Tony Stark /              Iron Man"}
(1 row)

Overall this syntax still will make use of all the same indexes, JSON functions, etc., that were previously available.  While this is a relatively minor change, it actually should make things much more natural for developers and hopefully drive more adoption and usage

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Sep
08
2021
--

Storing and Using JSON Within PostgreSQL Part Two

Storing and Using JSON Within PostgreSQL 2

We explored the basic functionality of JSON within PostgreSQL in Part One of this series. Now we will look a little deeper into some things we may want to use regularly.  Most of the functionality we talked about in part one works well when you have a simple JSON format.  However, in real life, our documents may get a bit messy.  Let’s look at part of the JSON we are using for our tests:

Example JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

You can see here that we have some nested arrays and a bit of multi-dimensional flair.  If we wanted to get all the characters or actors in this movie, we would have a challenge using the basic functions.  Thankfully, PostgreSQL has a deep set of functions for interacting with JSON.

First, let’s look at how to get all the movies starring Robert Downey Jr. The easiest way is to use one of the  following:

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "cast": [{ "name" : "Robert Downey Jr." }]}'

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.cast.name == "Robert Downey Jr."'

However, what if we also need to pull out the character from the movie?  For our needs of getting a full list of actors and characters who were in this particular movie, we can use the jsonb_to_rescordset (similar to MySQL’s json_table function we covered in the MySQL part of this series).

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where imdb_id = 'tt2395427' limit 15;
  imdb_id  |             title              | imdb_rating |    id     |         name         |    character     
-----------+--------------------------------+-------------+-----------+----------------------+------------------
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000375 | Robert Downey Jr.    | Tony Stark
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1165110 | Chris Hemsworth      | Thor
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0749263 | Mark Ruffalo         | Bruce Banner
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0262635 | Chris Evans          | Steve Rogers
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0424060 | Scarlett Johansson   | Natasha Romanoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0719637 | Jeremy Renner        | Clint Barton
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000652 | James Spader         | Ultron
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000168 | Samuel L. Jackson    | Nick Fury
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000332 | Don Cheadle          | James Rhodes
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1093951 | Aaron Taylor-Johnson | Pietro Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0647634 | Elizabeth Olsen      | Wanda Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0079273 | Paul Bettany         | Jarvis
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1130627 | Cobie Smulders       | Maria Hill
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1107001 | Anthony Mackie       | Sam Wilson
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm2017943 | Hayley Atwell        | Peggy Carter
(15 rows)

This works fine – until it doesn’t.  If I do a similar search for all movies starring Robert Downey Jr., I get:

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where name like 'Robert Downey%' limit 10; 
ERROR:  cannot call jsonb_to_recordset on a non-array

When we look at the results, you can see that the function expects an array, and several of our movies have no cast (or a NULL value in the json).

movie_json_test=> select imdb_id, jsonb_typeof((jsonb_column->>'cast')::jsonb) from movies_json_generated limit 10;
  imdb_id  | jsonb_typeof 
-----------+--------------
 tt7620156 | 
 tt0109524 | array
 tt0534466 | array
 tt0111091 | 
 tt4167726 | array
 tt0638383 | array
 tt6346314 | array
 tt5877808 | array
 tt4098782 | 
 tt0365100 | array
(10 rows)

You can work around this “Null” issue in a couple of different ways.  The easiest is by converting to text, then back into JSON.  For example:

select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset((jsonb_column->>'cast'::text)::jsonb) as t(id text,name varchar(100),character text) where name like 'Robert Downey%';

Remember in part one how I harped on types and ensuring you cast to different data types when you needed to?  This is another example.  In this case, first taking the null as text, then taking the empty string, and then converting to a JSON object with a null inside.

While this is the easiest way, let’s show some other interesting ways to work around this to highlight some of the other functions, indexes, etc., we learned in part one.  Because we know that we have NULL values for some cast entries, we could check and filter out the values where the type is not empty or null.  For example, here is a simple check if the text version of our jsonb_column->’cast’ is not equal to null.

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
  imdb_id  |                   title                   | imdb_rating |    id     |       name        |               character                
-----------+-------------------------------------------+-------------+-----------+-------------------+----------------------------------------
 tt0266220 | The 65th Annual Academy Awards (1993)     |             | nm0000375 | Robert Downey Jr. | Himself - Presenter
 tt1515091 | Sherlock Holmes: A Game of Shadows (2011) |        7.50 | nm0000375 | Robert Downey Jr. | Sherlock Holmes
 tt1231583 | Due Date (2010)                           |        6.60 | nm0000375 | Robert Downey Jr. | Peter Highman
 tt0343663 | Eros (2004)                               |        6.00 | nm0000375 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")
 tt4420868 | The EE British Academy Film Awards (2015) |        7.40 | nm0000375 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough
(5 rows)
 
movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..7.30 rows=5 width=332) (actual time=0.586..84.666 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..545198.71 rows=373482 width=332) (actual time=0.585..84.664 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..74611.38 rows=373482 width=272) (actual time=0.023..30.257 rows=3786 loops=1)
               Filter: ((jsonb_column -> 'cast'::text) <> 'null'::jsonb)
               Rows Removed by Filter: 258
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.014..0.014 rows=0 loops=3786)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 24
 Planning Time: 0.064 ms
 Execution Time: 84.692 ms
(10 rows)

This is not terribly fast, but it does work. This is basically working around the functionality built into the JSON functions, however.  To speed our query up, we can, of course, index this column; however, some of the data in our movie cast list is just too large:

movie_json_test=> create index gen_func_index_cast on movies_json_generated (((jsonb_column->>'cast')::text));
ERROR:  index row requires 10704 bytes, maximum size is 8191

Of course, we can work around this as well.  But sometimes, it’s best to pivot.  Let’s look at another alternative to simply searching for a NULL value.  As we learned in the first post, we can use generated columns or expression indexes to do some nice things.  Here we could create either an expression index that checks for a null or add a new column that uses a case statement to flag null casts:

movie_json_test=> alter table movies_json_generated add column cast_is_null boolean generated always as  ((case when ((jsonb_column->>'cast')::text) is null then true else false end)) stored; 
ALTER TABLE


movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is false limit 5;
  imdb_id  |                          title                          | imdb_rating |    id     |       name        |                   character                    
-----------+---------------------------------------------------------+-------------+-----------+-------------------+------------------------------------------------
 tt3473134 | Off Camera with Sam Jones (TV Series 2014– )            |             | nm0000375 | Robert Downey Jr. | Himself                  2 episodes, 2014-2019
 tt0092851 | Dear America: Letters Home from Vietnam (TV Movie 1987) |        7.90 | nm0000375 | Robert Downey Jr. | (voice)
 tt0426841 | The 1994 Billboard Music Awards (1994)                  |             | nm0000375 | Robert Downey Jr. | Himself
 tt1228705 | Iron Man 2 (2010)                                       |        7.00 | nm0000375 | Robert Downey Jr. | Tony Stark
 tt0821642 | The Soloist (2009)                                      |        6.70 | nm0000375 | Robert Downey Jr. | Steve Lopez
(5 rows)


movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is not true limit 5;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..8.24 rows=5 width=332) (actual time=0.912..30.550 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..309227.39 rows=187680 width=332) (actual time=0.912..30.548 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..72750.59 rows=187680 width=272) (actual time=0.007..1.069 rows=1789 loops=1)
               Filter: (cast_is_null IS NOT TRUE)
               Rows Removed by Filter: 106
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.016..0.016 rows=0 loops=1789)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 23
 Planning Time: 0.068 ms
 Execution Time: 30.572 ms
(10 rows)

You can see there are several options here for dealing with the nulls, some way easier (and cleaner) than others.  I want to highlight some of the challenges this brings up with using unstructured data within a structured system.

Evolution Upsets the Balance (or Breaks Stuff)

All of the above solutions work for the existing data, but the wonderful thing about JSON is that you can evolve what you store over time.  Let’s use the above example.  Let’s say that for years, every movie that is fed into your system has a full cast listing of characters and actors.  Then, one day, the feed you get your data from allows movies without a cast listing.  Your application will still work, your load scripts will still work.  But every once in a while, your users will get a weird error, or you will see small flashes in your logs (if you are logging these).  But 99.9% of queries are fine.  It is these transient issues that drive people bonkers.

Looking at this problem slightly differently, what if you start adding data or changing the order of certain items?  Back to our original JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

What if we wanted to add another piece of data to the cast records? Let’s say we want to add a thumbs up, or thumbs down vote to each character, so something like this:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100,
      "character": "Bruce Banner"
    },

Now we run into another challenge.  If we inject the new fields into the middle of the JSON, instead of the end, our call to jsonb_to_recordset(jsonb_column->’cast’) as t(id text,name varchar(100),character text) will cause use issues in the application.  We had to define the definition of the recordest; the first field returned is id, then actor name, and then character.  If we changed this record, the character returned in the SQL would be 10000, not Bruce Banner.  Now to make additions, you would need to start adding at the end like:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100
    },

MySQL mitigates this with allowing you to defined specific paths in their equivalent function json_table(json_column, ‘$.cast[*]’ columns( V_name varchar(200) path ‘$.name’, V_character varchar(200) path ‘$.character’) This allows you to define exactly which fields you want, not just the first X ones. That said, there are other limitations in the json_table method MYSQL uses.  Here this is a trade-off.  You need to be aware again that the order and structure of your JSON matter greatly if you are going to use built-in database functions to query this data.

To JSON or Not?  Database Design

While PostgreSQL’s JSON features are pretty in-depth (there are dozens of other functions I did not show), there are similar challenges I see when using these compared to a more classic database design. JSON functionality is NOT a replacement for actually designing your database and thinking about schema and design.  It is a great tool to offer additional flexibility.  If you treat PostgreSQL or any database simply as a dumping ground for data, bad things happen.  It would help to think about how JSON columns can augment your current database designs, not replace them.

Let me give you a simple example of some of the dangers of simply forgoing database design in favor of just dumping in JSON.  While I think we have highlighted some of the complexity and setup challenges with different methods of storing JSON in PostgreSQL, I wanted to take this a step further and compare the performance of different database designs and their impact on performance and scalability.  For this, I built three separate designs and schemas for housing the same data:

A.)  Minimal Database Design -> single table, single column.  “The dump.”  With only a GIN index.

B.)  Single Table -> Generated columns and Expression indexes where needed.

C.)  Old School Normalized Tables, with JSON stored for easy use

The idea here is to illustrate a few of the trade-offs around performance when picking a specific design(1):

A.)  Minimal Database Design No Gin IDX A.)  Minimal Database Design with Gin IDX B.)  Single Table W Generated Column B.)  Single Table W Expression Indexes c.) Normalized
Simple Query for Movie Title (random 10  titles) 800ms 0.3ms 0.2ms 0.2ms 0.2ms
Select top 10 movies 841ms 831ms 0.9ms 0.9ms 0.3ms
Select all movies for an actor  1009.0ms 228ms 5588.0ms(2) 5588.0ms(2) 0.8ms
Count movies for a given actor 5071ms 5075ms 5544ms NA 1ms
  1. Each test was repeated 100 times, and the average results were listed.  Min/Max is available as well. 
  2. The nested json for our “cast info” was not easily indexed, there are some things we could do to mitigate this, but it is a bit more complicated.

Interestingly, there are a few cases where we have problems with using only generated columns/expression indexes, generally because of nested data within the JSON.  We could do some tricks, like adding a GIN index and query the JSON to find the element or creating some funky nested generated column with an expression index.  I left these out for now.  Also, I could go deeper into the tests, testing inserts, updates, and, of course, assembling JSON from a normalized table, but I didn’t.  My goal here is merely to make you think about your design.  Choosing one way to implement and use JSON may be ok under certain use cases, but you may find yourself querying or aggregating the data, and things break.

Recap & What’s Next

A few quick takeaways:

Recapping part 1:

  • Using JSONB is probably going to be your best option in most use cases.
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits.

What we learned in part 2:

  • The more straightforward the JSON, the easier it will be to pull out and interact with the data you need.
  • Nested JSON data can be pulled out in a few different ways. jsonb_to_rescordset is the easiest way I found to pull out the data I need.  However, using this function or others is very susceptible to the data structure in your document (data types matter!).
  • JSON data whose format changes ( elements added or removed ) may make using some functions difficult, if not impossible, to use.
  • JSON within a well-built, designed application and database can be wonderful and offer many great benefits.  JSON just dumped into databases won’t scale.  Database design still matters.

Now that we have covered the basics and a few nuances of JSON with PostgreSQL, next up, it is time to look at the same functionality and data within MongoDB before finally comparing the three databases. Stay tuned!

Sep
03
2021
--

Storing and Using JSON Within PostgreSQL Part One

Storing JSON PostgreSQL

Continuing our series on storing JSON directly within your database, we are now looking at the JSON functionality built into PostgreSQL.  You can see the first two posts in this series exploring MySQL’s JSON functionality with Storing JSON in Your Databases: Tips and Tricks For MySQL Part One and Storing JSON in Your Databases: Tips and Tricks For MySQL Part Two.  I used the exact same table structures and datasets as I did within my MySQL tests.  You can get the instructions to follow along at home here on GitHub.

PostgreSQL has two JSON datatypes available to you.  First JSON, and second JSONB.  The docs highlight the differences pretty well:

JSON JSON B Compare

Basically, JSONB is stored decomposed, making it more efficient for the database to interact with, while the JSON type is stored as an exact text copy of what was input.  This means formatting, white space, etc., will be lost with JSONB.  That said, the suggestion is to generally use JSONB.  The trade-off being slightly slower inserts vs. faster overall read performance (on top of the storage differences mentioned above).

In terms of performance, how much of a difference is there?  That is tricky, as it will depend heavily on the size and complexity of the JSON document being stored.  The larger, more complex the document, the larger the difference you could see.  But to illustrate an example, we will use our movie JSON to select the same data from both JSON and JSONB.

I created the following tables and loaded them with movie-related JSON:

create table movies_json (
	ai_myid serial primary key, 
	imdb_id varchar(255),
	json_column json not null
);

create unique index movies_json_imdb_idx on movies_json(imdb_id);
CREATE INDEX gin_index ON movies_json USING gin (jsonb_column);


create table movies_jsonb (
	ai_myid serial primary key, 
	imdb_id varchar(255),
	jsonb_column jsonb not null
);

create unique index movies_jsonb_imdb_idx on movies_jsonb(imdb_id);
CREATE INDEX movies_jsonb_gin_index ON movies_jsonb USING gin (json_column);

Querying the JSON column results in slower retrieval:

movie_json_test=# explain (verbose true, analyze true) select * from movies_json where json_column->>'title' = 'Avengers: Endgame (2019)';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..55775.85 rows=1880 width=1059) (actual time=694.047..2516.879 rows=1 loops=1)
   Output: ai_myid, imdb_id, json_column
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_json  (cost=0.00..54587.85 rows=783 width=1059) (actual time=1905.511..2512.010 rows=0 loops=3)
         Output: ai_myid, imdb_id, json_column
         Filter: ((movies_json.json_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=2511.276..2511.277 rows=0 loops=1
         Worker 1:  actual time=2511.322..2511.322 rows=0 loops=1
 Planning Time: 0.166 ms
 Execution Time: 2516.897 ms
(12 rows)

While the JSONB column is faster (3x):

movie_json_test=# explain (verbose true, analyze true) select * from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54116.60 rows=1873 width=1025) (actual time=723.324..726.914 rows=1 loops=1)
   Output: ai_myid, imdb_id, jsonb_column
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52929.30 rows=780 width=1025) (actual time=548.982..721.730 rows=0 loops=3)
         Output: ai_myid, imdb_id, jsonb_column
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=720.995..720.995 rows=0 loops=1
         Worker 1:  actual time=202.751..720.994 rows=1 loops=1
 Planning Time: 0.038 ms
 Execution Time: 726.933 ms
(12 rows)

To ensure this is not a single data point I wrote a small script to run this test (with this title and other random titles).  Over 100 runs we had pretty consistent results:

JSON JSONB
Average Time: 2.5492

Min Time: 2.5297428970225155

Max Time: 2.56536191503983

Average Time: 0.747

Min Time: 0.7297536049736664

Max Time: 0.7827945239841938

As stated, insert/updates will lose some of that performance, so keep that in mind.  It is, however, easier to interact and update things in a JSONB datatype.  For instance, some functions are only available for JSONB data types, like the JSONB_SET function which allows you to in-place update your JSON.

movie_json_test=# explain (verbose true, analyze true) update movies_jsonb set jsonb_column= jsonb_set(jsonb_column, '{imdb_rating}', '9') where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Update on public.movies_jsonb  (cost=0.00..56211.80 rows=1873 width=52) (actual time=1928.691..1928.692 rows=0 loops=1)
   ->  Seq Scan on public.movies_jsonb  (cost=0.00..56211.80 rows=1873 width=52) (actual time=577.386..1928.246 rows=1 loops=1)
         Output: ai_myid, imdb_id, jsonb_set(jsonb_column, '{imdb_rating}'::text[], '9'::jsonb, true), ctid
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 375358
 Planning Time: 0.040 ms
 Execution Time: 1928.718 ms
(7 rows)

I have seen references in several places to the insert performance difference between JSON and JSONB, but I have not seen any concrete numbers.  I decided to run a few tests:

  • Inserting 10K records into a JSON column (via insert into select from):  72.851 ms
  • Inserting 10K records into a JSONB column (via insert into select from):  754.045 ms

If you are doing a heavy insert workload, the difference is significant enough to take note and plan for it.  

Another distinct advantage for JSONB is the ability to use GIN indexes over the JSON document.  Gin indexes are designed and optimized to work for text searches.  This focus lends itself well to JSON.  That said, the syntax may be a bit more complicated to make use of GIN indexes.

Here is how you create a GIN index. 

movie_json_test=# CREATE INDEX movies_jsonb_gin_index ON movies_jsonb USING gin (jsonb_column);
CREATE INDEX

You can see the original query is still not using this index:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54125.60 rows=1877 width=64) (actual time=716.059..719.346 rows=1 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52937.90 rows=782 width=64) (actual time=544.197..714.577 rows=0 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=202.768..713.907 rows=1 loops=1
         Worker 1:  actual time=713.890..713.891 rows=0 loops=1
 Planning Time: 0.041 ms
 Execution Time: 719.365 ms
(12 rows)

Why not?  The ->> returns the value as a text, so the conversion from JSON to Text causes some issues.  There are a couple of ways to work around this.  The first is to use @@ which returns the first JSON item:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.title == "Avengers: Endgame (2019)"';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=72.29..221.83 rows=38 width=64) (actual time=0.179..0.180 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: (movies_jsonb.jsonb_column @@ '($."title" == "Avengers: Endgame (2019)")'::jsonpath)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movies_jsonb_gin_index  (cost=0.00..72.28 rows=38 width=0) (actual time=0.068..0.068 rows=2 loops=1)
         Index Cond: (movies_jsonb.jsonb_column @@ '($."title" == "Avengers: Endgame (2019)")'::jsonpath)
 Planning Time: 0.145 ms
 Execution Time: 0.199 ms
(8 rows)

This works.  You can also use the @> which checks if the JSON value entries exist.

movie_json_test=> explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "title" : "Avengers: Endgame (2019)"}';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=44.29..193.94 rows=38 width=64) (actual time=0.199..0.200 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: (movies_jsonb.jsonb_column @> '{"title": "Avengers: Endgame (2019)"}'::jsonb)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movies_jsonb_gin_index  (cost=0.00..44.28 rows=38 width=0) (actual time=0.091..0.091 rows=1 loops=1)
         Index Cond: (movies_jsonb.jsonb_column @> '{"title": "Avengers: Endgame (2019)"}'::jsonb)
 Planning Time: 0.147 ms
 Execution Time: 0.214 ms
(8 rows)

Not only did this use the index it is now over 3x faster overall.  The PostgreSQL documentation offers a fairly detailed overview of the different options to interact with JSON, especially when it comes to functions and operators (note the table in the PG 12 Docs giving you an overview of the operators is a little more readable ) available to you. It’s important to note that not all operators and functions will work with GIN indexes.  For example, using ‘like_regex’:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.title like_regex "^Avengers*"';
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..53546.87 rows=38 width=64) (actual time=218.550..795.063 rows=9 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52543.07 rows=16 width=64) (actual time=251.866..790.098 rows=3 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: (movies_jsonb.jsonb_column @@ '($."title" like_regex "^Avengers*")'::jsonpath)
         Rows Removed by Filter: 125117
         Worker 0:  actual time=311.403..789.402 rows=3 loops=1
         Worker 1:  actual time=225.825..789.408 rows=1 loops=1
 Planning Time: 0.204 ms
 Execution Time: 795.087 ms
(12 rows)

In addition to the GIN indexes, you can create Hash or Btree indexes on your JSON columns, however, they are only useful if you are comparing or searching the entire JSON document.  Instead, you want to use an expression index (often referred to as functional indexes in other databases).  

Generated Columns and Expression Indexes

While the GIN indexes work for some use cases, you often will find it easier and more convenient to use actual columns or some of the normal functions you are familiar with.  The simplest use of expression indexes is to simply pull out and index a field you will use often.  Let’s say we will often want to search for a title.  We can create an index on the extracted JSON from our previous example.

movie_json_test=# create index movie_jsonb_title_index on movies_jsonb (((jsonb_column ->> 'title')::text));
CREATE INDEX

This creates a btree index on the jsonb_column->> title path, allowing us to search for a title.

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=66.97..6421.19 rows=1877 width=64) (actual time=0.138..0.140 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movie_jsonb_title_index  (cost=0.00..66.50 rows=1877 width=0) (actual time=0.014..0.014 rows=1 loops=1)
         Index Cond: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
 Planning Time: 0.160 ms
 Execution Time: 0.159 ms
(8 rows)

Originally, when we searched for movies with the ‘Avengers Endgame (2019)’ title there was no index (unless you modified the SQL to make use of the GIN index).  Now with this new index, you can see our original query is no longer doing a sequential scan but is instead using the btree index.  If we want the index to also be used for ‘ like string%’ statements as well to help with partial matching we need something a bit extra added to the index.  Note: this is not something you can do by default using only the GIN index (however you can look into the pg_trgm extension potentially for this).  Here we will create the same index with the operator class “text_pattern_ops” which is designed to work with Like and Regex.

Before:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' like 'Avengers%';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54125.60 rows=1877 width=64) (actual time=371.463..723.743 rows=8 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52937.90 rows=782 width=64) (actual time=288.053..718.957 rows=3 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) ~~ 'Avengers%'::text)
         Rows Removed by Filter: 125117
         Worker 0:  actual time=204.176..718.288 rows=2 loops=1
         Worker 1:  actual time=288.637..718.299 rows=3 loops=1
 Planning Time: 0.130 ms
 Execution Time: 723.762 ms
(12 rows)

After:

movie_json_test=# create index movie_jsonb_title_index on movies_jsonb (((jsonb_column ->> 'title')::text) text_pattern_ops);
CREATE INDEX
movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' like 'Avengers%';
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=71.66..6425.88 rows=1877 width=64) (actual time=0.195..0.498 rows=8 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) ~~ 'Avengers%'::text)
   Heap Blocks: exact=8
   ->  Bitmap Index Scan on movie_jsonb_title_index  (cost=0.00..71.19 rows=1877 width=0) (actual time=0.015..0.015 rows=8 loops=1)
         Index Cond: (((movies_jsonb.jsonb_column ->> 'title'::text) ~>=~ 'Avengers'::text) AND ((movies_jsonb.jsonb_column ->> 'title'::text) ~<~ 'Avengert'::text))
 Planning Time: 0.168 ms
 Execution Time: 0.519 ms

Now you can also perform a similar operation, but store the results as a queryable generated column instead:

movie_json_test=> \d movies_jsonb
                                         Table "public.movies_jsonb"
    Column    |          Type          | Collation | Nullable |                    Default                    
--------------+------------------------+-----------+----------+-----------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_jsonb_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | 
 jsonb_column | jsonb                  |           | not null | 
Indexes:
    "movies_jsonb_pkey" PRIMARY KEY, btree (ai_myid)
    "movies_jsonb_imdb_idx" UNIQUE, btree (imdb_id)

movie_json_test=> alter table movies_jsonb add column title varchar(255) generated always as (((jsonb_column ->> 'title')::text)) stored;
ALTER TABLE


movie_json_test=> \d movies_jsonb
                                                  Table "public.movies_jsonb"
    Column    |          Type          | Collation | Nullable |                             Default                             
--------------+------------------------+-----------+----------+-----------------------------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_jsonb_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | 
 jsonb_column | jsonb                  |           | not null | 
 title        | character varying(255) |           |          | generated always as (((jsonb_column ->> 'title'::text))) stored
Indexes:
    "movies_jsonb_pkey" PRIMARY KEY, btree (ai_myid)
    "movies_jsonb_imdb_idx" UNIQUE, btree (imdb_id)

Now we have the option to add an index on the title column, select it without the JSON formatting, etc.  Note: generated columns can not be updated.

We can create our entire movie_json table with extra columns and expression indexes and make it look a bit more like a normal table:

create table movies_json_generated (
	ai_myid serial primary key, 
	imdb_id varchar(255) generated always as (jsonb_column ->> 'imdb_id') stored,
	title varchar(255) generated always as (jsonb_column ->> 'title') stored,
    imdb_rating decimal(5,2) generated always as ((jsonb_column  ->> 'imdb_rating')::numeric) stored,
	overview text generated always as (jsonb_column ->> 'overview') stored,
	director jsonb generated always as ((jsonb_column ->> 'director')::json) stored,
	country varchar(100) generated always as (jsonb_column ->> 'country') stored,
	jsonb_column jsonb,
	json_column json
);

create unique index gen_imdb_idx on movies_json_generated(imdb_id);
create index gen_title_idx on movies_json_generated(title);
create index gen_func_title_index on movies_json_generated (((json_column ->> 'title')::varchar));	
CREATE INDEX Gen_gin_index ON movies_json_generated USING gin (jsonb_column);

You may have noticed we had to explicitly cast the columns to a specific data type.  One of the challenges or at least the more difficult things to get used to with both MySQL and PostgreSQL.

It’s worth remembering that a JSON document does not have explicit data types, so you will often find that some functions or indexes may not work as expected because you are comparing data of two different types (and will return different results from some operations, i.e. sorting ascii vs. numeric).  For example: to add 1 to our IMDB rating:

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')))+1.1))) ;
ERROR:  operator does not exist: text + numeric
LINE 1: ...imdb_rating}',(((((jsonb_column ->>'imdb_rating')))+1))) ;
                                                              ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

Here the imdb_rating that is returned is not numeric so you can not add 1.  So logically, you want to cast the value to a numeric to allow for the addition. 

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1))) ;
ERROR:  function jsonb_set(jsonb, unknown, numeric) does not exist
LINE 1: ...pdate  movies_jsonb_generated_1 set jsonb_column= jsonb_set(...
                                                             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Again we run into a type issue.  This time the jsonb_set function can not pass in a numeric value (nor does it implicitly convert) when the function is looking for a JSONB value.  So let’s cast the result to JSONB:

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1)::jsonb)) ;
ERROR:  cannot cast type numeric to jsonb
LINE 1: ...((((jsonb_column ->>'imdb_rating')::numeric))+1.1)::jsonb)) …

Here we find we can not cast a numeric directly to JSONB.  So we have to convert the numeric to text, then to JSONB.

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1)::text::jsonb)) ;
UPDATE 100000

For some developers, this may prove a bit tricky to get the hang of.  This will also extend to things that work, but maybe just slow or may produce the wrong data.  For instance, when using compare, sort, or filter you may or may not pick up the index you are looking for and you may be comparing the ASCII value for numerics.  Below you can see us explicitly convert certain columns to compare them. You can also see that when you generate a column or create an expression index you need to do the same:

movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where (jsonb_column->>'imdb_rating')::numeric > 8 order by (jsonb_column->>'imdb_rating')::numeric desc limit 5;
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=65291.78..65292.36 rows=5 width=128) (actual time=1055.064..1058.740 rows=5 loops=1)
   ->  Gather Merge  (cost=65291.78..77449.06 rows=104198 width=128) (actual time=1055.062..1058.738 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=64291.76..64422.00 rows=52099 width=128) (actual time=1053.392..1053.393 rows=4 loops=3)
               Sort Key: (((jsonb_column ->> 'imdb_rating'::text))::numeric) DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Seq Scan on movies_json_generated  (cost=0.00..63426.41 rows=52099 width=128) (actual time=0.993..1052.275 rows=2688 loops=3)
                     Filter: (((jsonb_column ->> 'imdb_rating'::text))::numeric > '8'::numeric)
                     Rows Removed by Filter: 122432
 Planning Time: 0.061 ms
 Execution Time: 1058.769 ms
(14 rows)
 
movie_json_test=> \d movies_json_generated
                                                     Table "public.movies_json_generated"
    Column    |          Type          | Collation | Nullable |                                    Default                                    
--------------+------------------------+-----------+----------+-------------------------------------------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_json_generated_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | generated always as (((json_column ->> 'imdb_id'::text))) stored
 title        | character varying(255) |           |          | generated always as (((json_column ->> 'title'::text))) stored
 imdb_rating  | numeric(5,2)           |           |          | generated always as (((json_column ->> 'imdb_rating'::text)::numeric)) stored
 overview     | text                   |           |          | generated always as (json_column ->> 'overview'::text) stored
 director     | jsonb                  |           |          | generated always as (((json_column ->> 'director'::text)::json)) stored
 country      | character varying(100) |           |          | generated always as (((json_column ->> 'country'::text))) stored
 jsonb_column | jsonb                  |           |          | 
 json_column  | json                   |           |          | 
Indexes:
    "movies_json_generated_pkey" PRIMARY KEY, btree (ai_myid)
    "gen_func_title_index" btree (((json_column ->> 'title'::text)::character varying))
    "gen_gin_index" gin (jsonb_column)
    "gen_imdb_idx" UNIQUE, btree (imdb_id)
    "gen_title_idx" btree (title)
 
movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 5;
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=62548.12..62548.70 rows=5 width=102) (actual time=112.458..114.704 rows=5 loops=1)
   ->  Gather Merge  (cost=62548.12..63277.80 rows=6254 width=102) (actual time=112.457..114.702 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=61548.09..61555.91 rows=3127 width=102) (actual time=110.807..110.808 rows=4 loops=3)
               Sort Key: imdb_rating DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on movies_json_generated  (cost=0.00..61496.16 rows=3127 width=102) (actual time=0.128..109.939 rows=2688 loops=3)
                     Filter: (imdb_rating > '8'::numeric)
                     Rows Removed by Filter: 122432
 Planning Time: 0.146 ms
 Execution Time: 114.729 ms
(14 rows)

movie_json_test=> create index test_index_imdb_rating on movies_json_generated (imdb_rating);
CREATE INDEX
movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 5;
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..19.41 rows=5 width=102) (actual time=0.094..0.134 rows=5 loops=1)
   ->  Index Scan Backward using test_index_imdb_rating on movies_json_generated  (cost=0.42..28521.09 rows=7510 width=102) (actual time=0.093..0.132 rows=5 loops=1)
         Index Cond: (imdb_rating > '8'::numeric)
 Planning Time: 0.136 ms
 Execution Time: 0.152 ms
(5 rows)

A quick side note:  there will be a performance difference for inserts/updates on JSON/JSONB columns with one or more generated columns or expression indexes.  Because of how JSONB is stored, you can see a boost in performance when you update a column that has a lot of generated columns/expression indexes.  In my tests, I saw 3-4x or more performance improvement in doing updates on a JSONB column with several generated/expression indexes vs. a JSON column with the same setup.  That said, the more generated columns and expression indexes you have on a table, the more it can impact performance… this is a trade-off.

  • A table with a single generated column, took 7650ms to load 100K rows.
  • A table with 6 generated columns, took 8576ms to load 100K rows.

While this is a very simple setup with just a single JSONB column and the generated columns, you can see a slight overhead.  This held true with the same 100K records being updated, updating only one column in the table with six generated columns took 4.7 seconds vs. 3.6 seconds for a single generated column (a ~25% difference).  Again, not an in-depth test, but it illustrates that adding a ton of expression indexes or generated columns will not be without some cost.

Recap & What’s Next

A few quick takeaways:

  • Using JSONB is probably going to be your best option in most use cases
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits

Now you have the basics of JSON in PostgreSQL.  Next post we will review some of the more advanced options and do a deeper dive on performance and explore a normalized schema -vs- one that is heavily JSON.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com