Sep
16
2021
--

Repoint Replica Servers in MySQL/Percona Server for MySQL 8.0

Repoint Replica Servers in MySQL

When doing migrations or failovers in MySQL, there is usually a need to do a topology change and repoint replica servers to obtain replication data from a different server.

For example, given servers {A, B, and C} and the following topology:

MySQL Topology

If you need to repoint C to be a replica of B, i.e:

repoint mysql

You can follow the next steps:

Note: log_replica_updates should be enabled on the soon-to-be primary as it is a prerequisite for chain replication.

Note: It is assumed that both replicas only stream from Server A and there are no conflicting replication filters in place that might break replication later on.

If Using File/Position-Based Replication:

1) Stop B and C

STOP REPLICA;

2) If replicas are multi-threaded, correct MTS gaps and make them single-threaded until all changes are applied. To do so, execute the following commands on BOTH nodes:

START REPLICA UNTIL SQL_AFTER_MTS_GAPS;
SHOW REPLICA STATUS\G -- repeat this until you see "Replica_SQL_Running: No"
STOP REPLICA;
SELECT @@global.replica_parallel_workers; -- take note to restore later
SET GLOBAL replica_parallel_workers=0; -- disable MTS during the operations

3) Then check which is the node that is more up to date by looking at Relay_Source_Log_File and Exec_Source_Log_Pos. Run on BOTH nodes:

SHOW REPLICA STATUS\G
# Take note of Relay_Source_Log_File/Exec_Source_Log_Pos from the most up to date node.

4) Sync replicas with UNTIL. Run on the most delayed node with above outputs:

START REPLICA UNTIL SOURCE_LOG_FILE='<Relay_Source_Log_File>', SOURCE_LOG_POS=<Exec_Source_Log_Pos>;
SHOW REPLICA STATUS\G -- repeat this until you see "Replica_SQL_Running: No"

5) If followed above steps, at this point both replicas should have the exact same data set and should be in sync at the same point in time.
# Double check that both replicas are stopped and with the same coords as doing topology changes while replication is ongoing and with diffs coords can cause inconsistencies:

SHOW REPLICA STATUS\G
# Replica_IO_Running must be “NO” in both replicas
# Replica_SQL_Running must be “NO” in both replicas
# Relay_Source_Log_File must match in both replicas
# Exec_Source_Log_Pos must match in both replicas

6) Get current coordinates from B (new intermediate primary). Execute on B:

SHOW MASTER STATUS \G
# Take note of File and Position

7) Repoint C to B. Execute on C with coords from previous step:

CHANGE REPLICATION SOURCE TO SOURCE_HOST='<ip-address-of-B>', SOURCE_LOG_FILE='<File>', SOURCE_LOG_POS='<Position>';

8) If you had disabled MTS, you should re-enable here for both B and C;

SET GLOBAL replica_parallel_workers=X; -- see output of step 2 for correct value

9) Restart replication normally. Run on both nodes:

START REPLICA;

If Using GTID-Based Replication:

1) Stop B and C:
STOP REPLICA;

2) If replicas are multi-threaded, correct MTS gaps and make them single-threaded until all changes are applied. Run on BOTH nodes:

SHOW REPLICA STATUS\G -- repeat this until you see "Replica_SQL_Running: No"
STOP REPLICA;
SELECT @@global.replica_parallel_workers; -- take note to restore later
SET GLOBAL replica_parallel_workers=0; -- disable MTS during the operations

3) Then check which is the node that is more up to date by looking at sequence numbers in Executed_Gtid_Set. Run on BOTH nodes:
?
SHOW REPLICA STATUS\G
# Take note of Executed_Gtid_Set with the largest sequence number. If there is a mismatch in the gtid sets it means there were either local writes or writes coming from some other server. In that case you should check data consistency between the servers, for example with pt-table-checksum . Then you need to fix gtid differences by either restoring the replica from scratch or fix errant transactions as explained on this other blogpost

4) Bring up all nodes to the same point in time. Run on node with smallest GTID sequence number;

START REPLICA UNTIL SQL_AFTER_GTIDS='<Executed_Gtid_Set>';
SHOW REPLICA STATUS\G -- repeat this until you see "Replica_SQL_Running: No"

5) If followed above steps, at this point both replicas should have the exact same data set and should be in sync at the same point in time.
# Double check that both replicas are stopped and with the same coords as doing topology changes while replication is ongoing and with diffs coords can cause inconsistencies:

SHOW REPLICA STATUS\G
# Replica_IO_Running must be “NO” in both replicas
# Replica_SQL_Running must be “NO” in both replicas
# Executed_Gtid_Set must match in both replicas

6) Now both replicas have identical data, so you can re-point C to replicate from B. Run on C:

CHANGE REPLICATION SOURCE TO SOURCE_HOST='<ip-address-of-B>'

7) If you had disabled MTS, you should re-enable here for both B and C;

SET GLOBAL replica_parallel_workers=X; -- see output of step 2 for correct value

8) Restart replication normally. Run on both nodes

START REPLICA;

Doing the opposite replication change from chain replication (A->B->C) into one primary with two replicas should be simpler:

If Using File/Position-Based Replication:

1) Stop replication on B and make sure B is not receiving any write activity:

STOP REPLICA;

2) Check current binary log position on B:

SHOW MASTER STATUS \G

3) On C check replication until C does catch up with B. On C:

SHOW REPLICA STATUS \G

# For C to have catch up with B, the following conditions should be met:
# “File” from B on step 2) should match Relay_Source_Log_File from 3)
# “Position” from B on step2) should match Exec_Source_Log_Pos from 3)

# After catchup, both servers will be in sync with the same data set.

4) Check current replication coords from B:

SHOW REPLICA STATUS \G
# Write down Relay_Source_Log_File and Exec_Source_Log_Pos from B, as we will be using this coords on C

5) Re point C to replicate from A. File and positions used should be the ones taken from B on last step: 

CHANGE REPLICATION SOURCE TO SOURCE_HOST='<ip-address-of-A>', SOURCE_LOG_FILE='<File>', SOURCE_LOG_POS='<Position>'

6) Restart replication normally. Run on both nodes:

START REPLICA;

If Using GTID-Based Replication:

1) Stop replication on B and make sure B is not receiving any write activity:

STOP REPLICA;

2) Check current binary log position on B:

SHOW MASTER STATUS \G

3) On C check replication until C does catch up with B. On C:

SHOW REPLICA STATUS \G

# For C to have catch up with B, the following conditions should be met:
# Executed_Gtid_Set from B step 2) should match Executed_Gtid_Set from 3)
# After catchup, both servers will be in sync with the same data set.

4) Re point C to replicate from A:

CHANGE REPLICATION SOURCE TO SOURCE_HOST='<ip-address-of-A>'

5) Restart replication normally. Run on both nodes

START REPLICA;

Conclusion:

Doing topology changes might seem hard at first, but with the above procedure, it should be easy and error-free! If you do not want to do the manual approach, then you can consider using tools like Orchestrator which allows for automatic failover and promotions.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Sep
13
2021
--

MySQL/ZFS in the Cloud, Leveraging Ephemeral Storage

MySQL/ZFS in the cloud

MySQL/ZFS in the cloudHere’s a second post focusing on the performance of MySQL on ZFS in cloud environments. In the first post, MySQL/ZFS Performance Update, we compared the performances of ZFS and ext4. This time we’ll look at the benefits of using ephemeral storage devices. These devices, called ephemeral in AWS, local in Google cloud, and temporary in Azure, are provided directly by the virtualization host. They are not network-attached and are not IO throttled, at least compared to regular storage. Not only can they handle a high number of IOPs, but their IO latency is also very low. For simplicity, we’ll name these devices local ephemeral. They can be quite large: Azure lsv2, Google Cloud n2, and AWS i3 instance types offer TBs of fast NVMe local ephemeral storage.

The main drawback of local ephemeral devices is the loss of all the data if the VM is terminated. For that reason, the usage of local ephemeral devices is limited with databases like MySQL. Typical use cases are temporary reporting servers and Percona XtraDB Cluster (PXC)/Galera cluster nodes. PXC is a bit of a wild case here: the well polished and automated full state transfer of Galera overcomes the issue caused by having to reload the dataset when a cluster node is recycled. Because of data compression, much more data can be stored on an ephemeral device. Actually, our TPCC dataset fits on the 75GB of temporary storage when compressed. Under such circumstances, the TPCC performance is stellar as shown below.

TPCC Transation Rate ZFS

TPCC results using ZFS on an ephemeral device

On the local ephemeral device, the TPCC transaction rate is much higher, hovering close to 200 per minute. The ZFS results on the regular SSD Premium are included as a reference. The transaction rate during the last hour was around 50 per minute. Essentially, with the use of the local ephemeral device, the load goes from IO-bound to CPU-bound.

Of course, it is not always possible to only use ephemeral devices. We’ll now explore a use case for an ephemeral device, as a caching device for the filesystem, using the ZFS L2ARC.

What is the ZFS L2ARC?

Like all filesystems, ZFS has a memory cache, called the ARC, to prevent disk IOPs from retrieving frequently used pieces of data. The ZFS ARC has a few additional tricks up its sleeve. First, when data compression is used on the filesystem, the compressed form is stored in the ARC. This helps store more data. The second ZFS trick is the ability to connect the ARC LRU eviction to a fast storage device, the L2ARC. L2 stands for “Level 2”, a bit like the leveled caches of CPUs.

Essentially, the ZFS ARC is a level 1 cache, and records evicted from it can be inserted into a level 2 cache, the L2ARC. For the L2ARC to be efficient, the device used must have a low latency and be able to perform a high number of IOPs. Those are characteristics of cloud ephemeral devices.

Configuration for the L2ARC

The ZFS L2ARC has many tunables and many of these have been inherited from the recent past when flash devices were much slower for writes than for reads. So, let’s start by the beginning, here is how we add a L2ARC using the local ephemeral device, /dev/sdb to the ZFS pool bench:

# zpool add bench cache /dev/sdb

Then, the cache device appears in the zpool:

# zpool status
       pool: bench
      state: ONLINE
     config:
         NAME      STATE  READ WRITE CKSUM
         bench     ONLINE        0      0      0
           sdc     ONLINE        0      0      0
         cache
           sdb     ONLINE        0      0      0

Once the L2ARC is created, if we want data in it, we must start storing data in the ARC with:

# zfs set primarycache=all bench/data

This is all that is needed to get data flowing to the L2ARC, but the default parameters controlling the L2ARC have conservative values and it can be quite slow to warm up the L2ARC. In order to improve the L2ARC performance, I modified the following kernel module parameters:

l2arc_headroom=4
l2arc_write_boost=134217728
l2arc_write_max=67108864
zfs_arc_max=4294967296

Essentially, I am boosting the ingestion rate of the L2ARC. I am also slightly increasing the size of the ARC because the pointers to the L2ARC data are kept in the ARC. If you don’t use a large enough ARC, you won’t be able to add data to the L2ARC. That ceiling frustrated me a few times until I realized the entry l2_hdr_size in /proc/spl/kstat/zfs/arcstats is data stored in the metadata section of the ARC. The ARC must be large enough to accommodate the L2ARC pointers.

L2ARC Impacts on TPCC Results

So, what happens to the TPCC transaction rate when we add a L2ARC? Since we copy the dataset is copied over every time, the L2ARC is fully warm at the beginning of a run. The figure below shows the ZFS results with and without a L2ARC in front of SSD premium Azure storage.

TPCC performance on ZFS with a L2ARC

TPCC performance on ZFS with a L2ARC

The difference is almost incredible. Since the whole compressed dataset fits into the L2ARC, the behavior is somewhat similar to the direct use of the local ephemeral device. Actually, since the write load is now sent to the SSD premium storage, the performance is even higher. However, after 4000s, the performance starts to degrade.

From what I found, this is caused by the thread feeding the L2ARC (l2arc_feed). As pages are updated by the TPCC workload, they are eventually flushed at a high rate to the storage. The L2ARC feed thread has to scan the ARC LRU to find suitable records before they are evited. This thread then writes it to the local ephemeral device, and updates the pointers in the ARC. Even if the write latency of the local ephemeral device is low, it is significant and it greatly limits the amount of work a single feed thread can do. Ideally, ZFS should be able to use more than a single L2ARC feed thread.

In the event you end up in such a situation with a degraded L2ARC, you can refresh it when the write load goes down. Just run the following command when activity is low:

# tar c /var/lib/mysql/data > /dev/null

It is important to keep in mind that a read-intensive or a moderately write-intensive workload will not degrade as much over time as the TPCC benchmark used here. Essentially, if a replica with one of even a few (2 or 3) replication threads can keep up with the write load, the ZFS L2ARC feed thread will also be able to keep up.

Comparison with bcache

The ZFS L2ARC is not the only option to use a local ephemeral device as a read cache; there are other options like bcache and flashcache. Since bcache is now part of the Linux kernel, we’ll focus on it.

bcache is used as an ext4 read cache extension. Its content is uncompressed, unlike the L2ARC. The dataset is much larger than the size of the local ephemeral device so the impacts are expected to be less important.

Comparison of the TPCC transaction rate between bcache and L2ARC

As we can see in the above figure, it is exactly what we observe. The transaction rate with bcache is inferior to L2ARC because less data is cached. The L2ARC yielded more than twice the number of transactions over the 2h period than bcache. However, bcache is not without merit, it did help ext4 increase its performance by about 43%.

How to Recreate L2ARC if Missing

By nature, local ephemeral devices are… ephemeral. When a virtual machine is restarted, it could end up on a different host. In such a case, the L2ARC data on the local ephemeral device is lost. Since it is only a read cache, it doesn’t prevent ZFS from starting but you get a pool status similar to this:

# zpool status
  pool: bench
 state: ONLINE
status: One or more devices could not be opened.  Sufficient replicas exist for
    	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://zfsonlinux.org/msg/ZFS-8000-2Q
  scan: none requested
config:

        NAME          	STATE 	READ WRITE CKSUM
    	bench        	ONLINE   	0 	0 	0
      	sdc         	ONLINE   	0 	0 	0
    	cache
      	/dev/sdb        UNAVAIL  	0 	0 	0  cannot open

In such case, the L2ARC can be easily be fixed with:

# zpool remove bench /dev/sdb
# zpool add bench cache /dev/sdb

These commands should be called from a startup script to ensure the L2ARC is sane after a restart.

Conclusion

In this post, we have explored the great potential of local ephemeral devices. These devices are means to improve MySQL performance and reduce the costs of cloud hosting. Either used directly or as a caching device, ZFS data compression and architecture allow nearly triple the number of TPCC transactions executed over a 2 hours period.

There are still a few ZFS related topics I’d like to cover in the near future. Those posts may not be in that order but the topics are: “Comparison with InnoDB compression”, “Comparison with BTRFS”, “ZFS tuning for MySQL”. If some of these titles raise your interest, stay tuned.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Sep
10
2021
--

Finding Undetected Jumbo Chunks in MongoDB

Jumbo Chunks in MongoDB

Jumbo Chunks in MongoDBI recently came across an interesting case of performance issues during balancing in a MongoDB cluster. Digging through the logs, it became clear the problem was related to chunk moves taking a long time.

As we know, the default maximum chunk size is 64 MB. So these migrations are supposed to be very fast in most of the hardware in use nowadays.

In this case, there were several chunks way above that limit being moved around. How can this happen? Shouldn’t those chunks have been marked as Jumbo?

Recap on Chunk Moves

For starters, let’s review what we know about chunk moves.

MongoDB does not move a chunk if the number of documents in it is greater than 1.3 times the result of dividing the configured chunk size (default 64 MB) by the average document size.

If we have a collection with an average document size of 2 KB, a chunk with more than 42597 (65535 / 2 * 1.3) documents won’t be balanced. This might be an issue for collections with non-uniform document sizes, as the estimation won’t be very good in that case.

Also, if a chunk grows past the maximum size, it is flagged as Jumbo. This means the balancer will not try to move it. You can check out the following article for more information about dealing with jumbo chunks.

Finally, we must keep in mind that chunk moves take a metadata lock at a certain part of the process, so write operations are blocked momentarily near the end of a chunk move.

The autoSplitter in Action

Before MongoDB 4.2, the autoSplitter process was responsible for ensuring chunks do not exceed the maximum size runs on the mongos router. The router process keeps some statistics about the operations it performs. These statistics are consulted to make chunk-splitting decisions.

The problem is that in a production environment, there are usually many of these processes deployed. An individual mongos router only has a partial idea of what’s happening with a particular chunk of data.

If multiple mongos routers change data on the same chunk, it could grow beyond the maximum size and still go unnoticed.

There are many reported issues related to this topic (for example, see SERVER-44088, SERVER-13806, and SERVER-16715).

Frequently restarting mongos processes could also lead to the same problem. This causes the statistics used to make chunk-splitting decisions to be reset.

Because of these issues, the autoSplitter process was changed to run on the primary member of each shard’s replica set. This happened in MongoDB version 4.2 (see SERVER-9287 for more details).

The process should better prevent jumbo chunks now, as each shard has a more accurate view of what data it contains (and how it is being changed).

The MongoDB version, in this case, was 3.6, and the autoSplitter was not doing as good a job as expected.

Finding Undetected Jumbo Chunks in MongoDB

For starters, I came across a script to print a summary of the chunks statistics here. We can build upon it to also display information about Jumbo chunks and generate the commands required to manually split any chunks exceeding the maximum size.

var allChunkInfo = function(ns){
    var chunks = db.getSiblingDB("config").chunks.find({"ns" : ns}).sort({min:1}).noCursorTimeout(); //this will return all chunks for the ns ordered by min
    var totalChunks = 0;
    var totalJumbo = 0;
    var totalSize = 0;
    var totalEmpty = 0;
 
    chunks.forEach( 
        function printChunkInfo(chunk) { 
          var db1 = db.getSiblingDB(chunk.ns.split(".")[0]) // get the database we will be running the command against later
          var key = db.getSiblingDB("config").collections.findOne({_id:chunk.ns}).key; // will need this for the dataSize call
         
          // dataSize returns the info we need on the data, but using the estimate option to use counts is less intensive
          var dataSizeResult = db1.runCommand({datasize:chunk.ns, keyPattern:key, min:chunk.min, max:chunk.max, estimate:false});
         
          if(dataSizeResult.size > 67108864) {
            totalJumbo++;
            print('sh.splitFind("' + chunk.ns.toString() + '", ' + JSON.stringify(chunk.min) + ')' + ' // '+  chunk.shard + '    ' + Math.round(dataSizeResult.size/1024/1024) + ' MB    ' + dataSizeResult.numObjects );
          }
          totalSize += dataSizeResult.size;
          totalChunks++;
          
          if (dataSizeResult.size == 0) { totalEmpty++ }; //count empty chunks for summary
          }
    )
    print("***********Summary Chunk Information***********");
    print("Total Chunks: "+totalChunks);
    print("Total Jumbo Chunks: "+totalJumbo);
    print("Average Chunk Size (Mbytes): "+(totalSize/totalChunks/1024/1024));
    print("Empty Chunks: "+totalEmpty);
    print("Average Chunk Size (non-empty): "+(totalSize/(totalChunks-totalEmpty)/1024/1024));
}

The script has to be called from a mongos router as follows:

mongos> allChunkInfo("db.test_col")

And it will print any commands to perform chunk splits as required:

sh.splitFind("db.test_col", {"_id":"jhxT2neuI5fB4o4KBIASK1"}) // shard-1    222 MB    7970
sh.splitFind("db.test_col", {"_id":"zrAESqSZjnpnMI23oh5JZD"}) // shard-2    226 MB    7988
sh.splitFind("db.test_col", {"_id":"SgkCkfSDrY789e9nD4crk9"}) // shard-1    218 MB    7986
sh.splitFind("db.test_col", {"_id":"X5MKEH4j32OhmAhY7LGPMm"}) // shard-1    238 MB    8338
...
***********Summary Chunk Information***********
Total Chunks: 5047
Total Jumbo Chunks: 120
Average Chunk Size (Mbytes): 19.29779934868946
Empty Chunks: 1107
Average Chunk Size (non-empty): 24.719795257064895

You can see the script prints the commands to split the jumbo chunks, along with the shard where each one lives, the size of the chunk, and the number of documents on it. At the end, summary information is also displayed.

The estimate: true option in the dataSize command causes it to use the average document size of the collection for the estimation. This is not very accurate if the document size has many variances but runs faster and is less resource-intensive. Consider setting that option if you know your document size is more or less uniform.

All that is left is to run the commands generated by the script, and the jumbo chunks should be gone. The balancer will then move chunks during the next balancing window as required to even out the data. Consider scheduling the balancing window outside the peak workload hours to reduce the impact.

Conclusion

Even though MongoDB 3.6 has been EOL for some time now, many people are still running it, mostly due to outdated drivers on the application side that are not compatible with newer releases. Nowadays, the autoSplitter should be doing a better job of keeping chunk size at bay.

If you are still running a MongoDB version older than 4.2, it would be a good idea to review your chunk stats from time to time to avoid some nasty surprises.

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

Download Percona Distribution for MongoDB Today!

Sep
08
2021
--

Storing and Using JSON Within PostgreSQL Part Two

Storing and Using JSON Within PostgreSQL 2

We explored the basic functionality of JSON within PostgreSQL in Part One of this series. Now we will look a little deeper into some things we may want to use regularly.  Most of the functionality we talked about in part one works well when you have a simple JSON format.  However, in real life, our documents may get a bit messy.  Let’s look at part of the JSON we are using for our tests:

Example JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

You can see here that we have some nested arrays and a bit of multi-dimensional flair.  If we wanted to get all the characters or actors in this movie, we would have a challenge using the basic functions.  Thankfully, PostgreSQL has a deep set of functions for interacting with JSON.

First, let’s look at how to get all the movies starring Robert Downey Jr. The easiest way is to use one of the  following:

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "cast": [{ "name" : "Robert Downey Jr." }]}'

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.cast.name == "Robert Downey Jr."'

However, what if we also need to pull out the character from the movie?  For our needs of getting a full list of actors and characters who were in this particular movie, we can use the jsonb_to_rescordset (similar to MySQL’s json_table function we covered in the MySQL part of this series).

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where imdb_id = 'tt2395427' limit 15;
  imdb_id  |             title              | imdb_rating |    id     |         name         |    character     
-----------+--------------------------------+-------------+-----------+----------------------+------------------
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000375 | Robert Downey Jr.    | Tony Stark
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1165110 | Chris Hemsworth      | Thor
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0749263 | Mark Ruffalo         | Bruce Banner
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0262635 | Chris Evans          | Steve Rogers
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0424060 | Scarlett Johansson   | Natasha Romanoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0719637 | Jeremy Renner        | Clint Barton
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000652 | James Spader         | Ultron
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000168 | Samuel L. Jackson    | Nick Fury
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000332 | Don Cheadle          | James Rhodes
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1093951 | Aaron Taylor-Johnson | Pietro Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0647634 | Elizabeth Olsen      | Wanda Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0079273 | Paul Bettany         | Jarvis
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1130627 | Cobie Smulders       | Maria Hill
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1107001 | Anthony Mackie       | Sam Wilson
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm2017943 | Hayley Atwell        | Peggy Carter
(15 rows)

This works fine – until it doesn’t.  If I do a similar search for all movies starring Robert Downey Jr., I get:

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where name like 'Robert Downey%' limit 10; 
ERROR:  cannot call jsonb_to_recordset on a non-array

When we look at the results, you can see that the function expects an array, and several of our movies have no cast (or a NULL value in the json).

movie_json_test=> select imdb_id, jsonb_typeof((jsonb_column->>'cast')::jsonb) from movies_json_generated limit 10;
  imdb_id  | jsonb_typeof 
-----------+--------------
 tt7620156 | 
 tt0109524 | array
 tt0534466 | array
 tt0111091 | 
 tt4167726 | array
 tt0638383 | array
 tt6346314 | array
 tt5877808 | array
 tt4098782 | 
 tt0365100 | array
(10 rows)

You can work around this “Null” issue in a couple of different ways.  The easiest is by converting to text, then back into JSON.  For example:

select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset((jsonb_column->>'cast'::text)::jsonb) as t(id text,name varchar(100),character text) where name like 'Robert Downey%';

Remember in part one how I harped on types and ensuring you cast to different data types when you needed to?  This is another example.  In this case, first taking the null as text, then taking the empty string, and then converting to a JSON object with a null inside.

While this is the easiest way, let’s show some other interesting ways to work around this to highlight some of the other functions, indexes, etc., we learned in part one.  Because we know that we have NULL values for some cast entries, we could check and filter out the values where the type is not empty or null.  For example, here is a simple check if the text version of our jsonb_column->’cast’ is not equal to null.

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
  imdb_id  |                   title                   | imdb_rating |    id     |       name        |               character                
-----------+-------------------------------------------+-------------+-----------+-------------------+----------------------------------------
 tt0266220 | The 65th Annual Academy Awards (1993)     |             | nm0000375 | Robert Downey Jr. | Himself - Presenter
 tt1515091 | Sherlock Holmes: A Game of Shadows (2011) |        7.50 | nm0000375 | Robert Downey Jr. | Sherlock Holmes
 tt1231583 | Due Date (2010)                           |        6.60 | nm0000375 | Robert Downey Jr. | Peter Highman
 tt0343663 | Eros (2004)                               |        6.00 | nm0000375 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")
 tt4420868 | The EE British Academy Film Awards (2015) |        7.40 | nm0000375 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough
(5 rows)
 
movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..7.30 rows=5 width=332) (actual time=0.586..84.666 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..545198.71 rows=373482 width=332) (actual time=0.585..84.664 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..74611.38 rows=373482 width=272) (actual time=0.023..30.257 rows=3786 loops=1)
               Filter: ((jsonb_column -> 'cast'::text) <> 'null'::jsonb)
               Rows Removed by Filter: 258
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.014..0.014 rows=0 loops=3786)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 24
 Planning Time: 0.064 ms
 Execution Time: 84.692 ms
(10 rows)

This is not terribly fast, but it does work. This is basically working around the functionality built into the JSON functions, however.  To speed our query up, we can, of course, index this column; however, some of the data in our movie cast list is just too large:

movie_json_test=> create index gen_func_index_cast on movies_json_generated (((jsonb_column->>'cast')::text));
ERROR:  index row requires 10704 bytes, maximum size is 8191

Of course, we can work around this as well.  But sometimes, it’s best to pivot.  Let’s look at another alternative to simply searching for a NULL value.  As we learned in the first post, we can use generated columns or expression indexes to do some nice things.  Here we could create either an expression index that checks for a null or add a new column that uses a case statement to flag null casts:

movie_json_test=> alter table movies_json_generated add column cast_is_null boolean generated always as  ((case when ((jsonb_column->>'cast')::text) is null then true else false end)) stored; 
ALTER TABLE


movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is false limit 5;
  imdb_id  |                          title                          | imdb_rating |    id     |       name        |                   character                    
-----------+---------------------------------------------------------+-------------+-----------+-------------------+------------------------------------------------
 tt3473134 | Off Camera with Sam Jones (TV Series 2014– )            |             | nm0000375 | Robert Downey Jr. | Himself                  2 episodes, 2014-2019
 tt0092851 | Dear America: Letters Home from Vietnam (TV Movie 1987) |        7.90 | nm0000375 | Robert Downey Jr. | (voice)
 tt0426841 | The 1994 Billboard Music Awards (1994)                  |             | nm0000375 | Robert Downey Jr. | Himself
 tt1228705 | Iron Man 2 (2010)                                       |        7.00 | nm0000375 | Robert Downey Jr. | Tony Stark
 tt0821642 | The Soloist (2009)                                      |        6.70 | nm0000375 | Robert Downey Jr. | Steve Lopez
(5 rows)


movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is not true limit 5;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..8.24 rows=5 width=332) (actual time=0.912..30.550 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..309227.39 rows=187680 width=332) (actual time=0.912..30.548 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..72750.59 rows=187680 width=272) (actual time=0.007..1.069 rows=1789 loops=1)
               Filter: (cast_is_null IS NOT TRUE)
               Rows Removed by Filter: 106
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.016..0.016 rows=0 loops=1789)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 23
 Planning Time: 0.068 ms
 Execution Time: 30.572 ms
(10 rows)

You can see there are several options here for dealing with the nulls, some way easier (and cleaner) than others.  I want to highlight some of the challenges this brings up with using unstructured data within a structured system.

Evolution Upsets the Balance (or Breaks Stuff)

All of the above solutions work for the existing data, but the wonderful thing about JSON is that you can evolve what you store over time.  Let’s use the above example.  Let’s say that for years, every movie that is fed into your system has a full cast listing of characters and actors.  Then, one day, the feed you get your data from allows movies without a cast listing.  Your application will still work, your load scripts will still work.  But every once in a while, your users will get a weird error, or you will see small flashes in your logs (if you are logging these).  But 99.9% of queries are fine.  It is these transient issues that drive people bonkers.

Looking at this problem slightly differently, what if you start adding data or changing the order of certain items?  Back to our original JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

What if we wanted to add another piece of data to the cast records? Let’s say we want to add a thumbs up, or thumbs down vote to each character, so something like this:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100,
      "character": "Bruce Banner"
    },

Now we run into another challenge.  If we inject the new fields into the middle of the JSON, instead of the end, our call to jsonb_to_recordset(jsonb_column->’cast’) as t(id text,name varchar(100),character text) will cause use issues in the application.  We had to define the definition of the recordest; the first field returned is id, then actor name, and then character.  If we changed this record, the character returned in the SQL would be 10000, not Bruce Banner.  Now to make additions, you would need to start adding at the end like:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100
    },

MySQL mitigates this with allowing you to defined specific paths in their equivalent function json_table(json_column, ‘$.cast[*]’ columns( V_name varchar(200) path ‘$.name’, V_character varchar(200) path ‘$.character’) This allows you to define exactly which fields you want, not just the first X ones. That said, there are other limitations in the json_table method MYSQL uses.  Here this is a trade-off.  You need to be aware again that the order and structure of your JSON matter greatly if you are going to use built-in database functions to query this data.

To JSON or Not?  Database Design

While PostgreSQL’s JSON features are pretty in-depth (there are dozens of other functions I did not show), there are similar challenges I see when using these compared to a more classic database design. JSON functionality is NOT a replacement for actually designing your database and thinking about schema and design.  It is a great tool to offer additional flexibility.  If you treat PostgreSQL or any database simply as a dumping ground for data, bad things happen.  It would help to think about how JSON columns can augment your current database designs, not replace them.

Let me give you a simple example of some of the dangers of simply forgoing database design in favor of just dumping in JSON.  While I think we have highlighted some of the complexity and setup challenges with different methods of storing JSON in PostgreSQL, I wanted to take this a step further and compare the performance of different database designs and their impact on performance and scalability.  For this, I built three separate designs and schemas for housing the same data:

A.)  Minimal Database Design -> single table, single column.  “The dump.”  With only a GIN index.

B.)  Single Table -> Generated columns and Expression indexes where needed.

C.)  Old School Normalized Tables, with JSON stored for easy use

The idea here is to illustrate a few of the trade-offs around performance when picking a specific design(1):

A.)  Minimal Database Design No Gin IDX A.)  Minimal Database Design with Gin IDX B.)  Single Table W Generated Column B.)  Single Table W Expression Indexes c.) Normalized
Simple Query for Movie Title (random 10  titles) 800ms 0.3ms 0.2ms 0.2ms 0.2ms
Select top 10 movies 841ms 831ms 0.9ms 0.9ms 0.3ms
Select all movies for an actor  1009.0ms 228ms 5588.0ms(2) 5588.0ms(2) 0.8ms
Count movies for a given actor 5071ms 5075ms 5544ms NA 1ms
  1. Each test was repeated 100 times, and the average results were listed.  Min/Max is available as well. 
  2. The nested json for our “cast info” was not easily indexed, there are some things we could do to mitigate this, but it is a bit more complicated.

Interestingly, there are a few cases where we have problems with using only generated columns/expression indexes, generally because of nested data within the JSON.  We could do some tricks, like adding a GIN index and query the JSON to find the element or creating some funky nested generated column with an expression index.  I left these out for now.  Also, I could go deeper into the tests, testing inserts, updates, and, of course, assembling JSON from a normalized table, but I didn’t.  My goal here is merely to make you think about your design.  Choosing one way to implement and use JSON may be ok under certain use cases, but you may find yourself querying or aggregating the data, and things break.

Recap & What’s Next

A few quick takeaways:

Recapping part 1:

  • Using JSONB is probably going to be your best option in most use cases.
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits.

What we learned in part 2:

  • The more straightforward the JSON, the easier it will be to pull out and interact with the data you need.
  • Nested JSON data can be pulled out in a few different ways. jsonb_to_rescordset is the easiest way I found to pull out the data I need.  However, using this function or others is very susceptible to the data structure in your document (data types matter!).
  • JSON data whose format changes ( elements added or removed ) may make using some functions difficult, if not impossible, to use.
  • JSON within a well-built, designed application and database can be wonderful and offer many great benefits.  JSON just dumped into databases won’t scale.  Database design still matters.

Now that we have covered the basics and a few nuances of JSON with PostgreSQL, next up, it is time to look at the same functionality and data within MongoDB before finally comparing the three databases. Stay tuned!

Sep
03
2021
--

Installing PostgreSQL using Docker

install postgresql with docker

Following the series of blogs started by Peter Zaitsev in Installing MySQL with Docker, on deploying Docker containers running open source databases, in this article, I’ll demonstrate how to install PostgreSQL using Docker.

Before proceeding, it is important to remind you of Peter’s warning from his article, which applies here as well: 

“The following instructions are designed to get a test instance running quickly and easily; you do not want to use these for production deployments.” 

We intend to show how to bring up PostgreSQL instance(s) into Docker containers. Moreover, you will find some basic Docker management operations and some snippets containing examples, always based on the Docker Hub PostgreSQL official image.

Installing

Starting with the installation, the following example shows how to bring up a Docker container within the latest PostgreSQL release from Docker Hub. It is required to provide the password of the Postgres user at the moment you are launching the container; otherwise, the container will not be created:

$ docker run --name dockerPostgreSQL \
-e POSTGRES_PASSWORD=secret \
-d postgres:latest

Once you start the container, you can check if it successfully started with docker inspect:

$ docker inspect -f '{{.State.Running}}' dockerPostgreSQL
true

If not, you can try starting it manually (you will see how to do it on the Managing the PostgreSQL Containers section).

Getting Connected to the Containerized Instance

The snippet below will show you how to connect to your PostgreSQL containerized instance.

  • Directly from docker:
$ docker exec -it dockerPostgreSQL psql --user postgres
postgres=# \conninfo

You are connected to database "postgres" as user "postgres" via socket in "/var/run/postgresql" at port "5432".

  • Or from a PostgreSQL client connecting to its IP address, which you can obtain with the command docker inspect:
$ docker inspect -f '{{.NetworkSettings.IPAddress}}' dockerPostgreSQL
172.16.0.12

$ psql --user postgres --port 5432 --host 172.16.0.12
postgres=# \conninfo
You are connected to database "postgres" as user "postgres" on host "172.16.0.12" at port "50432".

Managing the PostgreSQL Containers

The command below will show you how to list all active containers and their respective statuses:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                     NAMES
7f88656c4864        postgres:11         "docker-entrypoint..."   4 minutes ago       Up 4 minutes        0.0.0.0:50432->5432/tcp   dockerPostgreSQL11
8aba8609dabc        postgres:11.5       "docker-entrypoint..."   21 minutes ago      Up 21 minutes       5432/tcp                  dockerPostgreSQL115
15b9e0b789dd        postgres:latest     "docker-entrypoint..."   32 minutes ago      Up 32 minutes       5432/tcp                  dockerPostgreSQL

To stop one of the containers, do:

$ docker stop dockerPostgreSQL

To start a container:

$ docker start dockerPostgreSQL

Finally, to remove a container, first stop it and then:

$ docker rm dockerPostgreSQL

Customizing PostgreSQL settings in the container

Let’s say you want to define different parameters for the PostgreSQL server running on Docker. You can create a custom postgresql.conf configuration file to replace the default one used with Docker:

$ docker run --name dockerPostgreSQL11 -p 50433:5432 -v "$PWD/my-postgres.conf":/etc/postgresql/postgresql.conf -e POSTGRES_PASSWORD=scret -d postgres:11

Another example of how to pass startup arguments is changing the directory which stores the wal files:

$ docker run --name dockerPostgreSQL11 -p 50433:5432 -v "$PWD/my-postgres.conf":/etc/postgresql/postgresql.conf -e POSTGRES_PASSWORD=scret -e POSTGRES_INITDB_WALDIR=/backup/wal -d postgres:11

Running Different PostgreSQL Versions in Different Containers

You can launch multiple PostgreSQL containers, each running a different PostgreSQL version. In the example below, you will find how to start up a 10.5 PostgreSQL version:

$ docker run --name dockerPostgreSQL105 -p 50433:5432 -e POSTGRES_PASSWORD=scret  -d postgres:10.5
7f51187d32f339688c2f450ecfda6b7552e21a93c52f365e75d36238f5905017

Here is how you could launch a PostgreSQL 11 container:

$  docker run --name dockerPostgreSQL11 -p 50432:5432 -e POSTGRES_PASSWORD=scret -d postgres:11
7f88656c4864864953a5491705ac7ca882882df618623b1e5664eabefb662733

After that, by reissuing the status command, you will see all of the already created ones:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                     NAMES
e06e637ae090        postgres:10.5       "docker-entrypoint..."   4 seconds ago       Up 4 seconds        0.0.0.0:50433->5432/tcp   dockerPostgreSQL105
7f88656c4864        postgres:11         "docker-entrypoint..."   About an hour ago   Up About an hour    0.0.0.0:50432->5432/tcp   dockerPostgreSQL11
15b9e0b789dd        postgres:latest     "docker-entrypoint..."   2 hours ago         Up 2 hours          5432/tcp                  dockerPostgreSQL

You will notice that both 10.5 and 11 were created with different port mappings. Locally into the container, both are listening on 5432 but are externally mapped to different ports. Therefore, you will be able to access all of them externally.

You will find even more details on the PostgreSQL Hub Docker site.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Sep
03
2021
--

Storing and Using JSON Within PostgreSQL Part One

Storing JSON PostgreSQL

Continuing our series on storing JSON directly within your database, we are now looking at the JSON functionality built into PostgreSQL.  You can see the first two posts in this series exploring MySQL’s JSON functionality with Storing JSON in Your Databases: Tips and Tricks For MySQL Part One and Storing JSON in Your Databases: Tips and Tricks For MySQL Part Two.  I used the exact same table structures and datasets as I did within my MySQL tests.  You can get the instructions to follow along at home here on GitHub.

PostgreSQL has two JSON datatypes available to you.  First JSON, and second JSONB.  The docs highlight the differences pretty well:

JSON JSON B Compare

Basically, JSONB is stored decomposed, making it more efficient for the database to interact with, while the JSON type is stored as an exact text copy of what was input.  This means formatting, white space, etc., will be lost with JSONB.  That said, the suggestion is to generally use JSONB.  The trade-off being slightly slower inserts vs. faster overall read performance (on top of the storage differences mentioned above).

In terms of performance, how much of a difference is there?  That is tricky, as it will depend heavily on the size and complexity of the JSON document being stored.  The larger, more complex the document, the larger the difference you could see.  But to illustrate an example, we will use our movie JSON to select the same data from both JSON and JSONB.

I created the following tables and loaded them with movie-related JSON:

create table movies_json (
	ai_myid serial primary key, 
	imdb_id varchar(255),
	json_column json not null
);

create unique index movies_json_imdb_idx on movies_json(imdb_id);
CREATE INDEX gin_index ON movies_json USING gin (jsonb_column);


create table movies_jsonb (
	ai_myid serial primary key, 
	imdb_id varchar(255),
	jsonb_column jsonb not null
);

create unique index movies_jsonb_imdb_idx on movies_jsonb(imdb_id);
CREATE INDEX movies_jsonb_gin_index ON movies_jsonb USING gin (json_column);

Querying the JSON column results in slower retrieval:

movie_json_test=# explain (verbose true, analyze true) select * from movies_json where json_column->>'title' = 'Avengers: Endgame (2019)';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..55775.85 rows=1880 width=1059) (actual time=694.047..2516.879 rows=1 loops=1)
   Output: ai_myid, imdb_id, json_column
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_json  (cost=0.00..54587.85 rows=783 width=1059) (actual time=1905.511..2512.010 rows=0 loops=3)
         Output: ai_myid, imdb_id, json_column
         Filter: ((movies_json.json_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=2511.276..2511.277 rows=0 loops=1
         Worker 1:  actual time=2511.322..2511.322 rows=0 loops=1
 Planning Time: 0.166 ms
 Execution Time: 2516.897 ms
(12 rows)

While the JSONB column is faster (3x):

movie_json_test=# explain (verbose true, analyze true) select * from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54116.60 rows=1873 width=1025) (actual time=723.324..726.914 rows=1 loops=1)
   Output: ai_myid, imdb_id, jsonb_column
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52929.30 rows=780 width=1025) (actual time=548.982..721.730 rows=0 loops=3)
         Output: ai_myid, imdb_id, jsonb_column
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=720.995..720.995 rows=0 loops=1
         Worker 1:  actual time=202.751..720.994 rows=1 loops=1
 Planning Time: 0.038 ms
 Execution Time: 726.933 ms
(12 rows)

To ensure this is not a single data point I wrote a small script to run this test (with this title and other random titles).  Over 100 runs we had pretty consistent results:

JSON JSONB
Average Time: 2.5492

Min Time: 2.5297428970225155

Max Time: 2.56536191503983

Average Time: 0.747

Min Time: 0.7297536049736664

Max Time: 0.7827945239841938

As stated, insert/updates will lose some of that performance, so keep that in mind.  It is, however, easier to interact and update things in a JSONB datatype.  For instance, some functions are only available for JSONB data types, like the JSONB_SET function which allows you to in-place update your JSON.

movie_json_test=# explain (verbose true, analyze true) update movies_jsonb set jsonb_column= jsonb_set(jsonb_column, '{imdb_rating}', '9') where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Update on public.movies_jsonb  (cost=0.00..56211.80 rows=1873 width=52) (actual time=1928.691..1928.692 rows=0 loops=1)
   ->  Seq Scan on public.movies_jsonb  (cost=0.00..56211.80 rows=1873 width=52) (actual time=577.386..1928.246 rows=1 loops=1)
         Output: ai_myid, imdb_id, jsonb_set(jsonb_column, '{imdb_rating}'::text[], '9'::jsonb, true), ctid
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 375358
 Planning Time: 0.040 ms
 Execution Time: 1928.718 ms
(7 rows)

I have seen references in several places to the insert performance difference between JSON and JSONB, but I have not seen any concrete numbers.  I decided to run a few tests:

  • Inserting 10K records into a JSON column (via insert into select from):  72.851 ms
  • Inserting 10K records into a JSONB column (via insert into select from):  754.045 ms

If you are doing a heavy insert workload, the difference is significant enough to take note and plan for it.  

Another distinct advantage for JSONB is the ability to use GIN indexes over the JSON document.  Gin indexes are designed and optimized to work for text searches.  This focus lends itself well to JSON.  That said, the syntax may be a bit more complicated to make use of GIN indexes.

Here is how you create a GIN index. 

movie_json_test=# CREATE INDEX movies_jsonb_gin_index ON movies_jsonb USING gin (jsonb_column);
CREATE INDEX

You can see the original query is still not using this index:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54125.60 rows=1877 width=64) (actual time=716.059..719.346 rows=1 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52937.90 rows=782 width=64) (actual time=544.197..714.577 rows=0 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
         Rows Removed by Filter: 125119
         Worker 0:  actual time=202.768..713.907 rows=1 loops=1
         Worker 1:  actual time=713.890..713.891 rows=0 loops=1
 Planning Time: 0.041 ms
 Execution Time: 719.365 ms
(12 rows)

Why not?  The ->> returns the value as a text, so the conversion from JSON to Text causes some issues.  There are a couple of ways to work around this.  The first is to use @@ which returns the first JSON item:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.title == "Avengers: Endgame (2019)"';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=72.29..221.83 rows=38 width=64) (actual time=0.179..0.180 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: (movies_jsonb.jsonb_column @@ '($."title" == "Avengers: Endgame (2019)")'::jsonpath)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movies_jsonb_gin_index  (cost=0.00..72.28 rows=38 width=0) (actual time=0.068..0.068 rows=2 loops=1)
         Index Cond: (movies_jsonb.jsonb_column @@ '($."title" == "Avengers: Endgame (2019)")'::jsonpath)
 Planning Time: 0.145 ms
 Execution Time: 0.199 ms
(8 rows)

This works.  You can also use the @> which checks if the JSON value entries exist.

movie_json_test=> explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "title" : "Avengers: Endgame (2019)"}';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=44.29..193.94 rows=38 width=64) (actual time=0.199..0.200 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: (movies_jsonb.jsonb_column @> '{"title": "Avengers: Endgame (2019)"}'::jsonb)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movies_jsonb_gin_index  (cost=0.00..44.28 rows=38 width=0) (actual time=0.091..0.091 rows=1 loops=1)
         Index Cond: (movies_jsonb.jsonb_column @> '{"title": "Avengers: Endgame (2019)"}'::jsonb)
 Planning Time: 0.147 ms
 Execution Time: 0.214 ms
(8 rows)

Not only did this use the index it is now over 3x faster overall.  The PostgreSQL documentation offers a fairly detailed overview of the different options to interact with JSON, especially when it comes to functions and operators (note the table in the PG 12 Docs giving you an overview of the operators is a little more readable ) available to you. It’s important to note that not all operators and functions will work with GIN indexes.  For example, using ‘like_regex’:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.title like_regex "^Avengers*"';
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..53546.87 rows=38 width=64) (actual time=218.550..795.063 rows=9 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52543.07 rows=16 width=64) (actual time=251.866..790.098 rows=3 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: (movies_jsonb.jsonb_column @@ '($."title" like_regex "^Avengers*")'::jsonpath)
         Rows Removed by Filter: 125117
         Worker 0:  actual time=311.403..789.402 rows=3 loops=1
         Worker 1:  actual time=225.825..789.408 rows=1 loops=1
 Planning Time: 0.204 ms
 Execution Time: 795.087 ms
(12 rows)

In addition to the GIN indexes, you can create Hash or Btree indexes on your JSON columns, however, they are only useful if you are comparing or searching the entire JSON document.  Instead, you want to use an expression index (often referred to as functional indexes in other databases).  

Generated Columns and Expression Indexes

While the GIN indexes work for some use cases, you often will find it easier and more convenient to use actual columns or some of the normal functions you are familiar with.  The simplest use of expression indexes is to simply pull out and index a field you will use often.  Let’s say we will often want to search for a title.  We can create an index on the extracted JSON from our previous example.

movie_json_test=# create index movie_jsonb_title_index on movies_jsonb (((jsonb_column ->> 'title')::text));
CREATE INDEX

This creates a btree index on the jsonb_column->> title path, allowing us to search for a title.

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' = 'Avengers: Endgame (2019)';
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=66.97..6421.19 rows=1877 width=64) (actual time=0.138..0.140 rows=1 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Recheck Cond: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on movie_jsonb_title_index  (cost=0.00..66.50 rows=1877 width=0) (actual time=0.014..0.014 rows=1 loops=1)
         Index Cond: ((movies_jsonb.jsonb_column ->> 'title'::text) = 'Avengers: Endgame (2019)'::text)
 Planning Time: 0.160 ms
 Execution Time: 0.159 ms
(8 rows)

Originally, when we searched for movies with the ‘Avengers Endgame (2019)’ title there was no index (unless you modified the SQL to make use of the GIN index).  Now with this new index, you can see our original query is no longer doing a sequential scan but is instead using the btree index.  If we want the index to also be used for ‘ like string%’ statements as well to help with partial matching we need something a bit extra added to the index.  Note: this is not something you can do by default using only the GIN index (however you can look into the pg_trgm extension potentially for this).  Here we will create the same index with the operator class “text_pattern_ops” which is designed to work with Like and Regex.

Before:

movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' like 'Avengers%';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..54125.60 rows=1877 width=64) (actual time=371.463..723.743 rows=8 loops=1)
   Output: ((jsonb_column ->> 'title'::text)), ((jsonb_column ->> 'imdb_rating'::text))
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.movies_jsonb  (cost=0.00..52937.90 rows=782 width=64) (actual time=288.053..718.957 rows=3 loops=3)
         Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
         Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) ~~ 'Avengers%'::text)
         Rows Removed by Filter: 125117
         Worker 0:  actual time=204.176..718.288 rows=2 loops=1
         Worker 1:  actual time=288.637..718.299 rows=3 loops=1
 Planning Time: 0.130 ms
 Execution Time: 723.762 ms
(12 rows)

After:

movie_json_test=# create index movie_jsonb_title_index on movies_jsonb (((jsonb_column ->> 'title')::text) text_pattern_ops);
CREATE INDEX
movie_json_test=# explain (verbose true, analyze true) select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column->>'title' like 'Avengers%';
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.movies_jsonb  (cost=71.66..6425.88 rows=1877 width=64) (actual time=0.195..0.498 rows=8 loops=1)
   Output: (jsonb_column ->> 'title'::text), (jsonb_column ->> 'imdb_rating'::text)
   Filter: ((movies_jsonb.jsonb_column ->> 'title'::text) ~~ 'Avengers%'::text)
   Heap Blocks: exact=8
   ->  Bitmap Index Scan on movie_jsonb_title_index  (cost=0.00..71.19 rows=1877 width=0) (actual time=0.015..0.015 rows=8 loops=1)
         Index Cond: (((movies_jsonb.jsonb_column ->> 'title'::text) ~>=~ 'Avengers'::text) AND ((movies_jsonb.jsonb_column ->> 'title'::text) ~<~ 'Avengert'::text))
 Planning Time: 0.168 ms
 Execution Time: 0.519 ms

Now you can also perform a similar operation, but store the results as a queryable generated column instead:

movie_json_test=> \d movies_jsonb
                                         Table "public.movies_jsonb"
    Column    |          Type          | Collation | Nullable |                    Default                    
--------------+------------------------+-----------+----------+-----------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_jsonb_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | 
 jsonb_column | jsonb                  |           | not null | 
Indexes:
    "movies_jsonb_pkey" PRIMARY KEY, btree (ai_myid)
    "movies_jsonb_imdb_idx" UNIQUE, btree (imdb_id)

movie_json_test=> alter table movies_jsonb add column title varchar(255) generated always as (((jsonb_column ->> 'title')::text)) stored;
ALTER TABLE


movie_json_test=> \d movies_jsonb
                                                  Table "public.movies_jsonb"
    Column    |          Type          | Collation | Nullable |                             Default                             
--------------+------------------------+-----------+----------+-----------------------------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_jsonb_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | 
 jsonb_column | jsonb                  |           | not null | 
 title        | character varying(255) |           |          | generated always as (((jsonb_column ->> 'title'::text))) stored
Indexes:
    "movies_jsonb_pkey" PRIMARY KEY, btree (ai_myid)
    "movies_jsonb_imdb_idx" UNIQUE, btree (imdb_id)

Now we have the option to add an index on the title column, select it without the JSON formatting, etc.  Note: generated columns can not be updated.

We can create our entire movie_json table with extra columns and expression indexes and make it look a bit more like a normal table:

create table movies_json_generated (
	ai_myid serial primary key, 
	imdb_id varchar(255) generated always as (jsonb_column ->> 'imdb_id') stored,
	title varchar(255) generated always as (jsonb_column ->> 'title') stored,
    imdb_rating decimal(5,2) generated always as ((jsonb_column  ->> 'imdb_rating')::numeric) stored,
	overview text generated always as (jsonb_column ->> 'overview') stored,
	director jsonb generated always as ((jsonb_column ->> 'director')::json) stored,
	country varchar(100) generated always as (jsonb_column ->> 'country') stored,
	jsonb_column jsonb,
	json_column json
);

create unique index gen_imdb_idx on movies_json_generated(imdb_id);
create index gen_title_idx on movies_json_generated(title);
create index gen_func_title_index on movies_json_generated (((json_column ->> 'title')::varchar));	
CREATE INDEX Gen_gin_index ON movies_json_generated USING gin (jsonb_column);

You may have noticed we had to explicitly cast the columns to a specific data type.  One of the challenges or at least the more difficult things to get used to with both MySQL and PostgreSQL.

It’s worth remembering that a JSON document does not have explicit data types, so you will often find that some functions or indexes may not work as expected because you are comparing data of two different types (and will return different results from some operations, i.e. sorting ascii vs. numeric).  For example: to add 1 to our IMDB rating:

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')))+1.1))) ;
ERROR:  operator does not exist: text + numeric
LINE 1: ...imdb_rating}',(((((jsonb_column ->>'imdb_rating')))+1))) ;
                                                              ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

Here the imdb_rating that is returned is not numeric so you can not add 1.  So logically, you want to cast the value to a numeric to allow for the addition. 

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1))) ;
ERROR:  function jsonb_set(jsonb, unknown, numeric) does not exist
LINE 1: ...pdate  movies_jsonb_generated_1 set jsonb_column= jsonb_set(...
                                                             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Again we run into a type issue.  This time the jsonb_set function can not pass in a numeric value (nor does it implicitly convert) when the function is looking for a JSONB value.  So let’s cast the result to JSONB:

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1)::jsonb)) ;
ERROR:  cannot cast type numeric to jsonb
LINE 1: ...((((jsonb_column ->>'imdb_rating')::numeric))+1.1)::jsonb)) …

Here we find we can not cast a numeric directly to JSONB.  So we have to convert the numeric to text, then to JSONB.

movie_json_test=> update  movies_jsonb_generated_1 set jsonb_column= jsonb_set(jsonb_column::jsonb, '{imdb_rating}',(((((jsonb_column ->>'imdb_rating')::numeric))+1)::text::jsonb)) ;
UPDATE 100000

For some developers, this may prove a bit tricky to get the hang of.  This will also extend to things that work, but maybe just slow or may produce the wrong data.  For instance, when using compare, sort, or filter you may or may not pick up the index you are looking for and you may be comparing the ASCII value for numerics.  Below you can see us explicitly convert certain columns to compare them. You can also see that when you generate a column or create an expression index you need to do the same:

movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where (jsonb_column->>'imdb_rating')::numeric > 8 order by (jsonb_column->>'imdb_rating')::numeric desc limit 5;
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=65291.78..65292.36 rows=5 width=128) (actual time=1055.064..1058.740 rows=5 loops=1)
   ->  Gather Merge  (cost=65291.78..77449.06 rows=104198 width=128) (actual time=1055.062..1058.738 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=64291.76..64422.00 rows=52099 width=128) (actual time=1053.392..1053.393 rows=4 loops=3)
               Sort Key: (((jsonb_column ->> 'imdb_rating'::text))::numeric) DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Seq Scan on movies_json_generated  (cost=0.00..63426.41 rows=52099 width=128) (actual time=0.993..1052.275 rows=2688 loops=3)
                     Filter: (((jsonb_column ->> 'imdb_rating'::text))::numeric > '8'::numeric)
                     Rows Removed by Filter: 122432
 Planning Time: 0.061 ms
 Execution Time: 1058.769 ms
(14 rows)
 
movie_json_test=> \d movies_json_generated
                                                     Table "public.movies_json_generated"
    Column    |          Type          | Collation | Nullable |                                    Default                                    
--------------+------------------------+-----------+----------+-------------------------------------------------------------------------------
 ai_myid      | integer                |           | not null | nextval('movies_json_generated_ai_myid_seq'::regclass)
 imdb_id      | character varying(255) |           |          | generated always as (((json_column ->> 'imdb_id'::text))) stored
 title        | character varying(255) |           |          | generated always as (((json_column ->> 'title'::text))) stored
 imdb_rating  | numeric(5,2)           |           |          | generated always as (((json_column ->> 'imdb_rating'::text)::numeric)) stored
 overview     | text                   |           |          | generated always as (json_column ->> 'overview'::text) stored
 director     | jsonb                  |           |          | generated always as (((json_column ->> 'director'::text)::json)) stored
 country      | character varying(100) |           |          | generated always as (((json_column ->> 'country'::text))) stored
 jsonb_column | jsonb                  |           |          | 
 json_column  | json                   |           |          | 
Indexes:
    "movies_json_generated_pkey" PRIMARY KEY, btree (ai_myid)
    "gen_func_title_index" btree (((json_column ->> 'title'::text)::character varying))
    "gen_gin_index" gin (jsonb_column)
    "gen_imdb_idx" UNIQUE, btree (imdb_id)
    "gen_title_idx" btree (title)
 
movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 5;
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=62548.12..62548.70 rows=5 width=102) (actual time=112.458..114.704 rows=5 loops=1)
   ->  Gather Merge  (cost=62548.12..63277.80 rows=6254 width=102) (actual time=112.457..114.702 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=61548.09..61555.91 rows=3127 width=102) (actual time=110.807..110.808 rows=4 loops=3)
               Sort Key: imdb_rating DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on movies_json_generated  (cost=0.00..61496.16 rows=3127 width=102) (actual time=0.128..109.939 rows=2688 loops=3)
                     Filter: (imdb_rating > '8'::numeric)
                     Rows Removed by Filter: 122432
 Planning Time: 0.146 ms
 Execution Time: 114.729 ms
(14 rows)

movie_json_test=> create index test_index_imdb_rating on movies_json_generated (imdb_rating);
CREATE INDEX
movie_json_test=> explain analyze select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 5;
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..19.41 rows=5 width=102) (actual time=0.094..0.134 rows=5 loops=1)
   ->  Index Scan Backward using test_index_imdb_rating on movies_json_generated  (cost=0.42..28521.09 rows=7510 width=102) (actual time=0.093..0.132 rows=5 loops=1)
         Index Cond: (imdb_rating > '8'::numeric)
 Planning Time: 0.136 ms
 Execution Time: 0.152 ms
(5 rows)

A quick side note:  there will be a performance difference for inserts/updates on JSON/JSONB columns with one or more generated columns or expression indexes.  Because of how JSONB is stored, you can see a boost in performance when you update a column that has a lot of generated columns/expression indexes.  In my tests, I saw 3-4x or more performance improvement in doing updates on a JSONB column with several generated/expression indexes vs. a JSON column with the same setup.  That said, the more generated columns and expression indexes you have on a table, the more it can impact performance… this is a trade-off.

  • A table with a single generated column, took 7650ms to load 100K rows.
  • A table with 6 generated columns, took 8576ms to load 100K rows.

While this is a very simple setup with just a single JSONB column and the generated columns, you can see a slight overhead.  This held true with the same 100K records being updated, updating only one column in the table with six generated columns took 4.7 seconds vs. 3.6 seconds for a single generated column (a ~25% difference).  Again, not an in-depth test, but it illustrates that adding a ton of expression indexes or generated columns will not be without some cost.

Recap & What’s Next

A few quick takeaways:

  • Using JSONB is probably going to be your best option in most use cases
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits

Now you have the basics of JSON in PostgreSQL.  Next post we will review some of the more advanced options and do a deeper dive on performance and explore a normalized schema -vs- one that is heavily JSON.

Sep
02
2021
--

Introducing xbcloud: Exponential Backoff Feature in Percona XtraBackup

xbcloud Percona XtraBackup

xbcloud Percona XtraBackupStoring your data locally can impose security and availability risks. Major cloud providers have object storage services available to allow you to upload and distribute data across different regions using various retention and restore policies.

Percona XtraBackup delivers the xbcloud binary – an auxiliary tool to allow users to upload backups to different cloud providers directly.

Today we are glad to announce the introduction of the Exponential Backoff feature to xbcloud.

In short, this new feature will allow your backup upload/download to work better with unstable network connections by retrying each chunk and adding an exponential wait time in between retries, increasing the chances of completion in case of an unstable connection or network glitch.

This new functionality is available on today’s release of Percona XtraBackup 8.0.26 and will be available in Percona XtraBackup 2.4.24.

How it Works – in General

Whenever one chunk upload or download fails to complete its operation, xbcloud will check the reason for the failure. It can be either a CURL / HTTP or a client-specific error. If the error is listed as retriable (more about that later in this post), xbcloud will backoff/sleep for a certain amount of time before trying again. It will retry the same chunk 10 times before aborting the whole process. 10 is the default retry amount and can be configured via --max-retries parameter.

How it Works – Backoff Algorithm

Network glitches/instabilities usually happen for a short period of time. To make xbcloud tool more reliable and increase the chances of a backup upload/download to complete during those instabilities, we pause for a certain period of time before retrying the same chunk. The algorithm chosen is known as exponential backoff.

In the case of a retry, we calculate the power of two using the number of retries we already did for that specific chunk as the exponential factor. Since xbcloud does multiple asynchronous requests in parallel, we factor in a random number of milliseconds between 1 and 1000 to each chunk. This is to avoid all asynchronous request backoff for the same amount of time and retry all at once, which could cause network congestion.

The backoff time will keep increasing as the same chunk keeps failing to upload/download. Getting by example the default --max-retry of 10, that would mean the last backoff will be around 17 minutes. 

To overcome this, we have implemented the --max-backoff parameter. This parameter defines the maximum time the program can sleep in milliseconds between chunk retries – Default to 300000 (5 minutes).

How it Works – Retriable Errors

We have a set of errors that we know we should retry the operations. For CURL, we retry on:

CURLE_GOT_NOTHING
CURLE_OPERATION_TIMEDOUT
CURLE_RECV_ERROR
CURLE_SEND_ERROR
CURLE_SEND_FAIL_REWIND
CURLE_PARTIAL_FILE
CURLE_SSL_CONNECT_ERROR

For HTTP, we retry the operation in case of the following status codes:

503
500
504
408

Each cloud provider might return a different CURL or HTTP error depending on the issue. To allow users to extend this list and not rely on us providing a new version of xbcloud, we created a mechanism to allow users to extend this list.

One can add new errors by setting --curl-retriable-errors / --http-retriable-errors respectively.

On top of that, we have enhanced the error handling when using --verbose output to specify in which error xbcloud failed and what parameter a user will have to add to retry on this error. Here is one example:

210701 14:34:23 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Server returned nothing (no headers, no data)
210701 14:34:23 /work/pxb/ins/8.0/bin/xbcloud: Curl error (52) Server returned nothing (no headers, no data) is not configured as retriable. You can allow it by adding --curl-retriable-errors=52 parameter

Those options accept a comma list of error codes.

Example

Below is one example of xbcloud exponential backoff in practice used with --max-retries=5 --max-backoff=10000

210702 10:07:05 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Server returned nothing (no headers, no data)
210702 10:07:05 /work/pxb/ins/8.0/bin/xbcloud: Sleeping for 2384 ms before retrying backup3/xtrabackup_logfile.00000000000000000006 [1]

. . .

210702 10:07:23 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Server returned nothing (no headers, no data)
210702 10:07:23 /work/pxb/ins/8.0/bin/xbcloud: Sleeping for 4387 ms before retrying backup3/xtrabackup_logfile.00000000000000000006 [2]

. . .

210702 10:07:52 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Failed sending data to the peer
210702 10:07:52 /work/pxb/ins/8.0/bin/xbcloud: Sleeping for 8691 ms before retrying backup3/xtrabackup_logfile.00000000000000000006 [3]

. . .

210702 10:08:47 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Failed sending data to the peer
210702 10:08:47 /work/pxb/ins/8.0/bin/xbcloud: Sleeping for 10000 ms before retrying backup3/xtrabackup_logfile.00000000000000000006 [4]

. . .

210702 10:10:12 /work/pxb/ins/8.0/bin/xbcloud: successfully uploaded chunk: backup3/xtrabackup_logfile.00000000000000000006, size: 8388660

Let’s analyze the snippet log above:

  1. Chunk xtrabackup_logfile.00000000000000000006 failed to upload by the first time (as seen in the [1] above) and slept for 2384 milliseconds.
  2. Then the same chunk filed by the second time (as seen by the number within [] ) exponentially increasing the sleep time by 2
  3. When the chunk failed by the third time, we continued exponentially increasing the sleep time to around 8 seconds
  4. On the fourth time, we would originally increase the exponential time to around 16 seconds; however, we have used --max-backoff=10000which means that is the maximum sleep time between retries, resulting in the program waiting 10 seconds before trying the same chunk again.
  5. Then we can see that in the end, it successfully uploaded the chunk  xtrabackup_logfile.00000000000000000006

Summary

Best practices recommend distributing your backups to different locations. Cloud providers have dedicated services for this purpose. Using xbcloud alongside Percona XtraBackup are the tools to ensure you meet this requirement when talking about MySQL backup. On the other hand, we know that network connectivity can be unstable at the worst times. The new version of xbcloud won’t stop you from completing your backups as it will be more resilient to those instabilities with a variety of options to tune the network transfer.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Aug
31
2021
--

My Favorite Percona Monitoring and Management Additional Dashboards

Percona Monitoring and Management Dashboards

Percona Monitoring and Management (PMM) has dashboards that cover a lot of ground, yet PMM Superpowers come from the fact you do not need to stick to dashboards that are included with the product! You also can easily install additional dashboards provided by the Community, as well as implement your own.

In this blog post, we will cover some of the additional dashboards which I find particularly helpful.

Node Processes Dashboard

Node Processes Dashboard

Get insights into the processes on the system to better understand resource usage by your database server vs other stuff on the system.   Unexpected resource hog processes are a quite common cause of downtime and performance issues.  More information in the Understanding Processes on your Linux Host blog post.

MySQL Memory Usage Details

MySQL Memory Usage Details

Ever wondered where MySQL memory usage comes from? This dashboard can shed a light on this dark place, showing the top global memory consumers as well as what users and client hosts contribute to memory usage.  More details in the Understanding MySQL Memory Usage with Performance Schema blog post.

MySQL Query Performance Troubleshooting

MySQL Query Performance Troubleshooting

Want to understand which queries are responsible for CPU, Disk, Memory, or Network Usage and get some other advanced MySQL Query Troubleshooting tools? Check out this dashboard.  Read more about it in the  MySQL Query Performance Troubleshooting blog post.

RED Method for MySQL Dashboard

RED Method for MySQL Dashboard

Want to apply the RED (Rate-Errors-Duration)  method to MySQL?  Check out this dashboard, and check out RED Method for MySQL Performance Analyses for more details.

OK, so let’s say you’re convinced and want to get those dashboards into your PMM install but manual installation does not excite you.  Here is how you can use custom dashboard provisioning  to install all of them:

curl -LJOs https://github.com/Percona-Lab/pmm-dashboards/raw/main/misc/import-dashboard-grafana-cloud.sh --output import-dashboard-grafana-cloud.sh
curl -LJOs https://github.com/Percona-Lab/pmm-dashboards/raw/main/misc/cleanup-dash.py --output cleanup-dash.py

chmod a+x import-dashboard-grafana-cloud.sh
chmod a+x cleanup-dash.py

./import-dashboard-grafana-cloud.sh -s <PMM_SERVER_IP> -u admin:<ADMIN_PASSWORD> -f Custom -d 13266 -d 12630 -d 12470 -d 14239

Note:  Node Processes and MySQL Memory Usage Details dashboards also require additional configuration on the client-side. Check out the blog posts mentioned for specifics.

Enjoy!

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

Download Percona Monitoring and Management Today

Aug
31
2021
--

Installing MongoDB With Docker

installing MongoDB with Docker

installing MongoDB with DockerFollowing the series of blogs written with the intention to describe basic operations matching Docker and open source databases, in this article, I will demonstrate how to proceed with installing MongoDB with Docker.

The first one, written by Peter Zaitsev, was Installing MySQL with Docker.

Before proceeding, it is important to double warn by quoting Peter’s article:

“The following instructions are designed to get a test instance running quickly and easily; you do not want to use these for production deployments”. 

Also, a full description of Docker is not the objective of this article, though I assume prior knowledge of Docker, and also you that already have it installed and configured to move on.

Docker quickly deploys standalone MongoDB containers. If you are looking for fast deployments of both replica sets and shards, I suggest looking at the mlaunch tool.

Peter mentioned, in his article, how MySQL has two different “official” repositories, and with MongoDB, it’s the same: MongoDB has one repository maintained by MongoDB and another maintained by Docker. I wrote this article based on the Docker-maintained repository.

Installing the Latest Version of MongoDB

The following snippet is one example of how to initiate a container of the latest MongoDB version from the Docker repository. 

docker run --name mongodb_dockerhub \
                -e MONGO_INITDB_ROOT_USERNAME=admin 
                -e MONGO_INITDB_ROOT_PASSWORD=secret \
                -d mongo:latest

Now, if you want to check the container status right after creating it:

docker ps

CONTAINER ID        IMAGE COMMAND                  CREATED STATUS PORTS               NAMES
fdef23c0f32a        mongo:latest "docker-entrypoint..."   4 seconds ago Up 4 seconds 27017/tcp           mongodb_dockerhub

Connecting to MongoDB Server Docker Container

Having the container installed up and running, you will notice that no extra step or dependency installation was previously required, apart from the docker binaries. Now, it is time to access the container MongoDB shell and issue a basic command like “show dbs”. 

docker exec -it mongodb_dockerhub mongo -u admin -p secret
MongoDB shell version v4.2.5
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("89c3fb4a-a8ed-4724-8363-09d65024e458") }
MongoDB server version: 4.2.5
Welcome to the MongoDB shell.

> show dbs
admin   0.000GB
config  0.000GB
local   0.000GB

> exit
bye

It is also possible to connect to the containerized MongoDB using a host mongo shell. On the docker ps output, the container id has a field that informs its port mapping, and then it is a simple connection using that port.

In the below example, we connected to the mongodb_dockerhub36. It’s up and running locally on port 27017  but mapped to the host 27018 port:

docker ps

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                      NAMES
60ffc759fab9        mongo:3.6           "docker-entrypoint..."   20 seconds ago      Up 19 seconds       0.0.0.0:27018->27017/tcp   mongodb_dockerhub36

Hence, the mongo shell connection string will be executed against the external IP and port 27018

mongo admin -u admin --host 172.16.0.10 --port 27018 -psecret

Managing MongoDB Server in Docker Container

The following commands will demonstrate basic management operations when managing a MongoDB container. 

  • Starting the container and checking the Status
docker start mongodb_dockerhub

docker ps

CONTAINER ID        IMAGE COMMAND                  CREATED STATUS PORTS               NAMES
fdef23c0f32a        mongo:latest "docker-entrypoint..."   47 minutes ago Up 2 seconds 27017/tcp           mongodb_dockerhub

  • Stopping the container and checking the Status
docker stop mongodb_dockerhub

docker ps

CONTAINER ID        IMAGE COMMAND

  • Taking a look at the MongoDB log entries
docker logs mongodb_dockerhub
...
about to fork child process, waiting until server is ready for connections.
forked process: 25
2020-04-10T14:04:49.808+0000 I  CONTROL [main] ***** SERVER RESTARTED *****
2020-04-10T14:04:49.812+0000 I  CONTROL [main] Automatically disabling TLS 1.0, to f
...

Passing Command Line Options to MongoDB Server in Docker Container

It is also possible to define the instance’s parameters when launching the MongoDB container. In the example below, I will show how to set WiredTiger cache.

docker run --name mongodb_dockerhub \
                -e MONGO_INITDB_ROOT_USERNAME=admin -e MONGO_INITDB_ROOT_PASSWORD=secret \
                -d mongo:latest --wiredTigerCacheSizeGB 1.0

Running Different MongoDB Server Versions in Docker

Another possibility is getting two MongoDB containers running in parallel but under different versions. The below snippet will describe how to build that scenario

  • Launching a 4.0 container
docker run --name mongodb_dockerhub40 \
 -e MONGO_INITDB_ROOT_USERNAME=admin -e MONGO_INITDB_ROOT_PASSWORD=secret \
 -d mongo:4.0

  • Launching a 3.6 container
docker run --name mongodb_dockerhub36 \
 -e MONGO_INITDB_ROOT_USERNAME=admin -e MONGO_INITDB_ROOT_PASSWORD=secret \
 -d mongo:3.6

  • Checking the container’s status
docker ps

CONTAINER ID        IMAGE COMMAND                  CREATED STATUS PORTS               NAMES
e9f480497f2d        mongo:3.6 "docker-entrypoint..."   32 seconds ago Up 32 seconds 27017/tcp           mongodb_dockerhub36
3a799f8a907c        mongo:4.0 "docker-entrypoint..."   41 seconds ago Up 41 seconds 27017/tcp           mongodb_dockerhub40

If for some reason you need both containers running simultaneously and access them externally, then use different port mappings. In the below example, both MongoDB’s containers are deployed on the local 27017 port, nevertheless, I am setting different external port maps for each one.

  • MongoDB 4.0 mapping the port 27017
docker run --name mongodb_dockerhub40 \
 -p 27017:27017 \
 -e MONGO_INITDB_ROOT_USERNAME=admin -e MONGO_INITDB_ROOT_PASSWORD=secret \
 -d mongo:4.0

  • MongoDB 3.6 mapping the port 27018
docker run --name mongodb_dockerhub36 \
 -p 27018:27017 \
 -e MONGO_INITDB_ROOT_USERNAME=admin -e MONGO_INITDB_ROOT_PASSWORD=secret \
 -d mongo:3.6

  • Checking both Container’s status
docker ps

CONTAINER ID        IMAGE COMMAND                  CREATED STATUS PORTS                      NAMES
78a79e5606ae        mongo:4.0 "docker-entrypoint..."   3 seconds ago Up 2 seconds 0.0.0.0:27017->27017/tcp   mongodb_dockerhub40
60ffc759fab9        mongo:3.6 "docker-entrypoint..."   20 seconds ago Up 19 seconds 0.0.0.0:27018->27017/tcp   mongodb_dockerhub36

There are a lot of extra details on Docker’s MongoDB Hub page and you will find more options to use on your MongoDB Docker deployment. Enjoy it!

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

Download Percona Distribution for MongoDB Today!

Aug
24
2021
--

Resharding in MongoDB 5.0

Resharding in MongoDB 5.0

Resharding in MongoDB 5.0MongoDB 5.0 has been released with a bunch of new features. An important one is the capability to completely redefine the shard key of a collection in a sharded cluster. Resharding was a feature frequently requested by the customers. In MongoDB 4.4 only the refining of the shard was available; from now on you can change your shard key using also different fields.

When you use sharding in MongoDB and you fail to create the shard key of a collection, you can face issues like:

  • Unbalanced distribution of the data
  • Having jumbo chunks that cannot be split and cannot be migrated

These issues can simply make your cluster very slow, making some of the shards extremely overloaded. Also, the cluster can be really unmanageable and crashes could happen at some point.

The Problem of Changing a Shard Key

I had a customer running a 5-shard cluster with more than 50TB of data. They had a few collections with sub-optimal shard keys. They had one of the shards having more than 75% of the data and workload, and they had thousands of jumbo chunks around. Some of the jumbo chunks reached 200GB in size; really a lot. The cluster was terribly slow, chunk migration attempts were a lot and they failed very frequently, and the application was most of the time unresponsive due to timeouts. Manual splits and migrations of jumbo chunks were impossible. The only solution was to create a different shard key for those collections, but this wasn’t possible with that MongoDB version. The only suggested way to change a shard key was to drop and recreate the collection. Yes, remember also to dump and reload the data. This requires you to stop your application for an impressive amount of time if the dataset is large. That was the case.

In order to avoid stopping the application, we deployed a new empty cluster with a few more shards as well. We created all the empty collections in the new cluster with the optimal shard key. Then we moved the data out of the old cluster with custom scripts, one piece at the time, using boundaries on indexed timestamp fields. The solution required the customer to deploy some application changes. The application was instructed to read/write from a different cluster depending if the affected data was already migrated into the new cluster or not. The entire migration process took weeks to complete. In the end, it worked and the customer now has a balanced cluster with no jumbo chunks.

Anyway, the solution came at some cost in terms of new machines, development effort, and time.

Fortunately, MongoDB 5.0 introduced the reshardCollection command. From now on changing a shard key should be less expensive than in the past. That’s awesome.

In this article, we’ll take a look at how resharding in MongoDB 5.0 works.

Internals

The way resharding works is simple. When you issue the reshardCollection command, a new implicit empty collection is created by MongoDB with the new shard key defined, and the data will be moved from the existing collection chunk by chunk. There is a two-second lock required by the balancer to determine the new data distribution, but during the copy phase, the application can continue to read and write the collection with no issues.

Due to this copy phase, the larger the collection is, the more time the resharding takes.

Prerequisites for Resharding

There are some requirements you need to validate before starting the resharding, and some could be expensive, in certain cases.

  • Disk space: be sure to have at least 1.2x the size of the collection you’re going to reshard. If the collection is 1TB, you need at least 1.2TB free space in your disk.
  • I/O capacity should be below 50%
  • CPU load should be below 80%

The resources available must be evaluated on each shard.

If you fail to provide these resources the database will run out of space or the resharding could take longer than expected.

Another important requirement is that you will need to deploy changes to your application queries. To let the database work best during the sharding process, the queries must use filters on both the current shard key and the new shard key. Only at the end of the resharding process you may drop all the filters regarding the old shard key out of your queries.

This requirement is very important and in some cases it could be a little expensive. Temporarily changing the application code is sometimes a hard task.

There also other limitations you need to consider:

  • Resharing is not permitted if an index built is running
  • Some queries will return error if you didn’t include both the current and new shard keys: deleteOne(), findAnd Modify(), updateOne() … check the manual for the full list
  • You can reshard only one collection at a time
  • You cannot use addShard(), removeShard(), dropDatabase(), db.createCollection() while resharding is running
  • The new shard key cannot have a uniqueness constraint
  • Resharding a collection with a uniqueness constraint is not supported

Let’s Test Resharding in MongoDB

To test the new feature, I created a sharded cluster using AWS instances, t2-large with 2 CPUs, 8 GB RAM, and EBS volume. I deployed a two-shard cluster using just a single member for each replica set.

Let’s create a sharded collection and insert some sample data.

At the beginning we create the shard key on the age field. To create the random data, we use some functions and a javascript loop to insert one million documents.

[direct: mongos] testdb> sh.shardCollection("testdb.testcoll", { age: 1} )
{
  collectionsharded: 'testdb.testcoll',
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1628588848, i: 25 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1628588848, i: 21 })
}

[direct: mongos] testdb> function getRandomInt(min, max) {
...     return Math.floor(Math.random() * (max - min + 1)) + min;
... }
[Function: getRandomInt]

[direct: mongos] testdb> var day = 1000 * 60 * 60 * 24;

[direct: mongos] testdb> function getRandomDate() {
... return new Date(Date.now() - (Math.floor(Math.random() * day)))
... }
[Function: getRandomDate]

[direct: mongos] testdb> function getRandomString() {
... return (Math.random()+1).toString(36).substring(2)
... }
[Function: getRandomString]

[direct: mongos] testdb> for (var i=1; i<=1000000; i++) {
... db.testcoll.insert(
..... {
....... uid: i,
....... name: getRandomString(),
....... date_created: getRandomDate(),
....... age: getRandomInt(1,100),
....... address: getRandomString(),
....... city: getRandomString(),
....... country: getRandomString()
....... }
..... )
... }

Now we have our initial collection. Let’s take a look at the distribution of the chunks and the sharding status of the cluster.

{
  database: {
    _id: 'testdb',
    primary: 'sh1-rs',
    partitioned: true,
    version: {
    uuid: UUID("efda5b69-1ffc-42c3-abed-ad20e08e321d"),
    timestamp: Timestamp({ t: 1628527701, i: 21 }),
    lastMod: 1
  }
},
collections: {
  'testdb.testcoll': {
    shardKey: { age: 1 },
    unique: false,
    balancing: true,
    chunkMetadata: [
      { shard: 'sh0-rs', nChunks: 2 },
      { shard: 'sh1-rs', nChunks: 3 }
    ],
    chunks: [
      { min: { age: MinKey() }, max: { age: 1 }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 5, i: 0 }) },
      { min: { age: 1 }, max: { age: 42 }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 4, i: 0 }) },
      { min: { age: 42 }, max: { age: 72 }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 5, i: 2 }) },
      { min: { age: 72 }, max: { age: 100 }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 5, i: 3 }) },
      { min: { age: 100 }, max: { age: MaxKey() }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 5, i: 1 }) }
    ],
    tags: []
    }
  }
}

With the current shard key on the age field we’ve got 5 chunks in total, 2 on sh0-rs and 3 on sh1-rs. The total size of the collection is around 200MB (uncompressed).

It’s time to test the resharding. Let’s try to change the shard key to { name: 1 }. The following is the right syntax where we provide the namespace to reshard and the new shard key:

[direct: mongos] testdb> db.adminCommand({
... reshardCollection: "testdb.testcoll",
... key: { name: 1 }
... })

From another connection we can monitor the status of the resharding:

[direct: mongos] testdb> db.getSiblingDB("admin").aggregate([ 
{ $currentOp: { allUsers: true, localOps: false } }, 
{ $match: { type: "op", "originatingCommand.reshardCollection": "testdb.testcoll" } }])
[
  {
    shard: 'sh0-rs',
    type: 'op',
    desc: 'ReshardingRecipientService f86967fe-36a9-4836-8972-1061a61230df',
    op: 'command',
    ns: 'testdb.testcoll',
    originatingCommand: {
      reshardCollection: 'testdb.testcoll',
      key: { name: 1 },
      unique: false,
      collation: { locale: 'simple' }
    },
    totalOperationTimeElapsedSecs: Long("145"),
    remainingOperationTimeEstimatedSecs: Long("173"),
    approxDocumentsToCopy: Long("624135"),
    documentsCopied: Long("569868"),
    approxBytesToCopy: Long("95279851"),
    bytesCopied: Long("86996201"),
    totalCopyTimeElapsedSecs: Long("145"),
    oplogEntriesFetched: Long("0"),
    oplogEntriesApplied: Long("0"),
    totalApplyTimeElapsedSecs: Long("0"),
    recipientState: 'cloning',
    opStatus: 'running'
  },
  {
    shard: 'sh0-rs',
    type: 'op',
    desc: 'ReshardingDonorService f86967fe-36a9-4836-8972-1061a61230df',
    op: 'command',
    ns: 'testdb.testcoll',
    originatingCommand: {
      reshardCollection: 'testdb.testcoll',
      key: { name: 1 },
      unique: false,
      collation: { locale: 'simple' }
    },
    totalOperationTimeElapsedSecs: Long("145"),
    remainingOperationTimeEstimatedSecs: Long("173"),
    countWritesDuringCriticalSection: Long("0"),
    totalCriticalSectionTimeElapsedSecs: Long("0"),
    donorState: 'donating-initial-data',
    opStatus: 'running'
  }, 
  {
    shard: 'sh1-rs',
    type: 'op',
    desc: 'ReshardingDonorService f86967fe-36a9-4836-8972-1061a61230df',
    op: 'command',
    ns: 'testdb.testcoll',
    originatingCommand: {
      reshardCollection: 'testdb.testcoll',
      key: { name: 1 },
      unique: false,
      collation: { locale: 'simple' }
    },
    totalOperationTimeElapsedSecs: Long("145"),
    remainingOperationTimeEstimatedSecs: Long("276"),
    countWritesDuringCriticalSection: Long("0"),
    totalCriticalSectionTimeElapsedSecs: Long("0"),
    donorState: 'donating-initial-data', 
    opStatus: 'running'
  },
  {
    shard: 'sh1-rs',
    type: 'op',
    desc: 'ReshardingRecipientService f86967fe-36a9-4836-8972-1061a61230df',
    op: 'command',
    ns: 'testdb.testcoll',
    originatingCommand: {
      reshardCollection: 'testdb.testcoll',
      key: { name: 1 },
      unique: false,
      collation: { locale: 'simple' }
    },
    totalOperationTimeElapsedSecs: Long("145"),
    remainingOperationTimeEstimatedSecs: Long("276"),
    approxDocumentsToCopy: Long("624135"),
    documentsCopied: Long("430132"),
    approxBytesToCopy: Long("95279851"),
    bytesCopied: Long("65663031"),
    totalCopyTimeElapsedSecs: Long("145"),
    oplogEntriesFetched: Long("0"),
    oplogEntriesApplied: Long("0"),
    totalApplyTimeElapsedSecs: Long("0"),
    recipientState: 'cloning',
    opStatus: 'running'
  }
]

With the totalOperationTimeElapsedSecs you can see the time elapsed and with the remainingOperationTimeEstimatedSecs the estimated time for completing the operation.

While resharding is running we are allowed to write into the collection. For example, I tried the following and it worked:

[direct: mongos] testdb> db.testcoll.insert({ name: "new doc" })
{
  acknowledged: true,
  insertedIds: { '0': ObjectId("61125fa53916a70b7fdd8f6c") }
}

The full resharding of the collection took around 5 minutes. Let’s now see the new status of the sharding.

{
  database: {
    _id: 'testdb',
    primary: 'sh1-rs',
    partitioned: true,
    version: {
      uuid: UUID("efda5b69-1ffc-42c3-abed-ad20e08e321d"),
      timestamp: Timestamp({ t: 1628527701, i: 21 }),
      lastMod: 1
    }
  },
  collections: {
    'testdb.testcoll': {
    shardKey: { name: 1 },
    unique: false,
    balancing: true,
    chunkMetadata: [
      { shard: 'sh0-rs', nChunks: 3 },
      { shard: 'sh1-rs', nChunks: 3 }
    ],
    chunks: [
      { min: { name: MinKey() }, max: { name: '7gbk73hf42g' }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 6, i: 0 }) },
      { min: { name: '7gbk73hf42g' }, max: { name: 'cv1oqoetetf' }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 5, i: 1 }) },
      { min: { name: 'cv1oqoetetf' }, max: { name: 'kuk1p3z8kc' }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 2, i: 2 }) },
      { min: { name: 'kuk1p3z8kc' }, max: { name: 'ppq4ubqwywh' }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 2, i: 3 }) },
      { min: { name: 'ppq4ubqwywh' }, max: { name: 'zsods6qhqp' }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 1, i: 3 }) },
      { min: { name: 'zsods6qhqp' }, max: { name: MaxKey() }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 6, i: 1 }) }
    ],
    tags: []
    }
  }
}

We can see the shard key has been changed, and the chunks are now evenly distributed.

[direct: mongos] testdb> db.testcoll.getShardDistribution()
Shard sh1-rs at sh1-rs/172.30.0.108:27017
{
  data: '92.72MiB',
  docs: 636938,
  chunks: 3,
  'estimated data per chunk': '30.9MiB',
  'estimated docs per chunk': 212312
}
---
Shard sh0-rs at sh0-rs/172.30.0.221:27017
{
  data: '82.96MiB',
  docs: 569869,
  chunks: 3,
  'estimated data per chunk': '27.65MiB',
  'estimated docs per chunk': 189956
}
---
Totals
{
  data: '175.69MiB',
  docs: 1206807,
  chunks: 6,
  'Shard sh1-rs': [
    '52.77 % data',
    '52.77 % docs in cluster',
    '152B avg obj size on shard'
  ],
  'Shard sh0-rs': [
    '47.22 % data',
    '47.22 % docs in cluster',  
    '152B avg obj size on shard'
  ]
}

Good. It worked.

Out of curiosity, during the resharding, the temporary collection being copied is visible. Just run sh.status():

    collections: {
      'testdb.system.resharding.defda80d-ea5f-4e52-9a7f-9b3e2ed8ddf1': {
        shardKey: { name: 1 },
        unique: false,
        balancing: true,
        chunkMetadata: [
          { shard: 'sh0-rs', nChunks: 3 },
          { shard: 'sh1-rs', nChunks: 3 }
        ],
        chunks: [
          { min: { name: MinKey() }, max: { name: '9bnfw8n23l' }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 2, i: 0 }) },
          { min: { name: '9bnfw8n23l' }, max: { name: 'etd3ho9vfwg' }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 2, i: 1 }) },
          { min: { name: 'etd3ho9vfwg' }, max: { name: 'kmht0br01l' }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 1, i: 2 }) },
          { min: { name: 'kmht0br01l' }, max: { name: 'sljcb1s70g' }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 1, i: 3 }) },
          { min: { name: 'sljcb1s70g' }, max: { name: 'zzi3ijuw2f' }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 1, i: 4 }) },
          { min: { name: 'zzi3ijuw2f' }, max: { name: MaxKey() }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 1, i: 5 }) }
        ],
        tags: []
      },
      'testdb.testcoll': {
        shardKey: { age: 1 },
        unique: false,
        balancing: true,
        chunkMetadata: [
          { shard: 'sh0-rs', nChunks: 2 },
          { shard: 'sh1-rs', nChunks: 3 }
        ],
        chunks: [
          { min: { age: MinKey() }, max: { age: 1 }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 5, i: 0 }) },
          { min: { age: 1 }, max: { age: 42 }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 4, i: 0 }) },
          { min: { age: 42 }, max: { age: 72 }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 5, i: 2 }) },
          { min: { age: 72 }, max: { age: 100 }, 'on shard': 'sh1-rs', 'last modified': Timestamp({ t: 5, i: 3 }) },
          { min: { age: 100 }, max: { age: MaxKey() }, 'on shard': 'sh0-rs', 'last modified': Timestamp({ t: 5, i: 1 }) }
        ],
        tags: []
      }
    }
  }

testdb.system.resharding.defda80d-ea5f-4e52-9a7f-9b3e2ed8ddf1 is the newly created collection.

Some Graphs from Percona Monitoring and Management

I deployed Percona Monitoring and Management (PMM) to monitor the cluster and for evaluating the impact of the resharding operation. As said, I used a little test environment with no workload at all, so what we can see on the following graphs is just the impact of the resharding.

The cluster has one config server and two shards. Each replica set is just a single member running on a dedicated virtual host in AWS EC2.

Let’s see the CPU usage

percona monitoring and management mongodb

The node in the middle is the config server. The data-bearing nodes had more impact in terms of CPU usage, quite relevant. The config server didn’t have any impact.

Let’s see the Disk utilization in terms of IOPS and latency:

Disk utilization

The same as the CPU, the config server didn’t have any impact. Instead, the shards increased the IOPS as expected since a new collection has been created. The disk latency had spikes to more than 20 ms at the beginning and at the end of the process.

MongoDB increased the latency of the read operations.

The WiredTiger storage engine increased the number of transactions processed and the cache activity as expected.

Other metrics increased but they are not included here. This behavior was expected and it’s normal for a resharding operation.

The resharding is quite expensive, even for a small collection, in particular for the CPU utilization. Anyway, more tests are needed with larger collections, in the GB/TB magnitude.

Remember the resharding time depends on the size of the collection. Be aware of that and expect the performance decrease for your servers for some time when you plan to do a resharding.

Conclusions

Resharding in MongoDB 5.0 is an amazing new feature but it comes with some additional costs and limitations. Indeed, resharding requires resources in terms of disk space, I/O, and CPU. Be aware of that and plan carefully your resharding operations.

As a best practice, I suggest spending as much time as possible the first time you decide to shard a collection. If you choose the optimal shard key from the beginning, then you won’t need resharding. A good schema design is always the best approach when working with MongoDB.

I suggest continuously monitoring the status of the chunk distribution in your collections. As soon as you notice there’s something wrong with the shard key and the distribution is uneven and it’s getting worse, then use reshardCollection as soon as you can, while the collection is not too large.

Another suggestion is to do resharding only during a maintenance window because the performance impact for your cluster could be significant in the case of very large collections.

Further readings:

https://docs.mongodb.com/manual/core/sharding-reshard-a-collection/

https://docs.mongodb.com/v5.0/core/sharding-choose-a-shard-key/

https://docs.mongodb.com/manual/reference/command/refineCollectionShardKey/

https://www.percona.com/blog/mongodb-5-0-is-coming-in-hot-what-do-database-experts-across-the-community-think/

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

Download Percona Distribution for MongoDB Today!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com