Oct
14
2021
--

Custom Percona Monitoring and Management Metrics in MySQL and PostgreSQL

mysql postgresl custom metrics

mysql postgresl custom metricsA few weeks ago we did a live stream talking about Percona Monitoring and Management (PMM) and showcased some of the fun things we were doing at the OSS Summit.  During the live stream, we tried to enable some custom queries to track the number of comments being added to our movie database example.  We ran into a bit of a problem live and did not get it to work. As a result, I wanted to follow up and show you all how to add your own custom metrics to PMM and show you some gotchas to avoid when building them.

Custom metrics are defined in a file deployed on each server you are monitoring (not on the server itself).  You can add custom metrics by navigating over to one of the following:

  • For MySQL:  /usr/local/percona/pmm2/collectors/custom-queries/mysql
  • For PostgreSQL:  /usr/local/percona/pmm2/collectors/custom-queries/postgresql
  • For MongoDB:  This feature is not yet available – stay tuned!

You will notice the following directories under each directory:

  • high-resolution/  – every 5 seconds
  • medium-resolution/ – every 10 seconds
  • low-resolution/ – every 60 seconds

Note you can change the frequency of the default metric collections up or down by going to the settings and changing them there.  It would be ideal if in the future we added a resolution config in the YML file directly.  But for now, it is a universal setting:

Percona Monitoring and Management metric collections

In each directory you will find an example .yml file with a format like the following:

mysql_oss_demo: 
  query: "select count(1) as comment_cnt from movie_json_test.movies_normalized_user_comments;"
  metrics: 
    - comment_cnt: 
        usage: "GAUGE" 
        description: "count of the number of comments coming in"

Our error during the live stream was we forgot to include the database in our query (i.e. table_name.database_name), but there was a bug that prevented us from seeing the error in the log files.  There is no setting for the database in the YML, so take note.

This will create a metric named mysql_oss_demo_comment_cnt in whatever resolution you specify.  Each YML will execute separately with its own connection.  This is important to understand as if you deploy lots of custom queries you will see a steady number of connections (this is something you will want to consider if you are doing custom collections).  Alternatively, you can add queries and metrics to the same file, but they are executed sequentially.  If, however, the entire YML file can not be completed in less time than the defined resolution ( i.e. finished within five seconds for high resolution), then the data will not be stored, but the query will continue to run.  This can lead to a query pile-up if you are not careful.   For instance, the above query generally takes 1-2 seconds to return the count.  I placed this in the medium bucket.  As I added load to the system, the query time backed up.

You can see the slowdown.  You need to be careful here and choose the appropriate resolution.  Moving this over to the low resolution solved the issue for me.

That said, query response time is dynamic based on the conditions of your server.  Because these queries will run to completion (and in parallel if the run time is longer than the resolution time), you should consider limiting the query time in MySQL and PostgreSQL to prevent too many queries from piling up.

In MySQL you can use:

mysql>  select /*+ MAX_EXECUTION_TIME(4) */  count(1) as comment_cnt from movie_json_test.movies_normalized_user_comments ;
ERROR 1317 (70100): Query execution was interrupted

And on PostgreSQL you can use:

SET statement_timeout = '4s'; 
select count(1) as comment_cnt from movies_normalized_user_comments ;
ERROR:  canceling statement due to statement timeout

By forcing a timeout you can protect yourself.  That said, these are “errors” so you may see errors in the error log.

You can check the system logs (syslog or messages) for errors with your custom queries (note at this time as of PMM 2.0.21, errors were not making it into these logs because of a potential bug).  If the data is being collected and everything is set up correctly, head over to the default Grafana explorer or the “Advanced Data Exploration” dashboard in PMM.  Look for your metric and you should be able to see the data graphed out:

Advanced Data Exploration PMM

In the above screenshot, you will notice some pretty big gaps in the data (in green).  These gaps were caused by our query taking longer than the resolution bucket.  You can see when we moved to 60-second resolution (in orange), the graphs filled in.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

Download Percona Monitoring and Management Today

Oct
12
2021
--

Why Linux HugePages are Super Important for Database Servers: A Case with PostgreSQL

Linux HugePages PostgreSQL

Often users come to us with incidents of database crashes due to OOM Killer. The Out Of Memory killer terminates PostgreSQL processes and remains the top reason for most of the PostgreSQL database crashes reported to us. There could be multiple reasons why a host machine could run out of memory, and the most common problems are:

  1. Poorly tuned memory on the host machine.
  2. A high value of work_mem is specified globally (at instance level). Users often underestimate the multiplying effect for such blanket decisions.
  3. The high number of connections. Users ignore the fact that even a non-active connection can hold a good amount of memory allocation.
  4. Other programs co-hosted on the same machine consuming resources.

Even though we used to assist in tuning both host machines and databases, we do not always take the time to explain how and why HugePages are important and justify it with data. Thanks to repeated probing by my friend and colleague Fernando, I couldn’t resist doing so this time.

The Problem

Let me explain the problem with a testable and repeatable case. This might be helpful if anyone wants to test the case in their own way.

Test Environment

The test machine is equipped with 40 CPU cores (80 vCPUs) and 192 GB of installed memory. I don’t want to overload this server with too many connections, so only 80 connections are used for the test. Yes, just 80 connections, which we should expect in any environment and is very realistic. Transparent HugePages (THP) is disabled. I don’t want to divert the topic by explaining why it is not a good idea to have THP for a database server, but I commit that I will prepare another blog.

In order to have a relatively persistent connection,  just like the ones from application side poolers (or even from external connection poolers), pgBouncer is used for making all the 80 connections persistent throughout the tests.  The following is the pgBouncer configuration used:

[databases]
sbtest2 = host=localhost port=5432 dbname=sbtest2

[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
logfile = /tmp/pgbouncer.log
pidfile = /tmp/pgbouncer.pid
admin_users = postgres
default_pool_size=100
min_pool_size=80
server_lifetime=432000

As we can see, the

server_lifetime

  parameter is specified to a high value not to destroy the connection from the pooler to PostgreSQL. Following PostgreSQL, parameter modifications are incorporated to mimic some of the common customer environment settings.

logging_collector = 'on'
max_connections = '1000'
work_mem = '32MB'
checkpoint_timeout = '30min'
checkpoint_completion_target = '0.92'
shared_buffers = '138GB'
shared_preload_libraries = 'pg_stat_statements'

The test load is created using sysbench

sysbench /usr/share/sysbench/oltp_point_select.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-port=6432 --pgsql-db=sbtest2 --pgsql-user=postgres --pgsql-password=vagrant --threads=80 --report-interval=1 --tables=100 --table-size=37000000 prepare

and then

sysbench /usr/share/sysbench/oltp_point_select.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-port=6432 --pgsql-db=sbtest2 --pgsql-user=postgres --pgsql-password=vagrant --threads=80 --report-interval=1 --time=86400  --tables=80 --table-size=37000000  run

The first prepare stage puts a write load on the server and the second one read-only load.

I am not attempting to explain the theory and concepts behind HugePages, but concentrate on the impact analysis. Please refer to the LWN article: Five-Level Page Tables and Andres Freund’s blog post Measuring the Memory Overhead of a Postgres Connection for understanding some of the concepts.

Test Observations

During the test, memory consumption was checked using the Linux

free

utility command. When making use of the regular pool of memory pages, the consumption started with a really low value. But it was under steady increase (please see the screenshot below). The “Available” memory is depleted at a faster rate.

Available Memory

Towards the end, it started swap activity also. Swap activity is captured in

vmstat

  output below:

swap activity

Information from the

/proc/meminfo

  reveals that the Total Page Table size has grown to more than 25+GB from the initial 45MB

This is not just memory wastage; this is a huge overhead impacting the overall execution of the program and operating system. This size is the Total of Lower PageTable entries of the 80+ PostgreSQL processes.

The same can be verified by checking each PostgreSQL Process. Following is a sample

So the Total PageTable size (25GB) should be approximately this value * 80 (connections). Since this synthetic benchmark sends an almost similar workload through all connections, All individual processes are having very close values to what was captured above.

The following shell line can be used for checking the Pss (Proportional set size). Since PostgreSQL uses Linux shared memory, focusing on Rss won’t be meaningful.

for PID in $(pgrep "postgres|postmaster") ; do awk '/Pss/ {PSS+=$2} END{getline cmd < "/proc/'$PID'/cmdline"; sub("\0", " ", cmd);printf "%.0f --> %s (%s)\n", PSS, cmd, '$PID'}' /proc/$PID/smaps ; done|sort -n

Without Pss information, there is no easy method to understand the memory responsibility per process.

In a typical database system where we have considerable DML load, the background processes of PostgreSQL such as Checkpointer, Background Writer, or Autovaccum workers will be touching more pages in the shared memory. Corresponding Pss will be higher for those processes.

Postgres Process

This should explain why Checkpointer, Background worker, or even the Postmaster often becomes the usual victim/target of an OOM Killer very often. As we can see above, they carry the biggest responsibility of the shared memory.

After several hours of execution, the individual session touched more shared memory pages. As a consequence per process, Pss values were rearranged: Checkpointer is responsible for less as other sessions shared the responsibility.

However, checkpointer retains the highest share.

Even though it is not important for this test, it will be worth mentioning that this kind of load pattern is specific to synthetic benchmarking because every session does pretty much the same job. That is not a good approximation to typical application load, where we usually see checkpointer and background writers carry the major responsibility.

The Solution: Enable HugePages

The solution for such bloated page tables and associated problems is to make use of HugePages instead. We can figure out how much memory should be allocated to HugePages by checking the VmPeak of the postmaster process. For example, if 4357 is the PID of the postmaster:

grep ^VmPeak /proc/4357/status

This gives the amount of memory required in KB:

VmPeak: 148392404 kB

This much needs to be fitting into huge pages. Converting this value into 2MB pages:

postgres=# select 148392404/1024/2;
?column?
----------
    72457
(1 row)

Specify this value in

/etc/sysctl.conf

  for

vm.nr_hugepages

, for example:

vm.nr_hugepages = 72457

Now shutdown PostgreSQL instance and execute:

sysctl -p

We shall verify whether the requested number of huge pages are created or not:

grep ^Huge /proc/meminfo
HugePages_Total:   72457
HugePages_Free:    72457
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        148391936 kB

If we start up the PostgreSQL at this stage, we could see that the HugePages_Rsvd is allocated.

$ grep ^Huge /proc/meminfo
HugePages_Total:   72457
HugePages_Free:    70919
HugePages_Rsvd:    70833
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        148391936 kB

If everything is fine, I would prefer to make sure that PostgreSQL always uses HugePages. because I would prefer a failure of startup of PostgreSQL rather than problems/crashes later.

postgres=# ALTER SYSTEM SET huge_pages = on;

The above change needs a restart of the PostgreSQL instance.

Tests with HugePages “ON”

The HugePages are created in advance even before the PostgreSQL startup. PostgreSQL just allocates them and uses them. So there won’t be even any noticeable change in the

free

 output before and after the startup. PostgreSQL allocates its shared memory into these HugePages if they are already available. PostgreSQL’s

shared_buffers

  is the biggest occupant of this shared memory.

HugePages PostgreSQL

The first output of

free -h

in the above screenshot is generated before the PostgreSQL startup and the second one after the PostgreSQL startup. As we can see, there is no noticeable change

I have done the same test which ran for several hours and there wasn’t any change;  the only noticeable change even after many hours of run is the shift of “free” memory to filesystem cache which is expected and what we want to achieve. The total “available” memory remained pretty much constant as we can see in the following screenshot.

Available Memory PostgreSQL

Total Page Tables size remained pretty much the same :

As we can see the difference is huge: Just 61MB with HugePages instead of 25+GB previously. Pss per session also reduced drastically:

Pss per session

The biggest advantage I could observe is that CheckPointer or Background Writer is no longer accountable for several GBs of RAM.

Instead, they are accountable only for a few MBs of consumption. Obviously, they won’t be a candidate victim for the OOM Killer anymore.

Conclusion

In this blog post, we discussed how Linux Huge Pages can potentially save the database server from OOM Killers and associated crashes. We could see two improvements:

  1. Overall memory consumption was reduced by a big margin. Without the HugePages, the server almost ran out of memory (available memory completely depleted, and swapping activity started). However, once we switched to HugePages, 38-39GB remained as Available/Linux filesystem cache. This is a huge saving.
  2. With HugePages enabled, PostgreSQL background processes are not accounted for a large amount of shared memory.  So they won’t be candidate victim/target for OOM Killer easily.

These improvements can potentially save the system if it is on the brink of OOM condition, but I don’t want to make any claim that this will protect the database from all OOM conditions forever.

HugePage (hugetlbfs) originally landed in Linux Kernel in 2002 for addressing the requirement of database systems that need to address a large amount of memory. I could see that the design goals are still valid.

There are other additional indirect benefits of using HugePages:

  1. HugePages never get swapped out. When PostgreSQL shared buffers are in HugePages, it can produce more consistent and predictable performance.  I shall discuss that in another article.
  2. Linux uses multi-level page lookup method. HugePages are implemented using direct pointers to pages from the middle layer (a 2MB huge page would be found directly at the PMD level, with no intervening PTE page). The address translation becomes considerably simpler. Since this is a high-frequency operation in a database server with a large amount of memory, the gains are multiplied.

Note: The HugePages discussed in this blog post are about fixed size (2MB) huge pages.

Additionally, as a side note, I want to mention that there are a lot of improvements in Transparent HugePages (THP) over the years which allows applications to use HugePages without any code modification. THP is often considered as a replacement for regular HugePages (hugetlbfs) for a generic workload. However, usage of THP is discouraged on database systems as it can lead to memory fragmentation and increased delays. I want to cover that topic in another post and just want mention that these are not PostgreSQL specific problem, but affects every database systems.  For example,

  1. Oracle Recommends disabling TPH. Reference link
  2. MongoDB Recommends disabling THP. Reference link
  3. “THP is known to cause performance degradation with PostgreSQL for some users on some Linux versions.” Reference link.

As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Oct
07
2021
--

Using the Range and the New Multirange Data Type in PostgreSQL 14

Multirange Data Type in PostgreSQL 14

Don’t miss the great overview of PostgreSQL 14 that Umair Shahid recently put together, PostgreSQL 14 – Performance, Security, Usability, and Observability!

Range data types have been in Postgresql for some time now, however, as of PostgreSQL 14, we have seen the introduction of multi-range data types.  Before we cover that, let’s cover the basics of what the heck kind of voodoo black magic a range data type is, and why would you ever want to use it before diving into the multi-range variety.

Range data types store a beginning and end value.  The stored data could be an integer or a DateTime (timestamp). Note: if you are interested, you can also create your own custom range types as well.  Why would you use this?  I can think of a few interesting use cases, but the most generic one is keeping track of state changes and durations.  For example:

  • This machine was running from X time to Y time.
  • This room was occupied between X and Y.
  • This sale/price is active within these time frames only.

Range data has been stored for ages without a specific “range” data type.  Let’s show you how we used to do it and how range and multi-range types help us.

For our example, let us assume you want to track the days and times the HOSS wears his Cleveland Browns hat -vs- his elephant hat (practical, right?).

Matt Yonkovit Hats

Using Classic SQL to Interact with Ranges

Here is how you would classically do something like this:

create table matts_hats_1 ( 
id serial primary key, 
matts_hat varchar(100),
start_ts timestamp, 
end_ts timestamp
);

insert into matts_hats_1 (matts_hat,start_ts,end_ts) values ('Browns Hat', '2021-10-01 6:00', '2021-10-01 10:00');
insert into matts_hats_1 (matts_hat,start_ts,end_ts) values ('Elephant Hat', '2021-10-01 10:00', '2021-10-01 12:00');
insert into matts_hats_1 (matts_hat,start_ts,end_ts) values ('Browns Hat', '2021-10-01 14:00', '2021-10-01 20:00');
insert into matts_hats_1 (matts_hat,start_ts,end_ts) values ('Elephant Hat', '2021-10-01 22:00', '2021-10-01 23:00');

You can insert which hat I am wearing and when I start and stop wearing that hat. To see what hat I am wearing at any one time, I would use:

yonk=# select id, matts_hat from matts_hats_1 where start_ts <= '2021-10-01 7:35'::timestamp and end_ts >= '2021-10-01 7:35'::timestamp;
 id | matts_hat  
----+------------
  1 | Browns Hat

To see what hat I am wearing from 7 am to 11 am, I would need to do a couple of adjustments.  First, I may have started wearing the hat before 7 am and potentially wore the hat after 11 am.  In order to find these, I will need to look at start times before 7 am with end times after 7 am and start_times before 11 am with an end time after 11 am.  That looks like this:

yonk=# select id, matts_hat from matts_hats_1 where (start_ts <= '2021-10-01 07:00'::timestamp and end_ts >= '2021-10-01 07:00'::timestamp) or (start_ts <= '2021-10-01 11:00'::timestamp and end_ts >= '2021-10-01 11:00'::timestamp);
 id |  matts_hat   
----+--------------
  1 | Browns Hat
  2 | Elephant Hat
(2 rows)

The issue with this approach is that it quickly gets pretty complicated and it negates proper indexes.  See as we try and add an index:

yonk=# create index matts_hats_1_idx on matts_hats_1 (start_ts, end_ts);
 CREATE INDEX
 
 yonk=# explain select id, matts_hat from matts_hats_1 where (start_ts <= '2021-10-01 07:00'::timestamp and end_ts >= '2021-10-01 07:00'::timestamp) or (start_ts <= '2021-10-01 11:00'::timestamp and end_ts >= '2021-10-01 11:00'::timestamp);
                                                                                                                                          QUERY PLAN                                                                      
                                                                   
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------
  Seq Scan on matts_hats_1  (cost=0.00..1.08 rows=1 width=222)
    Filter: (((start_ts <= '2021-10-01 07:00:00'::timestamp without time zone) AND (end_ts >= '2021-10-01 07:00:00'::timestamp without time zone)) OR ((start_ts <= '2021-10-01 11:00:00'::timestamp without time zone) AN
 D (end_ts >= '2021-10-01 11:00:00'::timestamp without time zone)))
 (2 rows)

Introduction to Ranges in PostgreSQL

This is where the “range” data types come into play.  Let’s take a look at this data and query using a range.

create table matts_hats_2 ( 
id serial primary key, 
matts_hat varchar(100),
hat_ts tsrange
);

insert into matts_hats_2 (matts_hat,hat_ts) values ('Browns Hat', '[2021-10-01 6:00, 2021-10-01 10:00]');
insert into matts_hats_2 (matts_hat,hat_ts) values ('Elephant Hat', '[2021-10-01 10:00, 2021-10-01 12:00]');
insert into matts_hats_2 (matts_hat,hat_ts) values ('Browns Hat', '[2021-10-01 14:00, 2021-10-01 20:00]');
insert into matts_hats_2 (matts_hat,hat_ts) values ('Elephant Hat', '[2021-10-01 22:00, 2021-10-01 23:00]');

Out of the box, you can create ranges with integers, numerics, timestamps, or dates, and if you need other data (like a float) you can add these custom ones (check the docs for more info).  You can also put constraints on the ranges to prevent data from overlapping or enforce certain rules, but I won’t cover that in this blog.

Now if we wanted to find what hat I was wearing at 7:35 am like above, I would find that with the following:

yonk=# select id, matts_hat from matts_hats_2 where hat_ts @> '2021-10-01 7:35'::timestamp;
 id | matts_hat  
----+------------
  1 | Browns Hat

Notice the different operator when checking for the range?  Instead of “=” I am using the “@>” (the containment operator).  PostgreSQL has a set of operators that are used when interacting with ranges. In this case, the @> is checking if the range to the left ( hat_ts ) contains the value ‘2021-10-01 7:35’.  The most common operators which I use in this blog are:

@> , <@ See if an element or range is part of or contains the other value
&&  Do the ranges overlap
+ Creates a union between two ranges
Removes one range from another

There are other operators, so check out the docs for the complete list in the docs.

Now to get the hats I was wearing between 7 am and 11 am with a tsrange data type, I would issue the following command:

yonk=# select id, matts_hat from matts_hats_2 where hat_ts && '[2021-10-01 7:00, 2021-10-01 11:00]';
  id |  matts_hat   
 ----+--------------
   1 | Browns Hat
   2 | Elephant Hat

Notice how that looks much cleaner.  In this case, we are checking if the range 7 am-11 am overlaps with the hat_ts field.  Indexing will also work with range data types via gist.

CREATE INDEX matts_hats_2_idx ON matts_hats_2 USING GIST (hat_ts);
  
yonk=# explain select id, matts_hat from matts_hats_2 where hat_ts && '["2021-10-01 7:00","2021-10-01 11:00"]';
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Bitmap Heap Scan on matts_hats_2  (cost=4.17..14.51 rows=3 width=15)
   Recheck Cond: (hat_ts && '["2021-10-01 07:00:00","2021-10-01 11:00:00"]'::tsrange)
   ->  Bitmap Index Scan on matts_hats_2_idx  (cost=0.00..4.17 rows=3 width=0)
         Index Cond: (hat_ts && '["2021-10-01 07:00:00","2021-10-01 11:00:00"]'::tsrange)

I can modify the ranges as well, adding or removing time from what is already stored.  Let’s go ahead and remove 30 minutes from my hat-wearing time.

yonk=# update matts_hats_2 set hat_ts = hat_ts - '[2021-10-01 14:00,2021-10-01 14:30]' where id = 3;
UPDATE 1

yonk=# select * from matts_hats_2 where id = 3;
 id | matts_hat  |                    hat_ts                     
----+------------+-----------------------------------------------
  3 | Browns Hat | ("2021-10-01 14:30:00","2021-10-01 20:00:00"]
(1 row)

This method works as long as you are shrinking or extending the range.  However, removing or adding ranges gets tricky (at least before PostgreSQL 14) when removing the middle of a range or adding a non-contiguous hat-wearing time.

yonk=# update matts_hats_2 set hat_ts = hat_ts - '[2021-10-01 15:00,2021-10-01 15:30]' where id = 3;
ERROR:  result of range difference would not be contiguous

Introduction to Multi-Ranges in PostgreSQL 14

Before PostgreSQL 14, you would need to add new records to facilitate having multiple non-contiguous blocks.  However, in PG14 you have the option to use a “Multi Range” type.  Let’s see how this works:

create table matts_hats_3 ( 
id serial primary key, 
matts_hat varchar(100),
hat_date date,
hat_ts tsmultirange
);

insert into matts_hats_3 (matts_hat,hat_date,hat_ts) values ('Browns Hat',   '2021-10-01', '{[2021-10-01 6:00, 2021-10-01 10:00],[2021-10-01 14:00, 2021-10-01 20:00]}');
insert into matts_hats_3 (matts_hat,hat_date,hat_ts) values ('Elephant Hat', '2021-10-01', '{[2021-10-01 10:00, 2021-10-01 12:00],[2021-10-01 22:00, 2021-10-01 23:00]}');

You can see I consolidated all my time wearing each hat into one record for each.  I can run the same queries I used for the range examples here with the same results.  Now, however, I can add or remove additional non-contiguous ranges into that record:

yonk=# select * from matts_hats_3 where id = 1;
 id | matts_hat  |  hat_date  |                                            hat_ts                                             
----+------------+------------+-----------------------------------------------------------------------------------------------
  1 | Browns Hat | 2021-10-01 | {["2021-10-01 06:00:00","2021-10-01 10:00:00"],["2021-10-01 14:00:00","2021-10-01 20:00:00"]}
(1 row)

yonk=# update matts_hats_3 set hat_ts = hat_ts - '{[2021-10-01 15:00,2021-10-01 15:30]}' where id = 1;
UPDATE 1
yonk=# select * from matts_hats_3 where id = 1;
 id | matts_hat  |  hat_date  |                                                                   hat_ts                                                                    
----+------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------
  1 | Browns Hat | 2021-10-01 | {["2021-10-01 06:00:00","2021-10-01 10:00:00"],["2021-10-01 14:00:00","2021-10-01 15:00:00"),("2021-10-01 15:30:00","2021-10-01 20:00:00"]}
(1 row)4

Removing a block of time from the middle of the range now creates a third range.  This functionality can be an incredibly efficient and powerful way of storing and seeing when something is active or not.  This can also be useful for auditing and determining what conditions were active during a certain period.

Word of caution:  Just because you can do something does not mean you should.  I have not been able to find any hard limit on the number of distinct ranges you can store in a single field.  In fact, I was able to add over 100K distinct ranges into one field without PostgreSQL complaining.  That said, the performance of modifying or using that many different ranges is potentially very limiting; this was especially telling when adding or splitting ranges.

More 101 on Using Ranges in PostgreSQL

You may have noticed the ranges are bracketed with “[]“; these have a special meaning in PostgreSQL.  They specify whether the range should include the lower/upper bounds value when evaluating the data.   Here is a quick primer:

Range Description
( Exclusive lower range value, think of this like a > ( will not include the specified lower value)
) Exclusive upper range value, think of this like a < ( will not include the specified upper value)
[ Inclusive lower range value, think of this like a >= ( will include the specified lower value)
] Inclusive upper range value, think of this like a <= ( will include the specified upper value)
{} These are reserved for multi-range data types

You can visualize this if you load data into a test table:

create table test_range (
   int_range int4range,
   int_mrange int4multirange,
   ts_range tsrange,
   ts_mrange tsmultirange,
   my_desc varchar(100)
 );

insert into test_range values ('[1,10]','{[1,4],[6,10]}','[2021-10-01 10:30,2021-10-01 22:30]','{[2021-10-01 10:30,2021-10-01 12:30],[2021-10-01 16:30,2021-10-01 22:30]}','Inclusive upper/ Inclusive lower');
insert into test_range values ('(1,10]','{(1,4],(6,10]}','(2021-10-01 10:30,2021-10-01 22:30]','{(2021-10-01 10:30,2021-10-01 12:30],(2021-10-01 16:30,2021-10-01 22:30]}','Inclusive upper/ Exclusive lower');
insert into test_range values ('(1,10)','{(1,4),(6,10)}','(2021-10-01 10:30,2021-10-01 22:30)','{(2021-10-01 10:30,2021-10-01 12:30),(2021-10-01 16:30,2021-10-01 22:30)}','Exclusive upper/ Exclusive lower');
insert into test_range values ('[1,10)','{[1,4),[6,10)}','[2021-10-01 10:30,2021-10-01 22:30)','{[2021-10-01 10:30,2021-10-01 12:30),[2021-10-01 16:30,2021-10-01 22:30)}','Exclusive upper/ Inclusive lower');

yonk=#  select * from test_range;
 int_range |   int_mrange   |                   ts_range                    |                                           ts_mrange                                           |             my_desc              
-----------+----------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------+----------------------------------
 [1,11)    | {[1,5),[6,11)} | ["2021-10-01 10:30:00","2021-10-01 22:30:00"] | {["2021-10-01 10:30:00","2021-10-01 12:30:00"],["2021-10-01 16:30:00","2021-10-01 22:30:00"]} | Inclusive upper/ Inclusive lower
 [2,11)    | {[2,5),[7,11)} | ("2021-10-01 10:30:00","2021-10-01 22:30:00"] | {("2021-10-01 10:30:00","2021-10-01 12:30:00"],("2021-10-01 16:30:00","2021-10-01 22:30:00"]} | Inclusive upper/ Exclusive lower
 [2,10)    | {[2,4),[7,10)} | ("2021-10-01 10:30:00","2021-10-01 22:30:00") | {("2021-10-01 10:30:00","2021-10-01 12:30:00"),("2021-10-01 16:30:00","2021-10-01 22:30:00")} | Exclusive upper/ Exclusive lower
 [1,10)    | {[1,4),[6,10)} | ["2021-10-01 10:30:00","2021-10-01 22:30:00") | {["2021-10-01 10:30:00","2021-10-01 12:30:00"),["2021-10-01 16:30:00","2021-10-01 22:30:00")} | Exclusive upper/ Inclusive lower
(4 rows)

Final Thoughts

The range data type was already a unique and powerful feature to store multiple non-contiguous ranges. With this addition, developers will have yet another tool in their toolbox to get even more out of PostgreSQL.

As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Oct
05
2021
--

Using New JSON Syntax in PostgreSQL 14 – Update

JSON Syntax PostgreSQL 14

As previously mentioned here on the blog, PostgreSQL 14 was just released. Need a primer? Umair Shahid recently put together a great overview, PostgreSQL 14 – Performance, Security, Usability, and Observability.

I recently did a deep-dive into using JSON from within PostgreSQL in a two blog post series titled Storing and using JSON within PostgreSQL, While all the information from those two blogs still is relevant with the release of PostgreSQL 14 (PG14), we need to provide a quick update to show off some improvements.

Keeping this brief:  Prior to PG14, pulling fields out of JSON looked something like the following:

movie_json_test=# select jsonb_column->>'title' as title,   jsonb_column->>'imdb_rating' as rating,   jsonb_column->>'imdb_id' as imdb_id  from movies_jsonb  where (jsonb_column->>'title')::varchar = 'Avengers: Endgame (2019)';
          title           | rating |  imdb_id  
--------------------------+--------+-----------
 Avengers: Endgame (2019) |        | tt4154796
(1 row)

While this worked, it is not terribly intuitive for developers who may be used to interacting with JSON in code.   Starting in versions beyond 14 you can use a much more natural and user-friendly syntax to access data.

movie_json_test=# select jsonb_column['title'] as title,   jsonb_column['imdb_rating'] as rating,   jsonb_column['imdb_id'] as imdb_id  from movies_jsonb  where jsonb_column['title'] = '"Avengers: Endgame (2019)"';
           title            | rating |   imdb_id   
----------------------------+--------+-------------
 "Avengers: Endgame (2019)" | null   | "tt4154796"
(1 row)

Not only can you access top-level attributes, but you can also access nested arrays and lists as well:

movie_json_test=# select jsonb_column['title'] as title,   jsonb_column['imdb_rating'] as rating,   jsonb_column['imdb_id'] as imdb_id, jsonb_column['cast'][0]  from movies_jsonb  where jsonb_column['title'] = '"Avengers: Endgame (2019)"';
           title            | rating |   imdb_id   |                                            jsonb_column                                             
----------------------------+--------+-------------+-----------------------------------------------------------------------------------------------------
 "Avengers: Endgame (2019)" | null   | "tt4154796" | {"id": "nm0000375", "name": "Robert Downey Jr.", "character": "Tony Stark /              Iron Man"}
(1 row)

Overall this syntax still will make use of all the same indexes, JSON functions, etc., that were previously available.  While this is a relatively minor change, it actually should make things much more natural for developers and hopefully drive more adoption and usage

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Oct
01
2021
--

Join Community MeetUps for MySQL, PostgreSQL, MongoDB, or PMM

Join Percona Community MeetUps

Join Percona Community MeetUpsPercona serves several communities including MySQL, PostgreSQL, and MongoDB.  The Percona Community Team is committed to promoting content and technical advice on open source database technologies and needs. Hence, we organize monthly MeetUps as a series of regular 1-hour live streaming every Wednesday at 11 am EST/ 5:00 pm CEST/8:30 pm IST with Matt Yonkovit, The Head of Open Source Strategy (HOSS) at Percona, in collaboration with one to two guests from our engineering teams or external speakers. Every month, we will discuss:

  • MySQL on the 1st week
  • PostgreSQL on the 2nd week
  • MongoDB on the 3rd week
  • Monitoring and Observability via Percona Monitoring and Management (PMM) on the 4th week

Past Events

Each MeetUp is a great opportunity for experts to share their experience and for users to ask questions and get an answer right away. Check our community website for full videos and transcripts of past events:

Community MeetUp for MySQL, Sept 8th, 2021 Community MeetUp for MySQL, Sept 8th, 2021

Marcos Albe and Fernando Laudares Camargos

The best ways you REALLY make your database faster: Query optimization, indexes, the art of database design, hiding columns, and more.

Community MeetUp for PostgreSQL, Sept 15th, 2021 Community MeetUp for PostgreSQL, Sept 15th, 2021

Jobin Augustine and Ibrar Ahmed

Discuss the reasons behind the massive migrations to PostgreSQL and learn to secure your data and database.

 Community MeetUp for MongoDB, Sept 22nd, 2021 Community MeetUp for MongoDB, Sept 22nd, 2021

Vinodh Krishnaswamy and Ege Güne? 

Most common issues that people run into, when or when not to shard MongoDB and guidelines to migrate your MongoDB cluster to Kubernetes. 

Community MeetUp for PMM, Sept 29th, 2021 Community MeetUp for PMM, Sept 29th, 2021 

Percona sponsors the Open Source Summit in Seattle and the MeetUp for PMM is a one-hour live broadcast from our booth to talk about open source databases, observability, and monitoring MySQL, MongoDB, PostgreSQL with PMM.

Upcoming Events

There are more MeetUps coming up! Each event can be an online free talk or live broadcast from an open source event looking for tough database problems.

Community MeetUp for MySQL Community MeetUp for MySQL, Oct 6th

Wayne Leutwyler and Vaibhav Upadhyay

October 6th at 11 am EST or 5:00 pm CEST / 8:30 pm IST

Community MeetUp for PostgreSQL, Oct 13th

Charly Batista

October 13th at 11 am EST or 5:00 pm CEST / 8:30 pm IST

How to Participate

Users: Attendance is absolutely free; join us on YouTube and Twitch with live chat on Discord to ask questions. Check out new events at the Percona Community page! Add events to your Google Calendar to track all upcoming Community MeetUps for MySQL, PostgreSQL, MongoDB, and PMM.

Experts/Speakers: Community MeetUp is more a free discussion than an official lecture. It is very easy to contribute as a speaker. We try to make it more convenient by saving your time and effort. Speakers will receive gratitude from the Community Team. All that you need is to contact Community Team community-team@percona.com.

 

As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Sep
30
2021
--

PostgreSQL 14 – Performance, Security, Usability, and Observability

Percona PostgreSQL 14

Today saw the launch of PostgreSQL14. For me, the important area to focus on is what the community has done to improve performance for heavy transactional workloads and additional support for distributed data.

The amount of data that enterprises have to process continues to grow exponentially. Being able to scale up and out to support these workloads on PostgreSQL 14 through better handling of concurrent connections and additional features for query parallelism makes a lot of sense for performance, while the expansion of logical replication will also help.

Performance Enhancements

Server Side Enhancements

Reducing B-tree index bloat

Frequently updated indexes tend to have dead tuples that cause index bloat. Typically, these tuples are removed only when a vacuum is run. Between vacuums, as the page gets filled up, an update or insert will cause a page split – something that is not reversible. This split would occur even though dead tuples inside the existing page could have been removed, making room for additional tuples.

PostgreSQL 14 implements the improvement where dead tuples are detected and removed even between vacuums, allowing for a reduced number of page splits thereby reducing index bloat.

Eagerly delete B-tree pages

Reducing overhead from B-trees, the vacuum system has been enhanced to eagerly remove deleted pages. Previously, it took 2 vacuum cycles to do so, with the first one marking the page as deleted and the second one actually freeing up that space.

Enhancements for Specialized Use Cases

Pipeline mode for libpq

High latency connections with frequent write operations can slow down client performance as libpq waits for each transaction to be successful before sending the next one. With PostgreSQL 14, ‘pipeline mode’ has been introduced to libpq allowing the client to send multiple transactions at the same time, potentially giving a tremendous boost to performance. What’s more – because this is a client-side feature, PostgreSQL 14’s libpq can even be used with older versions of the PostgreSQL server.

Replicating in-progress transactions

Logical replication has been expanded to allow streaming in-progress transactions to subscribers. Large transactions were previously written to disk till the transaction was completed before replicating to the subscriber. By allowing in-progress transactions to be streamed, users gain significant performance benefits along with more confidence in their distributed workloads.

Add LZ4 compression to TOAST

PostgreSQL 14 adds support for LZ4 compression for TOAST, a system used to efficiently store large data. LZ4 is a lossless compression algorithm that focuses on the speed of compression and decompression. LZ4 can be configured at the column as well as the system level. Previously, the only option was pglz compression – which is fast but has a small compression ratio.

Enhancements for Distributed Workloads

The PostgreSQL foreign data wrapper – postgres_fdw – has been instrumental in easing the burdens of handling distributed workloads. With PostgreSQL 14, there are 2 major improvements in the FDW to improve performance for such transactions: Query parallelism can now be leveraged for parallel table scans on foreign tables, and Bulk insert of data is now allowed on foreign tables.

Both these improvements further cement the increasing ability of PostgreSQL to scale horizontally and natively handle distributed databases.

Improvement to SQL parallelism

PostgreSQL’s support for query parallelism allows the system to engage multiple CPU cores in multiple threads to execute queries in parallel, thereby drastically improving performance. PostgreSQL 14 brings more refinement to this system by adding support for RETURN QUERY and REFRESH MATERIALIZED VIEW to execute queries in parallel. Improvements have also been rolled out to the performance of parallel sequential scans and nested loop joins.

Security

SCRAM as default authentication

SCRAM-SHA-256 authentication was introduced in PostgreSQL 10 and has now been made the default in PostgreSQL 14. The previous default MD5 authentication has had some weaknesses that have been exploited in the past. SCRAM is much more powerful, and it allows for easier regulatory compliance for data security.

Predefined roles

Two predefined roles have been added in PostgreSQL 14 – pg_read_all_data and pg_write_all_data. The former makes it convenient to grant read-only access for a user to all tables, views, and schemas in the database. This role will have read access by default to any new tables that are created. The latter makes it convenient to create super-user-styled privileges, which means that one needs to be quite careful when using it!

Convenience for Application Developers

Access JSON using subscripts

From an application development perspective, I have always found support for JSON in PostgreSQL very interesting. PostgreSQL has supported this unstructured data form since version 9.2, but it has had a unique syntax for retrieving data. In version 14, support for subscripts has been added, making it easier for developers to retrieve JSON data using a commonly recognized syntax.

Multirange types

PostgreSQL has had Range types since version 9.2. PostgreSQL 14 now introduces ‘multirange’ support which allows for non-contiguous ranges, helping developers write simpler queries for complex sequences. A simple example of practical use would be specifying the ranges of time a meeting room is booked through the day.

OUT parameters in stored procedures

Stored procedures were added in PostgreSQL 11, giving developers transactional control in a block of code. PostgreSQL 14 implements the OUT parameter, allowing developers to return data using multiple parameters in their stored procedures. This feature will be familiar to Oracle developers and welcome addition for folks trying to migrate from Oracle to PostgreSQL.

Observability

Observability is one of the biggest buzzwords of 2021, as developers want more insight into how their applications are performing over time. PostgreSQL 14 adds more features to help with monitoring, with one of the biggest changes being the move of query hash system from pg_stat_statement to the core database. This allows for monitoring a query using a single ID across several PostgreSQL systems and logging functions. This version also adds new features to track the progress of COPY, WAL activity, and replication slots statistics.

The above is a small subset of the more than 200 features and enhancements that make up the PostgreSQL 14 release. Taken altogether, this version update will help PostgreSQL continue to lead the open source database sector.

As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Sep
27
2021
--

Percona and PostgreSQL: Better Together

Percona PostgreSQL

The future of technology adoption belongs to developers.

The future of application deployment belongs in the cloud.

The future of databases belongs to open source.

That’s essentially the summary of Percona’s innovation strategy and that’s the reason I decided to join Percona – I believe the company can provide me with the perfect platform for my vision of the future of PostgreSQL.

So What Does That Mean, Really?

Gone are the days when deals were closed by executives on a golf course. That was Sales-Led Growth. The days of inbound leads based on content are also numbered. That is Marketing-Led Growth.

We are living in an age where tech decision-making is increasingly bottom-up. Where end-users get to decide which technology will suit their needs best. Where the only way to ensure a future for your PostgreSQL offerings is to delight the end users – the developers.

That’s Product-Led Growth.

PostgreSQL has been rapidly gaining in popularity. The Stack Overflow Developer survey ranked it as the most wanted database and DB-Engines declaring it the DBMS of the year 2020. DB-Engines shows steady growth in popularity of PostgreSQL, which outpaces any other database out there. It is ACID compliant, secure, fast, reliable, with a liberal license, and backed by a vibrant community.

PostgreSQL is the World’s Most Advanced Open Source Relational Database.

Being a leader in open source databases, I believe Percona is well-positioned to drive a Product Led Growth strategy for PostgreSQL. We want to offer PostgreSQL that is freely available, fully supported, and certified. We want to offer you the flexibility of using it yourself, using it with our help, or letting us completely handle your database.

Our plan is to focus on the end-users, make it easy for developers to build applications using our technology, delight with our user experience, and deliver value with our expertise to all businesses – large or small.

Data is exciting. Data has the ability to tell stories. Data is here to stay. 

Data is increasingly stored in open source databases. 

PostgreSQL is the most wanted database by developers. 

That is why I believe: Percona and PostgreSQL – Better Together.

In our latest white paper, we explore how customers can optimize their PostgreSQL databases, tune them for performance, and reduce complexity while achieving the functionality and security your business needs to succeed.

Download to Learn Why Percona and PostgreSQL are Better Together

Sep
08
2021
--

Real-time database platform SingleStore raises $80M more, now at a $940M valuation

Organizations are swimming in data these days, and so solutions to help manage and use that data in more efficient ways will continue to see a lot of attention and business. In the latest development, SingleStore — which provides a platform to enterprises to help them integrate, monitor and query their data as a single entity, regardless of whether that data is stored in multiple repositories — is announcing another $80 million in funding, money that it will be using to continue investing in its platform, hiring more talent and overall business expansion. Sources close to the company tell us that the company’s valuation has grown to $940 million.

The round, a Series F, is being led by Insight Partners, with new investor Hewlett Packard Enterprise, and previous backers Khosla Ventures, Dell Technologies Capital, Rev IV, Glynn Capital and GV (formerly Google Ventures) also participating. The startup has to date raised $264 million, including most recently an $80 million Series E last December, just on the heels of rebranding from MemSQL.

The fact that there are three major strategic investors in this Series F — HPE, Dell and Google — may say something about the traction that SingleStore is seeing, but so too do its numbers: 300%+ increase in new customer acquisition for its cloud service and 150%+ year-over-year growth in cloud.

Raj Verma, SingleStore’s CEO, said in an interview that its cloud revenues have grown by 150% year over year and now account for some 40% of all revenues (up from 10% a year ago). New customer numbers, meanwhile, have grown by over 300%.

“The flywheel is now turning around,” Verma said. “We didn’t need this money. We’ve barely touched our Series E. But I think there has been a general sentiment among our board and management that we are now ready for the prime time. We think SingleStore is one of the best-kept secrets in the database market. Now we want to aggressively be an option for people looking for a platform for intensive data applications or if they want to consolidate databases to one from three, five or seven repositories. We are where the world is going: real-time insights.”

With database management and the need for more efficient and cost-effective tools to manage that becoming an ever-growing priority — one that definitely got a fillip in the last 18 months with COVID-19 pushing people into more remote working environments. That means SingleStore is not without competitors, with others in the same space, including Amazon, Microsoft, Snowflake, PostgreSQL, MySQL, Redis and more. Others like Firebolt are tackling the challenges of handing large, disparate data repositories from another angle. (Some of these, I should point out, are also partners: SingleStore works with data stored on AWS, Microsoft Azure, Google Cloud Platform and Red Hat, and Verma describes those who do compute work as “not database companies; they are using their database capabilities for consumption for cloud compute.”)

But the company has carved a place for itself with enterprises and has thousands now on its books, including GE, IEX Cloud, Go Guardian, Palo Alto Networks, EOG Resources and SiriusXM + Pandora.

“SingleStore’s first-of-a-kind cloud database is unmatched in speed, scale, and simplicity by anything in the market,” said Lonne Jaffe, managing director at Insight Partners, in a statement. “SingleStore’s differentiated technology allows customers to unify real-time transactions and analytics in a single database.” Vinod Khosla from Khosla Ventures added that “SingleStore is able to reduce data sprawl, run anywhere, and run faster with a single database, replacing legacy databases with the modern cloud.”

Sep
08
2021
--

Storing and Using JSON Within PostgreSQL Part Two

Storing and Using JSON Within PostgreSQL 2

We explored the basic functionality of JSON within PostgreSQL in Part One of this series. Now we will look a little deeper into some things we may want to use regularly.  Most of the functionality we talked about in part one works well when you have a simple JSON format.  However, in real life, our documents may get a bit messy.  Let’s look at part of the JSON we are using for our tests:

Example JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

You can see here that we have some nested arrays and a bit of multi-dimensional flair.  If we wanted to get all the characters or actors in this movie, we would have a challenge using the basic functions.  Thankfully, PostgreSQL has a deep set of functions for interacting with JSON.

First, let’s look at how to get all the movies starring Robert Downey Jr. The easiest way is to use one of the  following:

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "cast": [{ "name" : "Robert Downey Jr." }]}'

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.cast.name == "Robert Downey Jr."'

However, what if we also need to pull out the character from the movie?  For our needs of getting a full list of actors and characters who were in this particular movie, we can use the jsonb_to_rescordset (similar to MySQL’s json_table function we covered in the MySQL part of this series).

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where imdb_id = 'tt2395427' limit 15;
  imdb_id  |             title              | imdb_rating |    id     |         name         |    character     
-----------+--------------------------------+-------------+-----------+----------------------+------------------
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000375 | Robert Downey Jr.    | Tony Stark
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1165110 | Chris Hemsworth      | Thor
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0749263 | Mark Ruffalo         | Bruce Banner
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0262635 | Chris Evans          | Steve Rogers
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0424060 | Scarlett Johansson   | Natasha Romanoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0719637 | Jeremy Renner        | Clint Barton
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000652 | James Spader         | Ultron
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000168 | Samuel L. Jackson    | Nick Fury
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000332 | Don Cheadle          | James Rhodes
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1093951 | Aaron Taylor-Johnson | Pietro Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0647634 | Elizabeth Olsen      | Wanda Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0079273 | Paul Bettany         | Jarvis
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1130627 | Cobie Smulders       | Maria Hill
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1107001 | Anthony Mackie       | Sam Wilson
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm2017943 | Hayley Atwell        | Peggy Carter
(15 rows)

This works fine – until it doesn’t.  If I do a similar search for all movies starring Robert Downey Jr., I get:

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where name like 'Robert Downey%' limit 10; 
ERROR:  cannot call jsonb_to_recordset on a non-array

When we look at the results, you can see that the function expects an array, and several of our movies have no cast (or a NULL value in the json).

movie_json_test=> select imdb_id, jsonb_typeof((jsonb_column->>'cast')::jsonb) from movies_json_generated limit 10;
  imdb_id  | jsonb_typeof 
-----------+--------------
 tt7620156 | 
 tt0109524 | array
 tt0534466 | array
 tt0111091 | 
 tt4167726 | array
 tt0638383 | array
 tt6346314 | array
 tt5877808 | array
 tt4098782 | 
 tt0365100 | array
(10 rows)

You can work around this “Null” issue in a couple of different ways.  The easiest is by converting to text, then back into JSON.  For example:

select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset((jsonb_column->>'cast'::text)::jsonb) as t(id text,name varchar(100),character text) where name like 'Robert Downey%';

Remember in part one how I harped on types and ensuring you cast to different data types when you needed to?  This is another example.  In this case, first taking the null as text, then taking the empty string, and then converting to a JSON object with a null inside.

While this is the easiest way, let’s show some other interesting ways to work around this to highlight some of the other functions, indexes, etc., we learned in part one.  Because we know that we have NULL values for some cast entries, we could check and filter out the values where the type is not empty or null.  For example, here is a simple check if the text version of our jsonb_column->’cast’ is not equal to null.

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
  imdb_id  |                   title                   | imdb_rating |    id     |       name        |               character                
-----------+-------------------------------------------+-------------+-----------+-------------------+----------------------------------------
 tt0266220 | The 65th Annual Academy Awards (1993)     |             | nm0000375 | Robert Downey Jr. | Himself - Presenter
 tt1515091 | Sherlock Holmes: A Game of Shadows (2011) |        7.50 | nm0000375 | Robert Downey Jr. | Sherlock Holmes
 tt1231583 | Due Date (2010)                           |        6.60 | nm0000375 | Robert Downey Jr. | Peter Highman
 tt0343663 | Eros (2004)                               |        6.00 | nm0000375 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")
 tt4420868 | The EE British Academy Film Awards (2015) |        7.40 | nm0000375 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough
(5 rows)
 
movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..7.30 rows=5 width=332) (actual time=0.586..84.666 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..545198.71 rows=373482 width=332) (actual time=0.585..84.664 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..74611.38 rows=373482 width=272) (actual time=0.023..30.257 rows=3786 loops=1)
               Filter: ((jsonb_column -> 'cast'::text) <> 'null'::jsonb)
               Rows Removed by Filter: 258
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.014..0.014 rows=0 loops=3786)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 24
 Planning Time: 0.064 ms
 Execution Time: 84.692 ms
(10 rows)

This is not terribly fast, but it does work. This is basically working around the functionality built into the JSON functions, however.  To speed our query up, we can, of course, index this column; however, some of the data in our movie cast list is just too large:

movie_json_test=> create index gen_func_index_cast on movies_json_generated (((jsonb_column->>'cast')::text));
ERROR:  index row requires 10704 bytes, maximum size is 8191

Of course, we can work around this as well.  But sometimes, it’s best to pivot.  Let’s look at another alternative to simply searching for a NULL value.  As we learned in the first post, we can use generated columns or expression indexes to do some nice things.  Here we could create either an expression index that checks for a null or add a new column that uses a case statement to flag null casts:

movie_json_test=> alter table movies_json_generated add column cast_is_null boolean generated always as  ((case when ((jsonb_column->>'cast')::text) is null then true else false end)) stored; 
ALTER TABLE


movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is false limit 5;
  imdb_id  |                          title                          | imdb_rating |    id     |       name        |                   character                    
-----------+---------------------------------------------------------+-------------+-----------+-------------------+------------------------------------------------
 tt3473134 | Off Camera with Sam Jones (TV Series 2014– )            |             | nm0000375 | Robert Downey Jr. | Himself                  2 episodes, 2014-2019
 tt0092851 | Dear America: Letters Home from Vietnam (TV Movie 1987) |        7.90 | nm0000375 | Robert Downey Jr. | (voice)
 tt0426841 | The 1994 Billboard Music Awards (1994)                  |             | nm0000375 | Robert Downey Jr. | Himself
 tt1228705 | Iron Man 2 (2010)                                       |        7.00 | nm0000375 | Robert Downey Jr. | Tony Stark
 tt0821642 | The Soloist (2009)                                      |        6.70 | nm0000375 | Robert Downey Jr. | Steve Lopez
(5 rows)


movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is not true limit 5;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..8.24 rows=5 width=332) (actual time=0.912..30.550 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..309227.39 rows=187680 width=332) (actual time=0.912..30.548 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..72750.59 rows=187680 width=272) (actual time=0.007..1.069 rows=1789 loops=1)
               Filter: (cast_is_null IS NOT TRUE)
               Rows Removed by Filter: 106
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.016..0.016 rows=0 loops=1789)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 23
 Planning Time: 0.068 ms
 Execution Time: 30.572 ms
(10 rows)

You can see there are several options here for dealing with the nulls, some way easier (and cleaner) than others.  I want to highlight some of the challenges this brings up with using unstructured data within a structured system.

Evolution Upsets the Balance (or Breaks Stuff)

All of the above solutions work for the existing data, but the wonderful thing about JSON is that you can evolve what you store over time.  Let’s use the above example.  Let’s say that for years, every movie that is fed into your system has a full cast listing of characters and actors.  Then, one day, the feed you get your data from allows movies without a cast listing.  Your application will still work, your load scripts will still work.  But every once in a while, your users will get a weird error, or you will see small flashes in your logs (if you are logging these).  But 99.9% of queries are fine.  It is these transient issues that drive people bonkers.

Looking at this problem slightly differently, what if you start adding data or changing the order of certain items?  Back to our original JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

What if we wanted to add another piece of data to the cast records? Let’s say we want to add a thumbs up, or thumbs down vote to each character, so something like this:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100,
      "character": "Bruce Banner"
    },

Now we run into another challenge.  If we inject the new fields into the middle of the JSON, instead of the end, our call to jsonb_to_recordset(jsonb_column->’cast’) as t(id text,name varchar(100),character text) will cause use issues in the application.  We had to define the definition of the recordest; the first field returned is id, then actor name, and then character.  If we changed this record, the character returned in the SQL would be 10000, not Bruce Banner.  Now to make additions, you would need to start adding at the end like:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100
    },

MySQL mitigates this with allowing you to defined specific paths in their equivalent function json_table(json_column, ‘$.cast[*]’ columns( V_name varchar(200) path ‘$.name’, V_character varchar(200) path ‘$.character’) This allows you to define exactly which fields you want, not just the first X ones. That said, there are other limitations in the json_table method MYSQL uses.  Here this is a trade-off.  You need to be aware again that the order and structure of your JSON matter greatly if you are going to use built-in database functions to query this data.

To JSON or Not?  Database Design

While PostgreSQL’s JSON features are pretty in-depth (there are dozens of other functions I did not show), there are similar challenges I see when using these compared to a more classic database design. JSON functionality is NOT a replacement for actually designing your database and thinking about schema and design.  It is a great tool to offer additional flexibility.  If you treat PostgreSQL or any database simply as a dumping ground for data, bad things happen.  It would help to think about how JSON columns can augment your current database designs, not replace them.

Let me give you a simple example of some of the dangers of simply forgoing database design in favor of just dumping in JSON.  While I think we have highlighted some of the complexity and setup challenges with different methods of storing JSON in PostgreSQL, I wanted to take this a step further and compare the performance of different database designs and their impact on performance and scalability.  For this, I built three separate designs and schemas for housing the same data:

A.)  Minimal Database Design -> single table, single column.  “The dump.”  With only a GIN index.

B.)  Single Table -> Generated columns and Expression indexes where needed.

C.)  Old School Normalized Tables, with JSON stored for easy use

The idea here is to illustrate a few of the trade-offs around performance when picking a specific design(1):

A.)  Minimal Database Design No Gin IDX A.)  Minimal Database Design with Gin IDX B.)  Single Table W Generated Column B.)  Single Table W Expression Indexes c.) Normalized
Simple Query for Movie Title (random 10  titles) 800ms 0.3ms 0.2ms 0.2ms 0.2ms
Select top 10 movies 841ms 831ms 0.9ms 0.9ms 0.3ms
Select all movies for an actor  1009.0ms 228ms 5588.0ms(2) 5588.0ms(2) 0.8ms
Count movies for a given actor 5071ms 5075ms 5544ms NA 1ms
  1. Each test was repeated 100 times, and the average results were listed.  Min/Max is available as well. 
  2. The nested json for our “cast info” was not easily indexed, there are some things we could do to mitigate this, but it is a bit more complicated.

Interestingly, there are a few cases where we have problems with using only generated columns/expression indexes, generally because of nested data within the JSON.  We could do some tricks, like adding a GIN index and query the JSON to find the element or creating some funky nested generated column with an expression index.  I left these out for now.  Also, I could go deeper into the tests, testing inserts, updates, and, of course, assembling JSON from a normalized table, but I didn’t.  My goal here is merely to make you think about your design.  Choosing one way to implement and use JSON may be ok under certain use cases, but you may find yourself querying or aggregating the data, and things break.

Recap & What’s Next

A few quick takeaways:

Recapping part 1:

  • Using JSONB is probably going to be your best option in most use cases.
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits.

What we learned in part 2:

  • The more straightforward the JSON, the easier it will be to pull out and interact with the data you need.
  • Nested JSON data can be pulled out in a few different ways. jsonb_to_rescordset is the easiest way I found to pull out the data I need.  However, using this function or others is very susceptible to the data structure in your document (data types matter!).
  • JSON data whose format changes ( elements added or removed ) may make using some functions difficult, if not impossible, to use.
  • JSON within a well-built, designed application and database can be wonderful and offer many great benefits.  JSON just dumped into databases won’t scale.  Database design still matters.

Now that we have covered the basics and a few nuances of JSON with PostgreSQL, next up, it is time to look at the same functionality and data within MongoDB before finally comparing the three databases. Stay tuned!

Sep
03
2021
--

Installing PostgreSQL using Docker

install postgresql with docker

Following the series of blogs started by Peter Zaitsev in Installing MySQL with Docker, on deploying Docker containers running open source databases, in this article, I’ll demonstrate how to install PostgreSQL using Docker.

Before proceeding, it is important to remind you of Peter’s warning from his article, which applies here as well: 

“The following instructions are designed to get a test instance running quickly and easily; you do not want to use these for production deployments.” 

We intend to show how to bring up PostgreSQL instance(s) into Docker containers. Moreover, you will find some basic Docker management operations and some snippets containing examples, always based on the Docker Hub PostgreSQL official image.

Installing

Starting with the installation, the following example shows how to bring up a Docker container within the latest PostgreSQL release from Docker Hub. It is required to provide the password of the Postgres user at the moment you are launching the container; otherwise, the container will not be created:

$ docker run --name dockerPostgreSQL \
-e POSTGRES_PASSWORD=secret \
-d postgres:latest

Once you start the container, you can check if it successfully started with docker inspect:

$ docker inspect -f '{{.State.Running}}' dockerPostgreSQL
true

If not, you can try starting it manually (you will see how to do it on the Managing the PostgreSQL Containers section).

Getting Connected to the Containerized Instance

The snippet below will show you how to connect to your PostgreSQL containerized instance.

  • Directly from docker:
$ docker exec -it dockerPostgreSQL psql --user postgres
postgres=# \conninfo

You are connected to database "postgres" as user "postgres" via socket in "/var/run/postgresql" at port "5432".

  • Or from a PostgreSQL client connecting to its IP address, which you can obtain with the command docker inspect:
$ docker inspect -f '{{.NetworkSettings.IPAddress}}' dockerPostgreSQL
172.16.0.12

$ psql --user postgres --port 5432 --host 172.16.0.12
postgres=# \conninfo
You are connected to database "postgres" as user "postgres" on host "172.16.0.12" at port "50432".

Managing the PostgreSQL Containers

The command below will show you how to list all active containers and their respective statuses:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                     NAMES
7f88656c4864        postgres:11         "docker-entrypoint..."   4 minutes ago       Up 4 minutes        0.0.0.0:50432->5432/tcp   dockerPostgreSQL11
8aba8609dabc        postgres:11.5       "docker-entrypoint..."   21 minutes ago      Up 21 minutes       5432/tcp                  dockerPostgreSQL115
15b9e0b789dd        postgres:latest     "docker-entrypoint..."   32 minutes ago      Up 32 minutes       5432/tcp                  dockerPostgreSQL

To stop one of the containers, do:

$ docker stop dockerPostgreSQL

To start a container:

$ docker start dockerPostgreSQL

Finally, to remove a container, first stop it and then:

$ docker rm dockerPostgreSQL

Customizing PostgreSQL settings in the container

Let’s say you want to define different parameters for the PostgreSQL server running on Docker. You can create a custom postgresql.conf configuration file to replace the default one used with Docker:

$ docker run --name dockerPostgreSQL11 -p 50433:5432 -v "$PWD/my-postgres.conf":/etc/postgresql/postgresql.conf -e POSTGRES_PASSWORD=scret -d postgres:11

Another example of how to pass startup arguments is changing the directory which stores the wal files:

$ docker run --name dockerPostgreSQL11 -p 50433:5432 -v "$PWD/my-postgres.conf":/etc/postgresql/postgresql.conf -e POSTGRES_PASSWORD=scret -e POSTGRES_INITDB_WALDIR=/backup/wal -d postgres:11

Running Different PostgreSQL Versions in Different Containers

You can launch multiple PostgreSQL containers, each running a different PostgreSQL version. In the example below, you will find how to start up a 10.5 PostgreSQL version:

$ docker run --name dockerPostgreSQL105 -p 50433:5432 -e POSTGRES_PASSWORD=scret  -d postgres:10.5
7f51187d32f339688c2f450ecfda6b7552e21a93c52f365e75d36238f5905017

Here is how you could launch a PostgreSQL 11 container:

$  docker run --name dockerPostgreSQL11 -p 50432:5432 -e POSTGRES_PASSWORD=scret -d postgres:11
7f88656c4864864953a5491705ac7ca882882df618623b1e5664eabefb662733

After that, by reissuing the status command, you will see all of the already created ones:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                     NAMES
e06e637ae090        postgres:10.5       "docker-entrypoint..."   4 seconds ago       Up 4 seconds        0.0.0.0:50433->5432/tcp   dockerPostgreSQL105
7f88656c4864        postgres:11         "docker-entrypoint..."   About an hour ago   Up About an hour    0.0.0.0:50432->5432/tcp   dockerPostgreSQL11
15b9e0b789dd        postgres:latest     "docker-entrypoint..."   2 hours ago         Up 2 hours          5432/tcp                  dockerPostgreSQL

You will notice that both 10.5 and 11 were created with different port mappings. Locally into the container, both are listening on 5432 but are externally mapped to different ports. Therefore, you will be able to access all of them externally.

You will find even more details on the PostgreSQL Hub Docker site.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com