An Illustration of PostgreSQL Bloat

An Illustration of PostgreSQL Bloat

I have been working with Postgres now for many years, most recently as a consultant at Percona for companies with new implementations of Postgres as a result of migrating from Oracle or some other legacy database engine. In some cases, these are Fortune 100 companies with many talented people working for them. However, not all databases work the same and one of the most common observations I make when reviewing a Postgres environment for a client is the amount of table bloat, index bloat, and lack of understanding of its impact on performance – and how to address it.

I wrote a blog about this topic a few years ago and never gave it much thought after its publication. However, with the large number of companies moving to Postgres for obvious reasons, and the lack of true Postgres database management skills needed to support fairly large databases, I thought I would rewrite this blog and bring it back to life with some clarity to help one understand bloat and why it happens.

What causes bloat?

In PostgreSQL, the culprit is Multi-Version Concurrency Control, commonly referred to as MVCC.

MVCC ensures that a transaction against a database will return only data that’s been committed, in a snapshot, even if other processes are trying to modify that data.

Imagine a database with millions of rows in a table. Anytime you update or delete a row, Postgres has to keep track of that row based on a transaction ID. For example, you may be running a long query with transaction ID 100 while someone named John just updated the same table using transaction ID 101. At this point, since you are still inside of transaction 100, which is older than 101 as far as your query is concerned, the changes made by John in transaction ID 101 are not relevant or visible to your query. You are in your own personal bubble of data back before the data has changed. Any new queries from you or anyone else with a transaction ID greater than 101 will see the changes made by John in transaction 101. After all, new transaction IDs are greater than 101, meaning no other transactions are currently in play with IDs less than 101 the data you saw in transaction ID 100 will no longer be needed by the database and will be considered dead but not gone. Hence, bloat!

At a high level, vacuuming is used to free up dead rows in a table so they can be reused. It also helps you avoid transaction ID wraparound.

Let’s go through a few steps to illustrate how all this takes place

In order for Postgres to know which transaction data should be in the result set of your query, the snapshot makes note of transaction information.

Essentially, if your transaction ID is 100, you will only see data from all transaction IDs leading up to 100. As stated above, you will not see data from transaction ID 101 or greater.

Setting up an example

Let’s start by creating a simple table for our example, called percona:

percona=# CREATE TABLE percona ( col1 int );

percona=# INSERT INTO percona values (1);

percona=# INSERT INTO percona values (2);

percona=# INSERT INTO percona values (3);

percona=# INSERT INTO percona values (4);

percona=# INSERT INTO percona values (5);

You can wrap multiple inserts into a single transaction with BEGIN and COMMIT:

percona=# BEGIN;

percona=*# INSERT INTO percona SELECT generate_series(6,10);

percona=*# COMMIT;

Here we can see the 10 rows we inserted into the table, along with some hidden system columns:

percona=# SELECT xmin, xmax, * FROM percona;
   xmin   | xmax | col1
 69099597 |    0 |    1
 69099609 |    0 |    2
 69099627 |    0 |    3
 69099655 |    0 |    4
 69099662 |    0 |    5
 69099778 |    0 |    6
 69099778 |    0 |    7
 69099778 |    0 |    8
 69099778 |    0 |    9
 69099778 |    0 |   10
(10 rows)

As you can see, values one through five (in the col1 column) have unique transaction IDs (represented in the xmin column)—they were the result of individual INSERT statements, made one after the other. The rows with values of six through 10 share the same transaction ID of 6909978; they were all part of the one transaction we created with the BEGIN and COMMIT statements.

At this point, you may be asking yourself what this has to do with vacuum or autovacuum. We will get there. First, you need to know about the transaction id logic and visually see it for a better understanding as shown above.

How does the table bloat?

In Postgres, the heap is a file containing a list of variable-sized records, in no particular order, that points to the location of a row within a page. (A Postgres page is 8k in size). The pointer to the location is called the CTID.

To view the heap without having to read the raw data from the file, we need to create the following extension inside of our database:

CREATE extension pageinspect;

Now we can inspect the heap for our newly created table and rows:

percona=# SELECT * FROM heap_page_items(get_raw_page('percona',0));

 lp | lp_off | lp_flags | lp_len |  t_xmin  | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |   t_data
  1 |   8160 |        1 |     28 | 69099597 |      0 |        0 | (0,1)  |           1 |       2304 |     24 |        |       | \x01000000
  2 |   8128 |        1 |     28 | 69099609 |      0 |        0 | (0,2)  |           1 |       2304 |     24 |        |       | \x02000000
  3 |   8096 |        1 |     28 | 69099627 |      0 |        0 | (0,3)  |           1 |       2304 |     24 |        |       | \x03000000
  4 |   8064 |        1 |     28 | 69099655 |      0 |        0 | (0,4)  |           1 |       2304 |     24 |        |       | \x04000000
  5 |   8032 |        1 |     28 | 69099662 |      0 |        0 | (0,5)  |           1 |       2304 |     24 |        |       | \x05000000
  6 |   8000 |        1 |     28 | 69099778 |      0 |        0 | (0,6)  |           1 |       2304 |     24 |        |       | \x06000000
  7 |   7968 |        1 |     28 | 69099778 |      0 |        0 | (0,7)  |           1 |       2304 |     24 |        |       | \x07000000
  8 |   7936 |        1 |     28 | 69099778 |      0 |        0 | (0,8)  |           1 |       2304 |     24 |        |       | \x08000000
  9 |   7904 |        1 |     28 | 69099778 |      0 |        0 | (0,9)  |           1 |       2304 |     24 |        |       | \x09000000
 10 |   7872 |        1 |     28 | 69099778 |      0 |        0 | (0,10) |           1 |       2304 |     24 |        |       | \x0a000000
(10 rows)

The table above shows 10 entries with a few columns:

  • lp is the ID of the row/tuple
  • t_xmin is the transaction ID
  • t_ctid is the pointer
  • t_data is the actual data

Currently, the pointer for each row is pointing to itself as determined by the form (page,tupleid). Pretty straightforward.

Now, let’s perform a few updates on a specific row. Let’s change the value of five to 20, then to 30, and finally back to five.

percona=# UPDATE percona SET col1 = 20 WHERE col1 = 5;

percona=# UPDATE percona SET col1 = 30 WHERE col1 = 20;

percona=# UPDATE percona SET col1 = 5 WHERE col1 = 30;

These three changes took place under three different transactions.

What does this mean?  We changed the values for a column three times but never added or deleted any rows. So we should still have 10 rows, right?

percona=# SELECT COUNT(*) FROM percona;
(1 row)

Looks as expected. But wait! Let’s look at the heap now. The real data on disk.

percona=# SELECT * FROM heap_page_items(get_raw_page('percona',0));

 lp | lp_off | lp_flags | lp_len |  t_xmin  |  t_xmax  | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |   t_data
  1 |   8160 |        1 |     28 | 69099597 |        0 |        0 | (0,1)  |           1 |       2304 |     24 |        |       | \x01000000
  2 |   8128 |        1 |     28 | 69099609 |        0 |        0 | (0,2)  |           1 |       2304 |     24 |        |       | \x02000000
  3 |   8096 |        1 |     28 | 69099627 |        0 |        0 | (0,3)  |           1 |       2304 |     24 |        |       | \x03000000
  4 |   8064 |        1 |     28 | 69099655 |        0 |        0 | (0,4)  |           1 |       2304 |     24 |        |       | \x04000000
  5 |   8032 |        1 |     28 | 69099662 | 69103876 |        0 | (0,11) |       16385 |       1280 |     24 |        |       | \x05000000
  6 |   8000 |        1 |     28 | 69099778 |        0 |        0 | (0,6)  |           1 |       2304 |     24 |        |       | \x06000000
  7 |   7968 |        1 |     28 | 69099778 |        0 |        0 | (0,7)  |           1 |       2304 |     24 |        |       | \x07000000
  8 |   7936 |        1 |     28 | 69099778 |        0 |        0 | (0,8)  |           1 |       2304 |     24 |        |       | \x08000000
  9 |   7904 |        1 |     28 | 69099778 |        0 |        0 | (0,9)  |           1 |       2304 |     24 |        |       | \x09000000
 10 |   7872 |        1 |     28 | 69099778 |        0 |        0 | (0,10) |           1 |       2304 |     24 |        |       | \x0a000000
 11 |   7840 |        1 |     28 | 69103876 | 69103916 |        0 | (0,12) |       49153 |       9472 |     24 |        |       | \x14000000
 12 |   7808 |        1 |     28 | 69103916 | 69103962 |        0 | (0,13) |       49153 |       9472 |     24 |        |       | \x1e000000
 13 |   7776 |        1 |     28 | 69103962 |        0 |        0 | (0,13) |       32769 |      10496 |     24 |        |       | \x05000000
(13 rows)

We have 13 rows, not 10. What the heck just happened?

Let’s examine our three separate update transactions (69103876, 69103916, 69103962) to see what’s happening with the heap:

t_xmin (691103876)

  • UPDATE percona SET col1 = 20 WHERE col1 = 5;
  • Logically DELETE tuple ID 5
  • Physically INSERT tuple ID 11
  • UPDATE tuple ID 5 pointer (t_tcid) to point to tuple ID 11

Tuple ID 5 becomes a dead row when its t_xmax gets set to the new transaction ID initiated by transaction 691103876.

t_xmin (69103916)

  • UPDATE percona SET col1 = 30 WHERE col1 = 20;
  • Logically DELETE tuple ID 11
  • Physically INSERT tuple ID 12
  • UPDATE tuple ID 11 pointer (t_tcid) to point to tuple ID 12

Once again, Tuple ID 11 becomes a dead row when its t_xmax set to the new transaction ID initiated by transaction  69103916.

t_xmin (69103962)

  • UPDATE percona SET col1 = 5 WHERE col1 = 30;
  • Logically DELETE tuple ID 12
  • Physically INSERT tuple ID 13
  • UPDATE tuple ID 12 pointer (t_tcid) to point to tuple ID 13

Tuple ID 13 is live and visible to other transactions. It has no t_xmax and the t_ctid (0,13) points to itself.

The key takeaway from this is that we have not added or deleted rows in our table. We still see 10 in the count, but our heap has increased to 13 by an additional three transactions being executed.

At a very high level, this is how PostgreSQL implements MVCC and why we have table bloat in our heap. In essence, changes to data results in a new row reflecting the latest state of the data. The old rows need to be cleaned or reused for efficiency.

Vacuuming the table

The way to deal with the table bloat is to vacuum the table:

percona=# vacuum percona;

Now, let’s examine the heap again:

percona=# SELECT * FROM heap_page_items(get_raw_page('percona',0));

lp | lp_off | lp_flags | lp_len |  t_xmin  | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |   t_data
  1 |   8160 |        1 |     28 | 69099597 |      0 |        0 | (0,1)  |           1 |       2304 |     24 |        |       | \x01000000
  2 |   8128 |        1 |     28 | 69099609 |      0 |        0 | (0,2)  |           1 |       2304 |     24 |        |       | \x02000000
  3 |   8096 |        1 |     28 | 69099627 |      0 |        0 | (0,3)  |           1 |       2304 |     24 |        |       | \x03000000
  4 |   8064 |        1 |     28 | 69099655 |      0 |        0 | (0,4)  |           1 |       2304 |     24 |        |       | \x04000000
  5 |     13 |        2 |      0 |          |        |          |        |             |            |        |        |       |
  6 |   8032 |        1 |     28 | 69099778 |      0 |        0 | (0,6)  |           1 |       2304 |     24 |        |       | \x06000000
  7 |   8000 |        1 |     28 | 69099778 |      0 |        0 | (0,7)  |           1 |       2304 |     24 |        |       | \x07000000
  8 |   7968 |        1 |     28 | 69099778 |      0 |        0 | (0,8)  |           1 |       2304 |     24 |        |       | \x08000000
  9 |   7936 |        1 |     28 | 69099778 |      0 |        0 | (0,9)  |           1 |       2304 |     24 |        |       | \x09000000
 10 |   7904 |        1 |     28 | 69099778 |      0 |        0 | (0,10) |           1 |       2304 |     24 |        |       | \x0a000000
 11 |      0 |        0 |      0 |          |        |          |        |             |            |        |        |       |
 12 |      0 |        0 |      0 |          |        |          |        |             |            |        |        |       |
 13 |   7872 |        1 |     28 | 69103962 |      0 |        0 | (0,13) |       32769 |      10496 |     24 |        |       | \x05000000
(13 rows)

After vacuuming the table, rows five, 11, and 12 are now free to be used again.

So let’s insert another row, with the value of 11 and see what happens:

percona=# INSERT INTO percona values (11);

Let’s examine the heap once more:

percona=# SELECT * FROM heap_page_items(get_raw_page('percona',0));

 lp | lp_off | lp_flags | lp_len |  t_xmin  | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |   t_data
  1 |   8160 |        1 |     28 | 69099597 |      0 |        0 | (0,1)  |           1 |       2304 |     24 |        |       | \x01000000
  2 |   8128 |        1 |     28 | 69099609 |      0 |        0 | (0,2)  |           1 |       2304 |     24 |        |       | \x02000000
  3 |   8096 |        1 |     28 | 69099627 |      0 |        0 | (0,3)  |           1 |       2304 |     24 |        |       | \x03000000
  4 |   8064 |        1 |     28 | 69099655 |      0 |        0 | (0,4)  |           1 |       2304 |     24 |        |       | \x04000000
  5 |     13 |        2 |      0 |          |        |          |        |             |            |        |        |       |
  6 |   8032 |        1 |     28 | 69099778 |      0 |        0 | (0,6)  |           1 |       2304 |     24 |        |       | \x06000000
  7 |   8000 |        1 |     28 | 69099778 |      0 |        0 | (0,7)  |           1 |       2304 |     24 |        |       | \x07000000
  8 |   7968 |        1 |     28 | 69099778 |      0 |        0 | (0,8)  |           1 |       2304 |     24 |        |       | \x08000000
  9 |   7936 |        1 |     28 | 69099778 |      0 |        0 | (0,9)  |           1 |       2304 |     24 |        |       | \x09000000
 10 |   7904 |        1 |     28 | 69099778 |      0 |        0 | (0,10) |           1 |       2304 |     24 |        |       | \x0a000000
 11 |   7840 |        1 |     28 | 69750201 |      0 |        0 | (0,11) |           1 |       2048 |     24 |        |       | \x0b000000
 12 |      0 |        0 |      0 |          |        |          |        |             |            |        |        |       |
 13 |   7872 |        1 |     28 | 69103962 |      0 |        0 | (0,13) |       32769 |      10496 |     24 |        |       | \x05000000
(13 rows)

Our new tuple (with transaction ID 69750201) reused tuple 11, and now the tuple 11 pointer (0,11) is pointing to itself.

As you can see, the heap did not grow to accommodate the new row. It reused an open block for the new row that was made available when we vacuumed the table freeing up dead rows (rows that will no longer be visible in a transaction).

And there you have it. A step-by-step illustration of how bloat occurs in PostgreSQL!


Introducing Performance Improvement of Window Functions in PostgreSQL 15

Window Functions in PostgreSQL 15

When working with databases, there are always some projects oriented to performing analytics and reporting tasks over the information stored in the database. Usually, these tasks leverage window functions to do calculations “across a set of table rows that are somehow related to the current row,” as is described in the documentation.

There are several built-in windows functions available in PostgreSQL. In the latest release, PostgreSQL 15, some performance improvements were added for the rank(), row_number(), and count() functions. First, let’s review what these functions can do.

The window functions

As mentioned above, the window functions let us perform some calculations on a set of table rows related to the current one. The “set of table rows” is usually identified as a “partition” defined by a column or columns. As we read in the documentation, the named functions work for:

rank () ? bigint

Returns the rank of the current row, with gaps; that is, the row_number of the first row in its peer group.

row_number () ? bigint

Returns the current row number within its partition, counting from 1.

count ( * ) ? bigint

Computes the number of input rows.

Consider the following table and data:

                                          Table "public.employee"
    Column      |           Type           | Collation | Nullable |                 Default
emp_id          | integer                  |           | not null | nextval('employee_emp_id_seq'::regclass)
emp_name        | character varying(20)    |           | not null |
emp_country     | character varying(35)    |           | not null |
emp_salary      | integer                  |           | not null |
date_of_joining | timestamp with time zone |           | not null | now()
    "employee_pkey" PRIMARY KEY, btree (emp_id)

demo=# SELECT * FROM employee ;
emp_id  |   emp_name   | emp_country | emp_salary |    date_of_joining
      1 | KELIO        | Japon       |       2000 | 2021-10-26 00:00:00+00
      2 | JEAN-VINCENT | Canada      |       6500 | 2021-01-22 00:00:00+00
      3 | JUNO         | Japon       |       4000 | 2021-02-27 00:00:00+00
      4 | GUY-EMMANUEL | Salvador    |       2000 | 2020-07-27 00:00:00+00
      5 | WALI         | Japon       |       7000 | 2021-01-31 00:00:00+00
      6 | HENRI-PAUL   | Canada      |       4500 | 2021-08-19 00:00:00+00
      7 | MUHAMED      | France      |       5000 | 2021-07-20 00:00:00+00
      8 | MUHITTIN     | Madagascar  |       2500 | 2021-12-31 00:00:00+00
      9 | DEVLIN       | Madagascar  |       7000 | 2022-04-03 00:00:00+00
     10 | JOSUE        | Salvador    |       5500 | 2020-09-25 00:00:00+00
(10 rows)


We can use the rank() function to get the rank of employees (id) per country based on their salary. Look at the next example.

demo=# SELECT 
         rank() OVER (PARTITION BY emp_country ORDER BY emp_salary DESC) 
       FROM employee;

emp_id  | emp_salary | emp_country | rank
      2 |       6500 | Canada      |    1
      6 |       4500 | Canada      |    2
      7 |       5000 | France      |    1
      5 |       7000 | Japon       |    1
      3 |       4000 | Japon       |    2
      1 |       2000 | Japon       |    3
      9 |       7000 | Madagascar  |    1
      8 |       2500 | Madagascar  |    2
     10 |       5500 | Salvador    |    1
      4 |       2000 | Salvador    |    2
(10 rows)


In the next example, the row_number() function gets a sorted list of employees’ names per country and their relative numeric position.

demo=# SELECT 
         row_number() OVER (PARTITION BY emp_country ORDER BY emp_name) 
       FROM employee;

emp_id  |   emp_name   | emp_country | row_number
      6 | HENRI-PAUL   | Canada      |          1
      2 | JEAN-VINCENT | Canada      |          2
      7 | MUHAMED      | France      |          1
      3 | JUNO         | Japon       |          1
      1 | KELIO        | Japon       |          2
      5 | WALI         | Japon       |          3
      9 | DEVLIN       | Madagascar  |          1
      8 | MUHITTIN     | Madagascar  |          2
      4 | GUY-EMMANUEL | Salvador    |          1
     10 | JOSUE        | Salvador    |          2
(10 rows)


The count() function is an “old known” tool used by almost everyone with access to a SQL engine. This function is part of the aggregate functions list but can act as windows functions when the OVER clause follows the call. So we can use it to know how many employees share the same salary as a given employee name. 

demo=# SELECT 
         count(*) OVER (PARTITION BY emp_salary) 
       FROM employee;

  emp_name   | emp_salary | emp_country | count
KELIO        |       2000 | Japon       |     2
GUY-EMMANUEL |       2000 | Salvador    |     2
MUHITTIN     |       2500 | Madagascar  |     1
JUNO         |       4000 | Japon       |     1
HENRI-PAUL   |       4500 | Canada      |     1
MUHAMED      |       5000 | France      |     1
JOSUE        |       5500 | Salvador    |     1
JEAN-VINCENT |       6500 | Canada      |     1
WALI         |       7000 | Japon       |     2
DEVLIN       |       7000 | Madagascar  |     2
(10 rows)

Window functions in PostgreSQL 15

Now that we have refreshed what the window functions are, let’s consider what the PostgreSQL 15 release notes say:

Improve the performance of window functions that use row_number(), rank(), and count() (David Rowley)

Accordingly, if we are users of the window functions and move from a previous version to version 15, we should see an improvement in the performance of our workload. Let’s test it.

Laboratory case

To test the performance of the window functions, I created three instances of PostgreSQL, (a) version 13, (b) version 14, and (c) version 15. 

I used the same public.employee table used in the previous examples, and I loaded it with 10K rows. Then I executed the same queries we saw before for the window functions. I got the output from an EXPLAIN (ANALYZE) command which executes the query, and we can see the timing for the specific window function.

The EXPLAIN (ANALYZE) output was the same for each version of PostgreSQL.



                                                      QUERY PLAN
WindowAgg  (cost=1288.49..1488.49 rows=10000 width=25) (actual time=11.946..18.732 rows=10000 loops=1)
  ->  Sort  (cost=1288.49..1313.49 rows=10000 width=17) (actual time=11.928..12.803 rows=10000 loops=1)
        Sort Key: emp_country, emp_salary DESC
        Sort Method: quicksort  Memory: 980kB
        ->  Seq Scan on employee  (cost=0.00..180.00 rows=10000 width=17) (actual time=0.008..2.402 rows=10000 loops=1)
Planning Time: 0.143 ms
Execution Time: 19.268 ms
(7 rows)


                                                      QUERY PLAN
WindowAgg  (cost=844.39..1044.39 rows=10000 width=25) (actual time=12.585..20.921 rows=10000 loops=1)
  ->  Sort  (cost=844.39..869.39 rows=10000 width=17) (actual time=12.560..13.545 rows=10000 loops=1)
        Sort Key: emp_country, emp_salary DESC
        Sort Method: quicksort  Memory: 1020kB
        ->  Seq Scan on employee  (cost=0.00..180.00 rows=10000 width=17) (actual time=0.011..1.741 rows=10000 loops=1)
Planning Time: 0.449 ms
Execution Time: 21.407 ms
(7 rows)


                                                      QUERY PLAN
WindowAgg  (cost=844.39..1044.39 rows=10000 width=25) (actual time=18.949..28.619 rows=10000 loops=1)
  ->  Sort  (cost=844.39..869.39 rows=10000 width=17) (actual time=18.896..19.998 rows=10000 loops=1)
        Sort Key: emp_country, emp_salary DESC
        Sort Method: quicksort  Memory: 1020kB
        ->  Seq Scan on employee  (cost=0.00..180.00 rows=10000 width=17) (actual time=0.011..1.228 rows=10000 loops=1)
Planning Time: 0.460 ms
Execution Time: 29.111 ms
(7 rows)

We can easily see from the WindowAgg node that the total time was smaller in the PG15 than in the other two. The performance improvement is clear here.

To verify this is consistent, I got the Total Time for the WindowAgg node from 500 executions and plotted the next graph.

postgresql total time

We can see the timing from the PG15 version is better than the other versions. Also, I added a trending line. We see the PG13 performed the “worst,” and even when the PG14 showed a better trend, the PG15 was the best.

I did the same exercise for the row_number() and count() functions. 


postgresql row number


I also got the results from 500 executions for these last two functions. In both cases, the difference was smaller than the rank() function, but the PG15 still showed better results, as seen in the trend line.

The performance improvement for the rank(), row_number(), and count() window function introduced in the new PostgreSQL 15 will let all those analytic and reporting projects process their data faster than in previous versions. As always, every environment has its characteristics and challenges, but as we saw in these quick tests, the new version delivers the improvement just out of the box.


PostgreSQL for MySQL DBAs Episode 8: Materialized Views

PostgreSQL for MySQL DBAs Episode 8: Materialized Views

PostgreSQL for MySQL DBAs Episode 8: Materialized ViewsEpisode 8 in the PostgreSQL for MySQL DBAs series covers materialized views. MySQL has had views for many years and they are very useful, but it has never had materialized views. Those who use materialized views appreciate their utility and here we will cover how they are used.  This episode starts with regular, non-materialized views to help those who are not used to using them and then moves on to materialized views so you can see the difference.

Quick recap: Why views?

Views are often used to hide underlying column and table names from users.  This obfuscation works well as the user may have permission to query the view but is denied access to the base tables.  By having users standardize on using the view to get just the needed columns, it does simplify things. It allows the joining process and simplifies the use of multiple tables into a single virtual table. The view becomes a ‘common’ shorthand for required data. It can also encapsulate the details of the structure of your tables, which might change as your application evolves, behind consistent interfaces.

Views are great places for aggregate values like SUM(), AVG(), etc., to keep the calculations separate from the ‘pure’ data. They take up little space as compared to a table. Views can be built on top of other views, and views can be used like a table.

A basic view

Let’s start with a simple table and corresponding simple data.

test=# create table base (a int, b int, c int);
test=# insert into base values (1,2,3),(4,5,6),(7,8,9);

Our first view draws values from the above table.

test=# create view v1 as SELECT a, b, c*4 from base;


The view returned the values of the columns of a and b plus the value of c multiplied by four.

test=# select * from v1;
a | b | ?column?
1 | 2 | 12
4 | 5 | 24
7 | 8 | 36
(3 rows)

I can use this view like a table by using a WHERE clause.

test=# select * from v1 where a > 6;
a | b | ?column?
7 | 8 | 36

Materialized views

A materialized view is a static snapshot of the data with the two keywords being ‘static’ and ‘snapshot’.  If you have a case where you do not want many users asking for the same data frequently, then you can use materialized views as a cache.  You may be familiar with stock quotation services that only show updated values every five or so minutes.  This can take a major load off the database instance as it handles requests in the base tables.

test=# create materialized view v2 as SELECT a, b, c*4 from base;
test=# select * from v2 where a > 6;
a | b | ?column?
7 | 8 | 36
(1 row)

The keyword MATERIALIZED is added to the view definition.  In the above example, the materialized view uses the same logic as the previous non-materialized view.  This caches the values for later use.

But be careful.  This materialized view is a snapshot of a particular point in time.  Why be careful?  Consider the following situation where we update a value in the table.

test=# update base set b=99 where a = 7;
test=# select * from v2 where a > 6;
a | b | ?column?
7 | 8 | 36
(1 row)

The value for b is updated in the base table however the materialized view value is not updated.

But it is easy to ‘update’ the materialized view with the REFRESH keyword.

test=# refresh materialized view v2;
test=# select * from v2 where a > 6;
a | b | ?column?
7 | 99 | 36
(1 row)

Materialized views are a great way to cache answers but you have to be aware that the data may be out of date.

Need materialized views for MySQL today?

MySQL and MariaDB do not have materialized views.  You could make a feature request that expresses your desire for this feature with those vendors, but you will not see materialized views implemented in the next release or so.

Another option is EdgeDB, which has materialized views along with a columnar storage engine that is compatible with both MySQL and MariaDB.  This builds on the previous work of Justin Swanhart in the areas of materialized views and the WARP storage engine.


Now you know how materialized views work and how they differ from regular views.

The past videos for PostgreSQL for MySQL Database Administrators (DBA) can be found here: episode oneepisode two, episode three, episode four, episode five, episode six, and episode seven.


PostgreSQL for MySQL DBAs Episode 7: Vacuuming Tables

PostgreSQL for MySQL DBAs Vacuuming Tables

PostgreSQL for MySQL DBAs Vacuuming TablesExperienced MySQL DBAs exploring PostgreSQL are often shocked to learn that they will need to vacuum their tables. This is not using a Hoover brand-like device or letting a Roomba-like robot loose on the server. Vacuuming is a necessary part of maintenance on a PostgreSQL server.  This is why you need to watch Episode 7 for the details on vacuuming.

The documentation states that using VACUUM reclaims storage occupied by dead tuples (A tuple is PostgreSQL’s internal representation of a row in a table). Tuples that are deleted or obsoleted by an update are not physically removed from their table, so they remain present until a VACUUM is done. And it is necessary to do VACUUM periodically, especially on frequently-updated tables.


Let’s create some sample data that leaves old tuples hanging around, cluttering up our server.

test=# create table foo (id int, value int);
test=# insert into foo values (1,1);

And add some data, updating the same record repeatedly.

test=# update foo set value=2 where id =1;
test=# update foo set value=3 where id =1;
test=# update foo set value=4 where id =1;

And we can check to see the dead tuples.

test=# select relname, n_dead_tup from pg_stat_all_tables where relname = 'foo';
relname | n_dead_tup
foo | 3
(1 row)

Now, we can vacuum.

test=# VACUUM foo;
test=# select relname, n_dead_tup from pg_stat_all_tables where relname = 'foo';
relname | n_dead_tup
foo | 0
(1 row)

Why vacuum?

PostgreSQL’s VACUUM command has to process each table on a regular basis for several reasons:

  • To recover or reuse disk space occupied by updated or deleted rows.
  • To update data statistics used by the PostgreSQL query planner.
  • To update the visibility map, which speeds up index-only scans.
  • To protect against loss of very old data due to transaction ID wraparound.

The latter two bullet points will be new to MySQL DBAs.  Vacuum maintains a visibility map for each table to keep track of which pages contain only tuples that are known to be visible to all active transactions. This allows the vacuum to skip these tuples since they are known to be in a known ‘good’ state which speeds up vacuuming. And, it allows PostgreSQL to answer some queries using only the index (the corresponding indexes get vacuumed too), without reference to the underlying table. This provides for index-only scans which do not need to dive into the data to answer the query.

Wraparound transaction IDs are something to be worried about.  Postgres has a 32-bit transaction ID number that wraps around to zero and starts again. Compare this to MySQL’s running out of AUTO_INCREMENT numbers and just refusing to add more rows until corrected. When the wrap happens, old transactions that were in the past now appear to be in the future, and their output becomes invisible. This is a catastrophic data loss! Worse of all your data is safe but you have ZERO access to it.

To avoid this, it is necessary to vacuum every table in every database at least once every two billion transactions.


PostgreSQL has an optional but highly recommended feature called autovacuum, whose purpose is to automate the execution of VACUUM and ANALYZE commands.  Most recent versions have this turned on but double-check as detailed below.  Autovacuum is highly configurable and keeps you from having to remember to run it.

test=# SHOW autovacuum;
(1 row)

The past videos for PostgreSQL for MySQL Database Administrators (DBA) can be found here: episode oneepisode two, episode three, episode four, episode five, and episode six.


PostgreSQL for MySQL DBAs Episode 6 – Explaining EXPLAIN (And an Answer to a Bonus Question)

PostgreSQL for MySQL DBAs Explain

PostgreSQL for MySQL DBAs ExplainThe differences between MySQL and PostgreSQL are often trivial but occasionally there are stark differences.  A MySQL DBA wanting to optimize a query on a PostgreSQL server will hopefully have some experience with using EXPLAIN.  For the uninitiated, the keyword EXPLAIN is pre-pended to a query to reveal what the server wants to do to return the data requested in that query.  The implementations of the two versions of EXPLAIN are very different.  Episode six of the PostgreSQL for MySQL DBA series covers EXPLAIN.

So what is different? PostgreSQL adds XML and YAML output formats options past the traditional and JSON found in MySQL.  The PostgreSQL output looks, at least to me, like the TREE output of MySQL. But since the two databases are ‘mechanically’ different you need to learn how to interpret the output PostgreSQL provides.

The following example provides details such as the mechanism the server will use to get the data, the start-up cost, the overall cost, the number of rows to be returned, and the name of the key (if any) used. Refer to the video for details.

Index Only Scan using t1_pkey on t1 (cost=0.29..4.31 rows=1 width=4)
Index Cond: (id = 101)
(2 rows)

Please refer to video to see the differences and a quick introduction to PostgreSQL’s EXPLAIN.

Quiz Answer

I added a ‘bonus quiz question’ to the presentation and video.

Postgresql for mysql

The bonus quiz question from the video

And the first person to respond was Jack T:

You said earlier that a ‘Seq Scan’ is a full-table scan and it is taking 15.54ms to execute just that scan. In theory, if you add an index on postal_code, then that changes to an ‘Index Scan’ and the execution time should decrease. In MySQL, this sub-query pattern is recognized as a ‘semi-join’ and is executed as a JOIN. Does PGSQL have similar optimizations for rewriting?

This is where experience with one database helps you master another.  Generally adding an index will speed up a query.  But one of the big things to learn about PostgreSQL to remember is that has different ways of doing things.

Let’s rerun the EXPLAIN and thankfully the numbers match from my test machine.

dvdrental=# EXPLAIN SELECT * FROM customer WHERE address_id IN (SELECT address_id FROM address WHERE postal_code = '52137');
Nested Loop (cost=0.28..32.14 rows=2 width=70)
-> Seq Scan on address (cost=0.00..15.54 rows=2 width=4)
Filter: ((postal_code)::text = '52137'::text)
-> Index Scan using idx_fk_address_id on customer (cost=0.28..8.29 rows=1 width=70)
Index Cond: (address_id = address.address_id)

(5 rows)

Then we can create an index in the postal_code column.

dvdrental=# CREATE INDEX quiz_answer_1 ON address (postal_code);

So, we rerun explain and peek at the results.

dvdrental=# EXPLAIN SELECT * FROM customer WHERE address_id IN (SELECT address_id FROM address WHERE postal_code = '52137');
Nested Loop (cost=4.57..25.92 rows=2 width=70)
-> Bitmap Heap Scan on address (cost=4.29..9.32 rows=2 width=4)
Recheck Cond: ((postal_code)::text = '52137'::text)
-> Bitmap Index Scan on quiz_answer_1 (cost=0.00..4.29 rows=2 width=0)
Index Cond: ((postal_code)::text = '52137'::text)
-> Index Scan using idx_fk_address_id on customer (cost=0.28..8.29 rows=1 width=70)
Index Cond: (address_id = address.address_id)

(7 rows)

The results are interesting. Note the costs for the index scan on idx_fx_address_id stay the same as the new index does not work on the customer table. But the new index does bring down the scan on the address from 15.54 to 9.32. And the nested look cost drop from 32.14 to 25.92. The optimization is a bitmap scan.

From the PostgreSQL manual – Here the planner has decided to use a two-step plan: the bottom plan node visits an index to find the locations of rows matching the index condition, and then the upper plan node actually fetches those rows from the table itself. Fetching the rows separately is much more expensive than sequentially reading them, but because not all the pages of the table have to be visited, this is still cheaper than a sequential scan. (The reason for using two levels of plan is that the upper plan node sorts the row locations identified by the index into physical order before reading them, so as to minimize the costs of the separate fetches. The “bitmap” mentioned in the node names is the mechanism that does the sorting.)

So the index does speed the query up but with a much different optimization than what MySQL would use.

Next episode — Vacuuming Tables

Stay tuned!

The past videos for PostgreSQL for MySQL Database Administrators (DBA) can be found here: episode oneepisode two, episode three, episode four, and episode five.


Parallel Commits for Transactions Using postgres_fdw on PostgreSQL 15

Transactions Using postgres_fdw on PostgreSQL 15

Consuming data from a remote foreign database into our local database just like another local table is a great feature that adds an excellent level of flexibility to the design of a database solution. 

This option is present in a number of database products. Oracle (DBLINK), MySQL (FEDERATED Engine), and obviously, PostgreSQL (dblink, postgres_fdw) support it. PostgreSQL supports multiple foreign data wrappers. But in this blog, we will review a specific feature added for postgres_fdw in the new version 15.

How postgres_fdw works

Let’s say that you have a local and a remote PostgreSQL server. The second one has a table that your application using the local server needs to access. You might think of any of the following solutions:

  1. Configure your application to be able to connect both servers.
  2. Set up some background jobs to perform a copy every then and now.
  3. Configure logical replication between the two servers so the local table syncs up with the remote one.

Any of the above has its own advantages and disadvantages. For the first one, you might need to configure your network and grant access privileges to both DB servers. All these might add complexity and open the door for some security issues. The second option will add extra load on both servers. The data is not going to be up-to-date in real-time, so some reading inconsistencies might happen. The third one actually could be a good option. But if you are planning to MODIFY the synced data then you might face some collisions.  

Then the solution might be relaying in the Foreign Data Wrapper. From the official documentation the next steps are required to access data from a remote PostgreSQL server:

  1. Install the postgres_fdw extension using CREATE EXTENSION.
  2. Create a foreign server object, using CREATE SERVER, to represent each remote database you want to connect to. Specify connection information, except user and password, as options of the server object.
  3. Create a user mapping, using CREATE USER MAPPING, for each database user you want to allow to access each foreign server. Specify the remote user name and password to use as user and password options of the user mapping.
  4. Create a foreign table, using CREATE FOREIGN TABLE or IMPORT FOREIGN SCHEMA, for each remote table you want to access. The columns of the foreign table must match the referenced remote table. You can, however, use table and/or column names different from the remote tables, if you specify the correct remote names as options of the foreign table object.

You will end with something like this:

Foreign Data Wrapper PostgreSQL

So now, with the proper privileges, your application is able to access the data from the remote table just like another local table, you can SELECT or even execute DML statements on it.

What has changed on PG15?

The described functionality has been present for a while in the PostgreSQL versions, and the new enhancement that is going to be available in version 15 is the parallel transaction commit when using postgres_fdw. This is particularly useful when your solution design relies on having multiple remote servers, such as a distributed PostgreSQL database. 

From the Transaction Management documentation, we know that “during a query that references any remote tables on a foreign server, postgres_fdw opens a transaction on the remote server if one is not already open corresponding to the current local transaction”, in the case of our local transaction references multiple remote tables from different foreign servers, then a (sub)transaction is opened on each remote server.

Up to version 14, when there are multiple (sub)transactions tied to a local transaction and the transaction is committed locally, postgres_fdw commits each remote (sub)transaction one by one (sequentially). This behavior is still the default on version 15, but now we have the option to enable the parallel_commit option for each foreign server tied to a local transaction, and in that case, when the local transaction commits all the remote (sub)transactions will commit at once (parallel). 

Laboratory case

I built a laboratory to test the sequential and the parallel remote (sub)transactions commit and catch the differences. 

I decided to use the pgbench tool for the benchmark and the option to create partitions for the pgbench_accounts table, I adapted these partitions to be foreign tables, just as was described on a very interesting Percona blog about Sharding on PostgreSQL, worth having a look on this last one. Finally, I created a partitioned table with 70 partitions distributed over seven remote servers, each one tied to a postgres_fdw foreign server. 

The next diagram gives some insight.

postgres_fdw foreign server

Pgbench execution

To test the new feature I tried two different loads with the parallel_commit option disabled (default) and then enabled:

  • I ran the “normal” pgbench workload. It accesses one row per transaction, which means every local transaction opened only one remote (sub)transaction.
  • I ran a pgbench “custom” script to retrieve 100 rows per transaction. In this case, every local transaction opened multiple remote (sub)transactions.

Running the regular pgbench load against the PG15 server was not more than:

pgbench --progress-timestamp -c 16 -j 4 -T 3600 -P 2 \
  -h pg15 -U pgbench pgbench

After getting the results I enabled the parallel_commit option on each foreign server the next way:

ALTER SERVER remote_server<N> OPTIONS (ADD parallel_commit 'true');

And repeated the pgbench load. The results are as follows.

pgbench load

We see no big difference between the load with the parallel option set OFF or ON. This was kind of expected due to in the regular pgbench load every local transaction opened only one remote transaction.

For the second test scenario I used a custom pgbench script to go for 100 rows on each transaction, it was as follows:

$ cat pgbench_custom.sql
\set v1 random(1, 100000 * :scale)
\set v2 :v1 + 100

  SELECT abalance FROM pgbench_accounts
  WHERE aid BETWEEN :v1 AND :v2;

Before running the load I disabled the parallel_commit option from the foreign servers executing the next for each one:

ALTER SERVER remote_server<N> OPTIONS (DROP parallel_commit);

And then I ran the custom load with NO parallel commit the next way:

pgbench --progress-timestamp -c 8 -j 4 -T 3600 -P 2 \
  -h pg15 -U pgbench -f pgbench_custom.sql pgbench

Then I enabled it on each foreign server with the command I used before and ran the custom load again. The next are the results.

foreign server

Now we see a clear improvement when using the parallel_commit option enabled. Consistently the number of TPS was higher during the test when the option was enabled showing how the performance can be improved.

In general, the TPS dropped down compared with the regular pgbench load. The details are out of the scope of this blog post, the goal here was to get an idea of how the performance might be improved using the new parallel_commit option in similar conditions. 


The capabilities a PostgreSQL database gets with postgres_fdw have been known for a while, and multiple system designs have taken advantage of them. 

The addition of the new parallel_commit option when using postgres_fdw in PostgreSQL 15 brings a very good opportunity to improve the performance of our databases if they are designed as distributed systems and our local transactions would be tied to multiple remote (sub)transactions. Every solution and system design requires its own analysis. This blog post only intends to show the new option, and how we can deliver a higher throughput under the same conditions just by enabling the new parallel_commit option. 


Introducing PostgreSQL 15: Working with DISTINCT

Introducing PostgreSQL 15: Working with DISTINCT

Well, it’s that time of the year when once again we have a look at the newest version of PostgreSQL.

As tradition dictates, here at Percona, the team is given a list of features to write about. Mine happened to be about a very basic and, I might add, important function i.e. SELECT DISTINCT.

Before getting into the details I’d like to mention a couple of caveats regarding how the results were derived for this blog:

  1. The tables are pretty small and of a simple architecture.
  2. Because this demonstration was performed upon a relatively low-powered system, the real metrics have the potential of being significantly greater than what is demonstrated.

For those new to postgres, and the ANSI SQL standard for that matter, the SELECT DISTINCT statement eliminates duplicate rows from the result by matching specified expressions.

For example, given the following table:

table t_ex;
 c1 | c2
  2 | B
  4 | C
  6 | A
  2 | C
  4 | B
  6 | B
  2 | A
  4 | B
  6 | C
  2 | C

This SQL statement returns those records filtering out the UNIQUE values found in column “c1” in SORTED order:

select distinct on(c1) * from t_ex;

Notice, as indicated by column “c2”, that c1 uniqueness returns the first value found in the table:

c1 | c2
  2 | B
  4 | B
  6 | B

This SQL statement returns those records filtering out UNIQUE values found in column “c2”

select distinct on(c2) * from t_ex;
c1 | c2
  6 | A
  2 | B
  4 | C

And finally, of course, returning uniqueness for the entire row:

select distinct * from t_ex;
c1 | c2
  2 | A
  6 | B
  4 | C
  2 | B
  6 | A
  2 | C
  4 | B
  6 | C

So what’s this special new enhancement of DISTINCT you ask? The answer is that it’s been parallelized!

In the past, only a single CPU/process was used to count the number of distinct records. However, in postgres version 15 one can now break up the task of counting by running multiple numbers of workers in parallel each assigned to a separate CPU process. There are a number of runtime parameters controlling this behavior but the one we’ll focus on is max_parallel_workers_per_gather.

Let’s generate some metrics!

In order to demonstrate this improved performance three tables were created, without indexes, and populated with approximately 5,000,000 records. Notice the number of columns for each table i.e. one, five, and 10 respectively:

Table "public.t1"
 Column |  Type   | Collation | Nullable | Default
 c1     | integer |           |          |
Table "public.t5"
 Column |         Type          | Collation | Nullable | Default
 c1     | integer               |           |          |
 c2     | integer               |           |          |
 c3     | integer               |           |          |
 c4     | integer               |           |          |
 c5     | character varying(40) |           |          |
Table "public.t10"
 Column |         Type          | Collation | Nullable | Default
 c1     | integer               |           |          |
 c2     | integer               |           |          |
 c3     | integer               |           |          |
 c4     | integer               |           |          |
 c5     | character varying(40) |           |          |
 c6     | integer               |           |          |
 c7     | integer               |           |          |
 c8     | integer               |           |          |
 c9     | integer               |           |          |
 c10    | integer               |           |          |
insert into t1 select generate_series(1,500);
insert into t5
select   generate_series(1,500)
insert into t10
select   generate_series(1,500)
List of relations
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size  |
 public | t1   | table | postgres | permanent   | heap          | 173 MB |
 public | t10  | table | postgres | permanent   | heap          | 522 MB |
 public | t5   | table | postgres | permanent   | heap          | 404 MB |

The next step is to copy the aforementioned data dumps into the following versions of postgres:


The postgres binaries were compiled from the source and data clusters were created on the same low-powered hardware using the default, and untuned, runtime configuration values.

Once populated, the following bash script was executed to generate the results:

for v in 96 10 11 12 13 14 15
    # run the explain analzye 5X in order to derive consistent numbers
    for u in $(seq 1 5)
        echo "--- explain analyze: pg${v}, ${u}X ---"
        psql -p 100$v db01 -c "explain analyze select distinct on (c1) * from t1" >$v.explain.txt
        psql -p 100$v db01 -c "explain analyze select distinct * from t5" >$v.explain.txt
        psql -p 100$v db01 -c "explain analyze select distinct * from t10" >$v.explain.txt

And here are the results: One can see that the larger the tables become the greater the performance gains that can be achieved.


1 column (t1), ms

5 column (t5), ms

10 column (t10), ms































One of the more interesting aspects of the investigation was reviewing the query plans between the different versions of postgres. For example, the query plan for a single column DISTINCT was actually quite similar, ignoring the superior execution time of course, between the postgres 9.6 and 15 plans respectively.

 Unique  (cost=765185.42..790185.42 rows=500 width=4) (actual time=2456.805..3381.230 rows=500 loops=1)
   ->  Sort  (cost=765185.42..777685.42 rows=5000000 width=4) (actual time=2456.804..3163.600 rows=5000000 loops=1)
         Sort Key: c1
         Sort Method: external merge  Disk: 68432kB
         ->  Seq Scan on t1  (cost=0.00..72124.00 rows=5000000 width=4) (actual time=0.055..291.523 rows=5000000 loops=1)
 Planning time: 0.161 ms
 Execution time: 3381.662 ms
 Unique  (cost=557992.61..582992.61 rows=500 width=4) (actual time=946.556..1411.421 rows=500 loops=1)
   ->  Sort  (cost=557992.61..570492.61 rows=5000000 width=4) (actual time=946.554..1223.289 rows=5000000 loops=1)
         Sort Key: c1
         Sort Method: external merge  Disk: 58720kB
         ->  Seq Scan on t1  (cost=0.00..72124.00 rows=5000000 width=4) (actual time=0.038..259.329 rows=5000000 loops=1)
 Planning Time: 0.229 ms
   Functions: 1
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.150 ms, Inlining 31.332 ms, Optimization 6.746 ms, Emission 6.847 ms, Total 45.074 ms
 Execution Time: 1438.683 ms

The real difference showed up when the number of DISTINCT columns were increased, as demonstrated by querying table t10. One can see parallelization in action!

 Unique  (cost=1119650.30..1257425.30 rows=501000 width=73) (actual time=14257.801..20024.271 rows=50601 loops=1)
   ->  Sort  (cost=1119650.30..1132175.30 rows=5010000 width=73) (actual time=14257.800..19118.145 rows=5010000 loops=1)
         Sort Key: c1, c2, c3, c4, c5, c6, c7, c8, c9, c10
         Sort Method: external merge  Disk: 421232kB
         ->  Seq Scan on t10  (cost=0.00..116900.00 rows=5010000 width=73) (actual time=0.073..419.701 rows=5010000 loops=1)
 Planning time: 0.352 ms
 Execution time: 20025.956 ms
------------------------------------------------------------------------------------------- HashAggregate  (cost=699692.77..730144.18 rows=501000 width=73) (actual time=1212.779..1232.667 rows=50601 loops=1)
   Group Key: c1, c2, c3, c4, c5, c6, c7, c8, c9, c10
   Planned Partitions: 16  Batches: 17  Memory Usage: 8373kB  Disk Usage: 2976kB
   ->  Gather  (cost=394624.22..552837.15 rows=1002000 width=73) (actual time=1071.280..1141.814 rows=151803 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  HashAggregate  (cost=393624.22..451637.15 rows=501000 width=73) (actual time=1064.261..1122.628 rows=50601 loops=3)
               Group Key: c1, c2, c3, c4, c5, c6, c7, c8, c9, c10
               Planned Partitions: 16  Batches: 17  Memory Usage: 8373kB  Disk Usage: 15176kB
               Worker 0:  Batches: 17  Memory Usage: 8373kB  Disk Usage: 18464kB
               Worker 1:  Batches: 17  Memory Usage: 8373kB  Disk Usage: 19464kB
               ->  Parallel Seq Scan on t10  (cost=0.00..87675.00 rows=2087500 width=73) (actual time=0.072..159.083 rows=1670000 loops=3)
 Planning Time: 0.286 ms
   Functions: 31
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 3.510 ms, Inlining 123.698 ms, Optimization 200.805 ms, Emission 149.608 ms, Total 477.621 ms
 Execution Time: 1244.556 ms

INCREASING THE PERFORMANCE: Performance enhancements were made by updating the postgres runtime parameter max_parallel_workers_per_gather. The default value in a newly initialized cluster is 2. As the table below indicates, it quickly became an issue of diminishing returns due to the restricted capabilities of the testing hardware itself.



1 column (t1)

5 column (t5)

10 column (t10)





















PostgreSQL Distinct

ABOUT INDEXES: Performance improvements were not realized when indexes were applied as demonstrated in this query plan.

PG15, TABLE T10(10 DISTINCT columns), and max_parallel_workers_per_gather=4:

QUERY PLAN                                                                                 
 Unique  (cost=0.43..251344.40 rows=501000 width=73) (actual time=0.060..1240.729 rows=50601 loops=1)
   ->  Index Only Scan using t10_c1_c2_c3_c4_c5_c6_c7_c8_c9_c10_idx on t10  (cost=0.43..126094.40 rows=5010000 width=73) (actual time=0.058..710.780 rows=5010000 loops=1)
         Heap Fetches: 582675
 Planning Time: 0.596 ms
   Functions: 1
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.262 ms, Inlining 0.000 ms, Optimization 0.122 ms, Emission 2.295 ms, Total 2.679 ms
 Execution Time: 1249.391 ms

CONCLUDING THOUGHTS: Running DISTINCT across multiple CPUs is a big advance in performance capabilities. But keep in mind the risk of diminishing performance as you increase the number of max_parallel_workers_per_gather and you approach your hardware’s limitations. And as the investigation showed, under normal circumstances, the query planner might decide to use indexes instead of running parallel workers. One way to get around this is to consider disabling runtime parameters such as enable_indexonlyscan and enable_indexscan. Finally, don’t forget to run EXPLAIN ANALYZE in order to understand what’s going on.


PostgreSQL Replication and Conflicts

PostgreSQL Replication and Conflicts

Replication sits at the core of database high availability (HA). While on-disk data can be replicated through various methods, whether hardware or software, disk, file, or block-based solutions, without an online copy of the database, HA setup cannot be achieved.

This blog digs into more detail the challenges/conflicts that impact PostgreSQL replication, but before we do that, let’s understand some key concepts and the evolution of replication in PostgreSQL.

Evolution of replication in PostgreSQL

It comes as no surprise that the PostgreSQL community over the years has invested in building solid replication features that help deliver both HA and DR. Following is a timeline of how the replication core evolved over the years starting with the Write-Ahead Logging (WAL) and finally discussing synchronous streaming replication.

Write-Ahead Logging (WAL)

Write-Ahead Logging (WAL) was added to PostgreSQL in 2001 as part of the 7.1 release. That was a major step in ensuring durability as it ensured the “D” in ACID compliance. Beyond durability, this also improved performance and response times for transactions by reducing the need to write the same page multiple times.

Replication features

It was quite a few years after the introduction of WAL that the first major replication feature, warm standby, was implemented in PostgreSQL.

Warm standby or warm standby log shipping

The warm standby feature was introduced in the 8.2 release in 2006. This feature brought in the concept of continuously archiving and continuously recovering.

On standby, the server would be continuously recovering. This means that although the server is online, it will not be able to accept any connections. All it does is process the WAL file segments that it receives from the primary server’s continuously archiving process.

This was asynchronous replication. Multiple WAL records were packaged into a file and shipped out the standby. Data loss was possible in case of a primary server crash, some WAL records may not have been shipped.

So this was PostgreSQL’s first step toward an HA solution!

Warm to hot: Streaming replication

With the 9.0 release in 2010, streaming replication and hot standbys were introduced. Two new postmaster processes were added; walsender and walreceiver. This was essentially the same concept as continuously archiving and continuously recovering, but implemented through background processes. However, this enabled the standby to now honor incoming connections and run read-only queries.

These background processes were also responsible for flushing WAL files to disk on the primary and standby servers.

Two functions for monitoring replication were added as part of this release; pg_last_xlog_receive_location and pg_last_xlog_replay_location.

However, the replication remained asynchronous and data loss was possible in case of a primary server crash.

Synchronous streaming replication

The data loss problem potential was eliminated with the 9.1 release in 2011 with synchronous streaming replication. This ensured strict data integrity guarantees that are suitable for an HA system. The synchronous mechanism could be made asynchronous for a transaction block by issuing “SET LOCAL synchronous_replication TO OFF”.

Additionally, this release introduced the concepts of quorum, priorities, connected, and active standbys.

Setting up streaming replication

Setting up streaming replication is actually rather simple. It’s a straightforward five-step process:

  • Create a user for replication. We’ll be using rep_user for this example.
-- Let’s create rep_user with a replication role in the primary database.

  • Add replication user to pg_hba.conf on the primary server.
# TYPE  DATABASE        USER       ADDRESS                 METHOD
host    replication     rep_user   <ip address range>      <auth_method>

  • Reload the pg_hba.conf on the primary server either through “pg_ctl reload”, SQL function pg_reload_conf(), or through “kill -HUP”.
  • Take a backup of the primary database using pg_basebackup with “-R” option to write the replication information in the new data directory.
  • Start the standby server.

And viola, the streaming replication is enabled. The PostgreSQL official documentation explains the streaming replication configuration in more detail here.

Replication views and configurations

In terms of configuring replication, the process becomes a little tricky as one has to configure the primary, sending, and standby servers optimally to meet the performance and uptime requirements.

A sending server is any PostgreSQL server in a replication setup that is shipping WAL segments to standby servers. This may be the primary server or an intermediate standby server in cascade replication configuration.


There are about 20 different configuration options for primary, sending, and standby servers. These configurations are for WAL, vacuum, connection, promotion, replication delays, and replication feedback.


??pg_stat_replication and pg_stat_wal_receiver views provide information about replication on the primary and standby servers respectively. 

On the primary server, ??pg_stat_replication contains one row per walsender process connected to standbys. Each for the given standby also shows the application and client connection information along with the state of replication, write, flush, replay LSNs, and intervals. For the primary server, it shows sent LSN.

On the standby server, ??pg_stat_replication exposes very similar information as ??pg_stat_replication for connection and LSNs, but additionally also shows replication slot if any. It only has a single row for the walreceiver process.

Replication conflicts

Replication conflicts only occur on the standby servers. There are no conflicts on the primary server. The conflicts occur on standby as the walreceiver process initiates the WAL redo operation. WAL redo operation requires exclusive locks, which may be held by any long-running queries or transactions. The redo operation waits until a timeout is reached, after which any backends holding locks that prevent WAL processing are terminated. And these terminations are classified as replication conflicts. The nature of these conflicts can be very different.

PostgreSQL has pg_stat_database_conflicts view that tells us what type of conflicts and how many have occurred.

We can conveniently ignore the pg_stat_database view as it only has a single BigInt field for maintaining conflicts count. It doesn’t really provide any insights into the nature of these conflicts.


On a standby server, pg_stat_database_conflicts has one row per database showing database-wide query cancellation stats due to recovery. It’s a fairly simple view with datid, datname, and five bigints for counting different types of conflicts:

  • Tablespace,
  • Lock,
  • Snapshot,
  • Bufferpin, and
  • Deadlock.

Tablespace conflict

Tablespace conflicts are likely to occur if an expected temp_tablespaces is dropped. temp_tablespaces is used for temporary tables, indexes, and sort operations. 

These types of conflicts are somewhat uncommon as these operations will be far and few between.

Lock conflict

Lock conflicts may occur when you query a table for contents on the standby and the table is remotely altered on the primary.

This will lead to locking contention as the walreceiver process wants to make changes by acquiring an ACCESS EXCLUSIVE lock on an index/relation. However, another query is holding ACCESS SHARE lock.

Snapshot conflict

Snapshop conflicts are generated while and when a backend process tries to access rows on the standby server that have been vacuumed out on the primary.

Bufferpin conflict

These are mostly caused by hot chain pruning or perhaps version duplicate churn for btree indexes in version 14 as it requires an exclusive lock on a page.

These types of conflicts are not that common.

Deadlock conflict

These are conflicts due to query cancellations occurring because of deadlocks on standby.

Other conflicts

There may be other types of conflicts, but those aren’t captured by pg_stat_database_conflicts view. For example, one may experience a database conflict where a database is dropped on the primary but standby may have active connections using it.

Survey: Issues faced with PostgreSQL replication

Based on a survey conducted amongst friends and colleagues, I see that more than 50% of the respondents have experienced issues with PostgreSQL replication.

The type of issues and conflicts experienced also provide us with an insight into what to focus on while configuring the replication.

If you wish to share your experience with PostgreSQL replication with me, then you may do so using this Google form.


There will be more blogs coming from me as I’ll discuss how some of these replication conflicts occur, and how the standby server internally reacts. The key really is to get a better under-the-hood perspective so that we can improve replication configuration to avoid such conflicts leading to improved performance, better uptime, and simpler debugging.


Percona Operator Volume Expansion Without Downtime

Percona Operator Volume Expansion Without Downtime

Percona Operator Volume Expansion Without DowntimeThere are several ways to manage storage in Percona Kubernetes Operators: Persistent Volume (PV), hostPath, ephemeral storage, etc. Using PVs, which are provisioned by the Operator through Storage Classes and Persistent Volume Claims, is the most popular choice for our users. And one of the most popular questions is how to scale our operator storages which are based on PVs. To make PVs resize easier, the volume expansion feature was introduced as an alpha feature in Kubernetes 1.8 and eventually became a stable feature in 1.24. 

 In this blog post, I will show you how to easily increase the storage size in Percona Operators using this feature without any database downtime and explain what to do if your storage class doesn’t support volume expansion. 

Scale-up persistent volume claim (PVC) by volume expansion

  • You can only resize PVC if the storage class of the PVC has set the AllowVolumeExpansion=True option. 
kubectl describe sc <storage class name> | grep allowVolumeExpansion

  • Ensure that the delete PVC finalizer (delete-pxc-pvc/delete-psmdb-pvc) is not set in the custom resource, otherwise, all cluster data can be lost. 
  • Please refer to volume expansion documentation for intree volume types which support volume expansion
  • If the underlying Storage Driver can only support offline expansion, users of the PVC must take down their Pod before expansion can succeed. Please refer to the documentation of your storage provider to understand which modes of volume expansion are supported. 

Please note that expanding EBS volumes is a time-consuming operation. Also, there is a per-volume quota of one modification every six hours. 

Percona Operator for MongoDB/Percona Operator for MySQL

Under the hood of Percona Operator for MongoDB and Percona Operator for MySQL based on Percona XtraDB Cluster, StatefuSets are used, so the volume expansion task for the Operator comes down to resizing the corresponding StatefulSet persistent volume. 

Resizing a PV claimed by changing custom resource or StatefulSets (Does not work)

Only a number of specific StatefulSet fields can be modified after creation. Altering storage size in operator custom resource or StatefulSet leads to the error below:

StatefulSet.apps \"my-cluster-name-rs0\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 
'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden"

Expansion storage by modifying persistent volume claim (PVC)

1. Change PVC size: 

kubectl patch pvc <pvc-name>  -n <pvc-namespace> -p '{ "spec": { "resources": { "requests": { "storage": "NEW STORAGE VALUE" }}}}'

2. After the process is finished, you can see the below message in the PVC description. 

kubectl describe pvc <pvc-name>
Normal  FileSystemResizeSuccessful  3s    kubelet  MountVolume.NodeExpandVolume succeeded for volume "pvc-7ed0ba5c-cc79-42d4-a4b3-xxxxxxxxxxxx"

If you see the following event in the PVC description (and it does not change) you need to restart pods to finish resizing. (See further notes.)

FileSystemResizePending   True    Mon, 01 Jan 0001 00:00:00 +0000   Thu, 23 Jun 2022 19:24:50 +0200 Waiting for user to (re-)start a pod to finish 
file system resize of volume on node.

3. When PVC size was changed, you can ensure that other objects also changed.

kubectl get pvc

kubectl get pv

kubectl exec <pod-name> -- lsblk

4. Update storage size in corresponding operator custom resource (Percona Operator for MongoDB or Percona Operator for MySQL based on Percona XtraDB Cluster).

So now we have scaled storage and old storage values in the corresponding StatefulSets. As we saw above we can apply custom resources with new storage values, but the storage size field can’t be changed for StatefulSet objects. That is why we need to recreate StatefulSet with new values. To avoid downtime we delete StatefulSet without deleting the pods (–cascade=’orphan’ flag).

kubectl delete sts <statefulset-name> --cascade=orphan

5. Apply cr.yml.

kubectl apply -f deploy/cr.yml

6. Delete StatefulSet pods one by one (optional).

According to the Kubernetes official documentation:

If your storage provider supports online expansion then no Pod restart should be necessary for volume expansion to finish.

Percona Operator for PostgreSQL (pg operator)

In contrast to Percona Operator for MongoDB/Percona Operator for MySQL based on Percona XtraDB Cluster which uses partially modifiable StatefulSet object, our Percona Operator for PostgreSQL uses Deployment instead of StatefulSet for cluster objects. Deployment is a mutable object and can be changed not only at the cluster start, but also on the running cluster. That is why the change of storage size in custom resource will be continuously applied to the running cluster. Changes in custom resources not only resize volume but also initiate pods restart. 

Moreover, in order to maintain a safe cluster configuration,  the operator keeps primary and replicas volume size exactly equal.  And hence, if the size of the primary is changed, the size of the replicas is changed too and both primary and replica pods are restarted. Otherwise, if you modify replica size first, only replica PVCs are expanded and pods are recreated. However, if you try to increase the primary size to the same value as replicas, only primary PVC is scaled (PVCs of the replicas already have the necessary value ) but both replicas and primary pods are restarted.

So you can see the situation that both primary and replica PVCs changed the size, only the storage size for the primary was increased in the custom resource.

You need to be careful and keep your custom resource file up-to-date. 

Resizing persistent volume claim (PVC) by transferring data to new PVC

Volume Expansion is the great Kubernetes feature that provides an easy and effective way to resize PVC. However, some storage classes do not support PVC volume expansion. For these classes, you must create a new PVC and move content to it. You can do it in the following steps:

1. Configure PVC size in custom resource for the corresponding operator (Percona Operator for MongoDB or Percona Operator for MySQL based on Percona XtraDB Cluster) and apply it. 

kubectl apply -f deploy/cr.yaml

2. Delete StatefulSet. Like in the previous part, we need to recreate StatefulSet in order to apply storage changes. 

kubectl delete sts <statefulset-name> --cascade=orphan

As a result, the Pods are up and the Operator recreates the StatefulSet with a new volume size.

3. Scale up the cluster (optional).

Changing the storage size would require us to terminate the Pods, which can induce performance degradation. To preserve performance during the operation we can scale up the cluster using these instructions  Percona Operator for MongoDB or Percona Operator for MySQL based on Percona XtraDB ClusterAs long as we have changed the StatefulSet already, new operator pods will be provisioned with increased volumes. 

4. Reprovision the Pods One by One to Change the Storage.

This is the step where underlying storage is going to be changed for the database Pods.

Delete the PVC of the Pod that you are going to reprovision.

kubectl delete pvc <pvc-name>

The PVC will not be deleted right away as there is a Pod using it. To proceed, delete the Pod:

kubectl delete pod <pod-name>

The Pod will be deleted along with the PVCs. The StatefulSet controller will notice that the pod is gone and will recreate it along with the new expanded  PVC.

Check PVC size:

kubectl get pvc

NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mongod-data-minimal-cluster-rs1-1   Bound    pvc-e9b494fb-f201-44b9-9493-ea6faa903ddc   4Gi        RWO            standard       3m3s
mongod-data-minimal-cluster-rs1-2   Bound    pvc-6236c7e1-9670-49a8-9928-0dd24708588c   3Gi        RWO            standard       144m

The CAPACITY column indicates that this PVC is increased.

Once the Pod is up, the data is synced from other nodes. The data transfer speed depends on the amount of data in the cluster and cluster utilization as a consequence synchronization might take a while. Please wait until the node is fully up and running, sync is finished, and only then proceed to the next Pod.

5. Scale Down the Cluster and clean up unnecessary PVCs. 


We are constantly introducing new features in the Percona Kubernetes Operators. One of them is automated volume expansion, which will be implemented in a future release.  We also plan to add the possibility to change storage size directly in a custom resource that makes it possible to add to the Private DBaaS feature in Percona Monitoring and Management (PMM) detecting the volume consumption and automatically scaling the cluster. If you have any suggestions on the topic and are willing to collaborate with us, please submit the issue to Community Forum or JIRA


Enabling and Enforcing SSL/TLS for PostgreSQL Connections

SSL/TLS for PostgreSQL Connections

Enabling SSL in PostgreSQL is very straightforward. In just three steps we can make sure the connections to it are more secure, using in-transit encryption via SSL/TLS:

  1. Make sure we have the server certificate and key files available
  2. Enable the SSL configuration (ssl = on)
  3. Make sure the pg_hba.conf file rules are updated accordingly

In this blog post, we are going to go through these steps, and we’ll also see how we can check and validate the connections are indeed using the safer SSL protocol.

What is SSL/TLS?

SSL (Secure Sockets Layer) is an encryption protocol designed to make network communications between two nodes secure. Without some form of network encryption, any third party that can examine network packets will have access to the data sent between the client and server (in this case, the PostgreSQL data, which means users, passwords, and even SQL statements). TLS (Transport Layer Security) is the more modern definition of it, and even if SSL is deprecated, it is still common to use it for naming purposes. To all intents and purposes, we are using them as aliases in this blog.

The PostgreSQL documentation pages offer us some more insight in this respect. If needed, consult the Secure TCP/IP Connections with SSL and SSL Support entries for more information.

Trying to enable SSL without Cert/Key Files

Let’s now see what happens when we try to enable SSL without having the needed certificate and key files in place:

postgres=# alter system set ssl=on;
postgres=# select pg_reload_conf();
(1 row)

We don’t see any errors, but are we really using SSL? If we check the error log, we’ll indeed see the errors:

2022-06-23 20:43:54.713 UTC [5284] LOG:  received SIGHUP, reloading configuration files
2022-06-23 20:43:54.714 UTC [5284] LOG:  parameter "ssl" changed to "on"
2022-06-23 20:43:54.715 UTC [5284] LOG:  could not load server certificate file "server.crt": No such file or directory
2022-06-23 20:43:54.715 UTC [5284] LOG:  SSL configuration was not reloaded

Creating certificates

So, we first need to create the aforementioned files. If you don’t already have valid certificate and key files, a quick one-liner for this is the following openssl command (it’s not the focus here to delve too much into this part of the process):

[root@node0 ~]# cd /var/lib/pgsql/14/data
[root@node0 data]# openssl req -nodes -new -x509 -keyout server.key -out server.crt -subj '/C=US/L=NYC/O=Percona/CN=postgres'
Generating a 2048 bit RSA private key
writing new private key to 'server.key'

We have changed the current working directory to the PostgreSQL data directory since we were in a RHEL-based system. If you are on a Debian-based one, you should store the files in /etc/ssl/certs/ and /etc/ssl/private/ or define/check ssl_cert_file and ssl_key_file PostgreSQL configuration variables, respectively. Also, make sure the postgres user owns them, and they are only readable to it:

[root@node0 data]# chmod 400 server.{crt,key}
[root@node0 data]# chown postgres:postgres server.{crt,key}
[root@node0 data]# ll server.{crt,key}
-r--------. 1 postgres postgres 1212 Jun 23 20:49 server.crt
-r--------. 1 postgres postgres 1704 Jun 23 20:49 server.key

Enabling SSL/TLS

Now we can enable SSL and reload the configuration again; this time with no errors shown:

postgres=# alter system set ssl=on;
postgres=# select pg_reload_conf();
(1 row)

??2022-06-23 20:52:05.823 UTC [5284] LOG:  received SIGHUP, reloading configuration files
2022-06-23 20:52:05.823 UTC [5284] LOG:  parameter "ssl" changed to "on"

So far, we have enabled SSL, but unless we modify the pg_hba.conf file these settings won’t apply to any users (at least not in a forceful manner). This is the first step that can give us a false sense of security, so let’s go ahead and see how to fix it.

Enforcing SSL/TLS

As mentioned, the pg_hba.conf file is where we can tune which connections are going to be required to use SSL. We can instruct PostgreSQL to enforce this by using the “hostssl” keyword instead of the plain “host” one. Note that you can see some connections starting to use SSL at this point because the plain “host” keyword will allow for connections that want to use SSL to use it. However, this is not enforcing SSL to be used (i.e.: if the client doesn’t want to use SSL, PostgreSQL will not deny the connection).

Let’s imagine this is the pg_hba.conf file we have been using so far:

local   all             all                       peer
host    all             all       scram-sha-256
host    all             all    ::1/128            scram-sha-256
host    all             all          md5
host    replication     all   md5

And we want to enforce SSL connections from all remote users (and also include remote replication connections):

local    all            all                        peer
host     all            all       scram-sha-256
host     all            all     ::1/128            scram-sha-256
hostssl  all            all          md5
hostssl  replication    all   md5

Again, this is not enough if we are adamant about really enforcing connections to use SSL. We have to call pg_reload_conf() once more to make sure they are loaded into PostgreSQL itself:

postgres=# select pg_reload_conf();
(1 row)

At this point, new remote non-SSL connections will be denied:

[root@node1 ~]# psql "host= sslmode=disable"
psql: error: connection to server at "", port 5432 failed: FATAL:  no pg_hba.conf entry for host "", user "postgres", database "postgres", no encryption

So, can we finally say we are fully secure now? No, not yet! Connections that were already established are not forced to use SSL until they reconnect.

Checking for connections using SSL/TLS

We can check for connections using SSL with the following query:

postgres=# select, pg_ssl.ssl, pg_ssl.version,
           pg_sa.backend_type, pg_sa.usename, pg_sa.client_addr
           from pg_stat_ssl pg_ssl
           join pg_stat_activity pg_sa
             on =;
 pid  | ssl | version |  backend_type  | usename  |  client_addr
 5547 | f   |         | walsender      | postgres |
 5549 | f   |         | client backend | postgres |
 5556 | f   |         | client backend | postgres |
(3 rows)

In this case, the replication connection (walsender) is not yet using SSL and neither are the two other clients connected, so we need to force a restart if we want them to reconnect. As always, we recommend that you try all these steps in a testing environment first and that when it’s time to do it in production you do them in a properly established maintenance window (no matter how trivial the steps seem to be).

To force the replication connections to use SSL, one can either restart the service in the replica or use pg_terminate_backend (which will send the SIGTERM signal to the process and is safe to use in this context). In this case, we are using pg_terminate_backend in the primary itself, but it can also be used in the replica, provided we are using the correct PID number.

postgres=# select pg_terminate_backend(5547);
(1 row)

After that, we should see the new replica connection correctly using the SSL/TLS protocol:

postgres=# select, pg_ssl.ssl, pg_ssl.version,
           pg_sa.backend_type, pg_sa.usename, pg_sa.client_addr
           from pg_stat_ssl pg_ssl
           join pg_stat_activity pg_sa
             on =;
 pid  | ssl | version |  backend_type  | usename  |  client_addr
 5557 | t   | TLSv1.2 | walsender      | postgres |
 5549 | f   |         | client backend | postgres |
 5556 | f   |         | client backend | postgres |
(3 rows)

PID 5549 is our own connection, so that’s an easy fix:

postgres=# select pg_backend_pid();
(1 row)

Connection from 5556 would be the remaining one for us to check if we need to enforce SSL on all. On the client-side, we can use \conninfo to check information on our current connection:

postgres=# \conninfo
You are connected to database "postgres" as user "postgres" on host "" at port "5432".
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)

Disabling SSL/TLS

If you want to disable SSL instead, be sure to not lose the client connection after you set ssl=off and make changes to the pg_hba.conf file, otherwise you may be locked out if you don’t have any accounts using “host” only access method, and your only way out is to restart the service. To be safe, first, edit and reload pg_hba.conf file to include entries with “host”, and only then fully disable SSL (ssl=off).


Enabling SSL/TLS for in-transit connection encryption is easy, but there are some pitfalls to be aware of when it comes to enforcing its usage. Simply enabling the configuration for it is not enough for it to be enforced, even if by default some connections may prefer using SSL when it’s available. If you need to ensure that all connections use SSL, edit the pg_hba.conf file accordingly and make sure it’s loaded. Remember that “hostssl” entries are the ones that force this behavior.

We can use tcpdump and wireshark to check if connections are indeed being encrypted. But, that’s a topic for another blog…

Powered by WordPress | Theme: Aeros 2.0 by