Aug
05
2022
--

An Illustration of PostgreSQL Bloat

An Illustration of PostgreSQL Bloat

I have been working with Postgres now for many years, most recently as a consultant at Percona for companies with new implementations of Postgres as a result of migrating from Oracle or some other legacy database engine. In some cases, these are Fortune 100 companies with many talented people working for them. However, not all databases work the same and one of the most common observations I make when reviewing a Postgres environment for a client is the amount of table bloat, index bloat, and lack of understanding of its impact on performance – and how to address it.

I wrote a blog about this topic a few years ago and never gave it much thought after its publication. However, with the large number of companies moving to Postgres for obvious reasons, and the lack of true Postgres database management skills needed to support fairly large databases, I thought I would rewrite this blog and bring it back to life with some clarity to help one understand bloat and why it happens.

What causes bloat?

In PostgreSQL, the culprit is Multi-Version Concurrency Control, commonly referred to as MVCC.

MVCC ensures that a transaction against a database will return only data that’s been committed, in a snapshot, even if other processes are trying to modify that data.

Imagine a database with millions of rows in a table. Anytime you update or delete a row, Postgres has to keep track of that row based on a transaction ID. For example, you may be running a long query with transaction ID 100 while someone named John just updated the same table using transaction ID 101. At this point, since you are still inside of transaction 100, which is older than 101 as far as your query is concerned, the changes made by John in transaction ID 101 are not relevant or visible to your query. You are in your own personal bubble of data back before the data has changed. Any new queries from you or anyone else with a transaction ID greater than 101 will see the changes made by John in transaction 101. After all, new transaction IDs are greater than 101, meaning no other transactions are currently in play with IDs less than 101 the data you saw in transaction ID 100 will no longer be needed by the database and will be considered dead but not gone. Hence, bloat!

At a high level, vacuuming is used to free up dead rows in a table so they can be reused. It also helps you avoid transaction ID wraparound.

Let’s go through a few steps to illustrate how all this takes place

In order for Postgres to know which transaction data should be in the result set of your query, the snapshot makes note of transaction information.

Essentially, if your transaction ID is 100, you will only see data from all transaction IDs leading up to 100. As stated above, you will not see data from transaction ID 101 or greater.

Setting up an example

Let’s start by creating a simple table for our example, called percona:

percona=# CREATE TABLE percona ( col1 int );
CREATE TABLE


percona=# INSERT INTO percona values (1);
INSERT 0 1


percona=# INSERT INTO percona values (2);
INSERT 0 1


percona=# INSERT INTO percona values (3);
INSERT 0 1


percona=# INSERT INTO percona values (4);
INSERT 0 1


percona=# INSERT INTO percona values (5);
INSERT 0 1

You can wrap multiple inserts into a single transaction with BEGIN and COMMIT:

percona=# BEGIN;
BEGIN


percona=*# INSERT INTO percona SELECT generate_series(6,10);
INSERT 0 5

percona=*# COMMIT;
COMMIT

Here we can see the 10 rows we inserted into the table, along with some hidden system columns:

percona=# SELECT xmin, xmax, * FROM percona;
   xmin   | xmax | col1
----------+------+------
 69099597 |    0 |    1
 69099609 |    0 |    2
 69099627 |    0 |    3
 69099655 |    0 |    4
 69099662 |    0 |    5
 69099778 |    0 |    6
 69099778 |    0 |    7
 69099778 |    0 |    8
 69099778 |    0 |    9
 69099778 |    0 |   10
(10 rows)

As you can see, values one through five (in the col1 column) have unique transaction IDs (represented in the xmin column)—they were the result of individual INSERT statements, made one after the other. The rows with values of six through 10 share the same transaction ID of 6909978; they were all part of the one transaction we created with the BEGIN and COMMIT statements.

At this point, you may be asking yourself what this has to do with vacuum or autovacuum. We will get there. First, you need to know about the transaction id logic and visually see it for a better understanding as shown above.

How does the table bloat?

In Postgres, the heap is a file containing a list of variable-sized records, in no particular order, that points to the location of a row within a page. (A Postgres page is 8k in size). The pointer to the location is called the CTID.

To view the heap without having to read the raw data from the file, we need to create the following extension inside of our database:

CREATE extension pageinspect;

Now we can inspect the heap for our newly created table and rows:

percona=# SELECT * FROM heap_page_items(get_raw_page('percona',0));


 lp | lp_off | lp_flags | lp_len |  t_xmin  | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |   t_data
----+--------+----------+--------+----------+--------+----------+--------+-------------+------------+--------+--------+-------+------------
  1 |   8160 |        1 |     28 | 69099597 |      0 |        0 | (0,1)  |           1 |       2304 |     24 |        |       | \x01000000
  2 |   8128 |        1 |     28 | 69099609 |      0 |        0 | (0,2)  |           1 |       2304 |     24 |        |       | \x02000000
  3 |   8096 |        1 |     28 | 69099627 |      0 |        0 | (0,3)  |           1 |       2304 |     24 |        |       | \x03000000
  4 |   8064 |        1 |     28 | 69099655 |      0 |        0 | (0,4)  |           1 |       2304 |     24 |        |       | \x04000000
  5 |   8032 |        1 |     28 | 69099662 |      0 |        0 | (0,5)  |           1 |       2304 |     24 |        |       | \x05000000
  6 |   8000 |        1 |     28 | 69099778 |      0 |        0 | (0,6)  |           1 |       2304 |     24 |        |       | \x06000000
  7 |   7968 |        1 |     28 | 69099778 |      0 |        0 | (0,7)  |           1 |       2304 |     24 |        |       | \x07000000
  8 |   7936 |        1 |     28 | 69099778 |      0 |        0 | (0,8)  |           1 |       2304 |     24 |        |       | \x08000000
  9 |   7904 |        1 |     28 | 69099778 |      0 |        0 | (0,9)  |           1 |       2304 |     24 |        |       | \x09000000
 10 |   7872 |        1 |     28 | 69099778 |      0 |        0 | (0,10) |           1 |       2304 |     24 |        |       | \x0a000000
(10 rows)

The table above shows 10 entries with a few columns:

  • lp is the ID of the row/tuple
  • t_xmin is the transaction ID
  • t_ctid is the pointer
  • t_data is the actual data

Currently, the pointer for each row is pointing to itself as determined by the form (page,tupleid). Pretty straightforward.

Now, let’s perform a few updates on a specific row. Let’s change the value of five to 20, then to 30, and finally back to five.

percona=# UPDATE percona SET col1 = 20 WHERE col1 = 5;
UPDATE 1


percona=# UPDATE percona SET col1 = 30 WHERE col1 = 20;
UPDATE 1


percona=# UPDATE percona SET col1 = 5 WHERE col1 = 30;
UPDATE 1

These three changes took place under three different transactions.

What does this mean?  We changed the values for a column three times but never added or deleted any rows. So we should still have 10 rows, right?

percona=# SELECT COUNT(*) FROM percona;
 count
-------
    10
(1 row)

Looks as expected. But wait! Let’s look at the heap now. The real data on disk.

percona=# SELECT * FROM heap_page_items(get_raw_page('percona',0));


 lp | lp_off | lp_flags | lp_len |  t_xmin  |  t_xmax  | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |   t_data
----+--------+----------+--------+----------+----------+----------+--------+-------------+------------+--------+--------+-------+------------
  1 |   8160 |        1 |     28 | 69099597 |        0 |        0 | (0,1)  |           1 |       2304 |     24 |        |       | \x01000000
  2 |   8128 |        1 |     28 | 69099609 |        0 |        0 | (0,2)  |           1 |       2304 |     24 |        |       | \x02000000
  3 |   8096 |        1 |     28 | 69099627 |        0 |        0 | (0,3)  |           1 |       2304 |     24 |        |       | \x03000000
  4 |   8064 |        1 |     28 | 69099655 |        0 |        0 | (0,4)  |           1 |       2304 |     24 |        |       | \x04000000
  5 |   8032 |        1 |     28 | 69099662 | 69103876 |        0 | (0,11) |       16385 |       1280 |     24 |        |       | \x05000000
  6 |   8000 |        1 |     28 | 69099778 |        0 |        0 | (0,6)  |           1 |       2304 |     24 |        |       | \x06000000
  7 |   7968 |        1 |     28 | 69099778 |        0 |        0 | (0,7)  |           1 |       2304 |     24 |        |       | \x07000000
  8 |   7936 |        1 |     28 | 69099778 |        0 |        0 | (0,8)  |           1 |       2304 |     24 |        |       | \x08000000
  9 |   7904 |        1 |     28 | 69099778 |        0 |        0 | (0,9)  |           1 |       2304 |     24 |        |       | \x09000000
 10 |   7872 |        1 |     28 | 69099778 |        0 |        0 | (0,10) |           1 |       2304 |     24 |        |       | \x0a000000
 11 |   7840 |        1 |     28 | 69103876 | 69103916 |        0 | (0,12) |       49153 |       9472 |     24 |        |       | \x14000000
 12 |   7808 |        1 |     28 | 69103916 | 69103962 |        0 | (0,13) |       49153 |       9472 |     24 |        |       | \x1e000000
 13 |   7776 |        1 |     28 | 69103962 |        0 |        0 | (0,13) |       32769 |      10496 |     24 |        |       | \x05000000
(13 rows)

We have 13 rows, not 10. What the heck just happened?

Let’s examine our three separate update transactions (69103876, 69103916, 69103962) to see what’s happening with the heap:

t_xmin (691103876)

  • UPDATE percona SET col1 = 20 WHERE col1 = 5;
  • Logically DELETE tuple ID 5
  • Physically INSERT tuple ID 11
  • UPDATE tuple ID 5 pointer (t_tcid) to point to tuple ID 11

Tuple ID 5 becomes a dead row when its t_xmax gets set to the new transaction ID initiated by transaction 691103876.

t_xmin (69103916)

  • UPDATE percona SET col1 = 30 WHERE col1 = 20;
  • Logically DELETE tuple ID 11
  • Physically INSERT tuple ID 12
  • UPDATE tuple ID 11 pointer (t_tcid) to point to tuple ID 12

Once again, Tuple ID 11 becomes a dead row when its t_xmax set to the new transaction ID initiated by transaction  69103916.

t_xmin (69103962)

  • UPDATE percona SET col1 = 5 WHERE col1 = 30;
  • Logically DELETE tuple ID 12
  • Physically INSERT tuple ID 13
  • UPDATE tuple ID 12 pointer (t_tcid) to point to tuple ID 13

Tuple ID 13 is live and visible to other transactions. It has no t_xmax and the t_ctid (0,13) points to itself.

The key takeaway from this is that we have not added or deleted rows in our table. We still see 10 in the count, but our heap has increased to 13 by an additional three transactions being executed.

At a very high level, this is how PostgreSQL implements MVCC and why we have table bloat in our heap. In essence, changes to data results in a new row reflecting the latest state of the data. The old rows need to be cleaned or reused for efficiency.

Vacuuming the table

The way to deal with the table bloat is to vacuum the table:

percona=# vacuum percona;
VACUUM

Now, let’s examine the heap again:

percona=# SELECT * FROM heap_page_items(get_raw_page('percona',0));

lp | lp_off | lp_flags | lp_len |  t_xmin  | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |   t_data
----+--------+----------+--------+----------+--------+----------+--------+-------------+------------+--------+--------+-------+------------
  1 |   8160 |        1 |     28 | 69099597 |      0 |        0 | (0,1)  |           1 |       2304 |     24 |        |       | \x01000000
  2 |   8128 |        1 |     28 | 69099609 |      0 |        0 | (0,2)  |           1 |       2304 |     24 |        |       | \x02000000
  3 |   8096 |        1 |     28 | 69099627 |      0 |        0 | (0,3)  |           1 |       2304 |     24 |        |       | \x03000000
  4 |   8064 |        1 |     28 | 69099655 |      0 |        0 | (0,4)  |           1 |       2304 |     24 |        |       | \x04000000
  5 |     13 |        2 |      0 |          |        |          |        |             |            |        |        |       |
  6 |   8032 |        1 |     28 | 69099778 |      0 |        0 | (0,6)  |           1 |       2304 |     24 |        |       | \x06000000
  7 |   8000 |        1 |     28 | 69099778 |      0 |        0 | (0,7)  |           1 |       2304 |     24 |        |       | \x07000000
  8 |   7968 |        1 |     28 | 69099778 |      0 |        0 | (0,8)  |           1 |       2304 |     24 |        |       | \x08000000
  9 |   7936 |        1 |     28 | 69099778 |      0 |        0 | (0,9)  |           1 |       2304 |     24 |        |       | \x09000000
 10 |   7904 |        1 |     28 | 69099778 |      0 |        0 | (0,10) |           1 |       2304 |     24 |        |       | \x0a000000
 11 |      0 |        0 |      0 |          |        |          |        |             |            |        |        |       |
 12 |      0 |        0 |      0 |          |        |          |        |             |            |        |        |       |
 13 |   7872 |        1 |     28 | 69103962 |      0 |        0 | (0,13) |       32769 |      10496 |     24 |        |       | \x05000000
(13 rows)

After vacuuming the table, rows five, 11, and 12 are now free to be used again.

So let’s insert another row, with the value of 11 and see what happens:

percona=# INSERT INTO percona values (11);
INSERT 0 1

Let’s examine the heap once more:

percona=# SELECT * FROM heap_page_items(get_raw_page('percona',0));

 lp | lp_off | lp_flags | lp_len |  t_xmin  | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |   t_data
----+--------+----------+--------+----------+--------+----------+--------+-------------+------------+--------+--------+-------+------------
  1 |   8160 |        1 |     28 | 69099597 |      0 |        0 | (0,1)  |           1 |       2304 |     24 |        |       | \x01000000
  2 |   8128 |        1 |     28 | 69099609 |      0 |        0 | (0,2)  |           1 |       2304 |     24 |        |       | \x02000000
  3 |   8096 |        1 |     28 | 69099627 |      0 |        0 | (0,3)  |           1 |       2304 |     24 |        |       | \x03000000
  4 |   8064 |        1 |     28 | 69099655 |      0 |        0 | (0,4)  |           1 |       2304 |     24 |        |       | \x04000000
  5 |     13 |        2 |      0 |          |        |          |        |             |            |        |        |       |
  6 |   8032 |        1 |     28 | 69099778 |      0 |        0 | (0,6)  |           1 |       2304 |     24 |        |       | \x06000000
  7 |   8000 |        1 |     28 | 69099778 |      0 |        0 | (0,7)  |           1 |       2304 |     24 |        |       | \x07000000
  8 |   7968 |        1 |     28 | 69099778 |      0 |        0 | (0,8)  |           1 |       2304 |     24 |        |       | \x08000000
  9 |   7936 |        1 |     28 | 69099778 |      0 |        0 | (0,9)  |           1 |       2304 |     24 |        |       | \x09000000
 10 |   7904 |        1 |     28 | 69099778 |      0 |        0 | (0,10) |           1 |       2304 |     24 |        |       | \x0a000000
 11 |   7840 |        1 |     28 | 69750201 |      0 |        0 | (0,11) |           1 |       2048 |     24 |        |       | \x0b000000
 12 |      0 |        0 |      0 |          |        |          |        |             |            |        |        |       |
 13 |   7872 |        1 |     28 | 69103962 |      0 |        0 | (0,13) |       32769 |      10496 |     24 |        |       | \x05000000
(13 rows)

Our new tuple (with transaction ID 69750201) reused tuple 11, and now the tuple 11 pointer (0,11) is pointing to itself.

As you can see, the heap did not grow to accommodate the new row. It reused an open block for the new row that was made available when we vacuumed the table freeing up dead rows (rows that will no longer be visible in a transaction).

And there you have it. A step-by-step illustration of how bloat occurs in PostgreSQL!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com