Apr
19
2022
--

PostgreSQL 14 B-Tree Index: Reduced Bloat with Bottom-Up Deletion

PostgreSQL 14 B-Tree Index

Concurrent access to data within PostgreSQL is managed with the Multiversion Concurrency Control (MVCC) model. Data snapshots are maintained for each SQL statement so that they always get consistent data, even if other transactions are modifying it concurrently. This leads to managing multiple versions of the same row when the row has been modified by one or more transactions. From a user perspective, there might only be a single row of data, but internally PostgreSQL may be maintaining one or more versions of that row.

Whether a row version is visible to a transaction is maintained with the row data in the heap. To optimize the fetching of visibility information PostgreSQL also maintains a “_vm” relation fork that keeps track of which pages contain only the tuples that are known to be visible to all active transactions.

Dead versions that are no longer visible to any transaction are cleaned up by the vacuum process. Until that happens the index and heap pages may contain a significant number of dead tuples (This really depends on the nature of your workload). For a very update-intensive workload, this could be a huge number!

It may seem innocuous at first sight, but this accumulation of dead index tuples creates a cascading effect that leads to significant performance degradation. After the deduplication work in PostgreSQL 13, the next logical step was to prevent btree index expansion by reducing page splits.

Physical Data Storage

PostgreSQL maintains data in fixed-sized storage units called pages. The size of a page is defined during the PostgreSQL server compilation process. The default page size is 8k, but this can be changed to a higher value. Though changing the page size complicates things as other tools may require recompilation or reconfiguration as well.

Each table and index is stored in an array of pages. When data is inserted in a table, the data is written to a page having enough free space. Otherwise, a new page is created.

Indexes however are a little different. The first page in an index is a meta page that contains control information about the index. There can also be special pages that maintain index-related information. In the case of a btree index, the data must be sorted based on the index columns and heap tuple ID (the physical location of the tuple within the table). Therefore insertion and updates must happen on the correct page to maintain the sorting order. If the page does not have enough space for the incoming tuple a new page must be created, and some items from the overflowing page are moved to the new page. Parent pages of these leaf pages are split recursively if needed.

Avoiding Page Splits

B-Tree index page splits occur when new tuples or new non-HOT tuple versions are to be added to the index. HOT is an abbreviation for “heap only tuple”. In basic terms, it is a way of removing dead rows on a given page (defragmentation) and thereby making space for new rows. By avoiding or delaying page splits, we can avoid or slow down index expansion and therefore reduce the bloat. Now that is exciting!

While there isn’t much that can be done for new tuples, updates can be managed such that obsolete versions of logically unchanged index tuples (i.e. unchanged index columns) can be incrementally removed to maintain free space for the new version. This process is aided by the planner which provides a hint, “index unchanged” to the index method. This is true if none of the index columns are changed as a result of this update.

The bottom-up deletion is done during an index operation when a “version churn page split” is anticipated (the “index unchanged” hint is true). Obsolete versions of logically unchanged index tuples are removed making space on the page for the newer version. This approach potentially avoids a page split.

Bottom-Up Delete in Action

To see the actual benefit of this approach let us dig a little deeper into the B-Tree index. We are going to compare the btree index sizes between PostgreSQL versions 13 and 14. For a more detailed inspection of index data, I’ll be using the “pageinspect” extension that is available in the contrib modules. The “pageinspect” extension allows us to see the low-level page contents of indexes or tables.

Let’s start by creating the pageinspect extension. You may need to install the contrib modules, or if you are building from the source make install it and then proceed on.

CREATE EXTENSION IF NOT EXISTS pageinspect;

 

Let’s now create a table “foo” with two columns, create two indexes with one covering index, and analyze the table as well.

DROP TABLE IF EXISTS foo;
CREATE TABLE foo WITH (autovacuum_enabled = false) AS (SELECT GENERATE_SERIES(1, 1000) AS col1, SUBSTR(MD5(RANDOM()::TEXT), 0, 25) AS value);
CREATE INDEX ON foo(col1);
CREATE INDEX ON foo(col1) INCLUDE(value);

 

It is time to inspect a number of pages, tuples, and relation sizes for the “foo” table.

SELECT  relname
        , relkind
        , relpages
        , reltuples
        , PG_SIZE_PRETTY(PG_RELATION_SIZE(oid))
FROM    pg_class
WHERE   relname LIKE '%foo%'
ORDER
BY      relkind DESC;

??      relname       | relkind | relpages | reltuples | pg_size_pretty 
--------------------+---------+----------+-----------+----------------
 foo                | r       |        8 |      1000 | 64 kB
 foo_col1_idx       | i       |        5 |      1000 | 40 kB
 foo_col1_value_idx | i       |        9 |      1000 | 72 kB
(3 rows)

 

Both 14.1 and 13.5 give the exact same output for the above query.

Disable sequential and bitmap scans to force index scans. This will force the queries in this example to use an index scan.

SET enable_seqscan = false;
SET enable_bitmapscan = false;

 

Create four new versions of tuples.

UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';

 

The above queries result in updating 1,000 rows each. “ANALYZE” the table to ensure our statistics are accurate. Also let us review the number of pages, tuples, and relation sizes for the “foo” table.

ANALYZE foo;

SELECT  relname
        , relkind
        , relpages
        , reltuples
        , PG_SIZE_PRETTY(PG_RELATION_SIZE(oid))
FROM    pg_class
WHERE   relname LIKE '%foo%'
ORDER
BY      relkind DESC;

--PostgreSQL 14.1
??      relname       | relkind | relpages | reltuples | pg_size_pretty 
--------------------+---------+----------+-----------+----------------
 foo                | r       |        8 |      1000 | 288 kB
 foo_col1_idx       | i       |        5 |      1000 | 88 kB
 foo_col1_value_idx | i       |        9 |      1000 | 216 kB
(3 rows)


--PostgreSQL 13.5
??--------------------+---------+----------+-----------+----------------
 foo                | r       |        8 |      1000 | 288 kB
 foo_col1_idx       | i       |        5 |      1000 | 104 kB
 foo_col1_value_idx | i       |        9 |      1000 | 360 kB
(3 rows)

 

The table size has increased by the same amount in both versions however, the indexes in 14.1 have significantly smaller sizes compared to 13.5. Great, but just to be sure let’s inspect the page contents to understand what has happened behind the scenes.

Reviewing the contents of the first index page (not the meta page) clearly shows how the bottom-up deletion is keeping index sizes small.

 

SELECT  itemoffset
        , ctid
        , itemlen
        , nulls
        , vars
        , dead
        , htid
FROM    bt_page_items('foo_col1_value_idx', 1)
LIMIT   15;

PostgreSQL 14.1
?? itemoffset |  ctid   | itemlen | nulls | vars | dead |  htid   
------------+---------+---------+-------+------+------+---------
          1 | (7,1)   |      16 | f     | f    |      | 
          2 | (7,181) |      40 | f     | t    | f    | (7,181)
          3 | (7,225) |      48 | f     | t    | f    | (7,225)
          4 | (7,182) |      40 | f     | t    | f    | (7,182)
          5 | (7,226) |      48 | f     | t    | f    | (7,226)
          6 | (7,183) |      40 | f     | t    | f    | (7,183)
          7 | (7,227) |      48 | f     | t    | f    | (7,227)
          8 | (7,184) |      40 | f     | t    | f    | (7,184)
          9 | (7,228) |      48 | f     | t    | f    | (7,228)
         10 | (7,185) |      40 | f     | t    | f    | (7,185)
         11 | (7,229) |      48 | f     | t    | f    | (7,229)
         12 | (7,186) |      40 | f     | t    | f    | (7,186)
         13 | (7,230) |      48 | f     | t    | f    | (7,230)
         14 | (7,187) |      40 | f     | t    | f    | (7,187)
         15 | (7,231) |      48 | f     | t    | f    | (7,231)
(15 rows)


PostgreSQL 13.5
?? itemoffset |  ctid   | itemlen | nulls | vars | dead |  htid   
------------+---------+---------+-------+------+------+---------
          1 | (0,1)   |      16 | f     | f    |      | 
          2 | (0,1)   |      40 | f     | t    | f    | (0,1)
          3 | (7,49)  |      40 | f     | t    | f    | (7,49)
          4 | (7,137) |      40 | f     | t    | f    | (7,137)
          5 | (7,181) |      40 | f     | t    | f    | (7,181)
          6 | (7,225) |      48 | f     | t    | f    | (7,225)
          7 | (0,2)   |      40 | f     | t    | f    | (0,2)
          8 | (7,50)  |      40 | f     | t    | f    | (7,50)
          9 | (7,138) |      40 | f     | t    | f    | (7,138)
         10 | (7,182) |      40 | f     | t    | f    | (7,182)
         11 | (7,226) |      48 | f     | t    | f    | (7,226)
         12 | (0,3)   |      40 | f     | t    | f    | (0,3)
         13 | (7,51)  |      40 | f     | t    | f    | (7,51)
         14 | (7,139) |      40 | f     | t    | f    | (7,139)
         15 | (7,183) |      40 | f     | t    | f    | (7,183)
(15 rows)

Looking at the “itemoffset” 2 to 3 for 14.1 and 2 to 6 for 13.5 tells us the entire story. 13.5 is carrying the entire set of tuple versions whereas 14.1 cleansed the dead tuples to make room. With fewer versions, there are fewer pages resulting in less bloating, and giving us a smaller index size.

Conclusion

Reduction in index size due to bottom deletion is a huge plus in PostgreSQL version 14. Btree indexes have a mechanism where plain index scans set the LP_DEAD flag. This is not set for bitmap index scans though. Once this is set, the space can be reclaimed without the need of vacuum. However, that’s a completely different class of dead tuples. In the long run, this bottom-up deletion strategy helps in significantly reducing a specific class of duplicates. It not only reduces the load on vacuum but also helps keep indexes healthier leading to faster access speeds. So if you have a high update workload, there will definitely be savings in resource utilization and costs while giving better performance.

Dec
19
2014
--

Store UUID in an optimized way

A few years ago Peter Zaitsev, in a post titled “To UUID or not to UUID,” wrote: There is timestamp based part in UUID which has similar properties to auto_increment and which could be used to have values generated at same point in time physically local in BTREE index.”

For this post I’ve rearranged the timestamp part of UUID (Universal Unique Identifier) and did some benchmarks.

Many people store UUID as char (36) and use as row identity value (PRIMARY KEY) because it is unique across every table, every database and every server and allow easy merging of records from different databases. But here comes the problem, using it as PRIMARY KEY causes the problems described below.

Problems with UUID

  • UUID has 36 characters which makes it bulky.
  • InnoDB stores data in the PRIMARY KEY order and all the secondary keys also contain PRIMARY KEY. So having UUID as PRIMARY KEY makes the index bigger which can not be fit into the memory
  • Inserts are random and the data is scattered.

Despite the problems with UUID, people still prefer it because it is UNIQUE across every table, can be generated anywhere. In this blog, I will explain how to store UUID in an efficient way by re-arranging timestamp part of UUID.

Structure of UUID

MySQL uses UUID version 1 which is a 128-bit number represented by a utf8 string of five hexadecimal numbers

  • The first three numbers are generated from a timestamp.
  • The fourth number preserves temporal uniqueness in case the timestamp value loses monotonicity (for example, due to daylight saving time).
  • The fifth number is an IEEE 802 node number that provides spatial uniqueness. A random number is substituted if the latter is not available (for example, because the host computer has no Ethernet card, or we do not know how to find the hardware address of an interface on your operating system). In this case, spatial uniqueness cannot be guaranteed. Nevertheless, a collision should have very low probability.

The timestamp is mapped as follows:
When the timestamp has the (60 bit) hexadecimal value: 1d8eebc58e0a7d7. The following parts of the UUID are set:: 58e0a7d7eebc11d8-9669-0800200c9a66. The 1 before the most significant digits (in 11d8) of the timestamp indicates the UUID version, for time-based UUIDs this is 1.

Fourth and Fifth parts would be mostly constant if it is generated from a single server. First three numbers are based on timestamp, so they will be monotonically increasing. Lets rearrange the total sequence making the UUID closer to sequential. This makes the inserts and recent data look up faster. Dashes (‘-‘) make no sense, so lets remove them.
58e0a7d7-eebc-11d8-9669-0800200c9a66 => 11d8eebc58e0a7d796690800200c9a66

Benchmarking

I created created three tables

  • events_uuid – UUID binary(16) PRIMARY KEY
  • events_int – Additional BIGINT auto increment column and made it as primary key and index on UUID column
  • events_uuid_ordered – Rearranged UUID binary(16) as PRIMARY KEY

I created three stored procedures which insert 25K random rows at a time into the respective tables. There are three more stored procedures which call the random insert-stored procedures in a loop and also calculate the time taken to insert 25K rows and data and index size after each loop. Totally I have inserted 25M records.

    • Data Size
      Horizontal Axis – Number of inserts x 25,000
      Vertical Axis – Data Size in MB
      Data Size
      The data size for UUID table is more than other two tables.
    • Index Size
      Horizontal axis – Number of inserts x 25,000
      Vertical axis – Index Size in MB
      Index Size
    • Total Size
      Horizontal Axis – Number of inserts x 25,000
      Vertical Axis – Total Size in MB
      Total Size
    • Time taken
      Horizontal axis – Number of inserts x 25,000
      Vertical axis – Time Taken in seconds
      Time Taken

For the table with UUID as PRIMARY KEY, you can notice that as the table grows big, the time taken to insert rows is increasing almost linearly. Whereas for other tables, the time taken is almost constant.

The size of UUID table is almost 50% bigger than Ordered UUID table and 30% bigger than table with BIGINT as PRIMARY KEY. Comparing the Ordered UUID table BIGINT table, the time taken to insert rows and the size are almost same. But they may vary slightly based on the index structure.

root@localhost:~# ls -lhtr /media/data/test/ | grep ibd
-rw-rw---- 1 mysql mysql  13G Jul 24 15:53 events_uuid_ordered.ibd
-rw-rw---- 1 mysql mysql  20G Jul 25 02:27 events_uuid.ibd
-rw-rw---- 1 mysql mysql  15G Jul 25 07:59 events_int.ibd

Table Structure

#1 events_int
CREATE TABLE `events_int` ( 
`count` bigint(20) NOT NULL AUTO_INCREMENT, 
`id` binary(16) NOT NULL, 
`unit_id` binary(16) DEFAULT NULL, 
`event` int(11) DEFAULT NULL, 
`ref_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, 
`campaign_id` binary(16) COLLATE utf8_unicode_ci DEFAULT '', 
`unique_id` binary(16) COLLATE utf8_unicode_ci DEFAULT NULL, 
`user_agent` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL, 
`city` varchar(80) COLLATE utf8_unicode_ci DEFAULT NULL, 
`country` varchar(80) COLLATE utf8_unicode_ci DEFAULT NULL, 
`demand_partner_id` binary(16) DEFAULT NULL, 
`publisher_id` binary(16) DEFAULT NULL, 
`site_id` binary(16) DEFAULT NULL, 
`page_id` binary(16) DEFAULT NULL, 
`action_at` datetime DEFAULT NULL, 
`impression` smallint(6) DEFAULT NULL, 
`click` smallint(6) DEFAULT NULL, 
`sold_impression` smallint(6) DEFAULT NULL, 
`price` decimal(15,7) DEFAULT '0.0000000', 
`actioned_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', 
`unique_ads` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, 
`notification_url` text COLLATE utf8_unicode_ci, 
PRIMARY KEY (`count`), 
KEY `id` (`id`), 
KEY `index_events_on_actioned_at` (`actioned_at`), 
KEY `index_events_unit_demand_partner` (`unit_id`,`demand_partner_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
#2 events_uuid
CREATE TABLE `events_uuid` ( 
`id` binary(16) NOT NULL, 
`unit_id` binary(16) DEFAULT NULL,
~
~
PRIMARY KEY (`id`), 
KEY `index_events_on_actioned_at` (`actioned_at`), 
KEY `index_events_unit_demand_partner` (`unit_id`,`demand_partner_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
#3 events_uuid_ordered
CREATE TABLE `events_uuid_ordered` (  
`id` binary(16) NOT NULL,  
`unit_id` binary(16) DEFAULT NULL,
~
~
PRIMARY KEY (`id`),  
KEY `index_events_on_actioned_at` (`actioned_at`),  
KEY `index_events_unit_demand_partner` (`unit_id`,`demand_partner_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Conclusions

 

    • Create function to rearrange UUID fields and use it
DELIMITER //
CREATE DEFINER=`root`@`localhost` FUNCTION `ordered_uuid`(uuid BINARY(36))
RETURNS binary(16) DETERMINISTIC
RETURN UNHEX(CONCAT(SUBSTR(uuid, 15, 4),SUBSTR(uuid, 10, 4),SUBSTR(uuid, 1, 8),SUBSTR(uuid, 20, 4),SUBSTR(uuid, 25)));
//
DELIMITER ;

Inserts

INSERT INTO events_uuid_ordered VALUES (ordered_uuid(uuid()),'1','M',....);

Selects

SELECT HEX(uuid),is_active,... FROM events_uuid_ordered ;

    • Define UUID as binary(16) as binary does not have any character set

 

References

 

The post Store UUID in an optimized way appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com