Jul
07
2022
--

PostgreSQL Replication and Conflicts

PostgreSQL Replication and Conflicts

Replication sits at the core of database high availability (HA). While on-disk data can be replicated through various methods, whether hardware or software, disk, file, or block-based solutions, without an online copy of the database, HA setup cannot be achieved.

This blog digs into more detail the challenges/conflicts that impact PostgreSQL replication, but before we do that, let’s understand some key concepts and the evolution of replication in PostgreSQL.

Evolution of replication in PostgreSQL

It comes as no surprise that the PostgreSQL community over the years has invested in building solid replication features that help deliver both HA and DR. Following is a timeline of how the replication core evolved over the years starting with the Write-Ahead Logging (WAL) and finally discussing synchronous streaming replication.

Write-Ahead Logging (WAL)

Write-Ahead Logging (WAL) was added to PostgreSQL in 2001 as part of the 7.1 release. That was a major step in ensuring durability as it ensured the “D” in ACID compliance. Beyond durability, this also improved performance and response times for transactions by reducing the need to write the same page multiple times.

Replication features

It was quite a few years after the introduction of WAL that the first major replication feature, warm standby, was implemented in PostgreSQL.

Warm standby or warm standby log shipping

The warm standby feature was introduced in the 8.2 release in 2006. This feature brought in the concept of continuously archiving and continuously recovering.

On standby, the server would be continuously recovering. This means that although the server is online, it will not be able to accept any connections. All it does is process the WAL file segments that it receives from the primary server’s continuously archiving process.

This was asynchronous replication. Multiple WAL records were packaged into a file and shipped out the standby. Data loss was possible in case of a primary server crash, some WAL records may not have been shipped.

So this was PostgreSQL’s first step toward an HA solution!

Warm to hot: Streaming replication

With the 9.0 release in 2010, streaming replication and hot standbys were introduced. Two new postmaster processes were added; walsender and walreceiver. This was essentially the same concept as continuously archiving and continuously recovering, but implemented through background processes. However, this enabled the standby to now honor incoming connections and run read-only queries.

These background processes were also responsible for flushing WAL files to disk on the primary and standby servers.

Two functions for monitoring replication were added as part of this release; pg_last_xlog_receive_location and pg_last_xlog_replay_location.

However, the replication remained asynchronous and data loss was possible in case of a primary server crash.

Synchronous streaming replication

The data loss problem potential was eliminated with the 9.1 release in 2011 with synchronous streaming replication. This ensured strict data integrity guarantees that are suitable for an HA system. The synchronous mechanism could be made asynchronous for a transaction block by issuing “SET LOCAL synchronous_replication TO OFF”.

Additionally, this release introduced the concepts of quorum, priorities, connected, and active standbys.

Setting up streaming replication

Setting up streaming replication is actually rather simple. It’s a straightforward five-step process:

  • Create a user for replication. We’ll be using rep_user for this example.
-- Let’s create rep_user with a replication role in the primary database.
CREATE USER rep_user REPLICATION;

  • Add replication user to pg_hba.conf on the primary server.
# TYPE  DATABASE        USER       ADDRESS                 METHOD
host    replication     rep_user   <ip address range>      <auth_method>

  • Reload the pg_hba.conf on the primary server either through “pg_ctl reload”, SQL function pg_reload_conf(), or through “kill -HUP”.
  • Take a backup of the primary database using pg_basebackup with “-R” option to write the replication information in the new data directory.
  • Start the standby server.

And viola, the streaming replication is enabled. The PostgreSQL official documentation explains the streaming replication configuration in more detail here.

Replication views and configurations

In terms of configuring replication, the process becomes a little tricky as one has to configure the primary, sending, and standby servers optimally to meet the performance and uptime requirements.

A sending server is any PostgreSQL server in a replication setup that is shipping WAL segments to standby servers. This may be the primary server or an intermediate standby server in cascade replication configuration.

Configuration

There are about 20 different configuration options for primary, sending, and standby servers. These configurations are for WAL, vacuum, connection, promotion, replication delays, and replication feedback.

Views

??pg_stat_replication and pg_stat_wal_receiver views provide information about replication on the primary and standby servers respectively. 

On the primary server, ??pg_stat_replication contains one row per walsender process connected to standbys. Each for the given standby also shows the application and client connection information along with the state of replication, write, flush, replay LSNs, and intervals. For the primary server, it shows sent LSN.

On the standby server, ??pg_stat_replication exposes very similar information as ??pg_stat_replication for connection and LSNs, but additionally also shows replication slot if any. It only has a single row for the walreceiver process.

Replication conflicts

Replication conflicts only occur on the standby servers. There are no conflicts on the primary server. The conflicts occur on standby as the walreceiver process initiates the WAL redo operation. WAL redo operation requires exclusive locks, which may be held by any long-running queries or transactions. The redo operation waits until a timeout is reached, after which any backends holding locks that prevent WAL processing are terminated. And these terminations are classified as replication conflicts. The nature of these conflicts can be very different.

PostgreSQL has pg_stat_database_conflicts view that tells us what type of conflicts and how many have occurred.

We can conveniently ignore the pg_stat_database view as it only has a single BigInt field for maintaining conflicts count. It doesn’t really provide any insights into the nature of these conflicts.

pg_stat_database_conflicts

On a standby server, pg_stat_database_conflicts has one row per database showing database-wide query cancellation stats due to recovery. It’s a fairly simple view with datid, datname, and five bigints for counting different types of conflicts:

  • Tablespace,
  • Lock,
  • Snapshot,
  • Bufferpin, and
  • Deadlock.

Tablespace conflict

Tablespace conflicts are likely to occur if an expected temp_tablespaces is dropped. temp_tablespaces is used for temporary tables, indexes, and sort operations. 

These types of conflicts are somewhat uncommon as these operations will be far and few between.

Lock conflict

Lock conflicts may occur when you query a table for contents on the standby and the table is remotely altered on the primary.

This will lead to locking contention as the walreceiver process wants to make changes by acquiring an ACCESS EXCLUSIVE lock on an index/relation. However, another query is holding ACCESS SHARE lock.

Snapshot conflict

Snapshop conflicts are generated while and when a backend process tries to access rows on the standby server that have been vacuumed out on the primary.

Bufferpin conflict

These are mostly caused by hot chain pruning or perhaps version duplicate churn for btree indexes in version 14 as it requires an exclusive lock on a page.

These types of conflicts are not that common.

Deadlock conflict

These are conflicts due to query cancellations occurring because of deadlocks on standby.

Other conflicts

There may be other types of conflicts, but those aren’t captured by pg_stat_database_conflicts view. For example, one may experience a database conflict where a database is dropped on the primary but standby may have active connections using it.

Survey: Issues faced with PostgreSQL replication

Based on a survey conducted amongst friends and colleagues, I see that more than 50% of the respondents have experienced issues with PostgreSQL replication.

The type of issues and conflicts experienced also provide us with an insight into what to focus on while configuring the replication.

If you wish to share your experience with PostgreSQL replication with me, then you may do so using this Google form.

Conclusion

There will be more blogs coming from me as I’ll discuss how some of these replication conflicts occur, and how the standby server internally reacts. The key really is to get a better under-the-hood perspective so that we can improve replication configuration to avoid such conflicts leading to improved performance, better uptime, and simpler debugging.

Apr
19
2022
--

PostgreSQL 14 B-Tree Index: Reduced Bloat with Bottom-Up Deletion

PostgreSQL 14 B-Tree Index

Concurrent access to data within PostgreSQL is managed with the Multiversion Concurrency Control (MVCC) model. Data snapshots are maintained for each SQL statement so that they always get consistent data, even if other transactions are modifying it concurrently. This leads to managing multiple versions of the same row when the row has been modified by one or more transactions. From a user perspective, there might only be a single row of data, but internally PostgreSQL may be maintaining one or more versions of that row.

Whether a row version is visible to a transaction is maintained with the row data in the heap. To optimize the fetching of visibility information PostgreSQL also maintains a “_vm” relation fork that keeps track of which pages contain only the tuples that are known to be visible to all active transactions.

Dead versions that are no longer visible to any transaction are cleaned up by the vacuum process. Until that happens the index and heap pages may contain a significant number of dead tuples (This really depends on the nature of your workload). For a very update-intensive workload, this could be a huge number!

It may seem innocuous at first sight, but this accumulation of dead index tuples creates a cascading effect that leads to significant performance degradation. After the deduplication work in PostgreSQL 13, the next logical step was to prevent btree index expansion by reducing page splits.

Physical Data Storage

PostgreSQL maintains data in fixed-sized storage units called pages. The size of a page is defined during the PostgreSQL server compilation process. The default page size is 8k, but this can be changed to a higher value. Though changing the page size complicates things as other tools may require recompilation or reconfiguration as well.

Each table and index is stored in an array of pages. When data is inserted in a table, the data is written to a page having enough free space. Otherwise, a new page is created.

Indexes however are a little different. The first page in an index is a meta page that contains control information about the index. There can also be special pages that maintain index-related information. In the case of a btree index, the data must be sorted based on the index columns and heap tuple ID (the physical location of the tuple within the table). Therefore insertion and updates must happen on the correct page to maintain the sorting order. If the page does not have enough space for the incoming tuple a new page must be created, and some items from the overflowing page are moved to the new page. Parent pages of these leaf pages are split recursively if needed.

Avoiding Page Splits

B-Tree index page splits occur when new tuples or new non-HOT tuple versions are to be added to the index. HOT is an abbreviation for “heap only tuple”. In basic terms, it is a way of removing dead rows on a given page (defragmentation) and thereby making space for new rows. By avoiding or delaying page splits, we can avoid or slow down index expansion and therefore reduce the bloat. Now that is exciting!

While there isn’t much that can be done for new tuples, updates can be managed such that obsolete versions of logically unchanged index tuples (i.e. unchanged index columns) can be incrementally removed to maintain free space for the new version. This process is aided by the planner which provides a hint, “index unchanged” to the index method. This is true if none of the index columns are changed as a result of this update.

The bottom-up deletion is done during an index operation when a “version churn page split” is anticipated (the “index unchanged” hint is true). Obsolete versions of logically unchanged index tuples are removed making space on the page for the newer version. This approach potentially avoids a page split.

Bottom-Up Delete in Action

To see the actual benefit of this approach let us dig a little deeper into the B-Tree index. We are going to compare the btree index sizes between PostgreSQL versions 13 and 14. For a more detailed inspection of index data, I’ll be using the “pageinspect” extension that is available in the contrib modules. The “pageinspect” extension allows us to see the low-level page contents of indexes or tables.

Let’s start by creating the pageinspect extension. You may need to install the contrib modules, or if you are building from the source make install it and then proceed on.

CREATE EXTENSION IF NOT EXISTS pageinspect;

 

Let’s now create a table “foo” with two columns, create two indexes with one covering index, and analyze the table as well.

DROP TABLE IF EXISTS foo;
CREATE TABLE foo WITH (autovacuum_enabled = false) AS (SELECT GENERATE_SERIES(1, 1000) AS col1, SUBSTR(MD5(RANDOM()::TEXT), 0, 25) AS value);
CREATE INDEX ON foo(col1);
CREATE INDEX ON foo(col1) INCLUDE(value);

 

It is time to inspect a number of pages, tuples, and relation sizes for the “foo” table.

SELECT  relname
        , relkind
        , relpages
        , reltuples
        , PG_SIZE_PRETTY(PG_RELATION_SIZE(oid))
FROM    pg_class
WHERE   relname LIKE '%foo%'
ORDER
BY      relkind DESC;

??      relname       | relkind | relpages | reltuples | pg_size_pretty 
--------------------+---------+----------+-----------+----------------
 foo                | r       |        8 |      1000 | 64 kB
 foo_col1_idx       | i       |        5 |      1000 | 40 kB
 foo_col1_value_idx | i       |        9 |      1000 | 72 kB
(3 rows)

 

Both 14.1 and 13.5 give the exact same output for the above query.

Disable sequential and bitmap scans to force index scans. This will force the queries in this example to use an index scan.

SET enable_seqscan = false;
SET enable_bitmapscan = false;

 

Create four new versions of tuples.

UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';

 

The above queries result in updating 1,000 rows each. “ANALYZE” the table to ensure our statistics are accurate. Also let us review the number of pages, tuples, and relation sizes for the “foo” table.

ANALYZE foo;

SELECT  relname
        , relkind
        , relpages
        , reltuples
        , PG_SIZE_PRETTY(PG_RELATION_SIZE(oid))
FROM    pg_class
WHERE   relname LIKE '%foo%'
ORDER
BY      relkind DESC;

--PostgreSQL 14.1
??      relname       | relkind | relpages | reltuples | pg_size_pretty 
--------------------+---------+----------+-----------+----------------
 foo                | r       |        8 |      1000 | 288 kB
 foo_col1_idx       | i       |        5 |      1000 | 88 kB
 foo_col1_value_idx | i       |        9 |      1000 | 216 kB
(3 rows)


--PostgreSQL 13.5
??--------------------+---------+----------+-----------+----------------
 foo                | r       |        8 |      1000 | 288 kB
 foo_col1_idx       | i       |        5 |      1000 | 104 kB
 foo_col1_value_idx | i       |        9 |      1000 | 360 kB
(3 rows)

 

The table size has increased by the same amount in both versions however, the indexes in 14.1 have significantly smaller sizes compared to 13.5. Great, but just to be sure let’s inspect the page contents to understand what has happened behind the scenes.

Reviewing the contents of the first index page (not the meta page) clearly shows how the bottom-up deletion is keeping index sizes small.

 

SELECT  itemoffset
        , ctid
        , itemlen
        , nulls
        , vars
        , dead
        , htid
FROM    bt_page_items('foo_col1_value_idx', 1)
LIMIT   15;

PostgreSQL 14.1
?? itemoffset |  ctid   | itemlen | nulls | vars | dead |  htid   
------------+---------+---------+-------+------+------+---------
          1 | (7,1)   |      16 | f     | f    |      | 
          2 | (7,181) |      40 | f     | t    | f    | (7,181)
          3 | (7,225) |      48 | f     | t    | f    | (7,225)
          4 | (7,182) |      40 | f     | t    | f    | (7,182)
          5 | (7,226) |      48 | f     | t    | f    | (7,226)
          6 | (7,183) |      40 | f     | t    | f    | (7,183)
          7 | (7,227) |      48 | f     | t    | f    | (7,227)
          8 | (7,184) |      40 | f     | t    | f    | (7,184)
          9 | (7,228) |      48 | f     | t    | f    | (7,228)
         10 | (7,185) |      40 | f     | t    | f    | (7,185)
         11 | (7,229) |      48 | f     | t    | f    | (7,229)
         12 | (7,186) |      40 | f     | t    | f    | (7,186)
         13 | (7,230) |      48 | f     | t    | f    | (7,230)
         14 | (7,187) |      40 | f     | t    | f    | (7,187)
         15 | (7,231) |      48 | f     | t    | f    | (7,231)
(15 rows)


PostgreSQL 13.5
?? itemoffset |  ctid   | itemlen | nulls | vars | dead |  htid   
------------+---------+---------+-------+------+------+---------
          1 | (0,1)   |      16 | f     | f    |      | 
          2 | (0,1)   |      40 | f     | t    | f    | (0,1)
          3 | (7,49)  |      40 | f     | t    | f    | (7,49)
          4 | (7,137) |      40 | f     | t    | f    | (7,137)
          5 | (7,181) |      40 | f     | t    | f    | (7,181)
          6 | (7,225) |      48 | f     | t    | f    | (7,225)
          7 | (0,2)   |      40 | f     | t    | f    | (0,2)
          8 | (7,50)  |      40 | f     | t    | f    | (7,50)
          9 | (7,138) |      40 | f     | t    | f    | (7,138)
         10 | (7,182) |      40 | f     | t    | f    | (7,182)
         11 | (7,226) |      48 | f     | t    | f    | (7,226)
         12 | (0,3)   |      40 | f     | t    | f    | (0,3)
         13 | (7,51)  |      40 | f     | t    | f    | (7,51)
         14 | (7,139) |      40 | f     | t    | f    | (7,139)
         15 | (7,183) |      40 | f     | t    | f    | (7,183)
(15 rows)

Looking at the “itemoffset” 2 to 3 for 14.1 and 2 to 6 for 13.5 tells us the entire story. 13.5 is carrying the entire set of tuple versions whereas 14.1 cleansed the dead tuples to make room. With fewer versions, there are fewer pages resulting in less bloating, and giving us a smaller index size.

Conclusion

Reduction in index size due to bottom deletion is a huge plus in PostgreSQL version 14. Btree indexes have a mechanism where plain index scans set the LP_DEAD flag. This is not set for bitmap index scans though. Once this is set, the space can be reclaimed without the need of vacuum. However, that’s a completely different class of dead tuples. In the long run, this bottom-up deletion strategy helps in significantly reducing a specific class of duplicates. It not only reduces the load on vacuum but also helps keep indexes healthier leading to faster access speeds. So if you have a high update workload, there will definitely be savings in resource utilization and costs while giving better performance.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com