Oct
12
2021
--

Why Linux HugePages are Super Important for Database Servers: A Case with PostgreSQL

Linux HugePages PostgreSQL

Often users come to us with incidents of database crashes due to OOM Killer. The Out Of Memory killer terminates PostgreSQL processes and remains the top reason for most of the PostgreSQL database crashes reported to us. There could be multiple reasons why a host machine could run out of memory, and the most common problems are:

  1. Poorly tuned memory on the host machine.
  2. A high value of work_mem is specified globally (at instance level). Users often underestimate the multiplying effect for such blanket decisions.
  3. The high number of connections. Users ignore the fact that even a non-active connection can hold a good amount of memory allocation.
  4. Other programs co-hosted on the same machine consuming resources.

Even though we used to assist in tuning both host machines and databases, we do not always take the time to explain how and why HugePages are important and justify it with data. Thanks to repeated probing by my friend and colleague Fernando, I couldn’t resist doing so this time.

The Problem

Let me explain the problem with a testable and repeatable case. This might be helpful if anyone wants to test the case in their own way.

Test Environment

The test machine is equipped with 40 CPU cores (80 vCPUs) and 192 GB of installed memory. I don’t want to overload this server with too many connections, so only 80 connections are used for the test. Yes, just 80 connections, which we should expect in any environment and is very realistic. Transparent HugePages (THP) is disabled. I don’t want to divert the topic by explaining why it is not a good idea to have THP for a database server, but I commit that I will prepare another blog.

In order to have a relatively persistent connection,  just like the ones from application side poolers (or even from external connection poolers), pgBouncer is used for making all the 80 connections persistent throughout the tests.  The following is the pgBouncer configuration used:

[databases]
sbtest2 = host=localhost port=5432 dbname=sbtest2

[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
logfile = /tmp/pgbouncer.log
pidfile = /tmp/pgbouncer.pid
admin_users = postgres
default_pool_size=100
min_pool_size=80
server_lifetime=432000

As we can see, the

server_lifetime

  parameter is specified to a high value not to destroy the connection from the pooler to PostgreSQL. Following PostgreSQL, parameter modifications are incorporated to mimic some of the common customer environment settings.

logging_collector = 'on'
max_connections = '1000'
work_mem = '32MB'
checkpoint_timeout = '30min'
checkpoint_completion_target = '0.92'
shared_buffers = '138GB'
shared_preload_libraries = 'pg_stat_statements'

The test load is created using sysbench

sysbench /usr/share/sysbench/oltp_point_select.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-port=6432 --pgsql-db=sbtest2 --pgsql-user=postgres --pgsql-password=vagrant --threads=80 --report-interval=1 --tables=100 --table-size=37000000 prepare

and then

sysbench /usr/share/sysbench/oltp_point_select.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-port=6432 --pgsql-db=sbtest2 --pgsql-user=postgres --pgsql-password=vagrant --threads=80 --report-interval=1 --time=86400  --tables=80 --table-size=37000000  run

The first prepare stage puts a write load on the server and the second one read-only load.

I am not attempting to explain the theory and concepts behind HugePages, but concentrate on the impact analysis. Please refer to the LWN article: Five-Level Page Tables and Andres Freund’s blog post Measuring the Memory Overhead of a Postgres Connection for understanding some of the concepts.

Test Observations

During the test, memory consumption was checked using the Linux

free

utility command. When making use of the regular pool of memory pages, the consumption started with a really low value. But it was under steady increase (please see the screenshot below). The “Available” memory is depleted at a faster rate.

Available Memory

Towards the end, it started swap activity also. Swap activity is captured in

vmstat

  output below:

swap activity

Information from the

/proc/meminfo

  reveals that the Total Page Table size has grown to more than 25+GB from the initial 45MB

This is not just memory wastage; this is a huge overhead impacting the overall execution of the program and operating system. This size is the Total of Lower PageTable entries of the 80+ PostgreSQL processes.

The same can be verified by checking each PostgreSQL Process. Following is a sample

So the Total PageTable size (25GB) should be approximately this value * 80 (connections). Since this synthetic benchmark sends an almost similar workload through all connections, All individual processes are having very close values to what was captured above.

The following shell line can be used for checking the Pss (Proportional set size). Since PostgreSQL uses Linux shared memory, focusing on Rss won’t be meaningful.

for PID in $(pgrep "postgres|postmaster") ; do awk '/Pss/ {PSS+=$2} END{getline cmd < "/proc/'$PID'/cmdline"; sub("\0", " ", cmd);printf "%.0f --> %s (%s)\n", PSS, cmd, '$PID'}' /proc/$PID/smaps ; done|sort -n

Without Pss information, there is no easy method to understand the memory responsibility per process.

In a typical database system where we have considerable DML load, the background processes of PostgreSQL such as Checkpointer, Background Writer, or Autovaccum workers will be touching more pages in the shared memory. Corresponding Pss will be higher for those processes.

Postgres Process

This should explain why Checkpointer, Background worker, or even the Postmaster often becomes the usual victim/target of an OOM Killer very often. As we can see above, they carry the biggest responsibility of the shared memory.

After several hours of execution, the individual session touched more shared memory pages. As a consequence per process, Pss values were rearranged: Checkpointer is responsible for less as other sessions shared the responsibility.

However, checkpointer retains the highest share.

Even though it is not important for this test, it will be worth mentioning that this kind of load pattern is specific to synthetic benchmarking because every session does pretty much the same job. That is not a good approximation to typical application load, where we usually see checkpointer and background writers carry the major responsibility.

The Solution: Enable HugePages

The solution for such bloated page tables and associated problems is to make use of HugePages instead. We can figure out how much memory should be allocated to HugePages by checking the VmPeak of the postmaster process. For example, if 4357 is the PID of the postmaster:

grep ^VmPeak /proc/4357/status

This gives the amount of memory required in KB:

VmPeak: 148392404 kB

This much needs to be fitting into huge pages. Converting this value into 2MB pages:

postgres=# select 148392404/1024/2;
?column?
----------
    72457
(1 row)

Specify this value in

/etc/sysctl.conf

  for

vm.nr_hugepages

, for example:

vm.nr_hugepages = 72457

Now shutdown PostgreSQL instance and execute:

sysctl -p

We shall verify whether the requested number of huge pages are created or not:

grep ^Huge /proc/meminfo
HugePages_Total:   72457
HugePages_Free:    72457
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        148391936 kB

If we start up the PostgreSQL at this stage, we could see that the HugePages_Rsvd is allocated.

$ grep ^Huge /proc/meminfo
HugePages_Total:   72457
HugePages_Free:    70919
HugePages_Rsvd:    70833
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        148391936 kB

If everything is fine, I would prefer to make sure that PostgreSQL always uses HugePages. because I would prefer a failure of startup of PostgreSQL rather than problems/crashes later.

postgres=# ALTER SYSTEM SET huge_pages = on;

The above change needs a restart of the PostgreSQL instance.

Tests with HugePages “ON”

The HugePages are created in advance even before the PostgreSQL startup. PostgreSQL just allocates them and uses them. So there won’t be even any noticeable change in the

free

 output before and after the startup. PostgreSQL allocates its shared memory into these HugePages if they are already available. PostgreSQL’s

shared_buffers

  is the biggest occupant of this shared memory.

HugePages PostgreSQL

The first output of

free -h

in the above screenshot is generated before the PostgreSQL startup and the second one after the PostgreSQL startup. As we can see, there is no noticeable change

I have done the same test which ran for several hours and there wasn’t any change;  the only noticeable change even after many hours of run is the shift of “free” memory to filesystem cache which is expected and what we want to achieve. The total “available” memory remained pretty much constant as we can see in the following screenshot.

Available Memory PostgreSQL

Total Page Tables size remained pretty much the same :

As we can see the difference is huge: Just 61MB with HugePages instead of 25+GB previously. Pss per session also reduced drastically:

Pss per session

The biggest advantage I could observe is that CheckPointer or Background Writer is no longer accountable for several GBs of RAM.

Instead, they are accountable only for a few MBs of consumption. Obviously, they won’t be a candidate victim for the OOM Killer anymore.

Conclusion

In this blog post, we discussed how Linux Huge Pages can potentially save the database server from OOM Killers and associated crashes. We could see two improvements:

  1. Overall memory consumption was reduced by a big margin. Without the HugePages, the server almost ran out of memory (available memory completely depleted, and swapping activity started). However, once we switched to HugePages, 38-39GB remained as Available/Linux filesystem cache. This is a huge saving.
  2. With HugePages enabled, PostgreSQL background processes are not accounted for a large amount of shared memory.  So they won’t be candidate victim/target for OOM Killer easily.

These improvements can potentially save the system if it is on the brink of OOM condition, but I don’t want to make any claim that this will protect the database from all OOM conditions forever.

HugePage (hugetlbfs) originally landed in Linux Kernel in 2002 for addressing the requirement of database systems that need to address a large amount of memory. I could see that the design goals are still valid.

There are other additional indirect benefits of using HugePages:

  1. HugePages never get swapped out. When PostgreSQL shared buffers are in HugePages, it can produce more consistent and predictable performance.  I shall discuss that in another article.
  2. Linux uses multi-level page lookup method. HugePages are implemented using direct pointers to pages from the middle layer (a 2MB huge page would be found directly at the PMD level, with no intervening PTE page). The address translation becomes considerably simpler. Since this is a high-frequency operation in a database server with a large amount of memory, the gains are multiplied.

Note: The HugePages discussed in this blog post are about fixed size (2MB) huge pages.

Additionally, as a side note, I want to mention that there are a lot of improvements in Transparent HugePages (THP) over the years which allows applications to use HugePages without any code modification. THP is often considered as a replacement for regular HugePages (hugetlbfs) for a generic workload. However, usage of THP is discouraged on database systems as it can lead to memory fragmentation and increased delays. I want to cover that topic in another post and just want mention that these are not PostgreSQL specific problem, but affects every database systems.  For example,

  1. Oracle Recommends disabling TPH. Reference link
  2. MongoDB Recommends disabling THP. Reference link
  3. “THP is known to cause performance degradation with PostgreSQL for some users on some Linux versions.” Reference link.

As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Aug
10
2021
--

Impact of Network and Cursor on Query Performance of PostgreSQL

Query Performance of PostgreSQL

Many times, we see PostgreSQL users getting confused about the query/statement duration reported in PostgreSQL logs. Other PostgreSQL tools like pgBadger present the same data based on the log file, which further increases the confusion. Knowing the full impact of network-related overhead and cursors is important not only to alleviate the confusion but also to get the best performance.

One might ask “Why discuss Network overhead and Cursors specifically?”. Well, they are the hidden cost after the/in the query execution. Once the query execution starts and there is some data to be given to the client, these are the factors that mainly affect the performance. So the important point which we may want to keep in mind is since all these happens outside the query execution, the corresponding information will not be available through the EXPLAIN (ANALYZE).

Impact of Network

For the demonstration, I am using a query based on pgBench tables.

select a.bid, b.cnt from pgbench_accounts a,
   (select bid,count(*) cnt from pgbench_accounts group by bid) b
where b.bid > a.bid;

There is no significance for this query. A random query is selected which takes some time in the database to execute.

In order to capture the Explain Analyze output, auto_explain is used. Settings are made to capture all statements which take more than 250ms.

ALTER SYSTEM SET auto_explain.log_min_duration = 250

A few other settings to capture additional details with EXPLAIN ANALYZE are:

ALTER SYSTEM SET auto_explain.log_analyze = on;
ALTER SYSTEM SET auto_explain.log_buffers=on;
ALTER SYSTEM SET auto_explain.log_timing=on;
ALTER SYSTEM SET auto_explain.log_verbose=on;
ALTER SYSTEM SET auto_explain.log_triggers=on;
ALTER SYSTEM SET auto_explain.log_wal=on;

In order to Illustratre the difference, the same query will be executed from

1. The database Host Server

2. The application Host Server that is connected over a network

Queries Returning a Large Number of Rows

Case 1. Executing on the Database Host Itself

Following are the few lines from PostgreSQL logs generated by auto_explain:

2021-08-02 03:27:56.347 UTC [25537] LOG:  duration: 1591.784 ms  plan:
        Query Text: select a.bid, b.cnt from pgbench_accounts a,
           (select bid,count(*) cnt from pgbench_accounts group by bid) b
        where b.bid > a.bid;
        Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=119.739..1069.924 rows=1000000 loops=1)
          Output: a.bid, b.cnt
          Join Filter: (b.bid > a.bid)
          Rows Removed by Join Filter: 1500000
          ...

As we can see, the outer nested loop of the query was completed in 1069.924 ms returning 1 million records, but the total duration of the query is reported as 1591.784 ms. What could be the difference?

A straight, EXPLAIN ANALYZE shows that the planning time is sub milliseconds for this simple query where data is coming from a single table without any index. So the planning time shouldn’t be the reason.

Case 2. Executing from a Remote Application Host

Again the information from the PostgreSQL log looks different:

2021-08-02 04:08:58.955 UTC [25617] LOG:  duration: 6568.659 ms  plan:
        Query Text: select a.bid, b.cnt from pgbench_accounts a,
           (select bid,count(*) cnt from pgbench_accounts group by bid) b
        where b.bid > a.bid;
        Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=140.644..1069.153 rows=1000000 loops=1)
          Output: a.bid, b.cnt
          Join Filter: (b.bid > a.bid)
          Rows Removed by Join Filter: 1500000
          ...

As we can see the statement duration jumped to 6568.659 ms! even though the actual execution of the query remained pretty much the same 1069.153 ms. That’s a huge difference. What could be the reason?

Queries Returning a Fewer Number of Rows

The above-mentioned query can be slightly modified to return only the max values. The modified test query may look like this:

select max(a.bid), max(b.cnt) from pgbench_accounts a,
   (select bid,count(*) cnt from pgbench_accounts group by bid) b
where b.bid > a.bid ;

The query plan or time doesn’t change much other than there is an additional aggregate. Even if there is a change in plan that is not relevant for the topic, we are discussing it because we are considering only the time difference between the outer node of the query execution and the duration reported by PostgreSQL.

Case 1: Executing on the Database Host Itself

2021-08-03 06:58:14.364 UTC [28129] LOG:  duration: 1011.143 ms  plan:
        Query Text: select max(a.bid), max(b.cnt) from pgbench_accounts a,
           (select bid,count(*) cnt from pgbench_accounts group by bid) b
        where b.bid > a.bid ;
        Aggregate  (cost=67187.12..67187.13 rows=1 width=12) (actual time=1010.914..1011.109 rows=1 loops=1)
          Output: max(a.bid), max(b.cnt)
          Buffers: shared hit=12622 read=3786
          ->  Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=135.635..908.315 rows=1000000 loops=1)
                Output: a.bid, b.cnt
                Join Filter: (b.bid > a.bid)
                Rows Removed by Join Filter: 1500000
                Buffers: shared hit=12622 read=3786

As we can see there is not much difference between the completion of the query 1011.109 and the duration reported 1011.143 ms. The observation so far indicates that there is extra time consumed when there is a lot of rows are returned.

Case 2: Executing the Statement from the Remote Host

2021-08-03 06:55:37.221 UTC [28111] LOG:  duration: 1193.387 ms  plan:
        Query Text: select max(a.bid), max(b.cnt) from pgbench_accounts a,
           (select bid,count(*) cnt from pgbench_accounts group by bid) b
        where b.bid > a.bid ;
        Aggregate  (cost=67187.12..67187.13 rows=1 width=12) (actual time=1193.139..1193.340 rows=1 loops=1)
          Output: max(a.bid), max(b.cnt)
          Buffers: shared hit=11598 read=4810
          ->  Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=124.508..1067.409 rows=1000000 loops=1)
                Output: a.bid, b.cnt
                Join Filter: (b.bid > a.bid)
                Rows Removed by Join Filter: 1500000
                Buffers: shared hit=11598 read=4810

Again there is not much difference 1193.340 vs 1193.387 ms. Overall, I feel safe to assume from the results that if the data transfer is minimal, the application server on a different host machine doesn’t make much difference; meanwhile, the impact is huge if there is a lot of result transfer is involved.

Analyzing the Wait Events

Luckily, newer versions of PostgreSQL provide us with an excellent way to monitor the “wait event” information from the pg_stat_activity view of session/connection.

At Percona support we use a script from pg_gather snippet for gathering performance information including the wait events by collecting multiple samples. The script collects samples of wait events in every 10ms gap, so there will be 2000 samples in 20 seconds. Apart from UI, the gathered information can be analyzed using backend queries also.

The following is what I could see about PID: 25617 (the case of returning a large number of rows to remote host).

postgres=# select pid,wait_event,count(*) from pg_pid_wait where pid=25617 group by 1,2 order by 3 desc;
  pid  |  wait_event  | count 
-------+--------------+-------
 25617 | ClientWrite  |   286
 25617 |              |    75
 25617 | DataFileRead |     3
(3 rows)

The sessions are spending more time on “ClientWrite” As per PostgreSQL documentation.

ClientWrite Waiting to write data to the client.

It is the time spend on writing the data to the client. The wait_event NULL indicates the CPU utilization.

Impact of Cursors

Typically, after a query execution, the result data need to be processed by the application. Cursors are used for holding the result of queries and processing them. The impact on query performance is mainly decided by where the cursor is residing, whether on PostgreSQL server-side or at the client-side. The location of the cursor is expected to affect when the query is issued from a separate application host, so I am testing only that case.

Client-Side Cursors

Generally, this is the case with most of the PostgreSQL clients and applications. The data is retrieved fully to the database client end and then processed one by one.

Here is a simple python snippet (to mimic the application connection) for testing the same. (Only the relevant lines are copied.)

conn =  psycopg2.connect(connectionString)
   cur = conn.cursor()
   cur.itersize = 2000
   cur.execute("select a.bid, b.cnt from pgbench_accounts a, (select bid,count(*) cnt from pgbench_accounts group by bid) b where b.bid > a.bid")
   row = cur.fetchone()
   while row is not None:
     print(row)
     time.sleep(0.001)
     row = cur.fetchone()

   conn.close()

As we can see

itersize

is specified so only those many records are to be fetched at a time for processing and there is a 1-millisecond delay in a loop using

fetchone()

in each loop

But none of these affects the server-side query performance because the cursor is already cached on the client-side. The query time and duration reported are similar to executing from a remote application host for a large number of rows. As expected, the impact of the network is clearly visible:

2021-08-03 17:39:17.119 UTC [29180] LOG:  duration: 5447.793 ms  plan:
        Query Text: select a.bid, b.cnt from pgbench_accounts a, (select bid,count(*) cnt from pgbench_accounts group by bid) b where b.bid > a.bid
        Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=130.919..1240.727 rows=1000000 loops=1)
          Output: a.bid, b.cnt
          Join Filter: (b.bid > a.bid)
          Rows Removed by Join Filter: 1500000
          Buffers: shared hit=1647 read=14761
          ->  Seq Scan on public.pgbench_accounts a  (cost=0.00..13197.00 rows=500000 width=4) (actual time=0.086..183.503 rows=500000 loops=1)
                Output: a.aid, a.bid, a.abalance, a.filler
                Buffers: shared hit=864 read=7333

Server-Side Cursors

The way the cursor is created and used will be totally changing if we have a named cursor that stays on the server-side, and the statement

cur = conn.cursor()

is modified to include a name like

cur = conn.cursor('curname')

.

As the psycopg2: (The Python connector for PostgreSQL) documentation says:

“Psycopg wraps the database server-side cursor in named cursors. A named cursor is created using the cursor() method specifying the name parameter”

Surprisingly, the PostgreSQL log does not give any more information about the query even though auto_explain is configured. No duration information either. There is only a single line of info:

2021-08-03 18:02:45.184 UTC [29249] LOG:  duration: 224.501 ms  statement: FETCH FORWARD 1 FROM "curname"

PostgreSQL cursor supports various FETCH options with custom size specifications. Please refer to the documentation for FETCH for more details. It is up to the language driver/connector to wrap this functionality into corresponding functions.

The python driver for PostgreSQL – psychopg2 – wraps the functionality to run FETCH rows with the custom batch size specified as follows:

cur.fetchmany(size=20000)

Which produces a PostgreSQL log entry like:

2021-08-05 05:13:30.728 UTC [32374] LOG:  duration: 262.931 ms  statement: FETCH FORWARD 20000 FROM "curname"

As we expected, the fetch size is increased.

But the important point to note here is: even though the application iterating over a server-side cursor took a hell of a lot of time (running into several minutes), there is almost zero information about the query or the session in the PostgreSQL logs.

Ahh! This could the weirdest thing someone would expect.

Wait Event Analysis

Again wait event analysis comes in handy to understand what’s happening. Out of 2000 wait event samples collected by the pg_gather script, the wait events looks like this:

pid  |  wait_event  | count 
-------+--------------+-------
 30115 | ClientRead   |  1754
 30115 |   (CPU)      |   245
 30115 | DataFileRead |     1

The time is pend on the “ClientRead” weight event. This means that the server-side is waiting for the client to send the next request. So a slow network between the application server and the database server can adversely affect the server-side cursors.  But no information will be available in the PostgreSQL logs about the statement.

Conclusion

In this blog post, I tried to assess the impact of transferring a large amount of data from a Database Host to an Application Host.

PLEASE NOTE: The numbers discussed in the blog post in terms of the number of rows and the time recorded don’t have any absolute relevance and it might change from system to system with many environmental factors.

The discussion is more on the areas of impact and what we should expect and how to analyze as an end-user rather than any absolute numbers.

  1. Always try to specify the minimum number of columns in the query.  Avoid “SELECT *” or columns that are not used on the application side to avoid unnecessary data transfer.
  2. Avoid fetching a large number of rows to the application at a time.  If required, use proper paging with LIMIT and OFFSET.
  3. Avoid server-side cursors wherever possible. PostgreSQL reports only the duration of the first fetch and the actual query performance may go unnoticed. Bad performing queries could be hiding behind.
  4. Client-side cursors and data processing time won’t affect the query performance at the PostgreSQL server-side.
  5. Wait event analysis is very handy in understanding where the server is spending time.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Jun
24
2021
--

Understanding pg_repack: What Can Go Wrong – and How to Avoid It

Understanding pg_repack

pg_repack is one of the oldest, widely used, extension projects for PostgreSQL. It is so much popular that even DBaaS service providers couldn’t avoid it. It is a “power tool” in the hands of a DBA to deal with bloated/fragmented tables. I can’t imagine a serious production deployment without it these days. It magically replaces the bloated, fragmented tables with a fresh fully packed table without holding an exclusive lock on the table during its processing**. This extension made PostgreSQL’s built-in commands like VACUUM FULL and CLUSTER almost useless.

**Momentarily AccessExclusive locks are required. Please see the discussion below.

But unfortunately, regular and repeated usage of pg_repack increased the comfort levels of everyone (myself included). As a consequence, it is randomly recommended as a solution for everything, as a panacea. But I recently started coming across cases like a scheduled job doing pg_repack of every table. This time it rang an alarm bell and it started to feel like overuse of antibiotics. So I thought about writing about how pg_repack works and how it can affect your system while it is running for a regular user who doesn’t want to do a detailed study. because a better understanding might help for an informed decision on where it should be used.

pg_repack has a lot of functionalities (options). In this blog post, my intention is to discuss only how it does the basic “repack”ing of a fragmented/bloated table.

Pre-Checks and Ensuring Sanity

This is to ensure that pg_repack is installed and extension is available in the database, we are running the pg_repack as a superuser, etc. It creates a provision for cleaning up temporary objects created by pg_repack. It collects metadata about tables and associated objects like Indexes, Toast, Triggers, etc. because tracking of tables and associated objects is very important. Objects like invalid indexes, conflicting triggers, etc. are checked. We are not going to discuss this in detail.

The actual processing starts with obtaining an advisory lock on the table’s OID, to make sure no other pg_repack is working on the table. This is not so much a concern with full-table repacks, but mainly so that index-only repacks don’t interfere with each other or a full-table repack. If there are other pg_repaks in progress, repack attempts may get a message as follows and exit.

ERROR: Another pg_repack command may be running on the table. Please try again later.

Create temporary objects while holding AccessExclusive lock on the table

Yes. Correct, pg_repack needs the heavyweight AccessExclusive lock. But temporarily.  pg_repack attempts to gain a table ACCESS EXCLUSIVE lock by executing a statement as follows:

LOCK TABLE <tablename> IN ACCESS EXCLUSIVE MODE

An Access Exclusive lock requires that no other sessions are accessing the table, Not even a SELECT query. pg_repack wait for a “wait-time” (60 seconds by default). This wait-time can be changed using the optional parameter --wait-timeout. Once this wait-time is over pg_repack will start trying to cancel the conflicting statements.  Users may see messages as follows for each of the attempts.
WARNING: canceling conflicted backends
WARNING: canceling conflicted backends
...

So the point to be noted is:

So please avoid running pg_repack when there is a lot of concurrent activities on the table. because the new sessions are allowed to get conflicting locks like ACCESS SHARE concurrently and a session waiting for ACCESS EXCLUSIVE lock may need to wait indefinitely. We should be selecting a proper time window of low activity for pg_repack work

These cancel attempts will continue for another round of “wait-timeout” seconds. But even after attempting for this second round of wait-timeout, if the AcessExclusive lock is not obtained, it escalates to termination of every conflicting session. So pg_repack will terminate sessions if the total wait time exceeds double the “wait-time”. This could be problematic for application connections that get abruptly get terminated. This can lead to outages if the application layer does not handle it gracefully.

The pg_repack may emit a message as follows:

WARNING: terminating conflicted backends

And there will be PostgreSQL log entries as follows for each of the sessions which is killed

2021-06-17 06:28:46.317 UTC [2761] FATAL: terminating connection due to administrator command
2021-06-17 06:28:46.317 UTC [2758] FATAL: terminating connection due to administrator command

So the point to be noted is: pg_repack can terminate sessions if they stand against obtaining AcessExclusive lock which can lead to unexpected outage or misbehavior of the application.

If the double the wait time is crossed, pg_repack may just proceed with terminate the sessions. But, again this behavior also can be controlled using a parameter --no-kill-backend.  if this parameter is specified, pg_repack will respect all concurrent sessions and cancel itself instead of attempting to canceling or terminating other sessions
pg_repack may emit a message as follows

WARNING: timed out, do not cancel conflicting backends
INFO: Skipping repack public.t1 due to timeout

I believe this is more desirable in production systems. So the point to be noted is:

Always remember to specify --no-kill-backend whenever you deal with critical systems

When AccessExclusive is locked, pg_repack creates all temporary objects including the substituted table with the same structure.

It creates a primary key TYPE as per the original table. For example, if we are repacking a table with a primary key of a single field “id”, the primary key type definition would look like:

CREATE TYPE repack.pk_16423 AS (id integer)

This TYPE of definition is useful because there can be composite keys. The following is an example of it dealing with tables of a composite key.
CREATE TYPE repack.pk_16824 AS (id integer, id1 integer, id2 timestamp without time zone);

Then It proceeds to create a “log” table to capture all the data changes (CDC-Change data capture) during the pg_repack operation. This table will have a primary key of BIGINT data type, Primary key type of the original table created in the above step, and “row” datatype of the table which we are repacking. This row can hold the entire tuple information of the table which we repack.
This log table definition is easy because of the TYPE definition and the “row” type. here is an example

CREATE TABLE repack.log_16423 (id bigserial PRIMARY KEY, pk repack.pk_16423, row public.t1)

Now pg_repack creates a trigger on the table to be repacked so that whenever there is DML on the table, corresponding information needs to be captured to the log table created above. This is done using an AFTER INSERT OR DELETE OR UPDATE trigger. For example:

CREATE TRIGGER repack_trigger AFTER INSERT OR DELETE OR UPDATE ON public.t1 FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger('INSERT INTO repack.log_16423(pk, row) VALUES( CASE WHEN $1 IS NULL THEN NULL ELSE (ROW($1.id)::repack.pk_16423) END, $2)')

Since the pg_repack is holding the AccessExclusive lock at this stage, there won’t be any concurrent DMLs at this stage, which is going to change in the following stages.

pg_repack plays an important trick at this stage before releasing AcessExclusive lock as the comment in the source code says:

/* While we are still holding an AccessExclusive lock on the table, submit
* the request for an AccessShare lock asynchronously from conn2.
* We want to submit this query in conn2 while connection's
* transaction still holds its lock, so that no DDL may sneak in
* between the time that connection commits and conn2 gets its lock.
*/

Yes, pg_repack uses another connection to the database and sends an AccessShare lock request through that. This prevents any DDL in between the switch to AccessShare lock. The moment the main connection commits, An AccessShare lock will be granted to the second connection. So pg_repack uses two database connections to carry out the work.

But still, there is a chance that some DDL can interfere. So pg_repack kills any concurrent DDL against the table by default.

Once this stage is complete pg_repack can proceed with releasing AccessExclusive lock on the first connection by COMMITing the transaction. So that the AccessShare lock request by the second connection will be granted. This COMMIT is very special that all the log table and triggers on the table will be committed so that it is available to the entire system from this point onwards.

Copying rows/tuples to a temporary table

Copying of tuples from the table to a new table is performed with SERIALIZABLE isolation. because there shouldn’t be any inconsistency between the data getting into the log table and the temporary, substitute table which pg_repack is going to create.

BEGIN ISOLATION LEVEL SERIALIZABLE

Since the AccessExclusive lock is removed, concurrent sessions can proceed with their DMLs and Select queries. Only DDLs will be blocked. So we can say that the table is available for transactions and queries.

Prior to the data copy, the log table is truncated. because pg_repack needs only those log data that is captured from the starting of data copy.

DELETE FROM repack.log_16423

Again pg_repack will attempt to kill any session which might be waiting for doing a DDL and get an AcessShare lock on the table. Since the other connection is already holding an AccessShare lock, This can be gained without much problem.

At this stage, pg_repack creates a substitute table that is going to replace the original table with the exact same structure as the original table but without any data. It will be a CTAS statement, something like :

CREATE TABLE repack.table_16423... AS SELECT col1,col2,col3... FROM ONLY public.t1 WITH NO DATA

followed by the data copy:

INSERT INTO repack.table_16423 SELECT col1,col2,co3... FROM ONLY public.t1

Once the data copy is over, the transaction will be COMMIT ed. This completes one heaviest stage in repacking in terms of load and WAL generation

Indexes, Keys will be created at this stage on this temporary, substitute table at this stage.

Apply the CDC log to a temporary table

Please remember that pg_repack’s main connection is not holding any lock on the table at this stage (other than the second connection’s AccessShare lock).  So There is nothing blocking the transactions (DMLs) at this stage. Depending on the time it took for the data copy in the previous stage, and concurrent transactions during the data copy, There could be a lot of CDC(Change Data Capture) entries in the log file. This needs to be copied to the new temporary/substitute table.

This logic is implemented as C function in pg_repack. you may refer to the source code of repack_apply. It reads all the data from the log table and processes INSERTS, UPDATES, and DELETES. In order to speed up the repeated operations, Prepared Statements are used. Finally, all those data from the log table which is processed will be deleted from the log table.

Swapping the original table with a temporary table

This is performed by the second connection because it already holds an AccessShare lock, But it will escalate the lock to AccessExclusive lock.

LOCK TABLE public.t1 IN ACCESS EXCLUSIVE MODE

The CDC apply will be performed once again (The same “repack_apply”) while holding the AccessExclusive lock on the table. So if there is any new entry that appears in the log table, that also will be processed.

The original table and the temporary table quickly by executing repack_swap function like:

SELECT repack.repack_swap('16423');

This is the most beautiful and powerful part of pg_repack. which is implemented in a C function repack_swap. Not just tables are swapped, ownership, associated toasts (table and index), indexes, and dependencies are also swapped. Oids are swapped so that the oid of the table remains the same even after pg_repack. The Swapping work complies with a COMMIT

Final cleanup

pg_repack uses its built-in C Function repack_drop for doing the cleanup of all temporary objects. To prevent any concurrent sessions from acquiring a lock on the table which could prevent the cleanup, An AccessExclusive lock is obtained before the cleanup. This is the third time an AccessExcluive lock is placed on the table.

Summary

pg_repack is one of the most powerful, popular, and useful extensions. We encourage the usage wherever applicable with proper supervision. But please avoid over-usage. As I tried to explain,  we should expect a good amount of data movement between the original table to the temporary table, trigger writing to the log table, data copy from the log table to the temporary table, etc. So we should be expecting a higher WAL generation also. Considering all the implications, pg_repack needs to be performed on a low activity time window to avoid undesirable consequences.

Some of the important points to reiterate for end-users are:

  1. pg_repack needs to acquire heavyweight AccessExclusive lock multiple times. But Temporarily.
  2. In a high concurrency situation, it will be almost impossible to get an AccessExclusive lock
  3. pg_repack, by default, will attempt canceling the conflicting statements if it is not able to gain AcessExclusive lock-in wait-time
  4. It may proceed to terminate sessions if the total wait exceeds double the amount of wait time. This could lead to undesirable outcomes and outages.
  5. Defaults of pg_repack may not be good for critical systems. use --no-kill-backend option to make it more gentle.
  6. No DDLs are allowed against the table which is undergoing pg_repack and any session that attempts to do so might get killed.
Jun
11
2021
--

PostgreSQL HA with Patroni: Your Turn to Test Failure Scenarios

PostgreSQL HA with Patroni

A couple of weeks ago, Jobin and I did a short presentation during Percona Live Online bearing a similar title as the one for this post: “PostgreSQL HA With Patroni: Looking at Failure Scenarios and How the Cluster Recovers From Them”. We deployed a 3-node PostgreSQL environment with some recycled hardware we had lying around and set ourselves at “breaking” it in different ways: by unplugging network and power cables, killing main processes, attempting to saturate processors. All of this while continuously writing and reading data from PostgreSQL. The idea was to see how Patroni would handle the failures and manage the cluster to continue delivering service. It was a fun demo!

We promised a follow-up post explaining how we set up the environment, so you could give it a try yourselves, and this is it. We hope you also have fun attempting to reproduce our small experiment, but mostly that you use it as an opportunity to learn how a PostgreSQL HA environment managed by Patroni works in practice: there is nothing like a hands-on lab for this!

Initial Setup

We recycled three 10-year old Intel Atom mini-computers for our experiment but you could use some virtual machines instead: even though you will miss the excitement of unplugging real cables, this can still be simulated with a VM. We installed the server version of Ubuntu 20.04 and configured them to know “each other” by hostname; here’s how the hosts file of the first node looked like:

$ cat /etc/hosts
127.0.0.1 localhost node1
192.168.1.11 node1
192.168.1.12 node2
192.168.1.13 node3

etcd

Patroni supports a myriad of systems for Distribution Configuration Store but etcd remains a popular choice. We installed the version available from the Ubuntu repository on all three nodes:

sudo apt-get install etcd

It is necessary to initialize the etcd cluster from one of the nodes and we did that from node1 using the following configuration file:

$ cat /etc/default/etcd
ETCD_NAME=node1
ETCD_INITIAL_CLUSTER="node1=http://192.168.1.11:2380"
ETCD_INITIAL_CLUSTER_TOKEN="devops_token"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.1.11:2380"
ETCD_DATA_DIR="/var/lib/etcd/postgresql"
ETCD_LISTEN_PEER_URLS="http://192.168.1.11:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.1.11:2379,http://localhost:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.1.11:2379"

Note how ETCD_INITIAL_CLUSTER_STATE is defined with “new”.

We then restarted the service:

sudo systemctl restart etcd

We can then move on to install etcd on node2. The configuration file follows the same structure as that of node1, except that we are adding node2 to an existing cluster so we should indicate the other node(s):

ETCD_NAME=node2
ETCD_INITIAL_CLUSTER="node1=http://192.168.1.11:2380,node2=http://192.168.1.12:2380"
ETCD_INITIAL_CLUSTER_TOKEN="devops_token"
ETCD_INITIAL_CLUSTER_STATE="existing"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.1.12:2380"
ETCD_DATA_DIR="/var/lib/etcd/postgresql"
ETCD_LISTEN_PEER_URLS="http://192.168.1.12:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.1.12:2379,http://localhost:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.1.12:2379"

Before we restart the service, we need to formally add node2 to the etcd cluster by running the following command on node1:

sudo etcdctl member add node2 http://192.168.1.12:2380

We can then restart the etcd service on node2:

sudo systemctl restart etcd

The configuration file for node3 looks like this:

ETCD_NAME=node3
ETCD_INITIAL_CLUSTER="node1=http://192.168.1.11:2380,node2=http://192.168.1.12:2380,node3=http://192.168.1.13:2380"
ETCD_INITIAL_CLUSTER_TOKEN="devops_token"
ETCD_INITIAL_CLUSTER_STATE="existing"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.1.13:2380"
ETCD_DATA_DIR="/var/lib/etcd/postgresql"
ETCD_LISTEN_PEER_URLS="http://192.168.1.13:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.1.13:2379,http://localhost:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.1.13:2379"

Remember we need to add node3 to the cluster by running the following command on node1:

sudo etcdctl member add node3 http://192.168.1.13:2380

before we can restart the service on node3:

sudo systemctl restart etcd

We can verify the cluster state to confirm it has been deployed successfully by running the following command from any of the nodes:

$ sudo etcdctl member list
2ed43136d81039b4: name=node3 peerURLs=http://192.168.1.13:2380 clientURLs=http://192.168.1.13:2379 isLeader=false
d571a1ada5a5afcf: name=node1 peerURLs=http://192.168.1.11:2380 clientURLs=http://192.168.1.11:2379 isLeader=true
ecec6c549ebb23bc: name=node2 peerURLs=http://192.168.1.12:2380 clientURLs=http://192.168.1.12:2379 isLeader=false

As we can see above, node1 is the leader at this point, which is expected since the etcd cluster has been bootstrapped from it. If you get a different result, check for etcd entries logged to /var/log/syslog on each node.

Watchdog

Quoting Patroni’s manual:

Watchdog devices are software or hardware mechanisms that will reset the whole system when they do not get a keepalive heartbeat within a specified timeframe. This adds an additional layer of fail safe in case usual Patroni split-brain protection mechanisms fail.

While the use of a watchdog mechanism with Patroni is optional, you shouldn’t really consider deploying a PostgreSQL HA environment in production without it.

For our tests, we used the standard software implementation for watchdog that is shipped with Ubuntu 20.04, a module called softdog. Here’s the procedure we used in all three nodes to configure the module to load:

sudo sh -c 'echo "softdog" >> /etc/modules'

Patroni will be the component interacting with the watchdog device. Since Patroni is run by the postgres user, we need to either set the permissions of the watchdog device open enough so the postgres user can write to it or make the device owned by postgres itself, which we consider a safer approach (as it is more restrictive):

sudo sh -c 'echo "KERNEL==\"watchdog\", OWNER=\"postgres\", GROUP=\"postgres\"" >> /etc/udev/rules.d/61-watchdog.rules'

These two steps looked like all that would be required for watchdog to work but to our surprise, the softdog module wasn’t loaded after restarting the servers. After spending quite some time digging around we figured the module was blacklisted by default and there was a strain file with such a directive still lingering around:

$ grep blacklist /lib/modprobe.d/* /etc/modprobe.d/* |grep softdog
/lib/modprobe.d/blacklist_linux_5.4.0-72-generic.conf:blacklist softdog

Editing that file in each of the nodes to remove the line above and restarting the servers did the trick:

$ lsmod | grep softdog
softdog                16384  0

$ ls -l /dev/watchdog*
crw-rw---- 1 postgres postgres  10, 130 May 21 21:30 /dev/watchdog
crw------- 1 root     root     245,   0 May 21 21:30 /dev/watchdog0

PostgreSQL

Percona Distribution for PostgreSQL can be easily installed from the Percona Repository in a few easy steps:

sudo apt-get update -y; sudo apt-get install -y wget gnupg2 lsb-release curl
wget https://repo.percona.com/apt/percona-release_latest.generic_all.deb
sudo dpkg -i percona-release_latest.generic_all.deb
sudo apt-get update
sudo percona-release setup ppg-12
sudo apt-get install percona-postgresql-12

An important concept to understand in a PostgreSQL HA environment like this one is that PostgreSQL should not be started automatically by systemd during the server initialization: we should leave it to Patroni to fully manage it, including the process of starting and stopping the server. Thus, we should disable the service:

sudo systemctl disable postgresql

For our tests, we want to start with a fresh new PostgreSQL setup and let Patroni bootstrap the cluster, so we stop the server and remove the data directory that has been created as part of the PostgreSQL installation:

sudo systemctl stop postgresql
sudo rm -fr /var/lib/postgresql/12/main

These steps should be repeated in nodes 2 and 3 as well.

Patroni

The Percona Repository also includes a package for Patroni so with it already configured in the nodes we can install Patroni with a simple:

sudo apt-get install percona-patroni

Here’s the configuration file we have used for node1:

$ cat /etc/patroni/config.yml
scope: stampede
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008

etcd:
  host: node1:2379

bootstrap:
  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
#    master_start_timeout: 300
#    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        logging_collector: 'on'
        max_wal_senders: 5
        max_replication_slots: 5
        wal_log_hints: "on"
        #archive_mode: "on"
        #archive_timeout: 600
        #archive_command: "cp -f %p /home/postgres/archived/%f"
        #recovery_conf:
        #restore_command: cp /home/postgres/archived/%f %p

  # some desired options for 'initdb'
  initdb:  # Note: It needs to be a list (some options need values, others are switches)
  - encoding: UTF8
  - data-checksums

  pg_hba:  # Add following lines to pg_hba.conf after running 'initdb'
  - host replication replicator 192.168.1.1/24 md5
  - host replication replicator 127.0.0.1/32 trust
  - host all all 192.168.1.1/24 md5
  - host all all 0.0.0.0/0 md5
#  - hostssl all all 0.0.0.0/0 md5

  # Additional script to be launched after initial cluster creation (will be passed the connection URL as parameter)
# post_init: /usr/local/bin/setup_cluster.sh
  # Some additional users users which needs to be created after initializing new cluster
  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1:5432
  data_dir: "/var/lib/postgresql/12/main"
  bin_dir: "/usr/lib/postgresql/12/bin"
#  config_dir:
  pgpass: /tmp/pgpass0
  authentication:
    replication:
      username: replicator
      password: vagrant
    superuser:
      username: postgres
      password: vagrant
  parameters:
    unix_socket_directories: '/var/run/postgresql'

watchdog:
  mode: required # Allowed values: off, automatic, required
  device: /dev/watchdog
  safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

With the configuration file in place, and now that we already have the etcd cluster up, all that is required is to restart the Patroni service:

sudo systemctl restart patroni

When Patroni starts, it will take care of initializing PostgreSQL (because the service is not currently running and the data directory is empty) following the directives in the bootstrap section of Patroni’s configuration file. If everything went according to the plan, you should be able to connect to PostgreSQL using the credentials in the configuration file (password is vagrant):

$ psql -U postgres
psql (12.6 (Ubuntu 2:12.6-2.focal))
Type "help" for help.

postgres=#

Repeat the operation for installing Patroni on nodes 2 and 3: the only difference is that you will need to replace the references to node1 in the configuration file (there are four of them, shown in bold) with the respective node name.

You can also check the state of the Patroni cluster we just created with:

$ sudo patronictl -c /etc/patroni/config.yml list
+----------+--------+-------+--------+---------+----+-----------+
| Cluster  | Member |  Host |  Role  |  State  | TL | Lag in MB |
+----------+--------+-------+--------+---------+----+-----------+
| stampede | node1  | node1 | Leader | running |  2 |           |
| stampede | node2  | node2 |        | running |  2 |         0 |
| stampede | node3  | node3 |        | running |  2 |         0 |
+----------+--------+-------+--------+---------+----+-----------+

node1 started the Patroni cluster so it was automatically made the leader – and thus the primary/master PostgreSQL server. Nodes 2 and 3 are configured as read replicas (as the hot_standby option was enabled in Patroni’s configuration file).

HAProxy

A common implementation of high availability in a PostgreSQL environment makes use of a proxy: instead of connecting directly to the database server, the application will be connecting to the proxy instead, which will forward the request to PostgreSQL. When HAproxy is used for this, it is also possible to route read requests to one or more replicas, for load balancing. However, this is not a transparent process: the application needs to be aware of this and split read-only from read-write traffic itself. With HAproxy, this is done by providing two different ports for the application to connect. We opted for the following setup:

  • Writes   ?  5000
  • Reads   ?  5001

HAproxy can be installed as an independent server (and you can have as many as you want) but it can also be installed on the application server or the database server itself – it is a light enough service. For our tests, we planned on using our own Linux workstations (which also run Ubuntu 20.04) to simulate application traffic so we installed HAproxy on them:

sudo apt-get install haproxy

With the software installed, we modified the main configuration file as follows:

$ cat /etc/haproxy/haproxy.cfg
global
    maxconn 100

defaults
    log    global
    mode    tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /

listen primary
    bind *:5000
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 node1:5432 maxconn 100 check port 8008
    server node2 node2:5432 maxconn 100 check port 8008
    server node3 node3:5432 maxconn 100 check port 8008

listen standbys
    balance roundrobin
    bind *:5001
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 node1:5432 maxconn 100 check port 8008
    server node2 node2:5432 maxconn 100 check port 8008
    server node3 node3:5432 maxconn 100 check port 8008

Note there are two sections: primary, using port 5000, and standbys, using port 5001. All three nodes are included in both sections: that’s because they are all potential candidates to be either primary or secondary. For HAproxy to know which role each node currently has, it will send an HTTP request to port 8008 of the node: Patroni will answer. Patroni provides a built-in REST API support for health check monitoring that integrates perfectly with HAproxy for this:

$ curl -s http://node1:8008
{"state": "running", "postmaster_start_time": "2021-05-24 14:50:11.707 UTC", "role": "master", "server_version": 120006, "cluster_unlocked": false, "xlog": {"location": 25615248}, "timeline": 1, "database_system_identifier": "6965869170583425899", "patroni": {"version": "1.6.4", "scope": "stampede"}}

We configured the standbys group to balance read-requests in a round-robin fashion, so each connection request (or reconnection) will alternate between the available replicas. We can test this in practice, let’s save the postgres user password in a file to facilitate the process:

echo "localhost:5000:postgres:postgres:vagrant" > ~/.pgpass
echo "localhost:5001:postgres:postgres:vagrant" >> ~/.pgpass
chmod 0600 ~/.pgpass

We can then execute two read-requests to verify the round-robin mechanism is working as intended:

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"
 192.168.1.13

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"
 192.168.1.12

as well as test the writer access:

$ psql -Upostgres -hlocalhost -p5000 -t -c "select inet_server_addr()"
 192.168.1.11

You can also check the state of HAproxy by visiting http://localhost:7000/ on your browser.

Workload

To best simulate a production environment to test our failure scenarios, we wanted to have continuous reads and writes to the database. We could have used a benchmark tool such as Sysbench or Pgbench but we were more interested in observing the switch of source server upon a server failure than load itself. Jobin wrote a simple Python script that is perfect for this, HAtester. As was the case with HAproxy, we run the script from our Linux workstation. Since it is a Python script, you need to have a PostgreSQL driver for Python installed to execute it:

sudo apt-get install python3-psycopg2
curl -LO https://raw.githubusercontent.com/jobinau/pgscripts/main/patroni/HAtester.py
chmod +x HAtester.py

Edit the script with the credentials to access the PostgreSQL servers (through HAproxy) if you are using different settings from ours. The only requirement for it to work is to have the target table created beforehand, so first connect to the postgres database (unless you are using a different target) in the Primary and run:

CREATE TABLE HATEST (TM TIMESTAMP);

You can then start two different sessions:

  1. One for writes:

    ./HAtester.py 5000
  2. One for reads:
    ./HAtester.py 5001

The idea is to observe what happens with database traffic when the environment experiences a failure; that is, how HAproxy will route reads and writes as Patroni adjusts the PostgreSQL cluster. You can continuously monitor Patroni from the point of view of the nodes by opening a session in each of them and running the following command:

sudo -u postgres watch patronictl -c /etc/patroni/config.yml list

To facilitate observability and better follow the changes in real-time, we used the terminal multiplexer Tmux to visualize all 5 sessions on the same screen:

  • On the left side, we have one session open for each of the 3 nodes, continuously running:

    sudo -u postgres watch patronictl -c /etc/patroni/config.yml list

    It’s better to have the Patroni view for each node independently because when you start the failure tests you will lose connection to a part of the cluster.

  • On the right side, we are executing the HAtester.py script from our workstation:
    • Sending writes through port 5000:

      ./HAtester.py 5000
    • and reads through port 5001:

      ./HAtester.py 5001

A couple of notes on the execution of the HAtester.py script:

  • Pressing Ctrl+C will break the connection but the script will reconnect, this time to a different replica (in the case of reads) due to having the Standbys group on HAproxy configured with round-robin balancing.
  • When a switchover or failover takes place and the nodes are re-arranged in the cluster, you may temporarily see writes sent to a node that used to be a replica and was just promoted as primary and reads send to a node that used to be the primary and was demoted as secondary: that’s a limitation of the HAtester.py script but “by design”; we favored faster reconnections and minimal checks on the node’s role for demonstration purposes. On a production application, this part ought to be implemented differently.

Testing Failure Scenarios

The fun part starts now! We leave it to you to test and play around to see what happens with the PostgreSQL cluster in practice following a failure. We leave as suggestions the tests we did in our presentation. For each failure scenario, observe how the cluster re-adjusts itself and the impact on read and write traffic.

1) Loss of Network Communication

  • Unplug the network cable from one of the nodes (or simulate this condition in your VM):
    • First from a replica
    • Then from the primary
  • Unplug the network cable from one replica and the primary at the same time:
    • Does Patroni experience a split-brain situation?

2) Power Outage

  • Unplug the power cable from the primary
  • Wait until the cluster is re-adjusted then plug the power cable back and start the node

3) SEGFAULT

Simulate an OOM/crash by killing the postmaster process in one of the nodes with kill -9.

4) Killing Patroni

Remember that Patroni is managing PostgreSQL. What happens if the Patroni process (and not PostgreSQL) is killed?

5) CPU Saturation

Simulate CPU saturation with a benchmark tool such as Sysbench, for example:

sysbench cpu --threads=10 --time=0 run

This one is a bit tricky as the reads and writes are each single-threaded operation. You may need to decrease the priority of the HAtester.py processes with renice, and possibly increase that of Sysbench’s.

6) Manual Switchover

Patroni facilitates changes in the PostgreSQL hierarchy. Switchover operations can be scheduled, the command below is interactive and will prompt you with options:

sudo -u postgres patronictl -c /etc/patroni/config.yml switchover

Alternatively, you can be specific and tell Patroni exactly what to do:

sudo -u postgres patronictl -c /etc/patroni/config.yml switchover --master node1 --candidate node2 --force


We hope you had fun with this hands-on lab! If you have questions or comments, leave us a note in the comments section below!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com