Jun
24
2024
--

Upgrading to the New Etcd Version From 3.3 for Patroni

Upgrading to the New Etcd Version From 3.3 for PatroniWe have been promoting and using Patroni as the best high availability framework for PostgreSQL, and Etcd was the preferred/recommended DCS for the Patroni cluster. Both Patroni and Etcd have been part of PostgreSQL distribution from Percona for years now. But one area where we were stuck was the Etcd version, and we continued to […]

Jun
03
2024
--

Controlling Resource Consumption on a PostgreSQL Server Using Linux cgroup2

Controlling Resource Consumption on a PostgreSQL Server Using Linux cgroup2Multi-tenancy/co-hosting is always challenging. Running multiple PG instances could help to reduce the internal contention points (scalability issues) within PostgreSQL. However, the load caused by one of the tenants could affect other tenets, which is generally referred to as the “Noisy Neighbor” effect. Luckily, Linux allows users to control the resources consumed by each program […]

Apr
30
2024
--

LDAP Authentication in PgBouncer Through PAM

There are many cases where external connection poolers like pgBouncer become unavoidable despite the costs and complexities associated with them. PgBouncer is one of the most popular external connection poolers for PostgreSQL. It is thin and lightweight, so it doesn’t have built-in authentication features like LDAP, which is essential for many enterprises. Luckily, pgBouncer has […]

Dec
29
2023
--

Human Factors Behind Incidents: Why Settings Like “idle_session_timeout” Can Be a Bad Idea

idle_session_timeout

PostgreSQL 14 introduced the parameter
idle_session_timeout, and, unfortunately, many DBAs jumped to start using it without understanding or by ignoring the consequences. In a short span of time, it has become one of the most misused parameters in many PostgreSQL installations. There is nothing wrong with
idle_session_timeout  from a technical perspective; even without this parameter, I know frustrated DBAs run scheduled shell scripts to keep terminating old idle sessions. The new parameter made their work easy to nuke all idle connections. The problem is the collateral damage it can cause.

Note for novice DBAs: There are human and organizational factors for almost every incident, rather than just technical reasons. Trying to do a quick/dirty fix by changing parameters will have far-reaching implications. Not everything can be addressed on the database side. This is more evident in connection-related incidents. Blame games and bad decisions following one incident/outage can lead to further bad decisions and further problems.

For example, many such incidents will have a common root cause” ” something like unexpected, sudden connection explosion — a big bang of connections with hundreds or thousands of idle connections. The RCA investigation may end with only the technical side of the problem, like loose limits invited disaster to happen. But the right question to be addressed would be: Why do DBAs end up setting such loose limits? What are the factors that influenced and forced DBAs to do so? Unfortunately, it gets addressed very rarely in many organizations.

Common factors contributing to connection-related outages

1. Unrestricted connection settings: Leaving connection settings, such as max_connections and per-user limits, wide open exposes the system to malicious Denial-of-Service (DoS) attacks. These attacks can rapidly consume available connections, leaving legitimate users unable to connect and disrupting system functionality.

2. Inadequate application-side pooling: Suboptimal application-side pooling can lead to opening a large number of connections, and they may remain open and unused, consuming valuable resources. These connections can accumulate, eventually exhausting server resources and causing the system to become unresponsive.

3. Unexpected connection leaks: Unanticipated connection leaks can also arise from programming errors or faulty application logic. These leaks can quickly deplete the available connection pool, hindering system performance and potentially leading to crashes.

4. Human factor: In an ideal case, there must be stringent restrictions and enforcement on what load and number of connections can be onboarded to the Database. It should be a tight scrutiny like airline security.

The problem generally begins as a conflict of interests. App-Dev often demands unrestricted, uncontrolled access to the database. A “NO” from the Ops/DBAs at the right time may save the entire system from outages. However, gatekeeping is not an easy job. Pressure builds up, and DBAs give up in most cases and allow settings that the database server cannot accommodate. The first decision goes wrong here. Obviously, an incident is expected down the line.

5. Organizational factor of wrong decisions: A surprising set of wrong decisions generally happens after the investigation following major incidents, The investigation could be pointing fingers at an excessive number of connections. Its common that Ops/DBA becomes responsible for managing the “idle” connections!   because “it is a database problem”. I would say the very wrong person takes charge.

The point forgotten is that connections are owned by clients but maintained by a database. It’s like a bank account. The customer owns it, but the bank maintains it.  We shouldn’t be terminating or closing connections from the database side. Instead of Investigating who owns them and addressing them at their root, DBAs often go for shortcuts like idle_session_timeout or a script to terminate all “idle” connections. The boundaries of ownership are violated, inviting the next level of problems.

There are a lot of misunderstandings about “idle” connections among the user community. We should remember that a connection will be “idle” immediately after establishing it. It becomes “active” when there is an SQL for executing. It goes back to “idle” when the processing is over. Most of the connections from any connection pooler also will be “idle” most of the time. So, there is nothing wrong with a connection being “idle”. The point I want to make is that a connection appearing as “idle” doesn’t make it a candidate for termination/killing. This may sound too silly for an experienced user, but the reality is that I ended up sitting on emergency calls to explain this.

What matters is the number of connections. The right question to ask in an RCA call is, “Who is responsible for creating the large number of connections that are idle most of the time?

Common symptoms of poor connection management

  1. Gradual server performance degradation due to the high number of connections
  2. Often, the server becomes unresponsive and/or crashes
  3. The server is running out of memory, and OOM killer terminates PostgreSQL
  4. The server is running out of available connections
  5. Application reporting connection failures and unexpected drops of connection crashes

Problems of backend termination

I often get the question, “What is the problem with terminating connections from backends?”. In layman’s terms, it will be like a bank closing your account and taking all the money in it. That can become a big surprise for the database client/application. Most of the applications may not be designed and tested for handling such surprises. Even a
psql  will get it with a surprise

postgres=> l+
FATAL:  terminating connection due to idle-session timeout
FATAL:  server conn crashed?
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

So unless the application is designed, developed, and tested for handling unexpected connection drops, it won’t be able to handle this. The majority of applications just exit with error reporting. In critical systems, this causes serious outages for business.

Despite being tailored for PostgreSQL, pgBouncer is not immune to sudden connection terminations:

2023-11-14 16:17:21.967 UTC [4630] WARNING S-0x10b7940: postgres/authuser@127.0.0.1:5432 got packet 'E' from server when not linked
2023-11-14 16:17:21.967 UTC [4630] LOG S-0x10b7940: postgres/authuser@127.0.0.1:5432 closing because: server conn crashed? (age=218s)
2023-11-14 16:17:22.927 UTC [4630] LOG stats: 0 xacts/s, 0 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 0 B/s, out 11 B/s, xact 0 us, query 0 us, wait 12 us
2023-11-14 16:17:40.533 UTC [4630] LOG S-0x10b7bf8: postgres/pgbtestapp@[::1]:5432 closing because: server conn crashed? (age=236s)
2023-11-14 16:17:40.533 UTC [4630] LOG C-0x10af130: postgres/pgbtestapp@[::1]:39470 closing because: server conn crashed? (age=28s)
2023-11-14 16:17:40.533 UTC [4630] WARNING C-0x10af130: postgres/pgbtestapp@[::1]:39470 pooler error: server conn crashed?
2023-11-14 16:18:06.370 UTC [4630] LOG S-0x10b7bf8: postgres/authuser@127.0.0.1:5432 new connection to server (from 127.0.0.1:46632)

Even if we ignore all other problems, the connection pooler will end up re-establishing pretty much every connection in the pool. This completely destroys the very purpose of the connection pooler.

It is not that the PostgreSQL community is not aware of this possible abuse of parameters. This has been discussed in the community, and the implications on the pooler are known and well documented.

However, users still overlook the implications, and serious mess-ups happen.

From the discussions in the PostgreSQL community, the only strong argument in favor of this parameter is one-off users who directly log in to the database to execute custom statements. They may accidentally leave their sessions open for a long time, which causes damage.

There were discussions about making it more explicit like: Consider setting this for specific users instead of as a server default. Client connections managed by connection poolers or initiated indirectly, like those by a remote postgres_fdw using server, should probably be excluded from this timeout.

What can we do about it without causing any serious consequences?

Obviously, DBAs may have questions like, what if user accounts / application connections are really abusive? As mentioned above, the solution is not just technical,  but understanding the system and implementing appropriate strategies is important.

Segregation of database accounts and role-based settings

Those accounts used by individual users for interactive logins need to be differentiated from the accounts used by the application – The service accounts, In terms of user/role-level settings.PostgreSQL allows us to have parameter settings at different levels and scope. More stringent restrictions are to be placed on the interactive login accounts. It is very much OK to have settings like idle_session_timeout for those accounts, Preferably not exceeding 5 minutes. Most importantly, the idle_in_transaction_session_timeout also does not exceed a few seconds.
On the other hand, I would recommend NOT to use
idle_session_timeout  for service accounts. But using
idle_in_transaction_session_timeout  for a couple of minutes is acceptable as it helps us to find problematic application logic.

Use network timeouts

There are alternative approaches if we suspect that the connections are just leaked without the application holding them.  TCP timeouts could be the best solution for such cases. PostgreSQL has TCP-related timeouts as a parameter option, which can be specified like any other PostgreSQL parameter option.

tcp_keepalives_idle: This parameter specifies the number of seconds of inactivity, after which a keepalive message will be sent to the client by the PostgreSQL server. For example, if we set a value “120”, Sever will send a keepalive message to the client after two minutes of inactivity and wait for a response. If there is an acknowledgment from the client, the connection is validated.  What if there is no acknowledgment from a client? That is where the next parameter helps

tcp_keepalives_interval:  This is the number of seconds after which the TCP keepalive message will be retransmitted if there is no acknowledgment from the client. For example, setting a value of 20 results in sending keepalive packets every 20 seconds until an acknowledgment is obtained. What if there is no acknowledgment for any of the keepalive attempts? This is where the next parameter comes into play.

tcp_keepalives_count: This is the number of keepalive messages that can be lost before the client connection is considered as “dead”. And terminated.

So, the overall settings summary discussed is:

tcp_keepalives_idle = 120
tcp_keepalives_interval = 20
tcp_keepalives_count = 6

But you may adjust the values according to the needs of your environment.

There is one more parameter: tcp_user_timeout, Which, as some caveat, is documented and reportedly affects other parameter values. So this is not something we should consider on day one.

PostgreSQL 14 onwards, we have an additional option: client_connection_check_interval, which can detect whether the connection to the client has gone away or not.  This is most useful if a client dies during a long-running statement like those complex reporting queries from OLAP systems.  By default, this check is off (value 0). Setting this value to 5 to 10 seconds will be of great value if you are expecting cases of client connectivity. Please note that the value is specified in milliseconds. PostgreSQL allows us to set these values at even user level.

With all TCP timeouts in place, you may see entries as follows in PostgreSQL logs if some timeout happens.

2023-12-20 04:37:12.026 UTC [1230] LOG:  could not receive data from client: Connection timed out
2023-12-20 04:37:12.026 UTC [1230] LOG:  disconnection: session time: 0:22:23.419 user=jobin database=postgres host=fd42:8aba:4133:fb7d:216:3eff:fe3c:7bcb port=52984

Summary

There should be a cap on what a system can accept as a load. At the PostgreSQL level, it is the max_connections. I would consider it the primary duty of any Production DBA to protect the system from any possible abuse.

Dear DBAs, Please consider the following points to keep the system stable and protect it from any abuses.

  1. Configure connection settings appropriately. Set connection limits, such as max_connections and per-user limits, to reasonable values that align with expected usage patterns. Regularly review and adjust these settings as needed. The max_connections setting of a PostgreSQL instance should not be too far from 10x of available CPUs.
  2. Encourage the use of robust application-side pooling: Implement well-configured application-side pooling mechanisms to manage connections and prevent leaks efficiently. Employ libraries or frameworks that provide reliable pooling capabilities. In case the application framework doesn’t have a good connection pooler, consider external connection poolers like pgBouncer.
  3. Connections are owned by those who create it. If there is an excessive number of idle connections. The part of the system that is causing the creation of connections needs to be fixed. Terminating them silently from the backend is not a solution, but it invites more problems.
  4. Never terminate connections from the backend unless there is a request to do so from the application/client who owns the connections.
  5. Avoid using parameters like idle_session_timeout at the global level; use it only at the user/role level, especially for interactive login accounts.
  6. Avoid idle_session_timeout for application/service accounts. It can be used for accounts that are used for interactive logins where the consequence of timeout is predictable.
  7. Alternate options, as discussed above, to handle leaked connections.
  8. Monitor the connections. It is precious.

And most importantly, the crucial part. Empower yourself or your DBAs with the ability to say NO to any connection abuses. Junior DBAs are more vulnerable to pressure from other parts of organizations to take the wrong steps.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Aug
30
2023
--

PostgreSQL Parameters: Scope and Priority Users Should Know

postgresql parameters

PostgreSQL allows its users to set parameters at different scopes, and the same parameter can be specified at different places and using different methods. And there could be conflicts. Someone might be wondering why certain changes are not coming into effect, so it is important to understand/recollect the scope and priority of settings.

In this blog, I am trying to list the options available for users and list them in increasing order of priority. The purpose is to give a high-level view to users.

1. Compile time parameter settings

These are the set of parameters that are set at the time of compilation. This acts as the default value for PostgreSQL. We can check these values in the boot_val field of pg_settings.

select name,boot_val from pg_settings;

These compile time settings have the least priority and can be overridden in any other levels. However, some of these parameters cannot be modified by any other means. Changing these values at compile time is not intended for common use. If a PostgreSQL user wants to change these values, they need to recompile the PostgreSQL from the source code. Some are exposed through the
configure command line option. Some such configuration options are: 
withblocksize=<BLOCKSIZE>    This sets table block size in kB. The default is 8kb. 
withsegsize=<SEGSIZE>  This sets table segment size in GB. The default is 1GB. This means PostgreSQL creates a new file in the data directory as the table size exceeds 1GB.  
withwalblocksize=<BLOCKSIZE>   sets WAL block size in kB, and the default is 8kB.

Most of the parameters have compile time defaults. That is the reason why we can start running PostgreSQL by specifying a very minimal number of parameter values.

2. Data directory/initialization-specific parameter settings

Parameters can also be specified at the data directory initialization time.  Some of these parameters cannot be changed by other means or are difficult to change.

For example, the
wal_segment_size, which determines the WAL segment file, is such a parameter. PostgreSQL generates WAL segment files of 16MB by default, and it can be specified at the time of initialization only. This is the level at which decisions on whether to use
data_checksums  need to be taken. This can be changed later using the pg_checksums utility, but that will be a painful exercise on a big database.

The default character encoding and locale settings to be used can be specified at this level. But this can be specified at the subsequent levels also. You may refer to initdb options for more information:  https://www.postgresql.org/docs/current/app-initdb.html.

Those parameters taken from the specific data directory initialization, which overrides the built-in parameters, can be checked like this:

select name,setting from pg_settings where source='override';

This override includes some of the calculated auto-tune values for that environment.

3. PostgreSQL parameters set by environment variables

PostgreSQL executables, including the postmaster, are honoring many environment variables. But they are generally used by client tools. The most common parameter used by the PostgreSQL server (postmaster) will be PGDATA, which sets the parameter data_directory.  These parameters can be specified by the service managers like systemd.

$ cat /usr/lib/systemd/system/postgresql-14.service
...
Environment=PGDATA=/var/lib/pgsql/14/data/
…
Environment=PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
Environment=PG_OOM_ADJUST_VALUE=0

For many automation/scripting, this will be handy. Here is an example:

$ export PGDATA=/home/postgres/data
$ export PGPORT=5434
$ pg_ctl start
waiting for server to start....2023-08-04 06:53:09.637 UTC [5787] LOG:  starting PostgreSQL 15.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2023-08-04 06:53:09.637 UTC [5787] LOG:  listening on IPv6 address "::1", port 5434
2023-08-04 06:53:09.637 UTC [5787] LOG:  listening on IPv4 address "127.0.0.1", port 5434
2023-08-04 06:53:09.639 UTC [5787] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5434"

As we can see, PostgreSQL took the port as 5434.

4. Configuration files

Probably, this is the method every novice user will be aware of. The fundamental configuration file is postgresql.conf, and it is the most common place to have global settings. PostgreSQL looks for a configuration file in the PGDATA by default, but an alternate location can be specified using the command line parameter config_file  of postmaster.  The parameter specifications can be split into multiple files and directories because Postgresql supports 
include and 
include_dir directives in the configuration files. So, there can be nested/cascaded configurations.

PostgreSQL rereads all its configuration files if it receives a SIGHUP signal. If the same parameter is set in multiple locations, the last to read will be will be considered. Among all configuration files, postgresql.auto.conf gets the highest priority because that is the file to read the last. That is where all “ALTER SYSTEM SET/RESET” commands keep the information.

5. Command line argument to postmaster

The postmaster, aka Postgres, has a feature to set parameters as command-line arguments (it has features to get the values also). This is one of the most reliable methods used by many of the external tools to manage PostgreSQL service. For example, the high availability solution Patroni passes some of the most critical parameters as a command line argument.  Here is how the Postgres process with command-line options looks in a Patroni environment

/usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data --config-file=/var/lib/pgsql/14/data/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 
 --cluster_name=kc_primary_cluster --wal_level=replica --hot_standby=on --max_connections=100 --max_wal_senders=5 --max_prepared_transactions=0 
 --max_locks_per_transaction=64 --track_commit_timestamp=off --max_replication_slots=10 --max_worker_processes=8 --wal_log_hints=on

So Patroni can ensure that there is no local configuration mistake that can adversely affect the availability and stability of the cluster. But changing this is possible only at the server startup. Obviously, this has higher precedence over the values from the configuration files. The scope will be at the instance level. This answers many of the Patroni user’s questions on why they cannot change some of the parameters directly in the parameter file. PostgreSQL users can check those parameters which came as command line arguments like:

postgres=# select name,setting from pg_settings where source='command line';
       	name        	      |  setting  	 
------------------------------+--------------------
 cluster_name          	      | perconapg_cluster
 hot_standby           	      | on
 listen_addresses      	      | 0.0.0.0
 max_connections       	      | 100
 max_locks_per_transaction    | 64
 max_prepared_transactions    | 0
 max_replication_slots 	      | 10
 max_wal_senders       	      | 5
 max_worker_processes  	      | 8
 port                  	      | 5432
 track_commit_timestamp	      | off
 wal_level             	      | replica
 wal_log_hints         	      | on
(13 rows)

Parameters specifications up to this level can have “postmaster” context.  The concept of “context” is discussed as part of the next section.

6. Database level setting

All options discussed so far have a global scope. Meaning they are applicable for the entire instance. But there could be reasons why a PostgreSQL user wants to change that at a specific database level. For example, one of the databases might be handling an OLTP workload where query parallelism may not be really needed and may have an adverse impact. But another database might be an OLAP system.

postgres=# ALTER DATABASE newdb SET max_parallel_workers_per_gather = 4;
ALTER DATABASE

The concept of context

At this stage, we should recollect another concept called the “context” of a parameter. For example, the network port at which PostgreSQL listens cannot be changed at the individual database level. A change of such parameters requires a PostgreSQL restart. So we say that the context of parameter “
port” is postmaster. A change to such parameters requires the postmaster — the main process of the PostgreSQL — to restart. Please refer to the documentation: https://www.postgresql.org/docs/current/view-pg-settings.html to understand different contexts of PostgreSQL parameters.  We won’t be allowed to change a set of parameters at this level onwards; any attempt will be prevented.

postgres=# ALTER DATABASE db1 SET max_connections=100;
ERROR:  parameter "max_connections" cannot be changed without restarting the server

The max_connections is something to specify at the global (instance) level by the postmaster, and it requires restart.

There are other sets of parameters that need to be communicated through postmaster only, even though they can be changed without restarting the server. That context is called sighup. Because we can signal the postmaster, and it will re-read such parameters and propagate the same to all its child processes, changing them at the database level will be prevented.

postgres=# ALTER DATABASE db1 SET log_filename='postgresql-DB1.log';
ERROR:  parameter "log_filename" cannot be changed now

You may even consider looking at the PostgreSQL source code:

https://github.com/postgres/postgres/blob/6fde2d9a005a5bc04aa059d3faeb865c8dd322ce/src/backend/utils/misc/guc.c#L3376

for a much deeper understanding of the logic of “context” and what is allowed in which level.

7. User-level settings

Each user can have their preferred parameter settings so that all sessions created by that user will have that setting in place.  Please remember that this user-level setting has a higher preference than database-level settings. Users can check their own user-level settings, like this:

select name,setting,source,context from pg_settings where  source='user';

8. Database – user combination

PostgreSQL allows us to have parameter settings that will be applicable when a particular user/role connects to a particular database.

For example:

ALTER USER admin IN DATABASE db1 SET max_parallel_workers_per_gather=6;

Setting at this level has even higher priority than everything mentioned before.

select name,setting,source,context from pg_settings where  name='max_parallel_workers_per_gather';
          	name           	| setting |	source 	| context
---------------------------------+---------+---------------+---------
 max_parallel_workers_per_gather | 6   	| database user | user
(1 row)

9. Parameters by the client connection request

There is an option to specify parameters while making a new connection. It can be passed to PostgreSQL as part of the connection string.

For example, I want to connect to the database to perform some bulk data loading and manipulation (ETL), and I don’t want to wait for any WAL writing. If, at all, there is any crash in between, I am OK to perform the ETL again. So, I am going to request a connection with
synchronous_commit  off.

$ psql "host=localhost user=postgres options='-c synchronous_commit=off'"
psql (14.8)
Type "help" for help.

postgres=# select name,setting,source,context from pg_settings where  name='synchronous_commit';
    	name        | setting   | source | context
--------------------+-----------+--------+---------
 synchronous_commit | off 	| client | user
(1 row)

10. Session-level setting

Each session can decide on the settings for that session at that point in time or execution. The sessions are allowed to modify this session-level setting as and when required.

postgres=# set jit=off;
SET
postgres=# select name,setting,source,context from pg_settings where  name='jit';
 name | setting | source  | context
------+---------+---------+---------
 jit  | off 	| session | user
(1 row)

A good use case is that, suppose we are going to rebuild a big index. We know that it is going to use considerable maintenance_work_mem.  Setting this at the session level simplifies our life without affecting other sessions.

set maintenance_work_mem = '4GB';

11. Transaction-level settings

PostgreSQL allows us to specify parameters at a very small scope, like transaction level.

Here is an example of discouraging sequential scans in a particular transaction block.

postgres=# BEGIN;
BEGIN
postgres=*# SET LOCAL enable_seqscan=off;
SET

I prefer transaction-level settings because the changes are very local to the transaction, and they will be reverted back once the transaction is completed. This is the most preferred place to set the work_mem to minimize the impact.

12. Object-level settings

PostgreSQL allows us to specify the parameter specific to a program block, like a PL/pgSQL function. So, the setting goes as part of the function definition.

Here is an example of the function definition to test the function-level settings.

CREATE OR REPLACE FUNCTION checkParams()
 RETURNS BOOLEAN
as $$
DECLARE
  nm TEXT;
  setng TEXT;
  cntxt TEXT;
  src TEXT;
BEGIN
SELECT name,setting,context,source INTO nm,setng,cntxt,src from pg_settings where  name='enable_partitionwise_join';
RAISE NOTICE 'Parameter Name: % value:%  Context:%  Source:%',nm,setng,cntxt,src;
RETURN true;
END;
$$ LANGUAGE plpgsql
SET enable_partitionwise_join = 'on'
SET enable_seqscan = 'off';

PostgreSQL parameters summary

PostgreSQL parameter specification is very flexible and powerful, so understanding the scope, context, and priority is important for every user. A rule of thumb could be the broader the scope, the lower the priority.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Jul
20
2023
--

How To Measure the Network Impact on PostgreSQL Performance

Network Impact on PostgreSQL Performance

It is very common to see many infrastructure layers standing between a PostgreSQL database and the Application server.  The most common ones are connection poolers, load balancers, routers, firewalls, etc.  We often forget or take for granted the network hops involved and the additional overhead it creates on the overall performance. But it could cause severe performance penalties in many cases and overall throughput deterioration. I have been trying to get a good estimate of this overhead for quite some time. Previously I had written about how the volume of data transmission as part of SQL execution, as well as the cursor location, affects the overall performance.

Meanwhile, Hans-Jürgen Schönig’s presentation, which brought up the old discussion of Unix socket vs. TCP/IP connection, triggered me to write about other aspects of network impact on performance. He demonstrated a specific case of a 2x performance degradation while using TCP/IP connection.

How to detect and measure the impact

There is no easy mechanism for measuring the impact of network overhead. But a very close analysis of wait_events from pg_stat_activity can tell us the story as closely as possible. So we should be sampling the wait events. Many methods exist for wait-event sampling, including extensions. But I prefer not to install special tools or extensions on the user environment for the wait event sampling. At Percona Support, we use pg_gather as the method to collect and study the wait events because it is a standalone SQL script and doesn’t need to install anything on the database systems. It is designed to be very lightweight as well. There will be 2,000 samples collected per session.

pg_gather analysis report can show wait events and other information associated with each session.


But I will be discussing and highlighting only the wait events portion of it in this blog while going through different types types of workloads and how network performance shows up in wait events.

Case 1: Query retrieving a large number of rows

Let’s consider the case of pg_dump taking a logical backup on a different machine.  If the network is fast, we may see a good amount of CPU utilization and “DataFileRead” as a wait event.

Of course, there are “ClientWrite” events, which is the wait event related to writing the data to the client (pg_dump) in this case. If the client is a lightweight tool like psql and the network is really fast, the “ClientWrite” may even become invisible.

But let’s see what the wait events look like if the network slows down.

We can see that the CPU utilization and “DataFileRead” wait events are dropped, indicating that the overall server-side session activity is slow downed. At the same time, “ClientWrite” is shot up to 1821, indicating that the session is spending considerable time sending the data to its client-side (pg_dump). There is also “ClientRead,” indicating that the acknowledgment from the pg_dump is taking time.

The spike in “ClientWrite” does not depend on the client tool. Following is the screenshot of a regular psql session for a query retrieving a large number of records.

This excessive “ClientWrite” is good enough to spot the problem in these cases.

Case 2: Bulk data loading

This is the opposite of the previous case. But PostgreSQL has a lot more work to do for a write operation with bulk data. Following wait events are captured from a really fast/low latency network.

Obviously, the PostgreSQL process has to spend time in “DataFileExtend,” “WALWrite,” and “WALSync.” Now, if the network slows down, many of those wait events we saw may become invisible as the performance bottleneck emerges.

The following is the wait events from the same bulk data load over a slower network.

As we can see, “ClientRead” has become the major wait event. This means the server session is spending more time reading from its client.

The change may not be dramatic in many systems, but overall “ClientRead” has become more prominent.

Case 3: Impact on transactions

One might ask what is so special about transactions. On an OLTP workload, statements could be simple and small to cause any observable network impact.  But back-and-forth communication between the server and the client can result in unwanted delays between statements and final commits or rollback.  Yes, I mean delays/gaps between each statement.

Following is the wait event of a fast network and micro-transactions using pgbench.

Obviously, there are high WAL-related wait events and CPU usage. But we can see there is considerable “ClientRead” also. This happens because there will be a lot of network interactions for microtransactions. ClientRead is unavoidable for transactions, and it’s OK to expect 5-10% of it.

But as the network slows down, the “ClientRead” becomes increasingly important. Following is the information from the same pgbench transaction workload over a slower network.

The ClientRead became the biggest wait event in this case.

You might wonder, what is the “Net/Delay*” showing up? This additional analysis is available in the new version of pg_gather (version 21) to assess the delay outside the transaction block. See the next session for details.

Case 4: Connection utilization

As the network latency increases, the client connection won’t be able to use the server session to the extent possible. The server session has to wait on the eighter “ClientRead”/”ClientWrite” or sit idle. Either way, it can drastically affect the throughput of the system.

Within a transaction, the delay is captured as “ClientRead,” but the delay between two transactions is not captured because the session becomes “idle” momentarily. pg_gather new version prepares an estimate of this momentary switches to idle as the server wastes time or “Net/Delay*.” It could be due to network delays or poor application response. From the database side, it is difficult to distinguish between them. But the “Net/Delay*” can give a good idea about how much server time is wasted.

If it is possible to install PostgreSQL client tools on the application server, it is easy to simulate a load and study both network delay and application side response delay and compare that with the actual data.

The delay/latency becomes more visible when there is a lot of back-and-forth communication between the client and server.  This can be easily tested by creating a single statement file. (Idea taken from Hans-Jürgen.)

echo "SELECT 1" > query.sql

This can be executed against a remote database over a TCP connection for a specified number of seconds.

$ pgbench -h 10.197.42.1 -T 20 -f query.sql

On a fast network between my servers, I could get the following result as TPS of a single session.

…
latency average = 0.030 ms
initial connection time = 5.882 ms
tps = 32882.734311 (without initial connection time)

But the wait event analysis by pg_gather tells me that more time is spent on Net/Delay*.

It makes sense because “SELECT 1” doesn’t have much to do at the server, and this workload is all about sending back-and-forth communication.

With a local Unix socket connection, the single session throughput increased by more than double!

latency average = 0.013 ms
initial connection time = 1.498 ms
tps = 75972.733205 (without initial connection time)

But the wait event analysis tells us that still, the client-server communication is a major time consumer. 🙁

This kind of highly interactive workload could benefit from server-side programming (stored proc/function) or even an extension. Interestingly, the CPU usage is of less proportion compared to TPS when a Unix socket connection is used; that’s an important point to be noted. “ClientRead” increased because more data was transferred from the client.

If the network slows down in this case, the “Net/Delay*” also increases, and CPU usage and TPS drop because the session spends more time doing nothing between processing two statements.

Since this particular workload doesn’t have transactions and less data to send to the server, the “ClientRead” can drop to an unnoticeable level, as we see.

Summary

The “wait events” information from pg_stat_activity can tell us many details about performance and network congestion. Not just the sum of events, but the gap between two wait events and pattern has a lot of information to dig down.  Properly collected and analyzed data tells the story from the PostgreSQL perspective and how it experiences the network. More importantly, analysis becomes independent of database hosting or OS-level tools. There is no need to have any sophisticated tools or frameworks required to achieve this. Stand-alone SQL scripts like this can be handy in spotting problems and bottlenecks. Even though this blog post is specific about the network, the wait event analysis can be generic to many cases.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Apr
24
2023
--

PostgreSQL Indexes Can Hurt You: Negative Effects and the Costs Involved

PostgreSQL Indexes

Indexes are generally considered to be the panacea when it comes to SQL performance tuning, and PostgreSQL supports different types of indexes catering to different use cases. I keep seeing many articles and talks on “tuning” discussing how creating new indexes speeds up SQL but rarely ones discussing removing them. The urge to create more and more indexes is found to be causing severe damage in many systems. Many times, removing indexes is what we should be doing first before considering any new indexes for the benefit of the entire system. Surprised? Understanding the consequences and overhead of indexes can help to make an informed decision and potentially save the system from many potential problems.

At a very basic level, we should remember that indexes are not free of cost. The benefits come with a cost in terms of performance and resource consumption. The following is the list of ten problems/overheads that the excessive use of indexes can cause. This post is about PostgreSQL, but most of the problems also apply to other database systems.

1. Indexes penalize the transactions

We might see an improvement in the performance of a SELECT statement after adding an index. But we should not forget that the performance gains come with a cost to the transactions on the same table. Conceptually, every DML on a table needs to update all the indexes of the table. Even though there are a lot of optimizations for reducing the write amplification, it is a considerable overhead.

For example, let’s assume that there are five indexes on a table; every INSERT into the table will result in an INSERT of the index record on those five indexes. Logically, five index pages also will be updated. So effectively, the overhead is 5x.

2. Memory usage

Index pages must be in memory, regardless of whether any query uses them because they need to get updated by transactions. Effectively, the memory available for pages of the table gets less. The more indexes, the more the requirement of memory for effective caching. If we don’t increase the available memory,  this starts hurting the entire performance of the system.

3. Random writes: Updating indexes is more costly

Unlike INSERTS new records into tables, rows are less likely to be inserted into the same page. Indexes like B-Tree indexes are known to cause more random writes.

4. Indexes need more cache than tables

Due to random writes and reads, indexes need more pages to be in the cache. Cache requirements for indexes are generally much higher than associated tables.

5. WAL generation

In addition to WAL records of the table updates, there will also be WAL records for indexes. This helps in crash recovery and replication. If you are using any wait event analysis tools/scripts like pg_gather, the overhead of the WAL generation will be clearly visible. The actual impact depends on the index type.

WAL

This is a synthetic test case, but if WAL-related wait events appear as any of the top wait events, it is a matter of concern for a transaction system, and we should take every step to address it.

 

Download Percona Distribution for PostgreSQL Today!

 

6. More and more I/O

Not just WAL records are generated; we will have more pages dirtied, as well. As the index pages get dirtied, they must be written back to files, leading to more I/O again—the “DataFileWrite” wait event, as seen in the previous screenshot.

Another side effect is indexes increase the total Active-Dataset size. By “Active dataset,” I mean the tables and indexes which are frequently queried and used. As the size of the active dataset increases, the cache becomes less and less efficient. Less-effective cache results in more datafile read, so read I/O is increased. This is in addition to the read I/O required to bring the additional index pages from storage for specific queries.

Again the pg_gather report of another system with mainly select queries shows this problem. As the Active-Dataset increases, PostgreSQL has no choice but to bring the pages from storage.

PostgreSQL Active-Dataset

A more significant percentage of “DataFileRead” sustaining for a longer duration indicates that the Active-Dataset is much bigger, which is not cachable.

7. Impact on VACUUM/AUTOVACUUM

The overhead is not only for inserting or updating index pages, as discussed in the previous points. There is overhead in maintaining it since the indexes also need cleanups of old tuple references.

I have seen cases where autovacuum workers on a single table run for a very long duration because of the size of the table and, most importantly, the excessive number of indexes on the table. In fact, it is widespread that users see their autovacuum worker is “stuck” for hours without showing any progress for a longer duration.  This happens because the index cleanup by the autovacuum is the opaque stage of autovacuum and is not visible through views like pg_stat_progress_vacuum other than the vacuum phase is indicated as vacuuming indexes.

Indexes can get bloated and become less efficient over time.  Periodic index maintenance (REINDEX) might be needed in many systems.

8. Tunnel vision while tuning

Tunnel vision is the loss of the field view. The user may be concentrating on a particular SQL statement in an attempt to “tune” and decide on creating indexes.  By creating an index for tuning a query, we are shifting more system resources to that query.  Then it may give more performance to that particular statement by penalizing others.

But as we keep creating more and more indexes for tuning other queries, the resources will shift again towards other queries. This leads to a situation where the effort to tune every query penalizes every other query. Ultimately, everyone will be hurt, and only losers will be in this war. Someone trying to tune should consider how every part of the system can co-exist (maximizing business value) rather than absolute maximum performance for a particular query.

9. Greater storage requirement

Almost every day, I see cases where indexes take more storage than tablets.

PostgreSQL index storage requirement

This may sound too silly for those with more money to spend on storage, but we should remember that this has a cascading effect. The total database size grows to a multiple of the actual data. So obviously, backups take more time, storage, and network resources, and then the same backup can put more load on the host machine. This would also increase the time to restore a backup and recover it. Bigger databases affect many things, including more time to build standby instances.

10. Indexes are more prone to corruption

I am not just talking about rarely occurring index-related bugs like silent index corruption of PostgreSQL 14 or index corruption due to glibc collation change, which keeps popping up now and then and affects many environments even today. Over decades of working with databases, I have observed that index corruptions are reported more frequently. (I hope anyone involved in PostgreSQL for years and who has seen hundreds of cases will agree with me). As we increase the number of indexes, we increase the probability.

What should we do?

A set of critical questions should accompany new index considerations: Is it essential to have this index, or is it necessary to speed up the query at the cost of more index?  Is there a way to rewrite the query to get a better performance? Is it ok to discard the small gains and live without an index?

Existing indexes also require a critical review over a period of time. All unused indexes (those indexes with idx_scan as zero in pg_stat_user_indexes) should be considered for dropping. Scripts like the one from pgexperts can help to do more analysis.

The upcoming PostgreSQL 16 has one more column in pg_stat_user_indexes / pg_stat_all_indexes  with the name last_idx_scan, which can tell us when was the last time the index was used (timestamp). This will help us to take a well-informed look at all the indexes in the system.

Summary

The summary in simple words: Indexes are not cheap. There is a cost, and the cost can be manifold.  Indexes are not always good, and sequential scans are not always bad, either.  My humble advice is to avoid looking for improving individual queries as the first step because it is a slippery slope. A top-down approach to tuning the system yields better results starting from tuning the Host machine, Operating System, PostgreSQL parameter, Schema, etc.  An objective “cost-benefit analysis” is important before creating an index.

Our world-class PostgreSQL training teaches you all about indexes: when to use them, when not to use them, how they impact your system, etc. Come take our intensive instructor-led training course.

 

Learn more about Percona Training

Jan
24
2023
--

WAL Compression in PostgreSQL and Recent Improvements in Version 15

WAL Compression in PostgreSQL

Attempts to compress PostgreSQL WAL at different levels have always been around since the beginning. Some of the built-in features (wal_compression) have been there since 2016, and almost all backup tools do the WAL compression before taking it to the backup repository. But it is time to take another look at the built-in wal_compression because PostgreSQL 15 has much more to offer. There will be a considerable gain if we combine this with other sets of great improvements in WAL archiving in PostgreSQL 15, as discussed in previous posts New WAL Archive Module/Library in PostgreSQL 15 and Speed Up of the WAL Archiving in PostgreSQL 15.

I am not planning to look at WAL segment file compression by the backup tools because it is external to PostgreSQL and is primarily a feature of backup tooling. Even if we are not using specialized backup tools, we can still compress the WAL segment files as part of archiving, which can give great rewards.

Let’s look at what PostgreSQL offers internally for WAL compression. The WAL-compression feature inside the PostgreSQL kicks in (if we opt-in) while doing Full Page Writes into WAL, which can save a lot of I/O overhead. Reduced WAL segment size has further gains in replication and backup because less data need to be transmitted.

What is Full Page Writes?

Novice users might be wondering what is “Full Page Writes” , and we should remember that PostgreSQL uses 8k pages.

postgres=# show block_size ;
block_size
------------
8192
(1 row)

But the host machine might be dealing with pages of smaller size, say 4k pages.

$ getconf PAGESIZE
4096

PostgreSQL deals with 8k pages as the “atomic unit” for reading and writing. But since the host machine has a smaller page size, it will split the 8k page and treat each OS page as an atomic unit. This causes a problem if there are abrupt disruptions; a part of the 8k page can be saved, while another part might be lost because the host machine may not treat the latter as part of the original piece. This is generally referred to as “partial page writes” or “torn pages.”

Such “torn pages” are corruptions from the database point of view. If a datafile with such torn page exists, PostgreSQL loses the consistency of the page. This is not a problem for PostgreSQL alone; every database software needs to deal with this problem. For example, MySQL/InnoDB deals with this problem using a Doublewrite buffer from where a copy of the non-corrupt page can be obtained during the recovery. PostgreSQL’s approach to the problem is slightly different. PostgreSQL writes a copy of the full page, which is modified for the first time after a checkpoint into WAL logs. Since WALs are synced frequently, and PostgreSQL can decide on the point up too which the recovery should happen, it is a safe place to save the copy of “full page.”

The performance penalty of Full Page Writes

As mentioned above, when the database page is modified for the first time after a checkpoint, that needs to be safely written to WAL as the trustable reference. So during crash-recovery, PostgreSQL can safely apply the consistent pages from the WAL logs. But this comes with a considerable performance implication.

Now we know there is a high chance of too many full-page writes immediately after a checkpoint. This will be clearly visible in PostgreSQL performance benchmarks as a “Sawtooth wave” pattern observed by Vadim in his tests:

Sawtooth wave PostgreSQL

As we can see, the throughput suddenly drops after every checkpoint due to heavy WAL writing and gradually picks up until the next checkpoint.

Full Page compression and recent improvements

PostgreSQL 14 and older

The Full Pages are much bigger and fully contained. This allows one to compress those full pages before writing into WAL segments. This feature landed in PostgreSQL way back in PostgreSQL 9.5; it uses the built-in LZ compression implementation, often referred to as “pglz”. But it was not too popular because of the CPU overhead. So it has never been part of the general advice for optimizations. Those users who opt-in for FP compression after analyzing their workload could just set the parameter wal_compression on all supported PostgreSQL versions and signal the server as superuser.

ALTER SYSTEM SET wal_compression=ON;
SELECT pg_reload_conf();

PostgreSQL 15+

Modern compression algorithms started offering much better compression while taking fewer CPU cycles. Lz4 is a good example. PostgreSQL 15 modifies the same parameter wal_compression to take enumerated values like pglz, lz4, and zstd, in addition to on and off, which are used for backward compatibility. The boolean equivalent values like on, true, yes, and 1 are equivalent to “pglz.”

Unlike the pglz, which is a built-in compression library in PostgreSQL, the new compression functions are provided by external libraries. So they need to be enabled during the build time. The configuration flags –with-lz4 and –with-zstd for lz4 and zstd, respectively.

For the prebuilt packages installed from repositories, you shall check like:

/usr/pgsql-15/bin/pg_config | grep "zstd\|lz4"
CONFIGURE =  '--enable-rpath' '--prefix=/usr/pgsql-15' '--includedir=/usr/pgsql-15/include' '--mandir=/usr/pgsql-15/share/man' '--datadir=/usr/pgsql-15/share' '--libdir=/usr/pgsql-15/lib' '--with-lz4' '--with-extra-version= - Percona Distribution' '--with-zstd' '--with-icu' '--with-llvm' '--with-perl' '--with-python' '--with-tcl' '--with-tclconfig=/usr/lib64' '--with-openssl' '--with-pam' '--with-gssapi' '--with-includes=/usr/include' '--with-libraries=/usr/lib64' '--enable-nls' '--enable-dtrace' '--with-uuid=e2fs' '--with-libxml' '--with-libxslt' '--with-ldap' '--with-selinux' '--with-systemd' '--with-system-tzdata=/usr/share/zoneinfo' '--sysconfdir=/etc/sysconfig/pgsql' '--docdir=/usr/pgsql-15/doc' '--htmldir=/usr/pgsql-15/doc/html' 'CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 'LDFLAGS=-Wl,--as-needed' 'LLVM_CONFIG=/usr/bin/llvm-config' 'CLANG=/usr/bin/clang' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig' 'PYTHON=/usr/bin/python3'
LIBS = -lpgcommon -lpgport -lselinux -lzstd -llz4 -lxslt -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lpthread -lrt -ldl -lm

Why compression of FP writes only?

Individual processes generate WAL records, and latency is very crucial for transactions. So a coordinated effort for compressing WAL record generation by multiple sessions may not add much value. But there are areas like indexing or bulk data load which could potentially benefit from WAL compression if such a feature is included in the future.

A quick look at compression options

When it is about compression, automatically, the question will be about the savings in I/O. As we know, compression comes at the expense of CPU usage. My objective is to quickly check whether there is any benefit when there is already high CPU utilization and whether there is any adverse (negative) effect on overall TPS.

There is considerable saving if there are a lot of full-page writes. I could artificially trigger checkpoints to see the end effect on total WAL generation.

WAL generation postgresql

Lz4 gives a compression comparable to the default pglz at a much less CPU cost. Zstd can provide the highest compression (30% more compared to lz4).

In a system already on the brink of too much WAL generation, the uncompressed WAL can trigger many more checkpoints, resulting in even more WAL generation.

Summary

Some of the key points/takeaways I have from the discussion in the community and as well as in my simple tests:

  • The compression method pglz available in the older version was not very efficient. It uses more CPU and can affect the TPS for specific workloads and machine configuration.
  • Modern compression algorithms and libraries are excellent. They do a much better job than what was available in PostgreSQL (pglz).
  • I couldn’t see any adverse effect on the TPS on quick tests. Instead, I could observe 10-15% better throughput with compression enabled, maybe because of less I/O wait.
  • Lz4 compression can be the choice if the database workload is CPU bound because it is light on the CPU. It can give compression very close to pglz without causing high CPU overhead.
  • Zstd can be chosen if the server load is not CPU bound because it can give us better compression at the expense of more CPU utilization.
  • An indirect benefit of WAL compression is it reduces the chance for checkpoints triggered by the volume of WALs generated (max_wal_size).

The actual benefit of compression depends on many factors. Configured checkpoint frequency, the chance of checkpoints triggered by WAL generation, storage performance, acceptable CPU overhead, type of CPU architecture, and many other factors. With the introduction of new options, the entry barrier is significantly reduced.

Word of caution

If you are using lz4 or zstd compression for all its greatness, please make sure that the PostgreSQL binaries on standby are also capable of doing it. This warning may mainly apply to users building their PostgreSQL from source code.

Thanks to community

Thanks to Michael Paquier, who took the major part in the development of PG 15 improvements,  Justin Pryzby, who played the key role in this development, Andrey Borodin, who initiated the discussion, and many others who participated in the discussions.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Jan
04
2023
--

Speed Up of the WAL Archiving in PostgreSQL 15

Speed Up of the WAL Archiving in PostgreSQL 15

Last time I blogged about the New WAL Archive Module/Library feature available in PostgreSQL 15, which is quite transformative in how WALs are archived today in PostgreSQL. PostgreSQL 15 has many more improvements related to WAL archiving, which is worth discussing. In this blog, I would like to highlight some of them which solve great operational challenges for many of the PostgreSQL users.

Basics of WAL archiving

Let’s first discuss one of the problems that existed in PostgreSQL 14 and older.

When Postgresql generates WAL segment files in the pg_wal directory, it also generates associated

.ready

  files in

pg_wal/archive_status

  subdirectory.

For example,

$ ls -alrth pg_wal/
drwx------  3 postgres postgres  68 Dec  7 05:47 .
drwx------ 21 postgres postgres  32 Dec 21 03:54 ..
-rw-------  1 postgres postgres 16M Dec 21 04:38 0000000200000008000000E4
-rw-------  1 postgres postgres 16M Dec 21 04:38 0000000200000008000000E5
-rw-------  1 postgres postgres 16M Dec 21 04:38 0000000200000008000000E6
-rw-------  1 postgres postgres 16M Dec 21 04:38 0000000200000008000000E7
drwx------  2 postgres postgres   6 Dec 21 04:38 archive_status
-rw-------  1 postgres postgres 16M Dec 21 04:38 0000000200000008000000E8

-bash-4.2$ ls -alrth pg_wal/archive_status/
total 5.0K
drwx------ 3 postgres postgres 68 Dec  7 05:47 ..
-rw------- 1 postgres postgres  0 Dec 21 04:38 0000000200000008000000E4.ready
-rw------- 1 postgres postgres  0 Dec 21 04:38 0000000200000008000000E5.ready
-rw------- 1 postgres postgres  0 Dec 21 04:38 0000000200000008000000E6.ready
-rw------- 1 postgres postgres  0 Dec 21 04:38 0000000200000008000000E7.ready
drwx------ 2 postgres postgres  6 Dec 21 04:38 .

This indicates that WAL segment files up to

0000000200000008000000E7

are ready for archiving, and the

0000000200000008000000E8

  is the current WAL segment file which is not yet ready to be archived.

postgres=# SELECT pg_walfile_name(pg_current_wal_lsn());
 	pg_walfile_name 	 
--------------------------
 0000000200000008000000E8
(1 row)

Once the WALs are archived to the backup location (archive destination), the status changes to

.done

$ ls -alrth pg_wal/archive_status/
total 5.0K
drwx------ 3 postgres postgres 68 Dec  7 05:47 ..
-rw------- 1 postgres postgres  0 Dec 21 04:38 0000000200000008000000E4.done
-rw------- 1 postgres postgres  0 Dec 21 04:38 0000000200000008000000E5.done
-rw------- 1 postgres postgres  0 Dec 21 04:38 0000000200000008000000E6.done
-rw------- 1 postgres postgres  0 Dec 21 04:38 0000000200000008000000E7.done

PostgreSQL uses these status files to understand what are the WAL files pending for archiving.

PostgreSQL scans the directory

pg_wal/archive_status/

  to understand the oldest WAL segment file that has not yet been archived.

PostgreSQL’s archiver process wakes up every 60 seconds (default) and tries to address every pending WAL segment by executing an internal function

pgarch_ArchiverCopyLoop();

Which in turn executes the

archive_command

  for each of the WAL segment files. But the WAL archiving should happen in the right order and only for the remaining WAL segment files.

The decision on which WAL file to be handled next is decided by the function

pgarch_readyXlog()

. The following comments in the PostgreSQL code can tell the whole story.

/*
 * pgarch_readyXlog
 *
 * Return name of the oldest xlog file that has not yet been archived.
 * No notification is set that file archiving is now in progress, so
 * this would need to be extended if multiple concurrent archival
 * tasks were created. If a failure occurs, we will completely
 * re-copy the file at the next available opportunity.
 *
 * It is important that we return the oldest, so that we archive xlogs
 * in order that they were written, for two reasons:
 * 1) to maintain the sequential chain of xlogs required for recovery
 * 2) because the oldest ones will sooner become candidates for
 * recycling at time of checkpoint
 *
 * NOTE: the "oldest" comparison will consider any .history file to be older
 * than any other file except another .history file.  Segments on a timeline
 * with a smaller ID will be older than all segments on a timeline with a
 * larger ID; the net result being that past timelines are given higher
 * priority for archiving.  This seems okay, or at least not obviously worth
 * changing.
 */

The problem

But this function

pgarch_readyXlog() needs

to scan through the files in

pg_wal/archive_status/

 for deciding which is the next candidate for archiving. So effectively each WAL file to be archived results in a full directory scan.

What if there are thousands or millions of files in

pg_wal/archive_status/

? This happens in a high transaction system where WAL archiving is not able to catch up with WAL generation in peak hours or if WAL arching is failing for some time. Once there are a huge number of

.ready

  status files accumulated, the directory scans themselves will start taking more time. Effectively the chance of WAL arching to catch up becomes very glim.

If this happens in PG14 or older, the only solution is to try increasing wal_segment_size to a much bigger value, so that a lesser number of files will be generated. We had customer cases where we increased the

wal_segment_size

  from the default 16MB to 1GB to address this problem. Of course, that’s a dirty workaround with obvious consequences like huge data loss if a backup need to be restored.

How PostgreSQL 15 solves this problem

Multiple solutions and patches are discussed with all possibilities. If you want to look into the full details, you can refer to the thread in the mailing list here:

https://www.postgresql.org/message-id/flat/CA%2BTgmobhAbs2yabTuTRkJTq_kkC80-%2Bjw%3DpfpypdOJ7%2BgAbQbw%40mail.gmail.com

The discussion converged into two approaches:

  1. Scan the directory and hold the result in an array and provide the same for archive_command or module. Even though this can drastically reduce the number of directory scans, the directory scans still happen, and associated O(n^2) complexity exists.
  2. A much smarter approach is to predict the next WAL segment file (based on the WAL filename format) and attempt to see the same in the directory. Directory scans can be avoided for the major part of the logic.

It was a very difficult decision on which approach should be taken. After weighing all the implications, the first approach, which holds the WAL segment filenames in an array, was selected, mainly because this array can be used for further improvement of sending multiple files to archive_command or module at a time, which is another major area which needs improvement.

How it works in PostgreSQL 15

The idea is to scan the archive_status directory with .ready files and accumulate the list of WAL files to be archived into an array the array size can be controlled using a constant definition at compile time.

/*
* Maximum number of .ready files to gather per directory scan.
*/
#define NUM_FILES_PER_DIRECTORY_SCAN 64

So a directory scan will be done after 64 .ready files are handled.

Since it is very important to push the timeline history to the archive, it will be prioritized over the WAL segments. This is done by especially triggering a directory scan whenever there is a timeline switch.

Overall, 20x or more performance improvement is reported in the community

Better monitoring of WAL archiving

A new set of wait_events are added in PostgreSQL 15 for better observability and troubleshooting of WAL Archiving, Restore, and Cleanup phases.

ArchiveCleanupCommand Waiting for archive_cleanup_command to complete.
ArchiveCommand Waiting for archive_command to complete.
RecoveryEndCommand Waiting for recovery_end_command to complete.
RestoreCommand Waiting for restore_command to complete.

This wait event monitoring can tell us what the amount of time spent on specific actions is. For example, the wait event “ArchiveCommand” tells us that the shell command specified in “archive_command” is under execution.

Tools/Scripts like pg_gather can effectively utilize these waits to understand what percentage of time is spent on executing the archive_command and whether the speed of archive_command is a bottleneck in WAL archiving.

WAL archiving

Thanks to Community! I want to acknowledge the great contributions, namely Robert Haas, who was the committer/coordinator of the improvement, Dipesh Pandit, whose work showed the way forward, and finally, Nathan Bossart, who gave the array approach and Dilip Kumar.

Thanks to Fujii Masao for the wait event patches.

Thanks to others who participated in discussions and reviews: Michael Paquier, Bharath Rupireddy, Andres Freund, Andrey Borodin, Dilip Kumar, Stephen Frost, Kyotaro Horiguchi, Jeevan Ladhe, and many others.

Dec
02
2022
--

New WAL Archive Module/Library in PostgreSQL 15

New WAL Archive Module:Library in PostgreSQL 15

PostgreSQL traditionally uses shell commands to achieve continuous WAL archiving, which is essential for backups and stable standby replication.  In the past, we blogged about the inefficiency in that design and how some of the backup tools like PgBackRest solve that problem. It is a well-known problem in the PostgreSQL community, and many discussions happened in the past about the same.

It is also unfortunate that PostgreSQL documentation gives an unsafe command string, employing a combination of the test and cp commands as an example:

archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

Even though the documentation mentions it as an example to explain how the archive_command executes shell scripts/commands, we keep seeing many users copying this approach, falling into this documentation trap, and running into archiving issues.

The good news is that PostgreSQL 15 came up with a set of features to address problems associated with WAL archiving. PostgreSQL can now do WAL archiving using loadable modules. Please refer to the following commit id for the implementation details: https://git.postgresql.org/gitweb/?p=postgresql.git;h=5ef1eefd7 for details. The ability to use an archive module/library is a revolutionary step in this area.

Specifying the archive_library

Specifying the archive_library can be performed when the system is up and running. Just send a signal SIGHUP afterward to get the configuration reloaded::

postgres=# alter system set archive_library='<Library>';
postgres=# select pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)

It is important to keep an eye on PostgreSQL logs because anything going wrong will not be displayed on the psql session, but it will be there in the PostgreSQL logs:

2022-11-24 05:07:27.618 UTC [788] LOG:  received SIGHUP, reloading configuration files
2022-11-24 05:07:27.619 UTC [788] LOG:  parameter "archive_library" changed to "shell"
2022-11-24 05:07:27.620 UTC [1274] FATAL:  could not access file "shell": No such file or directory

We can remove the library specification if we want to use the shell commands as it was in PostgreSQL 14 or older.

ALTER SYSTEM SET archive_library='';
OR
ALTER SYSTEM RESET archive_library ;

And make sure that no archive_library specification is not in effect.

postgres=# show archive_library ;
archive_library
-----------------

(1 row)

At this stage, PostgreSQL will use the plain old archive_command for WAL archiving. And you may specify the same.

alter system set archive_command = '<shell command>';

So, in summary, the archive_command will be considered only if the archive_library spec is empty.

basic_archive: The sample archive library

PostgreSQL 15 provides a simple, sample archive library for a reference architecture, which is part of the contrib modules. If the contrib modules are already present, we can specify the archive_library like:

postgres=# ALTER SYSTEM SET archive_library = 'basic_archive';

But this library needs further input about where to place the archive files. Which can be specified using the module-specific parameter basic_archive.archive_directory.

Otherwise, we may start seeing messages as follows in PostgreSQL logs:

WARNING:  archive_mode enabled, yet archiving is not configured

You may want to refer to the official documentation also.

However, this archive library will be loaded only by the archiver process, not by the backend process of regular sessions. So any attempt to set the parameter or see the parameter can give you errors.

postgres=# ALTER SYSTEM SET basic_archive.archive_directory='/home/postgres/arch2';
ERROR:  unrecognized configuration parameter "basic_archive.archive_directory"

postgres=# show basic_archive.archive_directory;
ERROR:  unrecognized configuration parameter "basic_archive.archive_directory"

But regular sessions can LOAD the library if they want.

postgres=# LOAD 'basic_archive';
LOAD

postgres=# ALTER SYSTEM SET basic_archive.archive_directory='/home/postgres/arch2';                                                                                                                                                                             
ALTER SYSTEM

postgres=# show basic_archive.archive_directory;                                                                                                                                                                                             
basic_archive.archive_directory
---------------------------------

(1 row)

postgres=# SELECT pg_reload_conf();                                                                                                                                                                                                          
pg_reload_conf
----------------
t
(1 row)

postgres=# show basic_archive.archive_directory;
basic_archive.archive_directory
---------------------------------
/home/postgres/arch2
(1 row)

postgres=#

Here is a couple of important observations:

1) Instead of loading the library to a specific session, one might use other methods like shared_preload_libraries, but not advisable

2) The module-specific settings may not show up in pg_settings unless the module is loaded on the session.

If there are multiple archive libraries available, we can switch from one library to another when the system is live. But the archiver process will restart behind the scenes, and the new library specified will be loaded by the new archiver process. A message might appear in the PostgreSQL log like:

LOG:  restarting archiver process because value of "archive_library" was changed

if the attempt to archive the WAL by the basic_arhive fails, there will be corresponding ERROR entries in the PostgreSQL log.

2022-11-24 11:24:31.330 UTC [2323] ERROR:  could not create file "/home/postgres/arch2/archtemp.00000001000000000000007B.2323.1669289071330": No such file or directory
2022-11-24 11:24:31.330 UTC [2323] WARNING:  archiving write-ahead log file "00000001000000000000007B" failed too many times, will try again later

One who is closely observing the above error message might be wondering what this file: archtemp.00000001000000000000007B.2323.1669289071330, which the error message mentions, is. The name doesn’t appear to be a regular WAL segment file.  This happens because the basic_archive  module first creates a temporary file in the destination with a name like that, then it fsyncs the file, and then moves the file to its final name. This is to ensure better protection during crashes.

When the archiver is terminated by a signal (other than SIGTERM that is used as part of a server shutdown) or an error by the shell with an exit status greater than 125 (such as command not found), the archiver process aborts and it gets restarted by the postmaster. In such cases, the failures will be recorded in the

pg-stat-archiver

view. The errors in archive library execution also can be monitored using

pg-stat-archiver

.

postgres=# select * from pg_stat_archiver ;
-[ RECORD 1 ]------+------------------------------
archived_count     | 160
last_archived_wal  | 0000000100000000000000A1
last_archived_time | 2022-11-24 14:58:43.33668+00
failed_count       | 680
last_failed_wal    | 000000010000000000000087
last_failed_time   | 2022-11-24 14:49:51.716773+00
stats_reset        | 2022-11-15 08:58:34.939925+00

Advantage of basic_archive

The basic_archive module does pretty much the same thing as the shell command:

'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

But in a much better way. We already discussed one of the advantages:

It creates a temporary file, and it is fsynced and durably moved to the final archive file copy in the destination. This durability is something cp cannot provide. This drastically reduces the chance that a broken file in the archive destination can result in archive failures and sometimes even database recoverability.

Another important functional advantage of basic_archive is that it has a built-in feature to compare the file in the source (pg_wal) and archive destination when the same file exists. It compares the content of the files and verifies that they are exactly the same, and report back to the archiver a “success” so that the archiver process can continue to the next WAL segment. This also reduces the chance of archive failures. Because if the file is archived by the module, but the server crashes before recording it, PostgreSQL will try to archive the same WAL segment again. The new sample basic_archive module silently succeeds the second attempt if the file is copied and has the same content.

Additionally, the basic_archive module has a custom exception handler. This allows the archiver to treat an ERROR as a normal failure and avoid restarting the archiver again.

Potential risks/problems of using archive library

Since the archive modules are loaded into PostgreSQL, unlike shell commands executed by archive_command, they have access to server resources. So please pay attention and avoid modules from unknown sources and sources that are not trustable enough as they possess risk.

Module authors must also remember that the archiving callback runs in its transient memory context.  If the module needs to palloc() something that needs to stick around for a while, it will need to do so in a different memorycontext. One might refer to basic developer documentation and the source of base_archive source code.

The sample module: basic_archive may leave archtemp.* files behind in the archive directory if there were crashes or due to other strange failure cases. If such things happen, users may have to clean them before starting PostgreSQL again.

What can be expected in future

We can see a bright future with all these improvements, as the default archive_command  can become yet another archive module in future versions of PostgreSQL.

I wish the basic_archiveor new modules will come into existence with more features and configuration options like compression because Compression of PostgreSQL WAL Archives is Becoming More Important.

I would expect PostgreSQL support vendors to develop many more powerful libraries and add to the PostgreSQL ecosystem.

Database as a Service (DBaaS) vendors will be one of the biggest beneficiaries as they will have a very specific place and method for WAL archiving, which a module can handle in a much more efficient way.

I would expect the backup tools and methods to evolve. Now a backup tool can remain in demon mode, waiting for messages from the PostgreSQL archiver. walg_archive is an example in this direction.

I think it’s okay to expect restore_libary support also in future PostgreSQL.

Thanks to Community

Special thanks to Nathan Bossart, who took the major role in improving WAL archiving in PostgreSQL 15, and others like Robert Haas for active participation from the beginning till the final commit.

Thanks to Many others who were involved in the discussions and reviews, namely Julien Rouhaud, Andrey Borodin, Robert Haas, Stephen Frost, Magnus Hagander, David Steele, Fujii Masao, Michael Paquier, Alvaro Herrera, Tom Lane, Benoit Lobréau, Peter Eisentraut, Bharath Rupireddy, Kyotaro Horiguchi, Ian Lawrence Barwick, and many others.

Great to see that PostgreSQL 15 started addressing many of the areas of inefficiency, starting from the removal of the stats collector and progress in asynchronous I/O.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com