Nov
09
2023
--

PostgreSQL Checkpoints, Buffers, and WAL Usage with Percona Monitoring and Management

PostgreSQL Checkpoints, Buffers, and WAL Usage with Percona Monitoring and Management

In this blog post, we will discuss how to extend Percona Monitoring and Management (PMM) to get PostgreSQL metrics on checkpointing activity, internal buffers, and WAL usage. With this data, we’ll be able to better understand and tune our Postgres servers.

We’ll assume there are working PostgreSQL and PMM environments set up already. You can search the blog for more information on this if needed. For generating load, we used pgbench with the following commands:

shell> pgbench -i -s 100 sbtest
shell> pgbench -c 8 -j 3 -T $((60*60*4)) -N -S -P 1 sbtest

Creating a custom query collector

Note: This step will not be needed with soon-to-come postgresql_exporter PMM versions!

The first step is to include a new query that will gather data on WAL usage because this is not yet collected by default on the latest version of PMM. However, do note that only three graphs will use these custom metrics, so if you want to try out the dashboard without doing this customization, it’s also possible for you to do it (at the expense of having a bit less data).

For this, it’s as easy as executing the following (in the PMM client node that is monitoring the Postgres servers):

cd /usr/local/percona/pmm2/collectors/custom-queries/postgresql/high-resolution/
cat<<EOF >queries-postgres-wal.yml
# ######################################################
# This query gets information related to WAL LSN numbers in bytes, 
# which can be used to monitor the write workload.
# Returns a COUNTER: https://prometheus.io/docs/concepts/metric_types/
# ######################################################
pg_wal_lsn:
  query: "SELECT pg_current_wal_lsn() - '0/0' AS bytes, pg_walfile_name(pg_current_wal_lsn()) as file, pg_current_wal_insert_lsn() - '0/0' AS insert_bytes, pg_current_wal_flush_lsn() - '0/0' AS flush_bytes"
  metrics:
    - bytes:
        usage: "COUNTER"
        description: "WAL LSN (log sequence number) in bytes."
    - file:
        usage: "COUNTER"
        description: "WAL file name being written to."
    - insert_bytes:
        usage: "COUNTER"
        description: "WAL insert LSN (log sequence number) in bytes."
    - flush_bytes:
        usage: "COUNTER"
        description: "WAL flush LSN (log sequence number) in bytes."
EOF
chown pmm-agent:pmm-agent queries-postgres-wal.yml
chmod 660 queries-postgres-wal.yml

This will result in the following new metrics on PMM:

pg_wal_lsn_bytes
pg_wal_lsn_file
pg_wal_lsn_insert_bytes
pg_wal_lsn_flush_bytes

Importing the custom PMM dashboard

For viewing these new metrics, we will import a new custom PMM dashboard that will provide this information in a structured and easy-to-understand way. This can be done in two easy steps (since I’ve uploaded it to Grafana Labs).

1. Click on the Dashboards -> Import menu: 

Percona Monitoring and Management Dashboard

2. Import via grafana.com with ID number (19600):

Import via Grafana

3. Select the PostgreSQL folder and Metrics, and import it:

Import dashboard from Grafana

After this, you’ll be taken to the dashboard, which I generally “star” for easy access later on:

Using the new dashboard

The dashboard is divided into three main areas, all of which have relevant configurations at the top and graphs following them. The main idea is to be able to easily detect what the server is doing at any point in time so we can better understand how it reacts to the workload and tune accordingly. These graphs and panels not only show changes in usage and workload patterns but also changes in configuration, which can help during root cause analysis and performance reviews.

Checkpointing section

Here, we will find everything related to checkpointing: when are checkpoints started (and due to what) and when they finish. Checkpointing has a lot of impact on the write throughput/utilization by Postgres, so it’s important to make sure that it’s not being triggered before needed.

In the following example, we can see how at the beginning, there were many instances of forced checkpointing, not only because the checkpointer was running more often than checkpoint_timeout seconds but because of the kind of metric we can see in the Checkpoints graph (requested vs. scheduled). We can’t see it yet because the WAL graphs are at the bottom, but this was fixed after we increased the value for max_wal_size.

Checkpointing PostgreSQL

Additionally, after the metrics refresh, we can see that the Checkpoint Timeout stat panel shows another change closer to the end of our currently selected time range:

This means that we have also changed the checkpoint_timeout value. This was done only after our workload was more stable and there were no more forced checkpoints being issued.

Buffers section

This section holds configurations and metrics related to reads to and writes from the shared buffers. Again, demonstrating it by example, let’s see how much of an impact correctly sizing the shared_buffers has on our test workload. It’s set at 128Mb by default, which is hardly enough for any serious workload and will mean that we should see a lot of churn in both reads and writes until we increase it. The only downside is that modifying shared_buffers needs a restart, so before doing that, we can increase bgwriter_lru_maxpages (which doesn’t need a restart) to avoid the writer stopping each round and get a bit more performance out of it at the expense of I/O.

We can clearly see how, at first, there were constant reads into the shared buffers, and the background writer was doing most of the writes. Additionally, we can see the positive impact on increasing bgwriter_lru_maxpages because that process is not being throttled anymore. After we increased shared_buffers and restarted Postgres, the reads decreased to almost having to read nothing from the page cache or disks, and the writes are no longer done by the background writer but by the checkpointer.

WAL usage section

The last section pertains to write-ahead logs, which are another source of potential performance bottlenecks. Note that (for now) this is the only section that needs data coming from the custom query collector we added at the beginning, so you can use the rest of the dashboard even if you decide not to include it. PMM already collects the configurations, so we will only miss data from the graphs.

The WAL Writes Per Checkpoint Timeout graph will group the number of bytes that were written in chunks of checkpoint_timeout seconds so we can have a clear view of the expected max_wal_size value. If the chunks (shown as orange bars) exceed the max_wal_size (shown as a red line), it means that there will be forced checkpointing. We can easily dimension max_wal_size knowing that it should be larger than any orange bar, and we can tell when exactly it was changed, following changes in the red line. The WAL Writes graph uses the same metric but is shown as a rate, which can help us pinpoint heavy write times more granularly. Lastly, the WAL Locations graph is included for completeness and shows the different locations for WAL pointers (such as insertion and flush locations), which, according to the documentation, are mainly used for debugging purposes.

Hints and tips

Most graphs contain additional information and links on either the configuration variables or the metrics they show, so they are useful in case we need a bit of help interpreting them. Just hover the mouse over the “i” icon on each of them:

Checkpointing PostgreSQL

Getting even more data

Starting from PostgreSQL 15, we have a new view on WAL metrics: pg_stat_wal

The metrics used in this custom collector and dashboard will work for older versions, but in the future, it will be nice to see what data we can extract from this new view and how we can use it to tune our servers.

Conclusion

This is another example of how tunable and powerful PMM is. Not only because we can add our custom metrics easily but because it already collects many metrics we can use out of the box. Additionally, it’s easy to share new PMM dashboards via the Grafana Labs page by simply publishing it and sharing the ID number. With all this new information at hand, we can now better tune our PostgreSQL instances!

Lastly, we are working closely with the PMM Dev team to include this new dashboard in the PMM server itself, so all these custom steps won’t be needed in future releases. Let us know if you have any comments on it.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Jun
09
2023
--

Monitoring a PostgreSQL Patroni Cluster

Monitoring a PostgreSQL Patroni Cluster

Percona Monitoring and Management (PMM) boasts many functionalities that support its extension, be it by using Custom Queries, Custom Scripts, or by collecting data from already available External Exporters.

In this short blog post, we will see how to quickly (and easily) monitor a PostgreSQL cluster managed by Patroni. I will assume you already have an environment set up and running in which the PMM client is already monitoring the PostgreSQL database. I will assume a three-node PostgreSQL cluster, but it will, of course, work with other setups. If needed, refer to the online documentation on this:

https://docs.percona.com/percona-monitoring-and-management/setting-up/client/postgresql.html

Adding the Patroni metrics to PMM

Since version 2.1.0, Patroni exposes metrics in a Prometheus-compatible way via the /metrics endpoint:

https://patroni.readthedocs.io/en/latest/releases.html#version-2-1-0 

This means that we can instantly benefit from them by using the PMM External Exporters feature:

https://docs.percona.com/percona-monitoring-and-management/setting-up/client/external.html

In a nutshell, we just need to run the following in each client node:

shell> pmm-admin add external --listen-port=8008 --service-name=pg-nodeX-patroni

Where –service-name can be whatever naming scheme you want that lets you easily know which Patroni node it refers to.

After running these commands and waiting some seconds, we should see metrics incoming in the Advanced Data Exploration dashboard, like the patroni_primary one:

Patroni Advanced Data Exploration dashboard

Up to now, no Patroni dashboard would use these metrics. So I took the chance while working on a customer ticket to improve this and created my own dashboard (which is proof of the power behind PMM’s extensibility, in my opinion).

Importing the new Patroni Dashboard

I have uploaded the dashboard to Grafana Labs so it’s easy for everyone to import and use. You can use the ID 18870 to do so in three easy steps.

1- Click on the Import dashboard button in PMM:

2- Select the ID:

3- Select the PostgreSQL folder, and Metrics, and import it:

After this, you should be able to see all the information available from the Patroni /metrics endpoint in your PMM! (Actually, all but patroni_xlog_replayed_timestamp, patroni_cluster_unlocked, and patroni_failsafe_mode_is_active). The dashboard is split into three main sections:

In the headers, we can see information about a node in particular (the one selected in the service_name drop-down list):

  • Patroni information like version, communication with the DCS, if it’s the current leader, etc.
  • PostgreSQL information like version, if it’s running, its timeline number, etc.

Then we have the Patroni section, with graphs showing which were the Patroni primary and replica nodes at any time.

Finally, we have a PostgreSQL section, with graphs showing which were the PostgreSQL primary and replica nodes at any time and information on WAL generation and application.

The following two screenshots show this more clearly:

Conclusion

Adding new information to PMM is easy, and in this case, it’s even easier because all the moving parts are already developed, and it’s a matter of simply knowing how to connect them.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Jul
05
2022
--

Enabling and Enforcing SSL/TLS for PostgreSQL Connections

SSL/TLS for PostgreSQL Connections

Enabling SSL in PostgreSQL is very straightforward. In just three steps we can make sure the connections to it are more secure, using in-transit encryption via SSL/TLS:

  1. Make sure we have the server certificate and key files available
  2. Enable the SSL configuration (ssl = on)
  3. Make sure the pg_hba.conf file rules are updated accordingly

In this blog post, we are going to go through these steps, and we’ll also see how we can check and validate the connections are indeed using the safer SSL protocol.

What is SSL/TLS?

SSL (Secure Sockets Layer) is an encryption protocol designed to make network communications between two nodes secure. Without some form of network encryption, any third party that can examine network packets will have access to the data sent between the client and server (in this case, the PostgreSQL data, which means users, passwords, and even SQL statements). TLS (Transport Layer Security) is the more modern definition of it, and even if SSL is deprecated, it is still common to use it for naming purposes. To all intents and purposes, we are using them as aliases in this blog.

The PostgreSQL documentation pages offer us some more insight in this respect. If needed, consult the Secure TCP/IP Connections with SSL and SSL Support entries for more information.

Trying to enable SSL without Cert/Key Files

Let’s now see what happens when we try to enable SSL without having the needed certificate and key files in place:

postgres=# alter system set ssl=on;
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

We don’t see any errors, but are we really using SSL? If we check the error log, we’ll indeed see the errors:

2022-06-23 20:43:54.713 UTC [5284] LOG:  received SIGHUP, reloading configuration files
2022-06-23 20:43:54.714 UTC [5284] LOG:  parameter "ssl" changed to "on"
2022-06-23 20:43:54.715 UTC [5284] LOG:  could not load server certificate file "server.crt": No such file or directory
2022-06-23 20:43:54.715 UTC [5284] LOG:  SSL configuration was not reloaded

Creating certificates

So, we first need to create the aforementioned files. If you don’t already have valid certificate and key files, a quick one-liner for this is the following openssl command (it’s not the focus here to delve too much into this part of the process):

[root@node0 ~]# cd /var/lib/pgsql/14/data
[root@node0 data]# openssl req -nodes -new -x509 -keyout server.key -out server.crt -subj '/C=US/L=NYC/O=Percona/CN=postgres'
Generating a 2048 bit RSA private key
....+++
.........................+++
writing new private key to 'server.key'
-----

We have changed the current working directory to the PostgreSQL data directory since we were in a RHEL-based system. If you are on a Debian-based one, you should store the files in /etc/ssl/certs/ and /etc/ssl/private/ or define/check ssl_cert_file and ssl_key_file PostgreSQL configuration variables, respectively. Also, make sure the postgres user owns them, and they are only readable to it:

[root@node0 data]# chmod 400 server.{crt,key}
[root@node0 data]# chown postgres:postgres server.{crt,key}
[root@node0 data]# ll server.{crt,key}
-r--------. 1 postgres postgres 1212 Jun 23 20:49 server.crt
-r--------. 1 postgres postgres 1704 Jun 23 20:49 server.key

Enabling SSL/TLS

Now we can enable SSL and reload the configuration again; this time with no errors shown:

postgres=# alter system set ssl=on;
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

??2022-06-23 20:52:05.823 UTC [5284] LOG:  received SIGHUP, reloading configuration files
2022-06-23 20:52:05.823 UTC [5284] LOG:  parameter "ssl" changed to "on"

So far, we have enabled SSL, but unless we modify the pg_hba.conf file these settings won’t apply to any users (at least not in a forceful manner). This is the first step that can give us a false sense of security, so let’s go ahead and see how to fix it.

Enforcing SSL/TLS

As mentioned, the pg_hba.conf file is where we can tune which connections are going to be required to use SSL. We can instruct PostgreSQL to enforce this by using the “hostssl” keyword instead of the plain “host” one. Note that you can see some connections starting to use SSL at this point because the plain “host” keyword will allow for connections that want to use SSL to use it. However, this is not enforcing SSL to be used (i.e.: if the client doesn’t want to use SSL, PostgreSQL will not deny the connection).

Let’s imagine this is the pg_hba.conf file we have been using so far:

# TYPE  DATABASE        USER   ADDRESS            METHOD
local   all             all                       peer
host    all             all    127.0.0.1/32       scram-sha-256
host    all             all    ::1/128            scram-sha-256
host    all             all    0.0.0.0/0          md5
host    replication     all    10.124.33.113/24   md5

And we want to enforce SSL connections from all remote users (and also include remote replication connections):

# TYPE   DATABASE       USER    ADDRESS            METHOD
local    all            all                        peer
host     all            all     127.0.0.1/32       scram-sha-256
host     all            all     ::1/128            scram-sha-256
hostssl  all            all     0.0.0.0/0          md5
hostssl  replication    all     10.124.33.113/24   md5

Again, this is not enough if we are adamant about really enforcing connections to use SSL. We have to call pg_reload_conf() once more to make sure they are loaded into PostgreSQL itself:

postgres=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

At this point, new remote non-SSL connections will be denied:

[root@node1 ~]# psql "host=10.124.33.132 sslmode=disable"
psql: error: connection to server at "10.124.33.132", port 5432 failed: FATAL:  no pg_hba.conf entry for host "10.124.33.113", user "postgres", database "postgres", no encryption

So, can we finally say we are fully secure now? No, not yet! Connections that were already established are not forced to use SSL until they reconnect.

Checking for connections using SSL/TLS

We can check for connections using SSL with the following query:

postgres=# select pg_ssl.pid, pg_ssl.ssl, pg_ssl.version,
           pg_sa.backend_type, pg_sa.usename, pg_sa.client_addr
           from pg_stat_ssl pg_ssl
           join pg_stat_activity pg_sa
             on pg_ssl.pid = pg_sa.pid;
 pid  | ssl | version |  backend_type  | usename  |  client_addr
------+-----+---------+----------------+----------+---------------
 5547 | f   |         | walsender      | postgres | 10.124.33.113
 5549 | f   |         | client backend | postgres | 10.124.33.132
 5556 | f   |         | client backend | postgres | 10.124.33.113
(3 rows)

In this case, the replication connection (walsender) is not yet using SSL and neither are the two other clients connected, so we need to force a restart if we want them to reconnect. As always, we recommend that you try all these steps in a testing environment first and that when it’s time to do it in production you do them in a properly established maintenance window (no matter how trivial the steps seem to be).

To force the replication connections to use SSL, one can either restart the service in the replica or use pg_terminate_backend (which will send the SIGTERM signal to the process and is safe to use in this context). In this case, we are using pg_terminate_backend in the primary itself, but it can also be used in the replica, provided we are using the correct PID number.

postgres=# select pg_terminate_backend(5547);
 pg_terminate_backend
----------------------
 t
(1 row)

After that, we should see the new replica connection correctly using the SSL/TLS protocol:

postgres=# select pg_ssl.pid, pg_ssl.ssl, pg_ssl.version,
           pg_sa.backend_type, pg_sa.usename, pg_sa.client_addr
           from pg_stat_ssl pg_ssl
           join pg_stat_activity pg_sa
             on pg_ssl.pid = pg_sa.pid;
 pid  | ssl | version |  backend_type  | usename  |  client_addr
------+-----+---------+----------------+----------+---------------
 5557 | t   | TLSv1.2 | walsender      | postgres | 10.124.33.113
 5549 | f   |         | client backend | postgres | 10.124.33.132
 5556 | f   |         | client backend | postgres | 10.124.33.113
(3 rows)

PID 5549 is our own connection, so that’s an easy fix:

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
           5549
(1 row)

Connection from 5556 would be the remaining one for us to check if we need to enforce SSL on all. On the client-side, we can use \conninfo to check information on our current connection:

postgres=# \conninfo
You are connected to database "postgres" as user "postgres" on host "10.124.33.132" at port "5432".
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)

Disabling SSL/TLS

If you want to disable SSL instead, be sure to not lose the client connection after you set ssl=off and make changes to the pg_hba.conf file, otherwise you may be locked out if you don’t have any accounts using “host” only access method, and your only way out is to restart the service. To be safe, first, edit and reload pg_hba.conf file to include entries with “host”, and only then fully disable SSL (ssl=off).

Conclusion

Enabling SSL/TLS for in-transit connection encryption is easy, but there are some pitfalls to be aware of when it comes to enforcing its usage. Simply enabling the configuration for it is not enough for it to be enforced, even if by default some connections may prefer using SSL when it’s available. If you need to ensure that all connections use SSL, edit the pg_hba.conf file accordingly and make sure it’s loaded. Remember that “hostssl” entries are the ones that force this behavior.

We can use tcpdump and wireshark to check if connections are indeed being encrypted. But, that’s a topic for another blog…

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com