Replication from Percona Server for MySQL to PostgreSQL using pg_chameleon

postgres mysql replication using pg_chameleon

postgres mysql replication using pg_chameleonReplication is one of the well-known features that allows us to build an identical copy of a database. It is supported in almost every RDBMS. The advantages of replication may be huge, especially HA (High Availability) and load balancing. But what if we need to build replication between 2 heterogeneous databases like MySQL and PostgreSQL? Can we continuously replicate changes from a MySQL database to a PostgreSQL database? The answer to this question is pg_chameleon.

For replicating continuous changes, pg_chameleon uses the mysql-replication library to pull the row images from MySQL, which are transformed into a jsonb object. A pl/pgsql function in postgres decodes the jsonb and replays the changes into the postgres database. In order to setup this type of replication, your mysql binlog_format must be “ROW”.

A few points you should know before setting up this tool :

  1. Tables that need to be replicated must have a primary key.
  2. Works for PostgreSQL versions > 9.5 and MySQL > 5.5
  3. binlog_format must be ROW in order to setup this replication.
  4. Python version must be > 3.3

When you initialize the replication, pg_chameleon pulls the data from MySQL using the CSV format in slices, to prevent memory overload. This data is flushed to postgres using the COPY command. If COPY fails, it tries INSERT, which may be slow. If INSERT fails, then the row is discarded.

To replicate changes from mysql, pg_chameleon mimics the behavior of a mysql slave. It creates the schema in postgres, performs the initial data load, connects to MySQL replication protocol, stores the row images into a table in postgres. Now, the respective functions in postgres decode those rows and apply the changes. This is similar to storing relay logs in postgres tables and applying them to a postgres schema. You do not have to create a postgres schema using any DDLs. This tool automatically does that for the tables configured for replication. If you need to specifically convert any types, you can specify this in the configuration file.

The following is just an exercise that you can experiment with and implement if it completely satisfies your requirement. We performed these tests on CentOS Linux release 7.4.

Prepare the environment

Set up Percona Server for MySQL

InstallMySQL 5.7 and add appropriate parameters for replication.

In this exercise, I have installed Percona Server for MySQL 5.7 using YUM repo.

yum install
yum install Percona-Server-server-57
echo "mysql ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
usermod -s /bin/bash mysql
sudo su - mysql

pg_chameleon requires the following the parameters to be set in your my.cnf file (parameter file of your MySQL server). You may add the following parameters to /etc/my.cnf

binlog_format= ROW
log-bin = mysql-bin
server-id = 1

Now start your MySQL server after adding the above parameters to your my.cnf file.

$ service mysql start

Fetch the temporary root password from mysqld.log, and reset the root password using mysqladmin

$ grep "temporary" /var/log/mysqld.log
$ mysqladmin -u root -p password 'Secret123!'

Now, connect to your MySQL instance and create sample schema/tables. I have also created an emp table for validation.

$ wget
$ tar -xzf sakila-db.tar.gz
$ mysql -uroot -pSecret123! < sakila-db/sakila-schema.sql
$ mysql -uroot -pSecret123! < sakila-db/sakila-data.sql
$ mysql -uroot -pSecret123! sakila -e "create table emp (id int PRIMARY KEY, first_name varchar(20), last_name varchar(20))"

Create a user for configuring replication using pg_chameleon and give appropriate privileges to the user using the following steps.

$ mysql -uroot -p
create user 'usr_replica'@'%' identified by 'Secret123!';
GRANT ALL ON sakila.* TO 'usr_replica'@'%';

While creating the user in your mysql server (‘usr_replica’@’%’), you may wish to replace % with the appropriate IP or hostname of the server on which pg_chameleon is running.

Set up PostgreSQL

Install PostgreSQL and start the database instance.

You may use the following steps to install PostgreSQL 10.x

yum install
yum install postgresql10*
su - postgres
$ /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data start

As seen in the following logs, create a user in PostgreSQL using which pg_chameleon can write changed data to PostgreSQL. Also create the target database.

postgres=# CREATE USER usr_replica WITH ENCRYPTED PASSWORD 'secret';
postgres=# CREATE DATABASE db_replica WITH OWNER usr_replica;

Steps to install and setup replication using pg_chameleon

Step 1: In this exercise, I installed Python 3.6 and pg_chameleon 2.0.8 using the following steps. You may skip the python install steps if you already have the desired python release. We can create a virtual environment if the OS does not include Python 3.x by default.

yum install gcc openssl-devel bzip2-devel wget
cd /usr/src
tar xzf Python-3.6.6.tgz
cd Python-3.6.6
./configure --enable-optimizations
make altinstall
python3.6 -m venv venv
source venv/bin/activate
pip install pip --upgrade
pip install pg_chameleon

Step 2: This tool requires a configuration file to store the source/target server details, and a directory to store the logs. Use the following command to let pg_chameleon create the configuration file template and the respective directories for you.

$ chameleon set_configuration_files

The above command would produce the following output, which shows that it created some directories and a file in the location where you ran the command.

creating directory /var/lib/pgsql/.pg_chameleon
creating directory /var/lib/pgsql/.pg_chameleon/configuration/
creating directory /var/lib/pgsql/.pg_chameleon/logs/
creating directory /var/lib/pgsql/.pg_chameleon/pid/
copying configuration example in /var/lib/pgsql/.pg_chameleon/configuration//config-example.yml

Copy the sample configuration file to another file, lets say, default.yml

$ cd .pg_chameleon/configuration/
$ cp config-example.yml default.yml

Here is how my default.yml file looks after adding all the required parameters. In this file, we can optionally specify the data type conversions, tables to skipped from replication and the DML events those need to skipped for selected list of tables.

#global settings
pid_dir: '~/.pg_chameleon/pid/'
log_dir: '~/.pg_chameleon/logs/'
log_dest: file
log_level: info
log_days_keep: 10
rollbar_key: ''
rollbar_env: ''
# type_override allows the user to override the default type conversion into a different one.
    override_to: boolean
      - "*"
#postgres  destination connection
  host: "localhost"
  port: "5432"
  user: "usr_replica"
  password: "secret"
  database: "db_replica"
  charset: "utf8"
      host: "localhost"
      port: "3306"
      user: "usr_replica"
      password: "Secret123!"
      charset: 'utf8'
      connect_timeout: 10
      sakila: sch_sakila
#      -
#      -
      - usr_readonly
    lock_timeout: "120s"
    my_server_id: 100
    replica_batch_size: 10000
    replay_max_rows: 10000
    batch_retention: '1 day'
    copy_max_memory: "300M"
    copy_mode: 'file'
    out_dir: /tmp
    sleep_loop: 1
    on_error_replay: continue
    on_error_read: continue
    auto_maintenance: "disabled"
    gtid_enable: No
    type: mysql
#        - #skips inserts on the table
#        - delphis_mediterranea #skips deletes on schema delphis_mediterranea

Step 3: Initialize the replica using this command:

$ chameleon create_replica_schema --debug

The above command creates a schema and nine tables in the PostgreSQL database that you specified in the .pg_chameleon/configuration/default.yml file. These tables are needed to manage replication from source to destination. The same can be observed in the following log.

db_replica=# \dn
List of schemas
Name | Owner
public | postgres
sch_chameleon | target_user
(2 rows)
db_replica=# \dt sch_chameleon.t_*
List of relations
Schema | Name | Type | Owner
sch_chameleon | t_batch_events | table | target_user
sch_chameleon | t_discarded_rows | table | target_user
sch_chameleon | t_error_log | table | target_user
sch_chameleon | t_last_received | table | target_user
sch_chameleon | t_last_replayed | table | target_user
sch_chameleon | t_log_replica | table | target_user
sch_chameleon | t_replica_batch | table | target_user
sch_chameleon | t_replica_tables | table | target_user
sch_chameleon | t_sources | table | target_user
(9 rows)

Step 4: Add the source details to pg_chameleon using the following command. Provide the name of the source as specified in the configuration file. In this example, the source name is mysql and the target is postgres database defined under pg_conn.

$ chameleon add_source --config default --source mysql --debug

Once you run the above command, you should see that the source details are added to the t_sources table.

db_replica=# select * from sch_chameleon.t_sources;
-[ RECORD 1 ]-------+----------------------------------------------
i_id_source | 1
t_source | mysql
jsb_schema_mappings | {"sakila": "sch_sakila"}
enm_status | ready
t_binlog_name |
i_binlog_position |
b_consistent | t
b_paused | f
b_maintenance | f
ts_last_maintenance |
enm_source_type | mysql
v_log_table | {t_log_replica_mysql_1,t_log_replica_mysql_2}
$ chameleon show_status --config default
Source id Source name Type Status Consistent Read lag Last read Replay lag Last replay
----------- ------------- ------ -------- ------------ ---------- ----------- ------------ -------------
1 mysql mysql ready Yes N/A N/A

Step 5: Initialize the replica/slave using the following command. Specify the source from which you are replicating the changes to the PostgreSQL database.

$ chameleon init_replica --config default --source mysql --debug

Initialization involves the following tasks on the MySQL server (source).

1. Flush the tables with read lock
2. Get the master’s coordinates
3. Copy the data
4. Release the locks

The above command creates the target schema in your postgres database automatically.
In the default.yml file, we mentioned the following schema_mappings.

sakila: sch_sakila

So, now it created the new schema scott in the target database db_replica.

db_replica=# \dn
List of schemas
Name | Owner
public | postgres
sch_chameleon | usr_replica
sch_sakila | usr_replica
(3 rows)

Step 6: Now, start replication using the following command.

$ chameleon start_replica --config default --source mysql

Step 7: Check replication status and any errors using the following commands.

$ chameleon show_status --config default
$ chameleon show_errors

This is how the status looks:

$ chameleon show_status --source mysql
Source id Source name Type Status Consistent Read lag Last read Replay lag Last replay
----------- ------------- ------ -------- ------------ ---------- ----------- ------------ -------------
1 mysql mysql running No N/A N/A
== Schema mappings ==
Origin schema Destination schema
--------------- --------------------
sakila sch_sakila
== Replica status ==
--------------------- ---
Tables not replicated 0
Tables replicated 17
All tables 17
Last maintenance N/A
Next maintenance N/A
Replayed rows
Replayed DDL
Skipped rows

Now, you should see that the changes are continuously getting replicated from MySQL to PostgreSQL.

Step 8:  To validate, you may insert a record into the table in MySQL that we created for the purpose of validation and check that it is replicated to postgres.

$ mysql -u root -pSecret123! -e "INSERT INTO sakila.emp VALUES (1,'avinash','vallarapu')"
mysql: [Warning] Using a password on the command line interface can be insecure.
$ psql -d db_replica -c "select * from sch_sakila.emp"
 id | first_name | last_name
  1 | avinash    | vallarapu
(1 row)

In the above log, we see that the record that was inserted to the MySQL table was replicated to the PostgreSQL table.

You may also add multiple sources for replication to PostgreSQL (target).

Reference :

Please refer to the above documentation to find out about the many more options that are available with pg_chameleon

The post Replication from Percona Server for MySQL to PostgreSQL using pg_chameleon appeared first on Percona Database Performance Blog.


Tuning Autovacuum in PostgreSQL and Autovacuum Internals

Tuning Autovacuum in PostgreSQL

The performance of a PostgreSQL database can be compromised by dead tuples, since they continue to occupy space and can lead to bloat. We provided an introduction to VACUUM and bloat in an earlier blog post. Now, though, it’s time to look at autovacuum for postgres, and the internals you to know to maintain a high performance PostgreSQL database needed by demanding applications.

What is autovacuum ?

Autovacuum is one of the background utility processes that starts automatically when you start PostgreSQL. As you see in the following log, the postmaster (parent PostgreSQL process) with pid 2862 has started the autovacuum launcher process with pid 2868. To start autovacuum, you must have the parameter autovacuum set to ON. In fact, you should not set it to OFF in a production system unless you are 100% sure about what you are doing and its implications.

avi@percona:~$ps -eaf | egrep "/post|autovacuum"
postgres  2862     1  0 Jun17 pts/0    00:00:11 /usr/pgsql-10/bin/postgres -D /var/lib/pgsql/10/data
postgres  2868  2862  0 Jun17 ?        00:00:10 postgres: autovacuum launcher process
postgres 15427  4398  0 18:35 pts/1    00:00:00 grep -E --color=auto /post|autovacuum

Why is autovacuum needed ? 

We need VACUUM to remove dead tuples, so that the space occupied by dead tuples can be re-used by the table for future inserts/updates. To know more about dead tuples and bloat, please read our previous blog post. We also need ANALYZE on the table that updates the table statistics, so that the optimizer can choose optimal execution plans for an SQL statement. It is the autovacuum in postgres that is responsible for performing both vacuum and analyze on tables.

There exists another background process in postgres called Stats Collector that tracks the usage and activity information. The information collected by this process is used by autovacuum launcher to identify the list of candidate tables for autovacuum. PostgreSQL identifies the tables needing vacuum or analyze automatically, but only when autovacuum is enabled. This ensures that postgres heals itself and stops the database from developing more bloat/fragmentation.

Parameters needed to enable autovacuum in PostgreSQL are :

autovacuum = on  # ( ON by default )
track_counts = on # ( ON by default )


  is used by the stats collector. Without that in place, autovacuum cannot access the candidate tables.

Logging autovacuum

Eventually, you may want to log the tables on which autovacuum spends more time. In that case, set the parameter log_autovacuum_min_duration to a value (defaults to milliseconds), so that any autovacuum that runs for more than this value is logged to the PostgreSQL log file. This may help tune your table level autovacuum settings appropriately.

# Setting this parameter to 0 logs every autovacuum to the log file.
log_autovacuum_min_duration = '250ms' # Or 1s, 1min, 1h, 1d

Here is an example log of autovacuum vacuum and analyze

< 2018-08-06 07:22:35.040 EDT > LOG: automatic vacuum of table "vactest.scott.employee": index scans: 0
pages: 0 removed, 1190 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 110008 removed, 110008 remain, 0 are dead but not yet removable
buffer usage: 2402 hits, 2 misses, 0 dirtied
avg read rate: 0.057 MB/s, avg write rate: 0.000 MB/s
system usage: CPU 0.00s/0.02u sec elapsed 0.27 sec
< 2018-08-06 07:22:35.199 EDT > LOG: automatic analyze of table "vactest.scott.employee" system usage: CPU 0.00s/0.02u sec elapsed 0.15 sec

When does PostgreSQL run autovacuum on a table ? 

As discussed earlier, autovacuum in postgres refers to both automatic VACUUM and ANALYZE and not just VACUUM. An automatic vacuum or analyze runs on a table depending on the following mathematic equations.

The formula for calculating the effective table level autovacuum threshold is :

Autovacuum VACUUM thresold for a table = autovacuum_vacuum_scale_factor * number of tuples + autovacuum_vacuum_threshold

With the equation above, it is clear that if the actual number of dead tuples in a table exceeds this effective threshold, due to updates and deletes, that table becomes a candidate for autovacuum vacuum.

Autovacuum ANALYZE threshold for a table = autovacuum_analyze_scale_factor * number of tuples + autovacuum_analyze_threshold

The above equation says that any table with a total number of inserts/deletes/updates exceeding this threshold—since last analyze—is eligible for an autovacuum analyze.

Let’s understand these parameters in detail.

  • autovacuum_vacuum_scale_factor Or autovacuum_analyze_scale_factor : Fraction of the table records that will be added to the formula. For example, a value of 0.2 equals to 20% of the table records.
  • autovacuum_vacuum_threshold Or autovacuum_analyze_threshold : Minimum number of obsolete records or dml’s needed to trigger an autovacuum.

Let’s consider a table: percona.employee with 1000 records and the following autovacuum parameters.

autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 50
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 50

Using the above mentioned mathematical formulae as reference,

Table : percona.employee becomes a candidate for autovacuum Vacuum when,
Total number of Obsolete records = (0.2 * 1000) + 50 = 250

Table : percona.employee becomes a candidate for autovacuum ANALYZE when,
Total number of Inserts/Deletes/Updates = (0.1 * 1000) + 50 = 150

Tuning Autovacuum in PostgreSQL

We need to understand that these are global settings. These settings are applicable to all the databases in the instance. This means, regardless of the table size, if the above formula is reached, a table is eligible for autovacuum vacuum or analyze.

Is this a problem ?

Consider a table with ten records versus a table with a million records. Even though the table with a million records may be involved in transactions far more often, the frequency at which a vacuum or an analyze runs automatically could be greater for the table with just ten records.

Consequently, PostgreSQL allows you to configure individual table level autovacuum settings that bypass global settings.

ALTER TABLE scott.employee SET (autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 100);

Output Log
avi@percona:~$psql -d percona
psql (10.4)
Type "help" for help.
percona=# ALTER TABLE scott.employee SET (autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 100);

The above setting runs autovacuum vacuum on the table scott.employee only once there is more than 100 obsolete records.

How do we identify the tables that need their autovacuum settings tuned ? 

In order to tune autovacuum for tables individually, you must know the number of inserts/deletes/updates on a table for an interval. You can also view the postgres catalog view : pg_stat_user_tables to get that information.

percona=# SELECT n_tup_ins as "inserts",n_tup_upd as "updates",n_tup_del as "deletes", n_live_tup as "live_tuples", n_dead_tup as "dead_tuples"
FROM pg_stat_user_tables
WHERE schemaname = 'scott' and relname = 'employee';
 inserts | updates | deletes | live_tuples | dead_tuples
      30 |      40 |       9 |          21 |          39
(1 row)

As observed in the above log, taking a snapshot of this data for a certain interval should help you understand the frequency of DMLs on each table. In turn, this should help you with tuning your autovacuum settings for individual tables.

How many autovacuum processes can run at a time ? 

There cannot be more than autovacuum_max_workers number of autovacuum processes running at a time, across the instance/cluster that may contain more than one database. Autovacuum launcher background process starts a worker process for a table that needs a vacuum or an analyze. If there are four databases with autovacuum_max_workers set to 3, then, the 4th database has to wait until one of the existing worker process gets free.

Before starting the next autovacuum, it waits for autovacuum_naptime, the default is 1 min on most of the versions. If you have three databases, the next autovacuum waits for 60/3 seconds. So, the wait time before starting next autovacuum is always (autovacuum_naptime/N) where N is the total number of databases in the instance.

Does increasing autovacuum_max_workers alone increase the number of autovacuum processes that can run in parallel ?
NO. This is explained better in next few lines.

Is VACUUM IO intensive? 

Autovacuum can be considered as a cleanup. As discussed earlier, we have 1 worker process per table. Autovacuum reads 8KB (default block_size) pages of a table from disk and modifies/writes to the pages containing dead tuples. This involves both read and write IO. Thus, this could be an IO intensive operation, when there is an autovacuum running on a huge table with many dead tuples, during a peak transaction time. To avoid this issue, we have a few parameters that are set to minimize the impact on IO due to vacuum.

The following are the parameters used to tune autovacuum IO

  • autovacuum_vacuum_cost_limit : total cost limit autovacuum could reach (combined by all autovacuum jobs).
  • autovacuum_vacuum_cost_delay : autovacuum will sleep for these many milliseconds when a cleanup reaching autovacuum_vacuum_cost_limit cost is done.
  • vacuum_cost_page_hit : Cost of reading a page that is already in shared buffers and doesn’t need a disk read.
  • vacuum_cost_page_miss : Cost of fetching a page that is not in shared buffers.
  • vacuum_cost_page_dirty : Cost of writing to each page when dead tuples are found in it.
Default Values for the parameters discussed above.
autovacuum_vacuum_cost_limit = -1 (So, it defaults to vacuum_cost_limit) = 200
autovacuum_vacuum_cost_delay = 20ms
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20

Consider autovacuum VACUUM running on the table percona.employee.

Let’s imagine what can happen in 1 second. (1 second = 1000 milliseconds)

In a best case scenario where read latency is 0 milliseconds, autovacuum can wake up and go for sleep 50 times (1000 milliseconds / 20 ms) because the delay between wake-ups needs to be 20 milliseconds.

1 second = 1000 milliseconds = 50 * autovacuum_vacuum_cost_delay

Since the cost associated per reading a page in shared_buffers is 1, in every wake up 200 pages can be read, and in 50 wake-ups 50*200 pages can be read.

If all the pages with dead tuples are found in shared buffers, with an autovacuum_vacuum_cost_delay of 20ms, then it can read: ((200 / vacuum_cost_page_hit) * 8) KB in each round that needs to wait forautovacuum_vacuum_cost_delay amount of time.

Thus, at the most, an autovacuum can read : 50 * 200 * 8 KB = 78.13 MB per second (if blocks are already found in shared_buffers), considering the block_size as 8192 bytes.

If the blocks are not in shared buffers and need to fetched from disk, an autovacuum can read : 50 * ((200 / vacuum_cost_page_miss) * 8) KB = 7.81 MB per second.

All the information we have seen above is for read IO.

Now, in order to delete dead tuples from a page/block, the cost of a write operation is : vacuum_cost_page_dirty, set to 20 by default.

At the most, an autovacuum can write/dirty : 50 * ((200 / vacuum_cost_page_dirty) * 8) KB = 3.9 MB per second.

Generally, this cost is equally divided to all the autovacuum_max_workers number of autovacuum processes running in the Instance. So, increasing the autovacuum_max_workers may delay the autovacuum execution for the currently running autovacuum workers. And increasing the autovacuum_vacuum_cost_limit may cause IO bottlenecks. An important point to note is that this behaviour can be overridden by setting the storage parameters of individual tables, which would subsequently ignore the global settings.

postgres=# alter table percona.employee set (autovacuum_vacuum_cost_limit = 500);
postgres=# alter table percona.employee set (autovacuum_vacuum_cost_delay = 10);
postgres=# \d+ percona.employee
Table "percona.employee"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
id | integer | | | | plain | |
Options: autovacuum_vacuum_threshold=10000, autovacuum_vacuum_cost_limit=500, autovacuum_vacuum_cost_delay=10

Thus, on a busy OLTP database, always have a strategy to implement manual VACUUM on tables that are frequently hit with DMLs, during a low peak window. You may have as many parallel vacuum jobs as possible when you run it manually after setting relevant autovacuum_* settings. For this reason, a scheduled manual Vacuum Job is always recommended alongside finely tuned autovacuum settings.

The post Tuning Autovacuum in PostgreSQL and Autovacuum Internals appeared first on Percona Database Performance Blog.


This Week in Data with Colin Charles 48: Coinbase Powered by MongoDB and Prometheus Graduates in the CNCF

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

The call for submitting a talk to Percona Live Europe 2018 is closing today, and while there may be a short extension, have you already got your talk submitted? I suggest doing so ASAP!

I’m sure many of you have heard of cryptocurrencies, the blockchain, and so on. But how many of you realiize that Coinbase, an application that handles cryptocurrency trades, matching book orders, and more, is powered by MongoDB? With the hype and growth in interest in late 2017, Coinbase has had to scale. They gave an excellent talk at MongoDB World, titled MongoDB & Crypto Mania (the video is worth a watch), and they’ve also written a blog post, How we’re scaling our platform for spikes in customer demand. They even went driver hacking (the Ruby driver for MongoDB)!

It is great to see there be a weekly review of happenings in the Vitess world.

PingCap and TiDB have been to many Percona Live events to present, and recently hired Morgan Tocker. Morgan has migrated his blog from MySQL to TiDB. Read more about his experience in, This blog, now Powered by WordPress + TiDB. Reminds me of the early days of Galera Cluster and showing how Drupal could be powered by it!


Link List

  • Sys Schema MySQL 5.7+ – blogger from Wipro, focusing on an introduction to the sys schema on MySQL (note: still not available in the MariaDB Server fork).
  • Prometheus Graduates in the CNCF, so is considered a mature project. Criteria for graduation is such that “projects must demonstrate thriving adoption, a documented, structured governance process, and a strong commitment to community sustainability and inclusivity.” Percona benefits from Prometheus in Percona Monitoring & Management (PMM), so we should celebrate this milestone!
  • Replicating from MySQL 8.0 to MySQL 5.7
  • A while ago in this column, we linked to Shlomi Noach’s excellent post on MySQL High Availability at GitHub. We were also introduced to GitHub Load Balancer (GLB), which they ran on top of HAProxy. However back then, GLB wasn’t open; now you can get GLB Director: GLB: GitHub’s open source load balancer. The project describes GLB Director as: “… a Layer 4 load balancer which scales a single IP address across a large number of physical machines while attempting to minimise connection disruption during any change in servers. GLB Director does not replace services like haproxy and nginx, but rather is a layer in front of these services (or any TCP service) that allows them to scale across multiple physical machines without requiring each machine to have unique IP addresses.”
  • F1 Query: Declarative Querying at Scale – a well-written paper.

Upcoming Appearances


I look forward to feedback/tips via e-mail at or on Twitter @bytebot.


The post This Week in Data with Colin Charles 48: Coinbase Powered by MongoDB and Prometheus Graduates in the CNCF appeared first on Percona Database Performance Blog.


Basic Understanding of Bloat and VACUUM in PostgreSQL

VACUUM and Bloat PostgreSQL

VACUUM and Bloat PostgreSQLImplementation of MVCC (Multi-Version Concurrency Control) in PostgreSQL is different and special when compared with other RDBMS. MVCC in PostgreSQL controls which tuples can be visible to transactions via versioning.

What is versioning in PostgreSQL?

Let’s consider the case of an Oracle or a MySQL Database. What happens when you perform a DELETE or an UPDATE of a row? You see an UNDO record maintained in a global UNDO Segment. This UNDO segment contains the past image of a row, to help database achieve consistency. (the “C” in A.C.I.D). For example, if there is an old transaction that depends on the row that got deleted, the row may still be visible to it because the past image is still maintained in the UNDO. If you are an Oracle DBA reading this blog post, you may quickly recollect the error

ORA-01555 snapshot too old

 . What this error means is—you may have a smaller undo_retention or not a huge UNDO segment that could retain all the past images (versions) needed by the existing or old transactions.

You may not have to worry about that with PostgreSQL.

Then how does PostgreSQL manage UNDO ?

In simple terms, PostgreSQL maintains both the past image and the latest image of a row in its own Table. It means, UNDO is maintained within each table. And this is done through versioning. Now, we may get a hint that, every row of PostgreSQL table has a version number. And that is absolutely correct. In order to understand how these versions are maintained within each table, you should understand the hidden columns of a table (especially xmin) in PostgreSQL.

Understanding the Hidden Columns of a Table

When you describe a table, you would only see the columns you have added, like you see in the following log.

percona=# \d scott.employee
                                          Table "scott.employee"
  Column  |          Type          | Collation | Nullable |                    Default
 emp_id   | integer                |           | not null | nextval('scott.employee_emp_id_seq'::regclass)
 emp_name | character varying(100) |           |          |
 dept_id  | integer                |           |          |

However, if you look at all the columns of the table in pg_attribute, you should see several hidden columns as you see in the following log.

percona=# SELECT attname, format_type (atttypid, atttypmod)
FROM pg_attribute
WHERE attrelid::regclass::text='scott.employee'
ORDER BY attnum;
 attname  |      format_type
 tableoid | oid
 cmax     | cid
 xmax     | xid
 cmin     | cid
 xmin     | xid
 ctid     | tid
 emp_id   | integer
 emp_name | character varying(100)
 dept_id  | integer
(9 rows)

Let’s understand a few of these hidden columns in detail.

tableoid : Contains the OID of the table that contains this row. Used by queries that select from inheritance hierarchies.
More details on table inheritance can be found here :

xmin : The transaction ID(xid) of the inserting transaction for this row version. Upon update, a new row version is inserted. Let’s see the following log to understand the xmin more.

percona=# select txid_current();
(1 row)
percona=# INSERT into scott.employee VALUES (9,'avi',9);
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 9;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
  647 |    0 |    0 |    0 |      9 | avi      |       9
(1 row)

As you see in the above log, the transaction ID was 646 for the command => select txid_current(). Thus, the immediate INSERT statement got a transaction ID 647. Hence, the record was assigned an xmin of 647. This means, no transaction ID that has started before the ID 647, can see this row. In other words, already running transactions with txid less than 647 cannot see the row inserted by txid 647. 

With the above example, you should now understand that every tuple has an xmin that is assigned the txid that inserted it.

Note: the behavior may change depending on the isolation levels you choose, would be discussed later in another blog post.

xmax : This values is 0 if it was not a deleted row version. Before the DELETE is committed, the xmax of the row version changes to the ID of the transaction that has issued the DELETE. Let’s observe the following log to understand that better.

On Terminal A : We open a transaction and delete a row without committing it.

percona=# BEGIN;
percona=# select txid_current();
(1 row)
percona=# DELETE from scott.employee where emp_id = 10;

On Terminal B : Observe the xmax values before and after the delete (that has not been committed).

Before the Delete
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 10;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
  649 |    0 |    0 |    0 |     10 | avi      |      10
After the Delete
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 10;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
  649 |  655 |    0 |    0 |     10 | avi      |      10
(1 row)

As you see in the above logs, the xmax value changed to the transaction ID that has issued the delete. If you have issued a ROLLBACK, or if the transaction got aborted, xmax remains at the transaction ID that tried to DELETE it (which is 655) in this case.

Now that we understand the hidden columns xmin and xmax, let’s observe what happens after a DELETE or an UPDATE in PostgreSQL. As we discussed earlier, through the hidden columns in PostgreSQL for every table, we understand that there are multiple versions of rows maintained within each table. Let’s see the following example to understand this better.

We’ll insert 10 records to the table : scott.employee

percona=# INSERT into scott.employee VALUES (generate_series(1,10),'avi',1);

Now, let’s DELETE 5 records from the table.

percona=# DELETE from scott.employee where emp_id > 5;
percona=# select count(*) from scott.employee;
(1 row)

Now, when you check the count after DELETE, you would not see the records that have been DELETED. To see any row versions that exist in the table but are not visible, we have an extension called pageinspect. The pageinspect module provides functions that allow you to inspect the contents of database pages at a low level, which is useful for debugging purposes. Let’s create this extension to see the older row versions those have been deleted.

percona=# CREATE EXTENSION pageinspect;
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |              tuple_data_split
    668 |      0 | {"\\x01000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x02000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x03000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x04000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x05000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x06000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x07000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x08000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x09000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x0a000000","\\x09617669","\\x01000000"}
(10 rows)

Now, we could still see 10 records in the table even after deleting 5 records from it. Also, you can observe here that t_xmax is set to the transaction ID that has deleted them. These deleted records are retained in the same table to serve any of the older transactions that are still accessing them.

We’ll take a look at what an UPDATE would do in the following Log.  

percona=# DROP TABLE scott.employee ;
percona=# CREATE TABLE scott.employee (emp_id INT, emp_name VARCHAR(100), dept_id INT);
percona=# INSERT into scott.employee VALUES (generate_series(1,10),'avi',1);
percona=# UPDATE scott.employee SET emp_name = 'avii';
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |               tuple_data_split
    672 |    673 | {"\\x01000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x02000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x03000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x04000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x05000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x06000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x07000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x08000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x09000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x0a000000","\\x09617669","\\x01000000"}
    673 |      0 | {"\\x01000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x02000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x03000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x04000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x05000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x06000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x07000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x08000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x09000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x0a000000","\\x0b61766969","\\x01000000"}
(20 rows)

An UPDATE in PostgreSQL would perform an insert and a delete. Hence, all the records being UPDATED have been deleted and inserted back with the new value. Deleted records have non-zero t_xmax value.

Records for which you see a non-zero value for t_xmax may be required by the previous transactions to ensure consistency based on appropriate isolation levels.

We discussed about xmin and xmax. What are these hidden columns cmin and cmax ?

cmax : The command identifier within the deleting transaction or zero. (As per the documentation). However, both cmin and cmax are always the same as per the PostgreSQL source code.

cmin : The command identifier within the inserting transaction. You could see the cmin of the 3 insert statements starting with 0, in the following log.

See the following log to understand how the cmin and cmax values change through inserts and deletes in a transaction.

On Terminal A
percona=# BEGIN;
percona=# INSERT into scott.employee VALUES (1,'avi',2);
percona=# INSERT into scott.employee VALUES (2,'avi',2);
percona=# INSERT into scott.employee VALUES (3,'avi',2);
percona=# INSERT into scott.employee VALUES (4,'avi',2);
percona=# INSERT into scott.employee VALUES (5,'avi',2);
percona=# INSERT into scott.employee VALUES (6,'avi',2);
percona=# INSERT into scott.employee VALUES (7,'avi',2);
percona=# INSERT into scott.employee VALUES (8,'avi',2);
percona=# COMMIT;
percona=# select xmin,xmax,cmin,cmax,* from scott.employee;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
  644 |    0 |    0 |    0 |      1 | avi      |       2
  644 |    0 |    1 |    1 |      2 | avi      |       2
  644 |    0 |    2 |    2 |      3 | avi      |       2
  644 |    0 |    3 |    3 |      4 | avi      |       2
  644 |    0 |    4 |    4 |      5 | avi      |       2
  644 |    0 |    5 |    5 |      6 | avi      |       2
  644 |    0 |    6 |    6 |      7 | avi      |       2
  644 |    0 |    7 |    7 |      8 | avi      |       2
(8 rows)

If you observe the above output log, you see cmin and cmax values incrementing for each insert.

Now let’s delete 3 records from Terminal A and observe how the values appear in Terminal B before COMMIT.

On Terminal A
percona=# BEGIN;
percona=# DELETE from scott.employee where emp_id = 4;
percona=# DELETE from scott.employee where emp_id = 5;
percona=# DELETE from scott.employee where emp_id = 6;
On Terminal B, before issuing COMMIT on Terminal A
percona=# select xmin,xmax,cmin,cmax,* from scott.employee;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
  644 |    0 |    0 |    0 |      1 | avi      |       2
  644 |    0 |    1 |    1 |      2 | avi      |       2
  644 |    0 |    2 |    2 |      3 | avi      |       2
  644 |  645 |    0 |    0 |      4 | avi      |       2
  644 |  645 |    1 |    1 |      5 | avi      |       2
  644 |  645 |    2 |    2 |      6 | avi      |       2
  644 |    0 |    6 |    6 |      7 | avi      |       2
  644 |    0 |    7 |    7 |      8 | avi      |       2
(8 rows)

Now, in the above log, you see that the cmax and cmin values have incrementally started from 0 for the records being deleted. Their values where different before the delete, as we have seen earlier. Even if you ROLLBACK, the values remain the same.

After understanding the hidden columns and how PostgreSQL maintains UNDO as multiple versions of rows, the next question would be—what would clean up this UNDO from a table? Doesn’t this increase the size of a table continuously? In order to understand that better, we need to know about VACUUM in PostgreSQL.

VACUUM in PostgreSQL

As seen in the above examples, every such record that has been deleted but is still taking some space is called a dead tuple. Once there is no dependency on those dead tuples with the already running transactions, the dead tuples are no longer needed. Thus, PostgreSQL runs VACUUM on such Tables. VACUUM reclaims the storage occupied by these dead tuples. The space occupied by these dead tuples may be referred to as Bloat. VACUUM scans the pages for dead tuples and marks them to the freespace map (FSM). Each relation apart from hash indexes has an FSM stored in a separate file called <relation_oid>_fsm.

Here, relation_oid is the oid of the relation that is visible in pg_class.

percona=# select oid from pg_class where relname = 'employee';
(1 row)

Upon VACUUM, this space is not reclaimed to disk but can be re-used by future inserts on this table. VACUUM stores the free space available on each heap (or index) page to the FSM file.

Running a VACUUM is a non-blocking operation. It never causes exclusive locks on tables. This means VACUUM can run on a busy transactional table in production while there are several transactions writing to it.

As we discussed earlier, an UPDATE of 10 records has generated 10 dead tuples. Let us see the following log to understand what happens to those dead tuples after a VACUUM.

percona=# VACUUM scott.employee ;
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |               tuple_data_split
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
    673 |      0 | {"\\x01000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x02000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x03000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x04000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x05000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x06000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x07000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x08000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x09000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x0a000000","\\x0b61766969","\\x01000000"}
(20 rows)

In the above log, you might notice that the dead tuples are removed and the space is available for re-use. However, this space is not reclaimed to filesystem after VACUUM. Only the future inserts can use this space.

VACUUM does an additional task. All the rows that are inserted and successfully committed in the past are marked as frozen, which indicates that they are visible to all the current and future transactions. We will be discussing this in detail in our future blog post “Transaction ID Wraparound in PostgreSQL”.

VACUUM does not usually reclaim the space to filesystem unless the dead tuples are beyond the high water mark.

Let’s consider the following example to see when a VACUUM could release the space to filesystem.

Create a table and insert some sample records. The records are physically ordered on the disk based on the primary key index.

percona=# CREATE TABLE scott.employee (emp_id int PRIMARY KEY, name varchar(20), dept_id int);
percona=# INSERT INTO scott.employee VALUES (generate_series(1,1000), 'avi', 1);
INSERT 0 1000

Now, run ANALYZE on the table to update its statistics and see how many pages are allocated to the table after the above insert.

percona=# ANALYZE scott.employee ;
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
relpages | total_bytes | relsize
6        | 49152       | 49152
(1 row)

Let’s now see how VACUUM behaves when you delete the rows with emp_id > 500

percona=# DELETE from scott.employee where emp_id > 500;
percona=# VACUUM ANALYZE scott.employee ;
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
relpages | total_bytes | relsize
3        | 24576       | 24576
(1 row)

In the above log, you see that the VACUUM has reclaimed half the space to filesystem. Earlier, it occupied 6 pages (8KB each or as set to parameter : block_size). After VACUUM, it has released 3 pages to filesystem.

Now, let’s repeat the same exercise by deleting the rows with emp_id < 500

percona=# DELETE from scott.employee ;
percona=# INSERT INTO scott.employee VALUES (generate_series(1,1000), 'avi', 1);
INSERT 0 1000
percona=# DELETE from scott.employee where emp_id < 500;
percona=# VACUUM ANALYZE scott.employee ;
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
 relpages | total_bytes | relsize
        6 |       49152 |   49152
(1 row)

In the above example, you see that the number of pages still remain same after deleting half the records from the table. This means, VACUUM has not released the space to filesystem this time.

As explained earlier, if there are pages with no more live tuples after the high water mark, the subsequent pages can be flushed away to the disk by VACUUM. In the first case, it is understandable that there are no more live tuples after the 3rd page. So, the 4th, 5th and 6th page have been flushed to disk.

However, If you would need to reclaim the space to filesystem in the scenario where we deleted all the records with emp_id < 500, you may run VACUUM FULL. VACUUM FULL rebuilds the entire table and reclaims the space to disk.

percona=# VACUUM FULL scott.employee ;
percona=# VACUUM ANALYZE scott.employee ;
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
 relpages | total_bytes | relsize
        3 |       24576 |   24576
(1 row)

Please note that VACUUM FULL is not an ONLINE operation. It is a blocking operation. You cannot read from or write to the table while VACUUM FULL is in progress. We will discuss about the ways to rebuild a table online without blocking in our future blog post.

The post Basic Understanding of Bloat and VACUUM in PostgreSQL appeared first on Percona Database Performance Blog.


This Week in Data with Colin Charles 47: MySQL 8.0.12 and It’s Time To Submit!

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Don’t wait, submit a talk for Percona Live Europe 2018 to be held in Frankfurt 5-7 November 2018. The call for proposals is ending soon, there is a committee being created, and it is a great conference to speak at, with a new city to boot!


  • A big release, MySQL 8.0.12, with INSTANT ADD COLUMN support, BLOB optimisations, changes around replication, the query rewrite plugin and lots more. Naturally this also means the connectors get bumped up to the 8.0.12, including a nice new MySQL Shell.
  • A maintenance release, with security fixes, MySQL 5.5.61 as well as MariaDB 5.5.61.
  • repmgr v4.1 helps monitor PostgreSQL replication, and can handle switch overs and failovers.

Link List

  • Saving With MyRocks in The Cloud – a great MyRocks use case, as in the cloud, resources are major considerations and you can save on I/O with MyRocks. As long as your workload is I/O bound, you’re bound to benefit.
  • Hasura GraphQL Engine allows you to get an instant GraphQL API on any PostgreSQL based application. This is in addition to Graphile. For MySQL users, there is Prisma.

Industry Updates

  • Jeremy Cole (Linkedin) ended his sabbatical to start work at Shopify. He was previously hacking on MySQL and MariaDB Server at Google, and had stints at Twitter, Yahoo!, his co-owned firm Proven Scaling, as well as MySQL AB.
  • Dremio raises $30 million from the likes of Cisco and more for their Series B. They are a “data-as-a-service” company, having raised a total of $45m in two rounds (Crunchbase).

Upcoming Appearances


I look forward to feedback/tips via e-mail at or on Twitter @bytebot.


The post This Week in Data with Colin Charles 47: MySQL 8.0.12 and It’s Time To Submit! appeared first on Percona Database Performance Blog.


This Week in Data with Colin Charles 44: MongoDB 4.0 and Facebook MyRocks

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

There have been two big pieces of news this week: the release of MongoDB 4.0 and the fact that Facebook has migrated the Messenger backend over to MyRocks.

MongoDB 4.0 is stable, with support for multi-document ACID transactions. I quite like the engineering chalk and talks videos on the transactions page. There are also improvements to help manage your MongoDB workloads in a Kubernetes cluster. MongoDB Atlas supports global clusters (geographically distributed databases, low latency writes, and data placement controls for regulatory compliance), HIPAA compliance, and more. ZDNet calls it the “operational database that is developer friendly”. The TechCrunch take was more focused on MongoDB Atlas, MongoDB launches Global Clusters to put geographic data control within reach of anyone.

In addition to that, I found this little snippet on CNBC featuring Michael Gordon, MongoDB CFO, very interesting: last quarter MongoDB Inc reported 53% year-over-year growth in their subscription revenue business. The fastest-growing piece of the business? Cloud-hosted database as a service offering. They partner with Amazon, Google and Microsoft. They are looking to grow in the Chinese market.

Did you attend MongoDB World 2018? I personally can’t wait to see the presentations. Do not forget to read the MongoDB 4.0 release notes in the manual. Take heed of this important note: “In most cases, multi-document transaction incurs a greater performance cost over single document writes, and the availability of multi-document transaction should not be a replacement for effective schema design.”

As for Facebook Messenger migrating to MyRocks, this blog post is highly detailed: Migrating Messenger storage to optimize performance. This is a migration from the previous HBase backend to MyRocks. End users should notice a more responsive product and better search. For Facebook, storage consumption went down by 90%! The migration methodology to ensure Messenger usage was not disrupted for end users is also worth paying attention to. A more personal note from Yoshinori Matsunobu, as MyRocks is something he’s been spearheading. Don’t forget that you can try out MyRocks in Percona Server for MySQL as well as in MariaDB Server 10.2 and 10.3. To use Zstandard (or zstd for short), Percona Server for MySQL supports this (MariaDB does not, but has varying other compression algorithms).

Have you seen the Percona Open Source Database Community Blog? Jean-François Gagné recently wrote about how he posted on the Community Blog (so a very nice behind the scenes kind of post), and I hope you also read A Nice Feature in MariaDB 10.3: No InnoDB Buffer Pool in Core Dumps. Don’t forget to add this new blog to your RSS feed readers.

Lastly, as a quick note, there will unlikely be a column next week. I’m taking a short vacation, so see you in the following week!


Link List

Industry Updates

  • Louis Fahrberger (formerly of Clustrix, MariaDB Corporation, InfoBright and MySQL) is now an Account Executive in Sales for MemSQL.
  • The Wall Street Journal reports on Oracle Cloud and how the business continues to grow. “Revenues from its cloud services businesses jumped 25% year over year to $1.7 billion for its fiscal fourth quarter that ended May 31”.
  • The Financial Times reports on Red Hat sinks as currency swings cloud full-year sales outlook. The CFO, Eric Shander said, “we continue to expect strong demand for our hybrid cloud enabling technologies”.

Upcoming appearances

  • OSCON – Portland, Oregon, USA – July 16-19 2018


I look forward to feedback/tips via e-mail at or on Twitter @bytebot.

The post This Week in Data with Colin Charles 44: MongoDB 4.0 and Facebook MyRocks appeared first on Percona Database Performance Blog.


Faster Point In Time Recovery (PITR) in PostgreSQL Using a Delayed Standby

PostgreSQL Point in Time Recovery

PostgreSQL Point in Time RecoveryThe need to recover a database back to a certain point in time can be a nerve-racking task for DBAs and for businesses. Can this be simplified? Could it be made to work faster? Can we recover to a given point in time with zero loss of transactions/records? Fortunately, the answer to these questions is yes. PostgreSQL Point in Time Recovery (PITR) is an important facility. It offers DBAs the ability to restore a PostgreSQL database simply, quickly and without the loss of transactions or data.

In this post, we’ll help you to understand how this can be achieved, and reduce the potential for pain in the event of panic situations where you need to perform a PITR.

Before proceeding further, let us understand what could force us to perform a PITR.

  1. Someone has accidentally dropped or truncated a table.
  2. A failed deployment has made changes to the database that are difficult to reverse.
  3. You accidentally deleted or modified a lot of data, and as a consequence you cannot run your applications.

In such scenarios, you would immediately look for the latest full backup and the relevant transaction logs (aka WALs in PostgreSQL) to recover up to a known point in the past, before the error occurred. But what if your backup is corrupt and not valid?

Well, it is very important to perform a backup and recovery validation to ensure that the backups are always recoverable—we will address this in a future post. But, if the backup that you are looking at is corrupt, that can be a nightmare. One such unlucky incident for GitLab, where there was a backup restoration failure, caused a major outage followed by a data loss after recovery.

Even the best of plans can be hard to realize in practice.

It may be that our backups are intact and recoverable. Can we afford to wait until we copy/download the backup and recover it to another disk or server? What if the database size is several hundreds of GBs or several TBs like GitLab’s?

The solution to the problem is: add another standby that is always delayed by a few hours or a day.

This is one of the great features available in PostgreSQL. If you have migrated from Oracle RDBMS to PostgreSQL, you can think of it as an equivalent to FLASHBACK DATABASE in Oracle. Flashback database helps you to rewind data back in time. However, the technique does not work if you have dropped a data file. In fact, this is the case for both Oracle RDBMS and PostgreSQL PITR. ?

Adding a Delayed Standby in PostgreSQL

It is important that we use features like streaming replication to achieve high availability in PostgreSQL. Most of the environments have 1 master with 1 or more slaves (standby), either in the same data centre or geographically distributed. To save the time needed for PITR, you can add another slave that can always be delayed by a certain amount of time—this could be hours or days.

For example, if I know that my deployment is determined to be successful when no issues are observed in the first 12 hours, then I might delay one of the standbys by 12 hours.

To delay a standby, once you have setup streaming replication between your PostgreSQL master and slave, the following parameter needs to be added to the recovery.conf file of the slave, followed by a restart.

recovery_min_apply_delay = '12h' # or '1min' or 1d'

Now, let’s consider an example where you have inserted 10000 records at 10:27:34 AM and you have accidentally deleted 5000 records at 10:28:43 AM. Let’s say that you have a standby that is delayed by 1 hour. The steps to perform PITR using the delayed standby through until 10:27:34 AM look like this:

Steps to perform PostgreSQL Point in Time Recovery using a delayed standby

Step 1

Stop the slave (delayed standby) immediately, as soon as you have noticed that an accidental change has happened. If you know that the change has been already applied on the slave, then you cannot perform the point in time recovery using this method.

$ pg_ctl -D $PGDATA stop -mf

Step 2

Rename the recovery.conf file in your standby to another name.

$ mv $PGDATA/recovery.conf $PGDATA/recovery.conf.old

Step 3

Create a new recovery.conf file with the required parameters for PITR.

# recovery.conf file always exists in the Data Directory of Slave
recovery_target_time = '2018-06-07 10:27:34 EDT'
restore_command = 'sh /var/lib/pgsql/scripts/ %p %f'
recovery_target_action = 'pause'
recovery_target_inclusive = 'false'


Specifies the timestamp up to which you wish to recover your database.


Shell command that can be used by PostgreSQL to fetch the required Transaction Logs (WALs) for recovery.
PostgreSQL sends the arguments %p (path to WAL file) and % f (WAL file name) to this shell command. These arguments can be used in the script you use to copy your WALs.

Here is an example script for your reference. This example relies on rsync. The script connects to the backup server to fetch the WALs requested by PostgreSQL. (We’ll cover the procedure to archive these WALs in another blog post soon: this could be a good time to subscribe to the Percona blog mailing list!)

$ cat /var/lib/pgsql/scripts/
# Enable passwordless ssh to Backup Server
# $1 is %p substituted by postgres as the path to WAL File
# $2 is the %f substituted by postgres as the WAL File Name
rsync --no-motd -ave ssh ${Backup_Server}:${ArchiveDir}/${wal} ${wal_with_path} >>$LOG 2>&1
if [ "$?" -ne "0" ]
echo "Restore Failed for WAL : $wal" >> $LOG
exit 1


This is the action that needs to be performed after recovering the instance up to the recovery_target_time. Setting this to pause would let you modify the recovery_target_time after recovery, if you need to. You can then replay the transactions at a slow pace until your desired recovery target is reached. For example, you can recover until 2018-06-07 10:26:34 EDT and then modify recovery_target_time to 2018-06-07 10:27:34 EDT when using pause.

When you know that all the data you are looking for has been recovered, you can issue the following command to stop the recovery process, change the timeline and open the database for writes.

select pg_wal_replay_resume();

Other possible settings for this parameter are promote and shutdown. These do not allow you to replay a few more future transactions after the recovery, as you can with pause.


Whether to stop recovery just after the specified recovery_target_time(true) or before(false).

Step 4

Start PostgreSQL using pg_ctl. Now, it should read the parameters in recovery.conf and perform the recovery until the time you set in the recovery_target_time.

$ pg_ctl -D $PGDATA start

Step 5

Here is how the log appears. It says that has performed point-in-time-recovery and has reached a consistent state as requested.

2018-06-07 10:43:22.303 EDT [1423] LOG: starting point-in-time recovery to 2018-06-07 10:27:34-04
2018-06-07 10:43:22.607 EDT [1423] LOG: redo starts at 0/40005B8
2018-06-07 10:43:22.607 EDT [1423] LOG: consistent recovery state reached at 0/40156B0
2018-06-07 10:43:22.608 EDT [1421] LOG: database system is ready to accept read only connections
2018-06-07 10:43:22.626 EDT [1423] LOG: recovery stopping before commit of transaction 570, time 2018-06-07 10:28:59.645685-04
2018-06-07 10:43:22.626 EDT [1423] LOG: recovery has paused
2018-06-07 10:43:22.626 EDT [1423] HINT: Execute pg_wal_replay_resume() to continue.

Step 6

You can now stop recovery and open the database for writes after PITR.

Before executing the next command, you may want to verify that you have got all the desired data by connecting to the database and executing some SQL’s. You can still perform reads before you stop recovery. If you notice that you need another few minutes (or hours) of transactions, then modify the parameter recovery_target_time and go back to step 4. Otherwise, you can stop the recovery by running the following command.

$ psql
select pg_wal_replay_resume();

Summing up

Using PostgreSQL Point in time Recovery is the most simple of procedures that does not involve any effort in identifying the latest backups, transaction logs and space or server to restore in a database emergency. These things happen! Also, it could save a lot of time because the replay of WALs is much faster than rebuilding an entire instance using backups, especially when you have a huge database.

Important post script: I tested and recorded these steps using PostgreSQL 10.4. It is possible with PostgreSQL 9.x versions, however, the parameters could change slightly and you should refer to the PostgreSQL documentation for the correct syntax.

The post Faster Point In Time Recovery (PITR) in PostgreSQL Using a Delayed Standby appeared first on Percona Database Performance Blog.


Scaling PostgreSQL with PgBouncer: You May Need a Connection Pooler Sooner Than You Expect

PostgreSQL connection pools

PostgreSQL connection poolsAs PostgreSQL based applications scale, the need to implement connection pooling can become apparent sooner than you might expect. Since, PostgreSQL to date has no built-in connection pool handler, in this post I’ll explore some of the options for implementing connection pooling. In doing so, we’ll take a look at some of the implications for application performance.

PostgreSQL implements connection handling by “forking” it’s main OS process into a child process for each new connection. An interesting practical consequence of this is that we get a full view of resource utilization per connection in PostgreSQL at the OS level (the output below is from top):

24379 postgres  20 0 346m 148m 122m R 61.7  7.4 0:46.36 postgres: sysbench sysbench ::1(40120)
24381 postgres  20 0 346m 143m 119m R 62.7  7.1 0:46.14 postgres: sysbench sysbench ::1(40124)
24380 postgres  20 0 338m 137m 121m R 57.7  6.8 0:46.04 postgres: sysbench sysbench ::1(40122)
24382 postgres  20 0 338m 129m 115m R 57.4  6.5 0:46.09 postgres: sysbench sysbench ::1(40126)

This extra visualization comes at some additional cost though: it is more expensive—in terms of time and memory mostly—to fork an OS process than it would be, for example, to spawn a new thread for an existing process. This might be irrelevant if the rate at which connections are opened and closed is low but becomes increasingly important to consider over time as this reality changes. That is possibly one of the reasons why the need for a connection pooling mechanism often manifests itself early in the scaling life of a PostgreSQL-based application.

Scaling connections

When an application server sends a connection request to a PostgreSQL database it will be received by the Postmaster process. Postmaster, observing the limit set by max_connections, will then fork itself, creating a new backend process to handle this new connection. This backend process will live until the connection is closed by the client or terminated by PostgreSQL itself.

If the application was conceived with the database in mind it will make an effective use of connections, reusing existing ones whenever possible while avoiding idle connections from laying around for too long. Unfortunately, that is not always the case. Plus there are legitimate situations where the application popularity/usage increases. These naturally increases the rate at which new connection requests are created. There is a practical limit for the number of connections a server can manage at a given time. Beyond these limits, we start seeing contention in different areas. This in turn affects the server’s capacity to process requests, which can lead to bigger problems.

Putting a cap on the number of connections that may exist at any given time helps somewhat. However, considering the time it takes to properly establish a new connection, if the rate at which new connection requests arrive continues to rise we may soon reach a scaling problem.

Connection pooling for PostgreSQL applications

What if we could instead recycle existing connections to serve new client requests to gain on time (by avoiding the creation and remotion of yet another backend process each time), ultimately increasing transaction throughput, while also making a better use of the resources available? The strategy that revolves around the use of a cache (or pool) of connections that are kept open on the database server and re-used by different client requests is known as connection pooling.

Given there is no built-in handler for PostgreSQL, there are typically two ways of implementing such a mechanism:

  1. On the application side. Some frameworks like Ruby on Rails include their own, built-in connection pool mechanism. There are also libraries that extend database driver functionality to include connection pooling support, such as c3P0.
  2. As an external service, sitting between the application and the database server. The application will then connect to this external service instead, which will relay each request from the application to the database server through one of the connections it maintains in its pool.

An application connection pooler might provide better integration with the application, for example, when it comes to the use of prepared statements and the re-utilization of the cached result set. However, sometimes they may fall short on understanding PostgreSQL protocol and this might result in things like failing to properly clear pre-cached memory. An external service, on the other hand, won’t provide such a tight integration with a particular application but usually allows for greater customization and better maintenance of the pool. This often provides the ability to increase and decrease the number of connections it maintains cached dynamically and according to the demand observed, while also respecting pre-set threshold marks. Probably the most popular connection pooler used with PostgreSQL is PgBouncer.

A simple test

In order to illustrate the impact a connection pooler might have on the performance of a PostgreSQL server, I took advantage of the recent tests we did with sysbench-tpcc on PostgreSQL and repeated them partially by making use of PgBouncer as a connection pooler.

When we first ran the tests our goal was to optimize PostgreSQL for sysbench-tpcc workload running with 56 concurrent clients (threads), with the server having the same amount of CPUs available. The goal this time was to vary the number of concurrent clients (56, 150, 300 and 600) to see how the server would cope with the scaling of connections.

Instead of running each round for 10 hours, however, I ran them for 30 minutes only. This might mask, or at least change, the effects of checkpointing and caching observed in our initial tests.


I compiled the latest version of PgBouncer (1.8.1) following the instruction on GitHub and installed it in our test box, alongside PostgreSQL.

PgBouncer can be configured with three different types of pooling:

  • Session pooling: once the client gets one of the connections in the pool assigned it will keep it until it disconnects (or a timeout is reached).
  • Transaction pooling: once the client gets a connection from the pool, it keeps it to run a single transaction only.  After that, the connection is returned to the pool. If the client wants to run other transactions it has to wait until it gets another connection assigned to it.
  • Statement pooling: in this mode, PgBouncer will return a connection to its pool as soon as the first query is processed, which means multi-statement transactions would break in this mode.

I went with transaction pooling for my tests as the workload of sysbench-tpcc is composed of several short and single-statement transactions. Here’s the configuration file I used in full, which I named pgbouncer.ini:

sbtest = host= port=5432 dbname=sbtest
listen_port = 6543
listen_addr =
auth_type = md5
auth_file = users.txt
logfile = pgbouncer.log
pidfile =
admin_users = postgres
pool_mode = transaction

Apart from pool_mode, the other variables that matter the most are (definitions below came from PgBouncer’s manual page):

  • default_pool_size: how many server connections to allow per user/database pair.
  • max_client_conn: maximum number of client connections allowed

The users.txt file specified by auth_file contains only a single line with the user and password used to connect to PostgreSQL; more elaborate authentication methods are also supported.

Running the test

I started PgBouncer as a daemon with the following command:

$ pgbouncer -d pgbouncer.ini

Apart from running the benchmark for only 30 minutes and varying the number of concurrent threads each time, I employed the exact same options for sysbench-tpcc we used in our previous tests. The example below is from the first run with threads=56:

$ ./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=1800 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password=****** --db-driver=pgsql run > /var/lib/postgresql/Nando/56t.txt

For the tests using the connection pooler I adapted the connection options so as to connect with PgBouncer instead of PostgreSQL directly. Note it remains a local connection:

./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=1800 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password=****** --pgsql-port=6543 --db-driver=pgsql run > /var/lib/postgresql/Nando/P056t.txt

After each sysbench-tpcc execution I cleared the OS cache with the following command:

$ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'

It is not like this will have made much of a difference. As discussed in our previous post shared_buffers was set with 75% of RAM, enough to fit all of the “hot data” in memory.


Without further ado here are the results I obtained:

comparing the effects of running sysbench-tpcc with a connection pooler
TPS from sysbench-tpcc: comparing direct connection to PostgreSQL and the use of PgBouncer as a connection pooler

When running sysbench-tpcc with only 56 concurrent clients the use of direct connections to PostgreSQL provided a throughput (TPS stands for transactions per second) 2.5 times higher than that obtained when using PgBouncer. The use of a connection pooler in this case was extremely detrimental to performance. At such small scale there were no gains obtained from a pool of connections, only overhead.

When running the benchmark with 150 concurrent clients, however, we start seeing the benefits of employing a connection pooler. In fact, such benefits most probably materialize much earlier. With hindsight going from 56 concurrent transactions to 150 was too big of an initial jump.

It was somewhat surprising to me to realize at first that such throughput could be sustained with PgBouncer even when doubling and then quadrupling the number of concurrent clients. What happens, in this case, is that instead of flooding PostgreSQL with that many requests at once, they all stop at PgBouncer’s door. PgBouncer only allows the next request to proceed to PostgreSQL once one of the connections in its pool is freed. Remember, I configured it in transaction pooling mode. The process is transparent to PostgreSQL. It has no idea how many requests are waiting at PgBouncer’s door to be processed (and thus it is spared the trouble and doesn’t freak out!). It’s effectively like outsourcing connection management, beyond the ones used by the pool itself, to a contractor. PgBouncer does this hard work so PostgreSQL doesn’t need to.

This strategy seems to work great for sysbench-tpcc. With other workloads, the balance point might lie elsewhere.

Experimenting with bigger and smaller connection pools

For the tests above I set default_pool_size on PgBouncer equal to the number of CPU cores available on this server (56). To explore the tuning of this parameter, I repeated these tests using bigger connection pools (150, 300, 600) as well as a smaller one (14). The following chart summarizes the results obtained:

sysbench-tpcc with pgbouncer: comparing different pool sizes
How the use of PgBouncer impacts throughput on sysbench-tpcc: comparing the use of different pool sizes

Using a much smaller connection pool (14), sized to ¼ of the number of available CPUs, still yielded a result almost as good. That says a lot about how much leveraging connection handling alone to PgBouncer already helps. Maybe some of PgBouncer’s “gatekeeper” qualities could be incorporated by PostgreSQL itself?

Doubling the number of connections in the pool didn’t make any practical difference. But as soon as we extrapolate that number to 600 (which is the number of maximum concurrent threads I’ve tested) throughput becomes comparable to when not using a connection pooler once the number of concurrent threads is greater than the number of available CPUs. That’s true even when running as many concurrent threads as there are connections available in the pool (600). There’s a practical limit to it, which clearly is on the PostgreSQL side. That’s expected, otherwise, the need for a connection pool wouldn’t be as important.

As a starting point, setting the connection pool size equal to the number of CPUs available in the server looks like a good idea. There may be a hard limit for the pool around 150 connections or so, after which further benefits aren’t realized. However, that’s only speculation. Further tests using both different hardware and workloads would be needed to investigate this properly.

Here’s the table that summarizes the results obtained, for a different view:

Do I need a connection pooler ?

The need to couple PostgreSQL with a connection pooler depends on a number of factors including:

  • The number of connections typically established with the server. Take into account that the number usually varies significantly during the day. Think about the average number per hour, during different hours in the day, and particularly when compared with peak time.
  • The effective throughput (TPS) produced by these connections
  • The number of CPUs available in view of the effective throughput
  • The nature of your workload:
    • Whether your application opens a new connection for each new request or tends to leave connections open for longer
    • Whether it is composed of various single-statement transactions (AUTOCOMMIT=ON style) or big and long transactions.

Unfortunately, I don’t have a formula for this. But now that you understand how a connection pooler such as PgBouncer works and where it tends to benefit PostgreSQL’s performance (even if only having sysbench-tpcc’s workload as the sole example here) it’s a matter of investigating the points above and experimenting. Experimenting is the key! Though possibly using reads only on a stand-by replica at first, to stay on the safe side…

In a future post, we will cover how the architecture of a database solution changes with the use of connection poolers: where to place them, how to cope with single point of failure issues and how they can be used alongside load balancers.

The post Scaling PostgreSQL with PgBouncer: You May Need a Connection Pooler Sooner Than You Expect appeared first on Percona Database Performance Blog.


Webinar 6/28: Securing Database Servers From External Attacks

securing database servers

securing database serversPlease join Percona’s Chief Evangelist Colin Charles on Thursday, June 28th, 2018, as he presents Securing Database Servers From External attacks at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).


A critical piece of your infrastructure is the database tier, yet people don’t pay enough attention to it judging by how many are bitten via poorly chosen defaults, or just a lack understanding of running a secure database tier. In this talk, I’ll focus on MySQL/MariaDB, PostgreSQL, and MongoDB, and cover external authentication, auditing, encryption, SSL, firewalls, replication, and more gems from over a decade of consulting in this space from Percona’s 4,000+ customers.

Register Now


Colin Charles

Chief Evangelist

Colin Charles is the Chief Evangelist at Percona. He was previously on the founding team of MariaDB Server in 2009, and had worked at MySQL since 2005, and been a MySQL user since 2000. Before joining MySQL, he worked actively on the Fedora and projects. He’s well known within open source communities in APAC, and has spoken at many conferences. Experienced technologist, well known in the open source world for work that spans nearly two decades within the community. Pays attention to emerging technologies from an integration standpoint. Prolific speaker at many industry-wide conferences delivering talks and tutorials with ease. Interests: application development, systems administration, database development, migration, Web-based technologies. Considered expert in Linux and Mac OS X usage/administration/roll-out’s. Specialties: MariaDB, MySQL, Linux, Open Source, Community, speaking & writing to technical audiences as well as business stakeholders.

The post Webinar 6/28: Securing Database Servers From External Attacks appeared first on Percona Database Performance Blog.


This Week in Data with Colin Charles 43: Polyglots, Security and DataOps.Barcelona

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

This is a short working week for me due to a family emergency. It caused me to skip speaking at DataOps.Barcelona and miss hanging out with the awesome of speakers and attendees. This is the first time I’ve missed a scheduled talk, and I received many messages about my absence. I am sure we will all meet again soon.

One of the talks I was planning to give at DataOps.Barcelona will be available as a Percona webinar next week: Securing Your Database Servers from External Attacks on Thursday, June 28, 2018, at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4). I am also giving a MariaDB 10.3 overview on Tuesday, June 26, 2018, at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4). I will “virtually” see you there.

If you haven’t already read Werner Vogel’s post A one size fits all database doesn’t fit anyone, I highly recommend it. It is true there is no “one size fits all” solution when it comes to databases. This is why Percona has made “the polyglot world” a theme. It’s why Amazon offers different database flavors: relational (Aurora for MySQL/PostgreSQL, RDS for MySQL/PostgreSQL/MariaDB Server), key-value (DynamoDB), document (DynamoDB), graph (Neptune), in-memory (ElastiCache for Redis & Memcached), search (Elasticsearch service). The article has a plethora of use cases, from AirBnB using Aurora, to Snapchat Stories and Tinder using DynamoDB, to Thomson Reuters using Neptune, down to McDonald’s using ElastiCache and Expedia using Elasticsearch. This kind of detail, and customer use case, is great.

There are plenty more stories and anecdotes in the post, and it validates why Percona is focused not just on MySQL, but also MariaDB, MongoDB, PostgreSQL and polyglot solutions. From a MySQL lens, it’s also worth noting that not one storage engine fits every use case. Facebook famously migrated a lot of their workload from InnoDB to MyRocks, and it is exciting to see Mark Callaghan stating that there are already three big workloads on MyRocks in production, with another two coming soon.


  • MariaDB 10.1.34 – including fixes for InnoDB defragmentation and full text search (MDEV-15824). This was from the WebScaleSQL tree, ported by KakaoTalk to MariaDB Server.
  • Percona XtraDB Cluster 5.6.40-26.25 – now with Percona Server for MySQL 5.6.40, including a new variable to configure rolling schema upgrade (RSU) wait for active commit connection timeouts.
  • Are you using the MariaDB Connector/C, Connector/J or Connector/ODBC? A slew of updates abound.

Link List

Industry Updates

Upcoming appearances

  • OSCON – Portland, Oregon, USA – July 16-19 2018


I look forward to feedback/tips via e-mail at or on Twitter @bytebot.

The post This Week in Data with Colin Charles 43: Polyglots, Security and DataOps.Barcelona appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by