Apr
09
2019
--

Continuous Replication From a Legacy PostgreSQL Version to a Newer Version Using Slony

Native streaming replication in PostgreSQL works only between servers running the same major version. We discussed about logical replication in our previous blog post. In that post, we saw how logical replication could help us set up migration between two different PostgreSQL versions. However, logical replication works only for the currently supported versions of PostgreSQL, for example between PostgreSQL 9.4 and PostgreSQL 11. So what about the legacy versions that are older than 9.4? Slony-I could help us meet this replication requirement.

Replication between different PostgreSQL versions with Slony-I is useful for migration from legacy database installations to the latest available version. So what is Slony and how does it work?

This post is the fourth of our Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions series where we’ll be exploring different methods available to upgrade your PostgreSQL databases.

Slony

Slony is an application-level logical replication implementation for PostgreSQL. Rather, we could say that it is an external replication tool that requires a separate installation and configuration. Slony has been around for a long time. The latest version supports PostgreSQL versions from 8.4 and up to 11.

PostgreSQL logoThe main goal for replication is to ship changes from one database server to another. To better understand the architecture, you should know the terms such as Slon, Events and Slonik in Slony-I.

An aside: Slony means elephants in Russian, and elephants are indeed reputed to have a great memory. A slightly angry, but nevertheless pretty elephant, “Slonik”, looks at you from the PostgreSQL logo image.

Slon

Slon is a daemon that runs on each PostgreSQL node in Slony-I replication. These daemons are used for processing configuration and replication events for each PostgreSQL server. Each PostgreSQL server is called a “node”. All nodes together form a Slony “cluster”.

The “publisher node” is a source for replicated changes. While “subscriber” nodes receive and apply changes from the publisher node.

In order to setup replication, we should specify all replicated tables or “set”. Actual subscription works within a specific set. Changes to the tables being replicated are grouped together into SYNCs. These group of transactions are applied together to the subscriber nodes.

Events

Changes are transferred from the publisher in the form of “events”. When an event is processed by the slon daemon on a remote node, it generates a “confirmation”. Events are also used to notify nodes about configuration changes like adding/removing new nodes, new subscriptions or DDL changes.

Each event has a unique origin identifier, sequence number, transaction id for the snapshot on provider node for this event, multiple arguments, and timestamp with timezone.

Triggers written in PL/pgSQL register all changes in replicated tables. Unfortunately, there is no reliable way yet to handle changes to large objects (BLOBS), DDLs, or changes to users and roles.

slonik

Slonik means a little elephant. It is a command line utility with parser and interpreter and it accepts “slonik scripts” – a simple declarative scripting language. It is designed to overcome the limitations of procedural language. You use slonik commands to set up or modify slony replication and they can be embedded in shell scripts. It can accept commands from standard input or from files. The following example shows how a slonik script being fed to the slonik utility, and then embedded in shell scripts.

The script to create the initial configuration for the simple master-slave setup of our pgbench database looks like this:

#!/bin/sh
slonik <<_EOF_
 cluster name = percona_pg;
 node 1 admin conninfo = 'dbname=pg93 host=pg93_host user=percona_pg93_user';
 node 2 admin conninfo = 'dbname=pg11 host=pg11_host user=percona_pg11_user';
 #--
 # Creates a _$(clustername), this example, _percona_pg schema
 #--
 init cluster ( id=1, comment = 'Legacy PG Node');
 #--
 # Add a list of tables being replicated to a set.
 #--
create set (id=1, origin=1, comment='pgbench');
 set add table (set id=1, origin=1, id=1, fully qualified name = 'public.pgbench_accounts', comment='accounts');
 set add table (set id=1, origin=1, id=2, fully qualified name = 'public.pgbench_branches', comment='branches');
 set add table (set id=1, origin=1, id=3, fully qualified name = 'public.pgbench_tellers', comment='tellers');
 set add table (set id=1, origin=1, id=4, fully qualified name = 'public.pgbench_history', comment='history');
 #--
 # Create the second node (the slave) tell the 2 nodes how to connect to
 # each other and how they should listen for events.
 #--
 store node (id=2, comment = 'Target node', event node=1);
 store path (server = 1, client = 2, conninfo='dbname=pg93 host=pg93_host user=percona_pg93_user');
 store path (server = 2, client = 1, conninfo='dbname=pg11 host=pg11_host user=percona_pg11_user');
_EOF_

Why Slony is useful for migrations?

Despite the benefits of internal logical replication, this external solution is better for migrations between different versions that are older than PostgreSQL 9.4. The trigger-based approach depends on the database API – both older and newer versions should be compatible for PL/pgSQL and SQL syntax.

How to adapt your database for usage with Slony?

  • Your tables have to have primary keys. Add a serial field to all tables without primary key
  • OID blobs will not have their changes replicated. If you have columns with small-length values, you could convert these to BYTEA. For a really large objects like images, it’s better to store data externally e.g. use S3 in Amazon cloud. If it’s too hard to change your application, you could apply blob changes at the last stage of migration.
  • ALTER TABLE and other DDL operations. Slony can’t detect table structure changes. Instead you should use an EXECUTE SCRIPT slonik command to apply a SQL file with DDL or SQL strings to the whole replication cluster.

Online migration from legacy PostgreSQL

  1. Create replication user with superuser privileges. It’s possible to use fine-grained privileges, but they are significantly harder to setup.
  2. Create a database on the destination, setup access by TCP/IP
  3. Copy table definitions from master to slaves
  4. Install Slony-I. On servers with an old OS distribution you might find it simpler to install Slony-I from the source code.
  5. Define cluster, set of tables, and connection information to nodes as a list of slonik commands
  6. Start the slon daemon on each postgresql server. Check standard output or log files for any potential communication errors.
  7. Execute subscription slonik commands to start sync
  8. Test your read-only queries with a newer version of postgres
  9. Once all the data has been replicated and is in sync, stop your applications and repoint them to the new postgres server.
  10. Use “uninstall node” on the newer version of PostgreSQL to remove all traces of Slony replication

Downgrade steps

To downgrade, follow the same procedure as upgrade. Slony allows you to replicate from and to any versions of PosgreSQL supported by your slony version. The minimum supported version is 8.4.

Summary

So far, we have seen a high level overview of how Slony can be helpful to perform an upgrade with the least possible downtime. Come and see more of this in action during our Webinar.  And don’t forget at Percona Live in Austin, May 28-30 2019, we’ll have two days of PostgreSQL content in a postgres dedicated track.


Image derived from photo by Chen Hu on Unsplash

Apr
04
2019
--

Replication Between PostgreSQL Versions Using Logical Replication

postgres read replica from primary

When you need to upgrade PostgreSQL, there are a variety of approaches that you can use. To avoid application downtime, then not all of the options to upgrade postgres are suitable. When avoiding downtime is essential, then you might consider using replication as a means of upgrade, and depending on your scenario, you can choose to approach this task using either logical replication or physical (streaming) replication. Here, we take a look at the difference between logical and physical replication in PostgreSQL. Then we explore how to accomplish an upgrade using logical replication in more detail, and by doing so, avoid application downtime. In a subsequent article, we’ll investigate physical replication.

We have already discussed about a few methods available to perform PostgreSQL upgrades in our previous blog posts – PostgreSQL Upgrade Using pg_dumpall and PostgreSQL Upgrade Using pg_dump/pg_restore – as part of our Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions series. However, both of these methods involve downtime to application.

Types of logical replication

Here we’ll cover two types of replication you could implement:

  1. Replication between PostgreSQL 10 and 11 versions using built-in logical replication.
  2. Replication between PostgreSQL 9.4 or (<  PG 11) to PostgreSQL 11 using an extension named pglogical .

We might opt to implement replication as a method of upgrade to minimize downtime for the application. Once all the data to date has been replicated to another PostgreSQL server on the latest version, you can simply switch your application to the new server with a minimal downtime… though of course this does depends on the complexity of your application stack.

Logical replication in PostgreSQL allows users to perform a selective replication of a few tables and open a standby for writes. Whereas physical replication in PostgreSQL is a block level replication. In this case, each database in the master is replicated to a standby, and the standby is not open for writes. Going forward, we’ll refer to physical replication as streaming replication.

With logical replication, a standby can have replication enabled from multiple masters. This could be helpful in situations where you need to replicate data from several PostgreSQL databases (OLTP) to a single PostgreSQL server for reporting and data warehousing.

One of the biggest advantages of logical over streaming replication is that logical replication allows us to replicate changes from an older version PostgreSQL to a later version. Streaming replication works only when both the master and standby are of the same major version. We’d recommend they have same minor version too for best practice.

Replication between PostgreSQL 10 and 11 versions

Starting from PostgreSQL 10, logical replication is available with the PostgreSQL source by default. So, you could easily replicate a PostgreSQL 10 database to PostgreSQL 11. Logical Replication uses a publish and subscribe model. The node that sends the changes becomes a publisher. And the node that subscribes to those changes becomes a subscriber. You may have one or more subscriptions to a publication.

Publication

Publication is a set of changes generated from a group of tables. It is referred to as a change set or replication set. Publications can only contains tables and cannot contain any other objects. DMLs on these tables can be replicated but not DDLs.

In a publication, you can choose what type of DML to replicate: INSERT or DELETE or UPDATE or ALL. By default, it is ALL. You must have a replica identity set on the table being published to replicate UPDATES and DELETES to a subscriber. A replica identity set helps in identifying the rows to be updated or deleted.

The primary key of a table is its default replica identity. You can also make a unique index with NO NULL values as a replica identity. If there is no primary key or a unique index with NO NULLs, then you can set the replica_identity to FULL. When a replica identity is set to FULL, postgres uses the entire row as a key. Of course, this may be inefficient.

You might see ERRORS if a table with no primary key and a non-default replica identity has been added to a publication after an UPDATE or a DELETE operation.

Subscription

A subscriber can subscribe to one or more publications. Before adding the subscription, you must ensure that the tables being replicated have been created in the subscriber node. In order to achieve that, you can perform a schema-only dump from publisher to subscriber node.

An example of logical replication

The following example steps work for logical replication between PostgreSQL 10 and 11 versions only.

On the publishing node, create a publication. You can either add all tables or choose to add selected tables to the publication.

-- For adding ALL Tables in Database
CREATE PUBLICATION percpub FOR ALL TABLES;
-- For adding Selected Tables in Database
CREATE PUBLICATION percpub FOR TABLE scott.employee scott.departments;

On the subscriber node, create a subscription that refers to the publication on the publisher node. Perform a DDL dump of the tables to the subscriber node before creating the subscription, as mentioned above,

$ pg_dump -h publisher_server_ip -p 5432 -d percona -Fc -s -U postgres | pg_restore -d percona -h subscriber_node_ip -p 5432 -U postgres
CREATE SUBSCRIPTION percsub CONNECTION 'host=publisher_server_ip dbname=percona user=postgres password=secret port=5432' PUBLICATION percpub;

The above command also copies the pre-existing data from the tables. If you want to disable the copy of the pre-existing data, you can use the following syntax. It will then only start copying the changes to the publisher after you run this command.

CREATE SUBSCRIPTION percsub CONNECTION 'host=publisher_server_ip dbname=percona user=postgres password=oracle port=5432' PUBLICATION percpub WITH (copy_data = false);

Monitor the replication using the following command on the publishing node.

$ psql
\x
select * from pg_stat_replication;

Replication between PostgreSQL 9.4 and PostgreSQL 11

Now, what about the versions that are older than PostgreSQL 10? For this purpose, there is an extension named

pglogical

that works for versions from 9.4 until 11. Using pglogical, you can easily replicate PostgreSQL 9.4 to PostgreSQL 11.

The following sequence of steps demonstrates a high-level procedure to setup replication between PG 9.4 and PG 11 using pglogical extension.

Step 1 : Consider pgserver_94 to be the source server with a database : percona_94 running on PostgreSQL 9.4. Create the following extensions.

[pgserver_94:] $psql -d percona_94 -c "CREATE EXTENSION pglogical_origin"
CREATE EXTENSION
[pgserver_94:] $psql -d percona_94 -c "CREATE EXTENSION pglogical"
CREATE EXTENSION

Step 2 : Now, you can go ahead and add either selected tables or all the tables in a schema or multiple schemas for replication. In the following example, you can see an error when there is no primary key on one of the tables.

[pgserver_94:] $psql -d percona_94
psql (9.4.21)
Type "help" for help.
percona_94=# SELECT pglogical.create_node(node_name := 'provider1',dsn := 'host=192.168.0.24 port=5432 dbname=percona_94');
create_node
-------------
2976894835
(1 row)
percona_94=# SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public']);
ERROR: table pgbench_history cannot be added to replication set default
DETAIL: table does not have PRIMARY KEY and given replication set is configured to replicate UPDATEs and/or DELETEs
HINT: Add a PRIMARY KEY to the table
percona_94=# ALTER TABLE pgbench_history ADD PRIMARY KEY (tid,aid,delta);
ALTER TABLE
percona_94=# SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public']);
replication_set_add_all_tables
--------------------------------
t
(1 row)

Step 3

On the subscriber node, which is our PostgreSQL 11 database, you can run similar commands as follows.

[pgserver_11:] $psql -d percona_11
psql (11.2)
Type "help" for help.
percona_11=# SELECT pglogical.create_node(node_name := 'subscriber1',dsn := 'host=127.0.0.1 port=5432 dbname=percona_11 password=secret');
 create_node
-------------
   330520249
(1 row)
percona_11=# SELECT pglogical.create_subscription(subscription_name := 'subscription1',provider_dsn := 'host=192.168.0.24 port=5432 dbname=percona_94 password=secret');
 create_subscription
---------------------
          1763399739
(1 row)

Step 4 You can then validate the replication status by querying a few tables pglogical always updates:

percona_11=# select * from pglogical.local_sync_status;
 sync_kind | sync_subid | sync_nspname |   sync_relname   | sync_status | sync_statuslsn
-----------+------------+--------------+------------------+-------------+----------------
 f         | 1763399739 | public       | pgbench_accounts | r           | 0/2EB7D48
 f         | 1763399739 | public       | pgbench_history  | r           | 0/2EB7D48
 f         | 1763399739 | public       | pgbench_tellers  | r           | 0/2EB7D48
 f         | 1763399739 | public       | pgbench_branches | r           | 0/2EB7D48
 d         | 1763399739 |              |                  | r           | 0/0
(5 rows)
percona_11=# select * from pglogical.subscription;
   sub_id   |   sub_name    | sub_origin | sub_target | sub_origin_if | sub_target_if | sub_enabled |             sub_slot_name              |         sub_rep
lication_sets          | sub_forward_origins | sub_apply_delay
------------+---------------+------------+------------+---------------+---------------+-------------+----------------------------------------+----------------
-----------------------+---------------------+-----------------
 1763399739 | subscription1 | 2976894835 |  330520249 |    2402836775 |    2049915666 | t           | pgl_percona_11_provider1_subscription1 | {default,defaul
t_insert_only,ddl_sql} | {all}               | 00:00:00
(1 row)

Primary key selection

At step 2, you saw how all the tables of schema : public got added to a replication set by creating a primary key on the table that doesn’t currently have one. The primary key I chose may not be the right one for that table as it is just for demonstration. However, when you choose a primary key, make sure that you are selecting the right one. It needs to be always unique and use column(s) that don’t normally contain NULLs. If you don’t research primary key selection thoroughly, you could cause downtime to your application. Here’s an example error that you could encounter:

[pgserver_94:] $pgbench -c 10 -T 300 -n percona_94
Client 7 aborted in state 12: ERROR: duplicate key value violates unique constraint "pgbench_history_pkey"
DETAIL: Key (tid, aid, delta)=(7, 63268, 2491) already exists.

So far we have seen how you can use pglogical to create replication between an older version to a newer version PostgreSQL. After you have set up replication, you can easily switch your applications to the latest version with a lower downtime.

To see more in action, why not subscribe to our webinar? And don’t forget at Percona Live in Austin, May 28-30 2019, we have two days of PostgreSQL content in a postgres dedicated track.

Mar
15
2019
--

MySQL Ripple: The First Impression of a MySQL Binlog Server

MySQL Ripple

MySQL RippleJust about a month ago, Pavel Ivanov released Ripple under the Apache-2.0 license. Ripple is a MySQL binlog server: software which receives binary logs from MySQL or MariaDB servers and delivers them to another MySQL or MariaDB server. Practically ,this is an intermediary master which does not store any data, except the binary logs themselves, and does not apply events. This solution allows saving of a lot of resources on the server, which acts only as a middle-man between the master and its actual slave(s).

The intermediary server, keeping binary logs only and not doing any other job, is a prevalent use case which allows us to remove IO (binlog read) and network (binlog retrieval via network) load from the actual master and free its resources for updates. The intermediary master, which does not do any work, distributes binary logs to slaves connected to it. This way you can have an increased number of slaves, attached to such a server, without affecting the application, running updates.

Currently, users exploit the Blackhole storage engine to emulate similar behavior. But Blackhole is just a workaround: it still executes all the events in the binary logs, requires valid MySQL installation, and has a lot of issues. Such a pain!

Therefore a new product which can do the same job and is released with an open source license is something worth trying.

A simple test

For this blog, I did a simple test. First, I installed it as described in the README file. Instructions are pretty straightforward, and I successfully built the server on my Ubuntu 18.04.2 LTS laptop. Guidelines suggest to install

libmariadbclient-dev

, and I replaced

libmysqlclient-dev

which I had already on my machine. Probably this was not needed, but since the tool claims to support both MySQL and MariaDB binary log formats, I preferred to install the MariaDB client.

There is no manual of usage instructions. However, the tool supports

-help

  command, and it is, again, straightforward.

The server can be started with options:

$./bazel-bin/rippled -ripple_datadir=./data -ripple_master_address=127.0.0.1 -ripple_master_port=13001 -ripple_master_user=root -ripple_server_ports=15000

Where:

  • -ripple-datadir

     : datadir where Ripple stores binary logs

  • -ripple_master_address

     : master host

  • -ripple_master_port

     : master port

  • -ripple_master_user

     : replication user

  • -ripple_server_ports

     : comma-separated ports which Ripple will listen

I did not find an option for securing binary log retrieval. The slave can connect to the Ripple server with any credentials. Have this in mind when deploying Ripple in production.

Now, let’s run a simple test. I have two servers. Both running on localhost, one with port 13001 (master) and another one on port 13002 (slave). The command line which I used to start

rippled

 , points to the master. Binary logs are stored in the data directory:

$ ls -l data/
total 14920
-rw-rw-r-- 1 sveta sveta 15251024 Mar 6 01:43 binlog.000000
-rw-rw-r-- 1 sveta sveta 71 Mar 6 00:50 binlog.index

I pointed the slave to the Ripple server with the command

mysql> change master to master_host='127.0.0.1',master_port=15000, master_user='ripple';
Query OK, 0 rows affected, 1 warning (0.02 sec)

Then started the slave.

On the master, I created the database

sbtest

  and ran sysbench

oltp_read_write.lua

test for a single table. After some time, I stopped the load and checked the content of the table on master and slave:

master> select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 10000 |
+----------+
1 row in set (0.08 sec)
master> checksum table sbtest1;
+----------------+------------+
| Table | Checksum |
+----------------+------------+
| sbtest.sbtest1 | 4162333567 |
+----------------+------------+
1 row in set (0.11 sec)
slave> select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 10000 |
+----------+
1 row in set (0.40 sec)
slave> checksum table sbtest1;
+----------------+------------+
| Table | Checksum |
+----------------+------------+
| sbtest.sbtest1 | 1797645970 |
+----------------+------------+
1 row in set (0.13 sec)
slave> checksum table sbtest1;
+----------------+------------+
| Table | Checksum |
+----------------+------------+
| sbtest.sbtest1 | 4162333567 |
+----------------+------------+
1 row in set (0.10 sec)

It took some time for the slave to catch up, but everything was applied successfully.

Ripple has nice verbose logging:

$ ./bazel-bin/rippled -ripple_datadir=./data -ripple_master_address=127.0.0.1 -ripple_master_port=13001 -ripple_master_user=root -ripple_server_ports=15000
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 15:57:13.641451 27908 rippled.cc:48] InitPlugins
I0306 15:57:13.642007 27908 rippled.cc:60] Setup
I0306 15:57:13.642937 27908 binlog.cc:307] Starting binlog recovery
I0306 15:57:13.644090 27908 binlog.cc:350] Scanning binlog file: binlog.000000
I0306 15:57:13.872016 27908 binlog.cc:417] Binlog recovery complete
binlog file: binlog.000000, offset: 15251088, gtid: 6ddac507-3f90-11e9-8ee9-00163e000000:0-0-7192
I0306 15:57:13.872050 27908 rippled.cc:106] Recovered binlog
I0306 15:57:13.873811 27908 mysql_server_port_tcpip.cc:150] Listen on host: localhost, port: 15000
I0306 15:57:13.874282 27908 rippled.cc:62] Start
I0306 15:57:13.874511 27910 mysql_master_session.cc:181] Master session starting
I0306 15:57:13.882601 27910 mysql_client_connection.cc:148] connected to host: 127.0.0.1, port: 13001
I0306 15:57:13.895349 27910 mysql_master_session.cc:137] Connected to host: 127.0.0.1, port: 13001, server_id: 1, server_name:
W0306 15:57:13.898556 27910 mysql_master_session.cc:197] master does not support semi sync
I0306 15:57:13.898583 27910 mysql_master_session.cc:206] start replicating from '6ddac507-3f90-11e9-8ee9-00163e000000:0-0-7192'
I0306 15:57:13.899031 27910 mysql_master_session.cc:229] Master session entering main loop
I0306 15:57:13.899550 27910 binlog.cc:626] Update binlog position to end_pos: binlog.000000:15251152, gtid: 0-0-7192
I0306 15:57:13.899572 27910 binlog.cc:616] Skip writing event [ Previous_gtids len = 67 ]
I0306 15:57:13.899585 27910 binlog.cc:626] Update binlog position to end_pos: binlog.000000:15251152, gtid: 0-0-7192
...

Conclusion

it may be good to run more tests before using Ripple in production, and to explore its other options, but from a first view it seems to be a very nice and useful product.


Photo by Kishor on Unsplash

Mar
13
2019
--

Live MySQL Slave Rebuild with Percona Toolkit

MySQL slave data out of sync

MySQL slave data out of syncRecently, we had an edge case where a MySQL slave went out-of-sync but it couldn’t be rebuilt from scratch. The slave was acting as a master server to some applications and it had data was being written to it. It was a design error, and this is not recommended, but it happened. So how do you synchronize the data in this circumstance? This blog post describes the steps taken to recover from this situation. The tools used to recover the slave were pt-slave-restartpt-table-checksum, pt-table-sync and mysqldiff.

Scenario

To illustrate this situation, it was built a master x slave configuration with sysbench running on the master server to simulate a general application workload. The environment was set with a Percona Server 5.7.24-26 and sysbench 1.0.16.

Below are the sysbench commands to prepare and simulate the workload:

# Create Data
sysbench --db-driver=mysql --mysql-user=root --mysql-password=msandbox \
  --mysql-socket=/tmp/mysql_sandbox45008.sock --mysql-db=test --range_size=100 \
  --table_size=5000 --tables=100 --threads=1 --events=0 --time=60 \
  --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua prepare
# Simulate Workload
sysbench --db-driver=mysql --mysql-user=root --mysql-password=msandbox \
  --mysql-socket=/tmp/mysql_sandbox45008.sock --mysql-db=test --range_size=100 \
  --table_size=5000 --tables=100 --threads=10 --events=0 --time=6000 \
  --rand-type=uniform /usr/share/sysbench/oltp_read_write.lua --report-interval=1 run

With the environment set, the slave server was stopped, and some operations to desynchronize the slave were performed to reproduce the problem.

Fixing the issue

With the slave desynchronized, a restart on the replication was executed. Immediately, the error below appeared:

Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

To recover the slave from this error, we had to point the slave to an existing binary log with a valid binary log position. To get a valid binary log position, the command shown below had to be executed on the master:

master [localhost] {msandbox} ((none)) > show master status\G
*************************** 1. row ***************************
File: mysql-bin.000007
Position: 218443612
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set:
1 row in set (0.01 sec)

Then, a CHANGE MASTER command was run on the slave:

slave1 [localhost] {msandbox} (test) > change master to master_log_file='mysql-bin.000007', MASTER_LOG_POS=218443612;
Query OK, 0 rows affected (0.00 sec)
slave1 [localhost] {msandbox} (test) > start slave;
Query OK, 0 rows affected (0.00 sec)

Now the slave had a valid binary log file to read, but since it was inconsistent, it hit another error:

Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows event on table test.sbtest8; Can't find record in 'sbtest8', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000005, end_log_pos 326822861

Working past the errors

Before fixing the inconsistencies, it was necessary to keep the replication running and to skip the errors. For this, the pt-slave-restart tool will be used. The tool needs to be run on the slave server:

pt-slave-restart --user root --socket=/tmp/mysql_sandbox45008.sock --ask-pass

The tool skips errors and starts the replication threads. Below is an example of the output of the pt-slave-restart:

$ pt-slave-restart --user root --socket=/tmp/mysql_sandbox45009.sock --ask-pass
Enter password:
2019-02-22T14:18:01 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007        1996 1146
2019-02-22T14:18:02 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007        8698 1146
2019-02-22T14:18:02 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007       38861 1146

Finding the inconsistencies

With the tool running on one terminal, the phase to check the inconsistencies began. First things first, an object definition check was performed using mysqldiff utility. The mysqldiff tool is part of MySQL utilities. To execute the tool:

$ mysqldiff --server1=root:msandbox@localhost:48008 --server2=root:msandbox@localhost:48009 test:test --difftype=sql --changes-for=server2

And below are the differences found between the master and the slave:

1-) A table that doesn’t exist

# WARNING: Objects in server1.test but not in server2.test:
# TABLE: joinit

2-) A wrong table structure

# Comparing `test`.`sbtest98` to `test`.`sbtest98` [FAIL]
# Transformation for --changes-for=server2:
#
ALTER TABLE `test`.`sbtest98`
DROP INDEX k_98,
DROP COLUMN x,
ADD INDEX k_98 (k);

Performing the recommendations on the slave (creating the table and the table modification) the object’s definition was now equal. The next step was to check data consistency. For this, the pt-table-checksum was used to identify which tables are out-of-sync. This command was run on the master server.

$ pt-table-checksum -uroot -pmsandbox --socket=/tmp/mysql_sandbox48008.sock --replicate=percona.checksums --create-replicate-table --empty-replicate-table --no-check-binlog-format --recursion-method=hosts

And an output example:

01 master]$ pt-table-checksum --recursion-method dsn=D=percona,t=dsns --no-check-binlog-format --nocheck-replication-filter --host 127.0.0.1 --user root --port 48008 --password=msandbox
Checking if all tables can be checksummed ...
Starting checksum ...
  at /usr/bin/pt-table-checksum line 332.
Replica lag is 66 seconds on bm-support01.bm.int.percona.com.  Waiting.
Replica lag is 46 seconds on bm-support01.bm.int.percona.com.  Waiting.
Replica lag is 33 seconds on bm-support01.bm.int.percona.com.  Waiting.
           TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
02-26T16:27:59      0      0     5000          0       1       0   0.037 test.sbtest1
02-26T16:27:59      0      0     5000          0       1       0   0.039 test.sbtest10
02-26T16:27:59      0      1     5000          0       1       0   0.033 test.sbtest100
02-26T16:27:59      0      1     5000          0       1       0   0.034 test.sbtest11
02-26T16:27:59      0      1     5000          0       1       0   0.040 test.sbtest12
02-26T16:27:59      0      1     5000          0       1       0   0.034 test.sbtest13

Fixing the data inconsistencies

Analyzing the DIFFS column it is possible to identify which tables were compromised. With this information, the pt-table-sync tool was used to fix these inconsistencies. The tool synchronizes MySQL table data efficiently, performing non-op changes on the master so they can be replicated and applied on the slave. The tools need to be run on the slave server. Below is an example of the tool running:

$ pt-table-sync --execute --sync-to-master h=localhost,u=root,p=msandbox,D=test,t=sbtest100,S=/tmp/mysql_sandbox48009.sock

It is possible to perform a dry-run of the tool before executing the changes to check what changes the tool will apply:

$ pt-table-sync --print --sync-to-master h=localhost,u=root,p=msandbox,D=test,t=sbtest100,S=/tmp/mysql_sandbox48009.sock
REPLACE INTO `test`.`sbtest100`(`id`, `k`, `c`, `pad`) VALUES ('1', '1654', '97484653464-60074971666-42998564849-40530823048-27591234964-93988623123-02188386693-94155746040-59705759910-14095637891', '15000678573-85832916990-95201670192-53956490549-57402857633') /*percona-toolkit src_db:test src_tbl:sbtest100 src_dsn:D=test,P=48008,S=/tmp/mysql_sandbox48009.sock,h=127.0.0.1,p=...,t=sbtest100,u=root dst_db:test dst_tbl:sbtest100 dst_dsn:D=test,S=/tmp/mysql_sandbox48009.sock,h=localhost,p=...,t=sbtest100,u=root lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:17806 user:vinicius.grippa host:bm-support01.bm.int.percona.com*/;
REPLACE INTO `test`.`sbtest100`(`id`, `k`, `c`, `pad`) VALUES ('2', '3007', '31679133794-00154186785-50053859647-19493043469-34585653717-64321870163-33743380797-12939513287-31354198555-82828841987', '30122503210-11153873086-87146161761-60299188705-59630949292') /*percona-toolkit src_db:test src_tbl:sbtest100 src_dsn:D=test,P=48008,S=/tmp/mysql_sandbox48009.sock,h=127.0.0.1,p=...,t=sbtest100,u=root dst_db:test dst_tbl:sbtest100 dst_dsn:D=test,S=/tmp/mysql_sandbox48009.sock,h=localhost,p=...,t=sbtest100,u=root lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:17806 user:vinicius.grippa host:bm-support01.bm.int.percona.com*/;

After executing the pt-table-sync, we recommend that you run the pt-table-checksum again and check if the DIFFS column shows the value of 0.

Conclusion

This blog post was intended to cover all possible issues that could happen on a slave when it goes out-of-sync such as DDL operations, binary log purge and DML operations. This process involves many steps and it could take a long time to finish, especially in large databases. Note that this process might take longer than the backup/restore process. However, in situations like the one mentioned above, it might be the only solution to recover a slave.


Image based on Photo by Randy Fath on Unsplash

 

Mar
11
2019
--

Switch your PostgreSQL Primary for a Read Replica, Without Downtime

postgres read replica from primary

PostgreSQL logoIn my ongoing research to identify solutions and similarities between MySQL – PostgreSQL, I recently faced a simple issue. I needed to perform a slave shift from one IP to another and I did not want to have to restart the slave that is serving the reads. In MySQL, I can repoint the replication online with the command Change Master TO, so I was looking for similar solution in postgres. In my case, I could also afford some stale reads, so a few seconds delay would have been OK, but I couldn’t take down the server.

After brief research, I noticed that there is not a solution that allow you to do that without restarting the PostgreSQL server instance.
I was a bit disappointed, because I was just trying to move the whole traffic from one subnet to another, so not really changing the Master, but just the pointer.

At this point I raised my question to my colleagues who are experts in PG. Initially they confirmed to me that there is no real dynamic solution/command for that. However, while discussing this, one of them (Jobin Augustine) suggested a not “officially supported” way, that might work.

In brief, given that the WAL Receiver uses its own process, killing it would trigger an internal refresh operation, and that could result in having the replication restart from the new desired configuration.

This was an intriguing suggestion, but I wondered if it might have some negative side effects. In any case, I decided to try it and see what would happen.

This article describe the process I followed to test the approach. To be clear:  this is not an “Official” solution, and is not recommended as best practice.

From now on in this article I will drop the standard MySQL terms and instead use Primary for Master and Replica for Slave.

Scenarios

I carried out two main tests:

  1. No load in writing
  2. Writing happening

for each of these I took these steps:

a) move Replica to same Primary (different ip)
b) move Replica to different Primary/Replica, creating a chain, so from:

+--------+
                          | Primary|
                          +----+---+
                               |
                +--------+     |    +--------+
                |Replica1+<----+--->+Replica2|
                +--------+          +--------+

To:

+-------+
                          |Primary|
                          +---+---+
                              |
                              v
                          +---+----+
                          |Replica2|
                          +---+----+
                              |
                              v
                          +---+----+
                          |Replica1|
                          +--------+

The other thing was to try to be as non-invasive as possible. Given that, I used KILL SIGQUIT(3) instead of the more brutal SIGKILL.

SIGQUIT “The SIGQUIT signal is sent to a process by its controlling terminal when the user requests that the process quit and perform a core dump.

To note that I did try this with SIGTERM (15) which is the nicest approach, but it didn’t in fact force the process to perform the shift as desired.

In general in all the following tests what I execute is:

ps aux|grep 'wal receiver'
kill -3 <pid>

These are the current IPs for node:

Node1 (Primary):

NIC1 = 192.168.1.81
NIC2 = 192.168.4.81
NIC3 = 10.0.0.81

Node2 (replica1):

NIC1 = 192.168.1.82
NIC2 = 192.168.4.82
NIC3 = 10.0.0.82

Node1 (replica2):

NIC1 = 192.168.1.83
NIC2 = 192.168.4.83
NIC3 = 10.0.0.83

The starting position is:

select pid,usesysid,usename,application_name,client_addr,client_port,backend_start,state,sent_lsn,write_lsn,flush_lsn,sync_state from pg_stat_replication;
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22495 |    24601 | replica | node2            | 192.168.4.82 |       49518 | 2019-02-06 11:07:46.507511-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

And now let’s roll the ball and see what happen.

Experiment 1 – moving to same Primary no load

I will move Node2 to point to 192.168.1.81

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

change to:

primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 8343 0.0 0.0 667164 2180 ? Ss Feb06 16:27 postgres: wal receiver process streaming 10/FD6C60E8

Checking the replication status:

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres  8343  0.0  0.0 667164  2180 ?        Ss   Feb06  16:27 postgres: wal receiver process   streaming 10/FD6C60E8
                                                                  Tue 19 Feb 2019 12:10:22 PM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 23748 |    24601 | replica | node2            | 192.168.4.82 |       49522 | 2019-02-19 12:09:31.054915-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Tue 19 Feb 2019 12:10:23 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(1 row)
                                                                  Tue 19 Feb 2019 12:10:26 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 23756 |    24601 | replica | node2            | 192.168.1.82 |       37866 | 2019-02-19 12:10:26.904766-05 | catchup   | 10/FD460000 | 10/FD3A0000 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Tue 19 Feb 2019 12:10:28 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 23756 |    24601 | replica | node2            | 192.168.1.82 |       37866 | 2019-02-19 12:10:26.904766-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)

It takes six seconds to kill the process, shift to a new IP, and perform the catch up.

Experiment 2 – moving to Different Primary (as a chain of replicas) No load

I will move Node2 to point to 192.168.4.83

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 25859 0.0 0.0 667164 3484 ? Ss Feb19 1:53 postgres: wal receiver process

On Node1

Thu 21 Feb 2019 04:23:26 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 31241 |    24601 | replica | node2            | 192.168.1.82 |       38232 | 2019-02-21 04:17:24.535662-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Thu 21 Feb 2019 04:23:27 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

On Node3

pid | usesysid | usename | application_name | client_addr | client_port | backend_start | state | sent_lsn | write_lsn | flush_lsn | sync_state
-----+----------+---------+------------------+-------------+-------------+---------------+-------+----------+-----------+-----------+------------
(0 rows)
                                                                  Thu 21 Feb 2019 04:23:30 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 1435 |    24601 | replica | node2            | 192.168.4.82 |       58116 | 2019-02-21 04:23:29.846798-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

In this case, shifting to a new primary took four seconds.

Now all this is great, but I was working with NO load, what would happen if we have read/write taking place?

Experiment 3 – moving to same Primary WITH Load

I will move Node2 to point to 192.168.4.81

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 20765 0.2 0.0 667196 3712 ? Ss 06:23 0:00 postgres: wal receiver process streaming 11/E33F760

Thu 21 Feb 2019 06:23:03 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
 31649 |    24601 | replica | node2            | 192.168.1.82 |       38236 | 2019-02-21 06:21:23.539493-05 | streaming | 11/8FEC000 | 11/8FEC000 | 11/8FEC000 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/8FEC000 | 11/8FEC000 | 11/8FEC000 | async
                                                                 Thu 21 Feb 2019 06:23:04 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/904DCC0 | 11/904C000 | 11/904C000 | async
                                                                 Thu 21 Feb 2019 06:23:08 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
 31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | catchup   | 11/9020000 |            |            | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/9178000 | 11/9178000 | 11/9178000 | async
                                                                 Thu 21 Feb 2019 06:23:09 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
 31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | streaming | 11/91F7860 | 11/91F7860 | 11/91F7860 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/91F7860 | 11/91F7860 | 11/91F7860 | async

In this case shifting to a new primary takes six seconds.

Experiment 4 – moving to Different Primary (as a chain of replicas) No load

I move Node2 to point to 192.168.4.83
In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 21158 6.3 0.0 667196 3704 ? Ds 06:30 0:09 postgres: wal receiver process streaming 11/4F000000

Node1

Thu 21 Feb 2019 06:30:56 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | streaming | 11/177F8000 | 11/177F8000 | 11/177F8000 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/177F8000 | 11/177F8000 | 11/177F8000 | async
(2 rows)
                                                                  Thu 21 Feb 2019 06:30:57 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/17DAA000 | 11/17DAA000 | 11/17DAA000 | async
(1 row)

Node3

Thu 21 Feb 2019 06:31:01 AM EST (every 1s)
 pid | usesysid | usename | application_name | client_addr | client_port | backend_start | state | sent_lsn | write_lsn | flush_lsn | sync_state
-----+----------+---------+------------------+-------------+-------------+---------------+-------+----------+-----------+-----------+------------
(0 rows)
                                                                 Thu 21 Feb 2019 06:31:02 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |  state  |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
------+----------+---------+------------------+--------------+-------------+-------------------------------+---------+-------------+-------------+-------------+------------
 1568 |    24601 | replica | node2            | 192.168.4.82 |       58122 | 2019-02-21 06:31:01.937957-05 | catchup | 11/17960000 | 11/17800000 | 11/177F8CC0 | async
(1 row)
                                                                  Thu 21 Feb 2019 06:31:03 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 1568 |    24601 | replica | node2            | 192.168.4.82 |       58122 | 2019-02-21 06:31:01.937957-05 | streaming | 11/1A1D3D08 | 11/1A1D3D08 | 11/1A1D3D08 | async
(1 row)

In this case shifting to a new primary took seven seconds.

Finally, I did another test. I was wondering, can I move the server Node2 back under the main Primary Node1 while writes are happening?

Well, here’s what happened:

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

After I kill the process as I did in the previous examples, Node2 rejoined the Primary Node1, but …

Thu 21 Feb 2019 06:33:58 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
  1901 |    24601 | replica | node2            | 192.168.4.82 |       49900 | 2019-02-21 06:33:57.81308-05  | catchup   | 11/52E40000 | 11/52C00000 | 11/52BDFFE8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/5D3F9EC8 | 11/5D3F9EC8 | 11/5D3F9EC8 | async

… Node2 was not really able to catch up quickly, or at least not able to do that until the load was on the primary and high. As soon I reduced the application pressure:

Thu 21 Feb 2019 06:35:29 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
  1901 |    24601 | replica | node2            | 192.168.4.82 |       49900 | 2019-02-21 06:33:57.81308-05  | streaming | 11/70AE8000 | 11/70000000 | 11/70000000 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/70AE8000 | 11/70AE8000 | 11/70AE8000 | async

Node2 was able to catch up and align itself.

Conclusions

In all tests , the Replica was able to rejoin the Primary or the new primary, with obvious different times.

From the tests I carried out so far, it seems that modifying the replication source, and then killing the “WAL receiver” thread, is a procedure that allows us to shift the replication source without the need for a service restart.

This is even more efficient compared to the MySQL solution, given the time taken for the recovery and the flexibility.

What I am still wondering is IF this might cause some data inconsistency issues or not. I asked some of the PG experts inside the company, and it seems that the process should be relatively safe, but I would appreciate any feedback/comment in case you know this may not be a safe operation.

Good PostgreSQL to everybody!


Photo by rawpixel.com from Pexels

Jan
23
2019
--

MongoDB Replica Set Scenarios and Internals – Part II (Elections)

mongodb node election to primary

In this blog post, we will walk through the internals of the election process in MongoDB®, following on from a previous post on the internals of the replica set. You can read Part 1 here.

For this post, I am refer to the same configurations we discussed before.

Elections: As the term suggests, in MongoDB there is a freedom to “vote”: individual nodes of the cluster can vote and select their primary member for that replica set cluster.

Why Elections? MongoDB maintains high availability through this process.

When do elections take place?

  1. When the node does not found a primary node within the election timeout limit. By default this value is 10s, and from MongoDB version 3.2 this can be changed according to your needs.  The parameter to set this value is
    settings.electionTimeoutMillis

      and can be seen in the logs as:

settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5ba8ed10d4fddccfedeb7492') } }

From the mongo shell, the value for the

electionTimeoutMillis

  can be found in replica set configuration as:

rplint:SECONDARY> rs.conf()
{
	"_id" : "rplint",
	"version" : 3,
	"protocolVersion" : NumberLong(1),
	"members" : [
		{
			"_id" : 0,
			"host" : "m103:25001",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 1,
			"host" : "192.168.103.100:25002",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 2,
			"host" : "192.168.103.100:25003",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"catchUpTimeoutMillis" : 60000,
		"getLastErrorModes" : {
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("5c20ff87272eff3a5e28573f")
	}
}

More precisely the value for

electionTimeoutMillis

  can be found at:

rplint:SECONDARY> rs.conf().settings.electionTimeoutMillis
10000

2.  If the priority of the existing primary node is being taken over by another node. For example, during planned maintenance using replica set configuration settings. The priority of the member node can be changed as explained here

The priority of all three members can be seen from the replica set configuration like this:

rplint:SECONDARY> rs.conf().members[0].priority
1
rplint:SECONDARY>
rplint:SECONDARY>
rplint:SECONDARY> rs.conf().members[2].priority
1
rplint:SECONDARY> rs.conf().members[1].priority
1

How do elections work in a MongoDB replica set cluster?

Before real elections, the node runs a dry election. Dry election? Yes, the node first runs dry elections, and if the node wins a dry election, then an actual election begins. Here’s how:

  1. Candidate node asks every node if another node would vote for it through
    replSetRequestVotes

     , without increasing the term itself.

  2. Primary node steps down if it finds a candidate node term higher than itself. Otherwise the dry election fails, and the replica set continues to run as is did before.
  3. If the dry election succeeds, then an actual election begins.
  4. For the real election, the node increments its term and then votes for itself.
  5. VoterRequester sends
    replSetRequestVotes

     command through ScatterGatherRunner and then each node responds back with their vote.

  6. The candidate that receives votes from the most nodes wins the election.
  7. Once the candidate wins, it transits to primary node. Through heartbeats it sends a notification to all other nodes.
  8. Then the candidate node checks if it needs to catch up from the former primary node.
  9. The node that receives the 
    replSetRequestVotes

     command checks its own term and then votes, but only after ReplicationCoordinator receives confirmation from TopologyCoordinator

  10. The TopologyCoordinator grants the vote after following considerations:
    1. Config version must be matched,
    2. Replica set name must be matched
    3. An arbiter voter must not see any healthy primary of greater or equal priority.

An example

A primary (port:25002) Transition to secondary after receiving the

rs.stepDown()

  command.

2019-01-03T03:05:29.972+0000 I COMMAND  [conn124] Attempting to step down in response to replSetStepDown command
2019-01-03T03:05:29.976+0000 I REPL     [conn124] transition to SECONDARY
driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.15" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2019-01-03T03:05:40.874+0000 I REPL     [ReplicationExecutor] Member m103:25001 is now in state PRIMARY
2019-01-03T03:05:41.459+0000 I REPL     [rsBackgroundSync] sync source candidate: m103:25001
2019-01-03T03:05:41.459+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to m103:25001
2019-01-03T03:05:41.460+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to m103:25001, took 1ms (1 connections now open to m103:25001)
2019-01-03T03:05:41.461+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to m103:25001
2019-01-03T03:05:41.462+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to m103:25001, took 1ms (2 connections now open to m103:25001)

Dry election at candidate node (port:25001) and success: no primary found.

2019-01-03T03:05:31.498+0000 I REPL     [rsBackgroundSync] could not find member to sync from
2019-01-03T03:05:36.493+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to 192.168.103.100:25002: InvalidSyncSource: Sync source was cleared. Was 192.168.103.100:25002
2019-01-03T03:05:39.390+0000 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2019-01-03T03:05:39.390+0000 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected. current term: 35
2019-01-03T03:05:39.391+0000 I REPL     [ReplicationExecutor] VoteRequester(term 35 dry run) received a yes vote from 192.168.103.100:25002; response message: { term: 35, voteGranted: true, reason: "", ok: 1.0 }

Dry election succeeds and increments term by 1 (here the term was 35 and is incremented to 36). It transitions to primary and enters catchup mode.

2019-01-03T03:05:39.391+0000 I REPL [ReplicationExecutor] dry election run succeeded, running for election in term 36
2019-01-03T03:05:39.394+0000 I REPL [ReplicationExecutor] VoteRequester(term 36) received a yes vote from 192.168.103.100:25003; response message: { term: 36, voteGranted: true, reason: "", ok: 1.0 }
2019-01-03T03:05:39.395+0000 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 36
2019-01-03T03:05:39.395+0000 I REPL [ReplicationExecutor] transition to PRIMARY
2019-01-03T03:05:39.395+0000 I REPL [ReplicationExecutor] Entering primary catch-up mode.

Other nodes also receive information about the new primary.

2019-01-03T03:05:31.498+0000 I REPL [rsBackgroundSync] could not find member to sync from
2019-01-03T03:05:36.493+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to 192.168.103.100:25002: InvalidSyncSource: Sync source was cleared. Was 192.168.103.100:25002
2019-01-03T03:05:41.499+0000 I REPL [ReplicationExecutor] Member m103:25001 is now in state PRIMARY

This is how MongoDB is able to maintain high availability by electing primary node from the replica set clusters in the case of existing primary node failures.


Photo by Daria Shevtsova from Pexels

Jan
18
2019
--

Replication Manager Works with MariaDB

Complex multi-cluster replication topology

Some time ago I wrote a script to manage asynchronous replication links between Percona XtraDB clusters. The original post can be found here. The script worked well with Percona XtraDB Cluster but it wasn’t working well with MariaDB®.  Finally, the replication manager works with MariaDB.

First, let’s review the purpose of the script. Managing replication links between Galera based clusters is a tedious task. There are many potential slaves and many potential masters. Furthermore, each replication link must have only a single slave. Just try to imagine how you would maintain the following replication topology:

A complex replication topology

The above topology consists of five clusters and four master-to-master links. The replication manager can easily handle this topology. Of course, it is not a fix to the limitations of asynchronous replication. You must make sure your writes are replication safe. You could want, for example, a global user list or to centralize some access logs. Just to refresh memories, here are some of the script highlights:

  • Uses the Galera cluster for Quorum
  • Configurable, arbitrarily complex topologies
  • The script stores the topology in database tables
  • Elects slaves automatically
  • Monitors replication links
  • Slaves can connect to a list of potential masters

As you probably know, MariaDB has a different GTID implementation and syntax for the multi-source replication commands. I took some time to investigate why the script was failing and fixed it. Now, provided you are using MariaDB 10.1.4+ with GTIDs, the replication manager works fine.

You can found the script here. Be aware that although I work for Percona, the script is not officially supported by Percona.

Nov
30
2018
--

PostgreSQL Streaming Physical Replication With Slots

postgres replication using slots

PostgreSQLPostgreSQL streaming physical replication with slots simplifies setup and maintenance procedures. Usually, you should estimate disk usage for the Write Ahead Log (WAL) and provide appropriate limitation to the number of segments and setup of the WAL archive procedure. In this article, you will see how to use replication with slots and understand what problems it could solve.

Introduction

PostgreSQL physical replication is based on WAL. Th Write Ahead Log contains all database changes, saved in 16MB segment files. Normally postgres tries to keep segments between checkpoints. So with default settings, just 1GB of WAL segment files is available.

Replication requires all WAL files created after backup and up until the current time. Previously, it was necessary to keep a huge archive directory (usually mounted by NFS to all slave servers). The slots feature introduced in 9.4 allows Postgres to track the latest segment downloaded by a slave server. Now, PostgreSQL can keep all segments on disk, even without archiving, if a slave is seriously behind its master due to downtime or networking issues. The drawback: the disk space could be consumed infinitely in the case of configuration error. Before continuing, if you need a better understanding of physical replication and streaming replication, I recommend you read “Streaming Replication with PostgreSQL“.

Create a sandbox with two PostgreSQL servers

To setup replication, you need at least two PostgreSQL servers. I’m using pgcli (pgc) to setup both servers on the same host. It’s easy to install on Linux, Windows, and OS X, and provides the ability to download and run any version of PostgreSQL on your staging server or even on your laptop.

python -c "$(curl -fsSL https://s3.amazonaws.com/pgcentral/install.py)"
mv bigsql master
cp -r master slave
$ cd master
master$ ./pgc install pg10
master$ ./pgc start pg10
$ cd ../slave
slave$ ./pgc install pg10
slave$ ./pgc start pg10

First of all you should allow the replication user to connect:

master$ echo "host replication replicator 127.0.0.1/32 md5" >> ./data/pg10/pg_hba.conf

If you are running master and slave on different servers, please replace 127.0.0.1 with the slave’s address.

Next pgc creates a shell environment file with PATH and all the other variables required for PostgreSQL:

master$ source ./pg10/pg10.env

Allow connections from the remote host, and create a replication user and slot on master:

master$ psql
postgres=# CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'replicator';
CREATE ROLE
postgres=# ALTER SYSTEM SET listen_addresses TO '*';
ALTER SYSTEM
postgres=# SELECT pg_create_physical_replication_slot('slot1');
pg_create_physical_replication_slot
-------------------------------------
(slot1,)

To apply system variables changes and hba.conf, restart the Postgres server:

master$ ./pgc stop ; ./pgc start
pg10 stopping
pg10 starting on port 5432

Test table

Create a table with lots of padding on the master:

master$ psql psql (10.6) Type "help" for help.
postgres=# CREATE TABLE t(id INT, pad CHAR(200));
postgres=# CREATE INDEX t_id ON t (id);
postgres=# INSERT INTO t SELECT generate_series(1,1000000) AS id, md5((random()*1000000)::text) AS pad;

Filling WAL with random data

To see the benefits of slots, we should fill the WAL with some data by running transactions. Repeat the update statement below to generate a huge amount of WAL data:

UPDATE t SET pad = md5((random()*1000000)::text);

Checking the current WAL size

You can check total size for all WAL segments from the shell or from psql:

master$ du -sh data/pg10/pg_wal
17M data/pg10/pg_wal
master$ source ./pg10/pg10.env
master$ psql
postgres=# \! du -sh data/pg10/pg_wal
17M data/pg10/pg_wal

Check maximum WAL size without slots activated

Before replication configuration, we can fill the WAL with random data and find that after 1.1G, the data/pg10/pg_wal directory size does not increase regardless of the number of update queries.

postgres=# UPDATE t SET pad = md5((random()*1000000)::text); -- repeat 4 times
postgres=# \! du -sh data/pg10/pg_wal
1.1G data/pg10/pg_wal
postgres=# UPDATE t SET pad = md5((random()*1000000)::text);
postgres=# \! du -sh data/pg10/pg_wal
1.1G data/pg10/pg_wal

Backup master from the slave server

Next, let’s make a backup for our slot1:

slave$ source ./pg10/pg10.env
slave$ ./pgc stop pg10
slave$ rm -rf data/pg10/*
# If you are running master and slave on different servers, replace 127.0.0.1 with master's IP address.
slave$ PGPASSWORD=replicator pg_basebackup -S slot1 -h 127.0.0.1 -U replicator -p 5432 -D $PGDATA -Fp -P -Xs -Rv

Unfortunately pg_basebackup hangs with: initiating base backup, waiting for checkpoint to complete.
We can wait for the next checkpoint, or force the checkpoint on the master. Checkpoint happens every checkpoint_timeout seconds, and is set to five minutes by default.

Forcing checkpoint on master:

master$ psql
postgres=# CHECKPOINT;

The backup continues on the slave side:

pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/92000148 on timeline 1
pg_basebackup: starting background WAL receiver
1073986/1073986 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/927FDDE8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed

The backup copies settings from the master, including its TCP port value. I’m running both master and slave on the same host, so I should change the port in the slave .conf file:

slave$ vim data/pg10/postgresql.conf
# old value port = 5432
port = 5433

Now we can return to the master and run some queries:

slave$ cd ../master
master$ source pg10/pg10.env
master$ psql
postgres=# UPDATE t SET pad = md5((random()*1000000)::text);
UPDATE t SET pad = md5((random()*1000000)::text);

By running these queries, the WAL size is now 1.4G, and it’s bigger than 1.1G! Repeat this update query three times and the WAL grows to 2.8GB:

master$ du -sh data/pg10/pg_wal
2.8G data/pg10/pg_wal

Certainly, the WAL could grow infinitely until whole disk space is consumed.
How do we find out the reason for this?

postgres=# SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
redo_lsn    | slot_name | restart_lsn | gb_behind
------------+-----------+-------------+-----------
1/2A400630  | slot1     |  0/92000000 | 2.38

We have one slot behind the master of 2.38GB.

Let’s repeat the update and check again. The gap has increased:

postgres=# postgres=# SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
redo_lsn    | slot_name | restart_lsn | gb_behind
------------+-----------+-------------+-----------
1/8D400238  |     slot1 | 0/92000000  | 3.93

Wait, though: we have already used slot1 for backup! Let’s start the slave:

master$ cd ../slave
slave$ ./pgc start pg10

Replication started without any additional change to recovery.conf:

slave$ cat data/pg10/recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replicator password=replicator passfile=''/home/pguser/.pgpass'' host=127.0.0.1 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
primary_slot_name = 'slot1'

pg_basebackup -R option instructs backup to write to the recovery.conf file with all required options, including primary_slot_name.

WAL size, all slots connected

The gap reduced several seconds after the slave started:

postgres=# SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
redo_lsn    | slot_name | restart_lsn | gb_behind
------------+-----------+-------------+-----------
 1/8D400238 |     slot1 |  0/9A000000 | 3.80

And a few minutes later:

postgres=# SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
redo_lsn    | slot_name | restart_lsn | gb_behind
------------+-----------+-------------+-----------
 1/9E5DECE0 |     slot1 |  1/9EB17250 | -0.01
postgres=# \!du -sh data/pg10/pg_wal
1.3G data/pg10/pg_wal

Slave server maintenance

Let’s simulate slave server maintenance with ./pgc stop pg10 executed on the slave. We’ll push some data onto the master again (execute the UPDATE query 4 times).

Now, “slot1” is again 2.36GB behind.

Removing unused slots

By now, you might realize that a problematic slot is not in use. In such cases, you can drop it to allow retention for segments:

master$ psql
postgres=# SELECT pg_drop_replication_slot('slot1');

Finally the disk space is released:

master$ du -sh data/pg10/pg_wal
1.1G data/pg10/pg_wal

Important system variables

  • archive_mode is not required for streaming replication with slots.
  • wal_level – is replica by default
  • max_wal_senders – set to 10 by default, a minimum of three for one slave, plus two for each additional slave
  • wal_keep_segments – 32 by default, not important because PostgreSQL will keep all segments required by slot
  • archive_command – not important for streaming replication with slots
  • listen_addresses – the only option that it’s necessary to change, to allow remote slaves to connect
  • hot_standby – set to on by default, important to enable reads on slave
  • max_replication_slots – 10 by default https://www.postgresql.org/docs/10/static/runtime-config-replication.html

Summary

  • Physical replication setup is really easy with slots. By default in pg10, all settings are already prepared for replication setup.
  • Be careful with orphaned slots. PostgreSQL will not remove WAL segments for inactive slots with initialized restart_lsn.
  • Check pg_replication_slots restart_lsn value and compare it with current redo_lsn.
  • Avoid long downtime for slave servers with slots configured.
  • Please use meaningful names for slots, as that will simplify debug.

References

Nov
29
2018
--

MySQL High Availability: Stale Reads and How to Fix Them

solutions for MySQL Stale Reads

solutions for MySQL Stale ReadsContinuing on the series of blog posts about MySQL High Availability, today we will talk about stale reads and how to overcome this issue.

The Problem

Stale reads is a read operation that fetches an incorrect value from a source that has not synchronized an update operation to the value (source Wiktionary).

A practical scenario is when your application applies INSERT or UPDATE data to your master/writer node, and has to read it immediately after. If this particular read is served from another server in the replication/cluster topology, the data is either not there yet (in case of an INSERT) or it still provides the old value (in case of an UPDATE).

If your application or part of your application is sensitive to stale reads, then this is something to consider when implementing HA/load balancing.

How NOT to fix stale reads

While working with customers, we have seen a few incorrect attempts to fix the issue:

SELECT SLEEP(X)

The most common incorrect approach that we see in Percona support is when customers add a sleep between the write and the read. This may work in some cases, but it’s not 100% reliable for all scenarios, and it can add latency when there is no need.

Let’s review an example where by the time you query your slave, the data is already applied and you have configured your transaction to start with a SELECT SLEEP(1). In this case, you just added 1000ms latency when there was no need for it.

Another example could be when the slave is lagging behind for more than whatever you configured as the parameter on the sleep command. In this case, you will have to create a login to keep trying the sleep until the slave has received the data: potentially it could take several seconds.

Reference: SELECT SLEEP.

Semisync replication

By default, MySQL replication is asynchronous, and this is exactly what causes the stale read. However, MySQL distributes a plugin that can make the replication semi-synchronous. We have seen customers enabling it hoping the stale reads problem will go away. In fact, that is not the case. The semi-synchronous plugin only ensures that at least one slave has received it (IO Thread has streamed the binlog event to relay log), but the action of applying the event is done asynchronously. In other words, stale reads are still a problem with semi-sync replication.

Reference: Semisync replication.

How to PROPERLY fix stale reads

There are several ways to fix/overcome this situation, and each one has its pros and cons:

1) MASTER_POS_WAIT

Consists of executing a SHOW MASTER STATUS right after your write, getting the binlog file and position, connecting on a slave, and executing the SELECT MASTER_POS_WAIT function, passing the binlog file and position as parameters. The execution will block until the slave has applied the position via the function. You can optionally pass a timeout to exit the function in case of exceeding this timeout.

Pros:

  • Works on all MySQL versions
  • No prerequisites

Cons:

  • Requires an application code rewrite.
  • It’s a blocking operation, and can add significant latency to queries in cases where a slave/node is too far behind.

Reference: MASTER_POS_WAIT.

2) WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS

Requires GTID: this is similar to the previous approach, but in this case, we need to track the executed GTID from the master (also available on SHOW MASTER STATUS).

Pros:

  • Works on all MySQL versions.

Cons:

  • Requires an application code rewrite.
  • It’s a blocking operation, can add significant latency to queries in cases where a slave/node is too far behind.
  • As it requires GTID, it only works on versions from 5.6 onwards.

Reference: WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS

3) Querying slave_relay_log_info

Consists of enabling relay_log_info_repository=TABLE and sync_relay_log_info=1 on the slave, and using a similar approach to option 1. After the write, execute  SHOW MASTER STATUS, connect to the slave, and query mysql.slave_relay_log_info , passing the binlog name and position to verify if the slave is already applying a position after the one you got from SHOW MASTER STATUS.

Pros:

  • This is not a blocking operation.
  • In cases where the slave is missing the position you require, you can try to connect to another slave and repeat the process. There is even an option to fail over back to the master if none of the slaves have the said position.

Cons:

  • Requires an application code rewrite.
  • In cases of checking multiple slaves, this can add significant latency.

Reference: slave_relay_log_info.

4) wsrep-sync-wait

Requires Galera/Percona XtraDB Cluster: Consists of setting a global/session variable to enforce consistency. This will block execution of subsequent queries until the node has applied all write-sets from it’s applier queue. It can be configured to trigger on multiple commands, such as SELECT, INSERT, and so on.

Pros:

  • Easy to implement. Built-in as a SESSION variable.

Cons:

  • Requires an application code rewrite in the event that you want to implement the solution on per session basis.
  • It’s a blocking operation, and can add significant latency to queries if a slave/node is too far behind.

Reference: wsrep-sync-wait

5) ProxySQL 2.0 GTID consistent reads

Requires MySQL 5.7 and GTID: MySQL 5.7 returns the GTID generated by a commit as part of the OK package. ProxySQL with the help of binlog readers installed on MySQL servers can keep track of which GTID the slave has already applied. With this information + the GTID received from the OK package at the moment of the write, ProxySQL will decide if it will route a subsequent read to one of the slaves/read nodes or if the master/write node will serve the read.

Pros:

  • Transparent to the application – no code changes are required.
  • Adds minimal latency.

Cons:

  • This still a new feature of ProxySQL 2.0, which is not yet GA.

Referece: GTID consistent reads.

Conclusions

Undesirable issues can arise from adding HA and distributing the load across multiple servers. Stale reads can cause an impact on applications sensitive to them. We have demonstrated various approaches you can use to overcome them.


Photo by Tim Evans on Unsplash

Nov
22
2018
--

Caveats With pt-table-checksum Using Row-Based Replication, and Replication Filters

pt-table-checksum row based replication caveat

pt-table-checksum row based replication caveatAs per the documentation, pt-table-checksum is a tool to perform online replication consistency checks by executing checksum queries on the master, which produces different results on replicas that are inconsistent with the master.

The master and each slave insert checksums into the percona.checksums table, and these are later compared for differences. It’s fairly obvious that the checksums need to be determined independently on each node, and so these inserts must be replicated as STATEMENT and not ROW. Otherwise, the slaves would just insert the same checksum as the master and not calculate it independently.

The tool only requires

binlog_format=STATEMENT

  for its own session. It sets this itself on the master, but will error if this isn’t already set on each slave node. The reason for the error is that the statement to change the session’s binlog_format will not be replicated. So if a slave has binlog_format=ROW then the slave itself will execute the checksum correctly, but the results will be written as a ROW. Any slaves of this slave in the chain will just insert the same result. See bug 899415.

This is only a problem if we have chained replication, and the error can be skipped with --no-check-binlog-format so for simple replication setups with ROW or MIXED replication we can still use the tool. If we do not have a slave-of-slave style chained topology, then there’s no need to worry about this.

However, there is one caveat to be aware of if you’re using replication filters: when a slave isn’t replicating a particular database due to binlog-ignore-db, this setting behaves differently with ROW based replication (RBR) vs. STATEMENT based.

Per the documentation, with RBR,

binlog-ignore-db=testing

will cause all updates to testing.* to be skipped. With STATEMENT-based replication it will cause all updates after

USE test_database;

  to be ignored (regardless of where the updates were being written to).

pt-table-checksum operates in the following way:

use `testing`/*!*/;
SET TIMESTAMP=1541583280/*!*/;
REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ‘testing', 'testing', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `id`, CONCAT(ISNULL(`id`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `testing`.`testing` /*checksum table*/

Due to the use testing the slave will then skip these statements with no errors, and simply not write into percona.checksums.

As per the documentation:

The tool monitors replicas continually. If any replica falls too far behind in replication, pt-table-checksum pauses to allow it to catch up. If any replica has an error, or replication stops, pt-table-checksum pauses and waits.

In this case, you will see the tool continually wait, with the following debug output:

# pt_table_checksum:12398 10967 SELECT MAX(chunk) FROM `percona`.`checksums` WHERE db=‘testing’ AND tbl=‘testing’ AND master_crc IS NOT NULL
# pt_table_checksum:12416 10967 Getting last checksum on slave1
# pt_table_checksum:12419 10967 slave1 max chunk: undef
# pt_table_checksum:12472 10967 Sleep 0.25 waiting for chunks
# pt_table_checksum:12416 10967 Getting last checksum on slave1
# pt_table_checksum:12419 10967 slave1 max chunk: undef
# pt_table_checksum:12472 10967 Sleep 0.5 waiting for chunks
# pt_table_checksum:12416 10967 Getting last checksum on slave1
# pt_table_checksum:12419 10967 slave1 max chunk: undef

We don’t recommend using the tool with replication filters in place, but if --no-check-replication-filters is specified you should be aware of the differences in how different binlog formats handle these filters.

One workaround would be to replace

binlog-ignore-db=testing

With the following which will just ignore writes to that database:

binlog-wild-ignore-table=testing.%

More resources

You can read more about pt-table-checksum in this blog post MySQL Replication Primer with pt-table-checksum and pt-table-sync

The latest version of Percona Toolkit can be downloaded from our website. All Percona software is open source and free.

Photo by Black ice from Pexels

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com