Mar
11
2019
--

Switch your PostgreSQL Primary for a Read Replica, Without Downtime

postgres read replica from primary

PostgreSQL logoIn my ongoing research to identify solutions and similarities between MySQL – PostgreSQL, I recently faced a simple issue. I needed to perform a slave shift from one IP to another and I did not want to have to restart the slave that is serving the reads. In MySQL, I can repoint the replication online with the command Change Master TO, so I was looking for similar solution in postgres. In my case, I could also afford some stale reads, so a few seconds delay would have been OK, but I couldn’t take down the server.

After brief research, I noticed that there is not a solution that allow you to do that without restarting the PostgreSQL server instance.
I was a bit disappointed, because I was just trying to move the whole traffic from one subnet to another, so not really changing the Master, but just the pointer.

At this point I raised my question to my colleagues who are experts in PG. Initially they confirmed to me that there is no real dynamic solution/command for that. However, while discussing this, one of them (Jobin Augustine) suggested a not “officially supported” way, that might work.

In brief, given that the WAL Receiver uses its own process, killing it would trigger an internal refresh operation, and that could result in having the replication restart from the new desired configuration.

This was an intriguing suggestion, but I wondered if it might have some negative side effects. In any case, I decided to try it and see what would happen.

This article describe the process I followed to test the approach. To be clear:  this is not an “Official” solution, and is not recommended as best practice.

From now on in this article I will drop the standard MySQL terms and instead use Primary for Master and Replica for Slave.

Scenarios

I carried out two main tests:

  1. No load in writing
  2. Writing happening

for each of these I took these steps:

a) move Replica to same Primary (different ip)
b) move Replica to different Primary/Replica, creating a chain, so from:

+--------+
                          | Primary|
                          +----+---+
                               |
                +--------+     |    +--------+
                |Replica1+<----+--->+Replica2|
                +--------+          +--------+

To:

+-------+
                          |Primary|
                          +---+---+
                              |
                              v
                          +---+----+
                          |Replica2|
                          +---+----+
                              |
                              v
                          +---+----+
                          |Replica1|
                          +--------+

The other thing was to try to be as non-invasive as possible. Given that, I used KILL SIGQUIT(3) instead of the more brutal SIGKILL.

SIGQUIT “The SIGQUIT signal is sent to a process by its controlling terminal when the user requests that the process quit and perform a core dump.

To note that I did try this with SIGTERM (15) which is the nicest approach, but it didn’t in fact force the process to perform the shift as desired.

In general in all the following tests what I execute is:

ps aux|grep 'wal receiver'
kill -3 <pid>

These are the current IPs for node:

Node1 (Primary):

NIC1 = 192.168.1.81
NIC2 = 192.168.4.81
NIC3 = 10.0.0.81

Node2 (replica1):

NIC1 = 192.168.1.82
NIC2 = 192.168.4.82
NIC3 = 10.0.0.82

Node1 (replica2):

NIC1 = 192.168.1.83
NIC2 = 192.168.4.83
NIC3 = 10.0.0.83

The starting position is:

select pid,usesysid,usename,application_name,client_addr,client_port,backend_start,state,sent_lsn,write_lsn,flush_lsn,sync_state from pg_stat_replication;
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22495 |    24601 | replica | node2            | 192.168.4.82 |       49518 | 2019-02-06 11:07:46.507511-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

And now let’s roll the ball and see what happen.

Experiment 1 – moving to same Primary no load

I will move Node2 to point to 192.168.1.81

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

change to:

primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 8343 0.0 0.0 667164 2180 ? Ss Feb06 16:27 postgres: wal receiver process streaming 10/FD6C60E8

Checking the replication status:

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres  8343  0.0  0.0 667164  2180 ?        Ss   Feb06  16:27 postgres: wal receiver process   streaming 10/FD6C60E8
                                                                  Tue 19 Feb 2019 12:10:22 PM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 23748 |    24601 | replica | node2            | 192.168.4.82 |       49522 | 2019-02-19 12:09:31.054915-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Tue 19 Feb 2019 12:10:23 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(1 row)
                                                                  Tue 19 Feb 2019 12:10:26 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 23756 |    24601 | replica | node2            | 192.168.1.82 |       37866 | 2019-02-19 12:10:26.904766-05 | catchup   | 10/FD460000 | 10/FD3A0000 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Tue 19 Feb 2019 12:10:28 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 23756 |    24601 | replica | node2            | 192.168.1.82 |       37866 | 2019-02-19 12:10:26.904766-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)

It takes six seconds to kill the process, shift to a new IP, and perform the catch up.

Experiment 2 – moving to Different Primary (as a chain of replicas) No load

I will move Node2 to point to 192.168.4.83

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 25859 0.0 0.0 667164 3484 ? Ss Feb19 1:53 postgres: wal receiver process

On Node1

Thu 21 Feb 2019 04:23:26 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 31241 |    24601 | replica | node2            | 192.168.1.82 |       38232 | 2019-02-21 04:17:24.535662-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Thu 21 Feb 2019 04:23:27 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

On Node3

pid | usesysid | usename | application_name | client_addr | client_port | backend_start | state | sent_lsn | write_lsn | flush_lsn | sync_state
-----+----------+---------+------------------+-------------+-------------+---------------+-------+----------+-----------+-----------+------------
(0 rows)
                                                                  Thu 21 Feb 2019 04:23:30 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 1435 |    24601 | replica | node2            | 192.168.4.82 |       58116 | 2019-02-21 04:23:29.846798-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

In this case, shifting to a new primary took four seconds.

Now all this is great, but I was working with NO load, what would happen if we have read/write taking place?

Experiment 3 – moving to same Primary WITH Load

I will move Node2 to point to 192.168.4.81

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 20765 0.2 0.0 667196 3712 ? Ss 06:23 0:00 postgres: wal receiver process streaming 11/E33F760

Thu 21 Feb 2019 06:23:03 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
 31649 |    24601 | replica | node2            | 192.168.1.82 |       38236 | 2019-02-21 06:21:23.539493-05 | streaming | 11/8FEC000 | 11/8FEC000 | 11/8FEC000 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/8FEC000 | 11/8FEC000 | 11/8FEC000 | async
                                                                 Thu 21 Feb 2019 06:23:04 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/904DCC0 | 11/904C000 | 11/904C000 | async
                                                                 Thu 21 Feb 2019 06:23:08 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
 31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | catchup   | 11/9020000 |            |            | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/9178000 | 11/9178000 | 11/9178000 | async
                                                                 Thu 21 Feb 2019 06:23:09 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
 31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | streaming | 11/91F7860 | 11/91F7860 | 11/91F7860 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/91F7860 | 11/91F7860 | 11/91F7860 | async

In this case shifting to a new primary takes six seconds.

Experiment 4 – moving to Different Primary (as a chain of replicas) No load

I move Node2 to point to 192.168.4.83
In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 21158 6.3 0.0 667196 3704 ? Ds 06:30 0:09 postgres: wal receiver process streaming 11/4F000000

Node1

Thu 21 Feb 2019 06:30:56 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | streaming | 11/177F8000 | 11/177F8000 | 11/177F8000 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/177F8000 | 11/177F8000 | 11/177F8000 | async
(2 rows)
                                                                  Thu 21 Feb 2019 06:30:57 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/17DAA000 | 11/17DAA000 | 11/17DAA000 | async
(1 row)

Node3

Thu 21 Feb 2019 06:31:01 AM EST (every 1s)
 pid | usesysid | usename | application_name | client_addr | client_port | backend_start | state | sent_lsn | write_lsn | flush_lsn | sync_state
-----+----------+---------+------------------+-------------+-------------+---------------+-------+----------+-----------+-----------+------------
(0 rows)
                                                                 Thu 21 Feb 2019 06:31:02 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |  state  |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
------+----------+---------+------------------+--------------+-------------+-------------------------------+---------+-------------+-------------+-------------+------------
 1568 |    24601 | replica | node2            | 192.168.4.82 |       58122 | 2019-02-21 06:31:01.937957-05 | catchup | 11/17960000 | 11/17800000 | 11/177F8CC0 | async
(1 row)
                                                                  Thu 21 Feb 2019 06:31:03 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
 1568 |    24601 | replica | node2            | 192.168.4.82 |       58122 | 2019-02-21 06:31:01.937957-05 | streaming | 11/1A1D3D08 | 11/1A1D3D08 | 11/1A1D3D08 | async
(1 row)

In this case shifting to a new primary took seven seconds.

Finally, I did another test. I was wondering, can I move the server Node2 back under the main Primary Node1 while writes are happening?

Well, here’s what happened:

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

After I kill the process as I did in the previous examples, Node2 rejoined the Primary Node1, but …

Thu 21 Feb 2019 06:33:58 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
  1901 |    24601 | replica | node2            | 192.168.4.82 |       49900 | 2019-02-21 06:33:57.81308-05  | catchup   | 11/52E40000 | 11/52C00000 | 11/52BDFFE8 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/5D3F9EC8 | 11/5D3F9EC8 | 11/5D3F9EC8 | async

… Node2 was not really able to catch up quickly, or at least not able to do that until the load was on the primary and high. As soon I reduced the application pressure:

Thu 21 Feb 2019 06:35:29 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
-------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
  1901 |    24601 | replica | node2            | 192.168.4.82 |       49900 | 2019-02-21 06:33:57.81308-05  | streaming | 11/70AE8000 | 11/70000000 | 11/70000000 | async
 22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/70AE8000 | 11/70AE8000 | 11/70AE8000 | async

Node2 was able to catch up and align itself.

Conclusions

In all tests , the Replica was able to rejoin the Primary or the new primary, with obvious different times.

From the tests I carried out so far, it seems that modifying the replication source, and then killing the “WAL receiver” thread, is a procedure that allows us to shift the replication source without the need for a service restart.

This is even more efficient compared to the MySQL solution, given the time taken for the recovery and the flexibility.

What I am still wondering is IF this might cause some data inconsistency issues or not. I asked some of the PG experts inside the company, and it seems that the process should be relatively safe, but I would appreciate any feedback/comment in case you know this may not be a safe operation.

Good PostgreSQL to everybody!


Photo by rawpixel.com from Pexels

Mar
04
2019
--

Upcoming Webinar Wed 3/6: MySQL High Availability and Disaster Recovery

MySQL High Availability and Disaster Recovery Webinar

MySQL High Availability and Disaster Recovery WebinarJoin Percona CEO Peter Zaitsev as he presents MySQL High Availability and Disaster Recovery on Wednesday, March 6th, 2019, at 11:00 AM PST (UTC-8) / 2:00 PM EST (UTC-5).

Register Now

In this hour-long webinar, Peter describes the differences between high availability (HA) and disaster recovery (DR). Afterward, Peter will go through scenarios detailing how each is handled manually and in Amazon RDS.

He will review the pros and cons of managing HA and DR in the traditional database environment as well in the cloud. Having full control of these areas is daunting. However, Amazon RDS makes meeting these needs easier and more efficient.

Regardless of which path you choose, monitoring your environment is vital. Peter’s talk will make that message clear. A discussion of metrics you should regularly review to keep your environment working correctly and performing optimally concludes the webinar.

In order to learn more register for Peter’s webinar on MySQL High Availability and Disaster Recovery.

Feb
28
2019
--

Percona XtraDB Cluster 5.6.43-28.32 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.7

Percona is glad to announce the release of Percona XtraDB Cluster 5.6.43-28.32 on February 28, 2019. Binaries are available from the downloads section or from our software repositories.

This release of Percona XtraDB Cluster includes the support of Ubuntu 18.10 (Cosmic Cuttlefish). Percona XtraDB Cluster 5.6.43-28.32 is now the current release, based on the following:

All Percona software is open-source and free.

Bugs Fixed

  • PXC-2388: In some cases, DROP FUNCTION function_name was not replicated.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Feb
28
2019
--

Percona XtraDB Cluster 5.7.25-31.35 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.7Percona is glad to announce the release of Percona XtraDB Cluster 5.7.25-31.35 on February 28, 2018. Binaries are available from the downloads section or from our software repositories.

This release of Percona XtraDB Cluster includes the support of Ubuntu 18.10 (Cosmic Cuttlefish). Percona XtraDB Cluster 5.7.25-31.35 is now the current release, based on the following:

All Percona software is open-source and free.

Bugs Fixed

  • PXC-2346mysqld could crash when executing mysqldump --single-transaction while the binary log is disabled. This problem was also reported in PXC-1711PXC-2371PXC-2419.
  • PXC-2388: In some cases, DROP FUNCTION function_name was not replicated.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Feb
20
2019
--

ProxySQL Native Support for Percona XtraDB Cluster (PXC)

galera proxy content image

ProxySQL in its versions up to 1.x did not natively support Percona XtraDB Cluster (PXC). Instead, it relied on the flexibility offered by the scheduler. This approach allowed users to implement their own preferred way to manage the ProxySQL behaviour in relation to the Galera events.

From version 2.0 we can use native ProxySQL support for PXC.. The mechanism to activate native support is very similar to the one already in place for group replication.

In brief it is based on the table [runtime_]mysql_galera_hostgroups and the information needed is mostly the same:

  • writer_hostgroup: the hostgroup ID that refers to the WRITER
  • backup_writer_hostgroup: the hostgoup ID referring to the Hostgorup that will contain the candidate servers
  • reader_hostgroup: The reader Hostgroup ID, containing the list of servers that need to be taken in consideration
  • offline_hostgroup: The Hostgroup ID that will eventually contain the writer that will be put OFFLINE
  • active: True[1]/False[0] if this configuration needs to be used or not
  • max_writers: This will contain the MAX number of writers you want to have at the same time. In a sane setup this should be always 1, but if you want to have multiple writers, you can define it up to the number of nodes.
  • writer_is_also_reader: If true [1] the Writer will NOT be removed from the reader HG
  • max_transactions_behind: The number of wsrep_local_recv_queue after which the node will be set OFFLINE. This must be carefully set, observing the node behaviour.
  • comment: I suggest to put some meaningful notes to identify what is what.

Given the above let us see what we need to do in order to have a working galera native solution.
I will have three Servers:

192.168.1.205 (Node1)
  192.168.1.21  (Node2)
  192.168.1.231 (node3)

As set of Hostgroup, I will have:

Writer  HG-> 100
Reader  HG-> 101
BackupW HG-> 102
offHG   HG-> 9101

To set it up

Servers first:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,1000);

Then the galera settings:

insert into mysql_galera_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) values (100,102,101,9101,0,1,1,16);

As usual if we want to have R/W split we need to define the rules for it:

insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1040,6033,'windmills','app_test',100,1,3,'^SELECT.*FOR UPDATE',1);
insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1041,6033,'windmills','app_test',101,1,3,'^SELECT.*@@',1);
save mysql query rules to disk;
load mysql query rules to run;

Then another important variable… the server version, please do yourself a good service ad NEVER use the default.

update global_variables set variable_value='5.7.0' where variable_name='mysql-server_version';
LOAD MYSQL VARIABLES TO RUNTIME;SAVE MYSQL VARIABLES TO DISK;

Finally activate the whole thing:

save mysql servers to disk;
load mysql servers to runtime;

One thing to note before we go ahead. In the list of servers I had:

  1. Filled only the READER HG
  2. Used the same weight

This because of the election mechanism ProxySQL will use to identify the writer, and the (many) problems that may be attached to it.

For now let us go ahead and see what happens when I load this information to runtime.

Before running the above commands:

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

After:

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 1000   | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
| 1000   | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
| 1000   | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
| 1000   | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
| 1000   | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
| 1000   | 102       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
mysql> select * from runtime_mysql_galera_hostgroups \G
*************************** 1. row ***************************
       writer_hostgroup: 100
backup_writer_hostgroup: 102
       reader_hostgroup: 101
      offline_hostgroup: 9101
                active: 0  <----------- note this
            max_writers: 1
  writer_is_also_reader: 1
max_transactions_behind: 16
                comment: NULL
1 row in set (0.01 sec)

As we can see, ProxySQL had taken the nodes from my READER group and distribute them adding node 1 in the writer and node 2 as backup_writer.

But – there is a but – wasn’t my rule set with Active=0? Indeed it was, and I assume this is a bug (#Issue  1902).

The other thing we should note is that ProxySQL had elected as writer node 3 (192.168.1.231).
As I said before what should we do IF we want to have a specific node as preferred writer?

We need to modify its weight. So say we want to have node 1 (192.168.1.205) as writer we will need something like this:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,10000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,100);

Doing that will give us :

+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 10000  | 100       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 2209       |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 508        |
| 10000  | 101       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 2209       |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 508        |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

If you noticed, given we had set the WEIGHT in node 1 higher, this node will become also the most utilized for reads.
We probably do not want that, so let us modify the reader weight.

update mysql_servers set weight=10 where hostgroup_id=101 and hostname='192.168.1.205';

At this point if we trigger the failover, with set global wsrep_reject_queries=all; on node 1.
ProxySQL will take action and will elect another node as writer:

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 588        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 588        |
| 10000  | 9101      | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 468        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

Node 3 (192.168.1.231) is the new writer and node 1 is in the special group for OFFLINE.
Let see now what will happen IF we put back node 1.

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 10000  | 100       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
| 10000  | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

Ooops the READER has come back with the HIGHEST value and as such it will be the most used node, once more. To fix it, we need to re-run the update as before.

But there is a way to avoid this? In short the answer is NO!
This, in my opinion, is BAD and is worth a feature request, because this can really put a node on the knees.

Now this is not the only problem. There is another point that is probably worth discussion, which is the fact ProxySQL is currently doing FAILOVER/FAILBACK.

Failover, is obviously something we want to have, but failback is another discussion. The point is, once the failover is complete and the cluster has redistributed the incoming requests, doing a failback is an impacting operation that can be a disruptive one too.

If all nodes are treated as equal, there is no real way to prevent it, while if YOU set a node to be the main writer, something can be done, let us see what and how.
Say we have:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,100);
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 1000   | 100       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 613        |
| 10     | 101       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 613        |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

Let us put the node down
set global wsrep_reject_queries=all;

And check:

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 506        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 506        |
| 1000   | 9101      | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 527        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

We can now manipulate the weight in the special OFFLINE group and see what happen:

update mysql_servers set weight=10 where hostgroup_id=9101 and hostname='192.168.1.205'

Then I put the node up again:
set global wsrep_reject_queries=none;

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 573        |
| 10     | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 458	|
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 573	|
| 10     | 102       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 458        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

That’s it, the node is back but with no service interruption.

At this point we can decide if make this node reader like the others, or wait and plan a proper time of the day when we can put it back as writer, while, in the meanwhile it has a bit of load to warm its bufferpool.

The other point – and important information – is what is ProxySQL is currently checking on Galera? From reading the code Proxy will trap the following:

  • read_only
  • wsrep_local_recv_queue
  • wsrep_desync
  • wsrep_reject_queries
  • wsrep_sst_donor_rejects_queries
  • primary_partition

Plus the standard sanity checks on the node.

Finally to monitor the whole situation we can use this:

mysql> select * from mysql_server_galera_log order by time_start_us desc limit 10;
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+
| hostname      | port | time_start_us    | success_time_us | primary_partition | read_only | wsrep_local_recv_queue | wsrep_local_state | wsrep_desync | wsrep_reject_queries | wsrep_sst_donor_rejects_queries | error |
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+
| 192.168.1.231 | 3306 | 1549982591661779 | 2884            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982591659644 | 2778            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982591658728 | 2794            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982586669233 | 2827            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982586663458 | 5100            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982586658973 | 4132            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982581665317 | 3084            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982581661261 | 3129            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982581658242 | 2786            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982576661349 | 2982            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+

As you can see above the log table keeps track of what is changed. In this case, it reports that node 1 has wsrep_reject_queries activated, and it will continue to log this until we set wsrep_reject_queries=none.

Conclusions

ProxySQL galera native integration is a useful feature to manage any Galera implementation, no matter whether it’s Percona PXC, MariaDB cluster or MySQL/Galera.

The generic approach is obviously a good thing, still it may miss some specific extension like we have in PXC with the performance_schema pxc_cluster_view table.

I’ve already objected about the failover/failback, and I am here again to remind you: whenever you do a controlled failover REMEMBER to change the weight to prevent an immediate failback.

This is obviously not possible in the case of a real failover, and, for instance, a simple temporary eviction will cause two downtimes instead only one. Some environments are fine with that others not so.

Personally I think there should be a FLAG in the configuration, such that we can decide if failback should be executed or not.

 

Jan
18
2019
--

Percona XtraDB Cluster Operator 0.2.0 Early Access Release Is Now Available

Percona XtraDB Cluster Operator

 

Percona announces the release of Percona XtraDB Cluster Operator  0.2.0 early access.

The Percona XtraDB Cluster Operator simplifies the deployment and management of Percona XtraDB Cluster in a Kubernetes or OpenShift environment. It extends the Kubernetes API with a new custom resource for deploying, configuring and managing the application through the whole life cycle.Percona XtraDB Cluster Operator

Note: PerconaLabs and Percona-QA are open source GitHub repositories for unofficial scripts and tools created by Percona staff. These handy utilities can help save your time and effort.

Percona software builds located in the Percona-Lab and Percona-QA repositories are not officially released software, and also aren’t covered by Percona support or services agreements.

You can install the Percona XtraDB Cluster Operator on Kubernetes or OpenShift. While the operator does not support all the Percona XtraDB Cluster features in this early access release, instructions on how to install and configure it are already available along with the operator source code, hosted in our Github repository.

The Percona XtraDB Cluster Operator on Percona-Lab is an early access release. Percona doesn’t recommend it for production environments. 

New features

  • Advanced nodes assignment implemented in this version allows to run containers with Percona XtraDB Cluster nodes on different hosts, availability zones, etc. to achieve higher availability and fault tolerance.
  • Cluster backups are now supported, and can be performed on a schedule or on demand.
  • Percona XtraDB Cluster Operator now supports private container registries like those in OpenShift so that Internet access is not required to deploy the operator.

Improvements

  • CLOUD-69: Annotations and labels are now passed from the deploy/cr.yaml configuration file to a StatefulSet for both Percona XtraDB Cluster and ProxySQL Pods
  • CLOUD-55: Now setting a password for the ProxySQL admin user is supported.
  • CLOUD-48: Migration to operator SDK 0.2.1

Fixed Bugs

  • CLOUD-82: Pods were stopped in random order while the cluster removal, which could cause problems when recreating the cluster with the same name.
  • CLOUD-79: Setting long cluster name in the deploy/cr.yaml file made Percona XtraDB Cluster unable to start.
  • CLOUD-54: The clustercheck tool used monitor user instead of its own clustercheck one for liveness and readiness probes.
Percona XtraDB Cluster is an open source, cost-effective and robust clustering solution for businesses. It integrates Percona Server for MySQL with the Galera replication library to produce a highly-available and scalable MySQL® cluster complete with synchronous multi-master replication, zero data loss and automatic node provisioning using Percona XtraBackup.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system.

Jan
11
2019
--

AWS Aurora MySQL – HA, DR, and Durability Explained in Simple Terms

It’s a few weeks after AWS re:Invent 2018 and my head is still spinning from all of the information released at this year’s conference. This year I was able to enjoy a few sessions focused on Aurora deep dives. In fact, I walked away from the conference realizing that my own understanding of High Availability (HA), Disaster Recovery (DR), and Durability in Aurora had been off for quite a while. Consequently, I decided to put this blog out there, both to collect the ideas in one place for myself, and to share them in general. Unlike some of our previous blogs, I’m not focused on analyzing Aurora performance or examining the architecture behind Aurora. Instead, I want to focus on how HA, DR, and Durability are defined and implemented within the Aurora ecosystem.  We’ll get just deep enough into the weeds to be able to examine these capabilities alone.

introducing the aurora storage engine 1

Aurora MySQL – What is it?

We’ll start with a simplified discussion of what Aurora is from a very high level.  In its simplest description, Aurora MySQL is made up of a MySQL-compatible compute layer and a multi-AZ (multi availability zone) storage layer. In the context of an HA discussion, it is important to start at this level, so we understand the redundancy that is built into the platform versus what is optional, or configurable.

Aurora Storage

The Aurora Storage layer presents a volume to the compute layer. This volume is built out in 10GB increments called protection groups.  Each protection group is built from six storage nodes, two from each of three availability zones (AZs).  These are represented in the diagram above in green.  When the compute layer—represented in blue—sends a write I/O to the storage layer, the data gets replicated six times across three AZs.

Durable by Default

In addition to the six-way replication, Aurora employs a 4-of-6 quorum for all write operations. This means that for each commit that happens at the database compute layer, the database node waits until it receives write acknowledgment from at least four out of six storage nodes. By receiving acknowledgment from four storage nodes, we know that the write has been saved in at least two AZs.  The storage layer itself has intelligence built-in to ensure that each of the six storage nodes has a copy of the data. This does not require any interaction with the compute tier. By ensuring that there are always at least four copies of data, across at least two datacenters (AZs), and ensuring that the storage nodes are self-healing and always maintain six copies, it can be said that the Aurora Storage platform has the characteristic of Durable by Default.  The Aurora storage architecture is the same no matter how large or small your Aurora compute architecture is.

One might think that waiting to receive four acknowledgments represents a lot of I/O time and is therefore an expensive write operation.  However, Aurora database nodes do not behave the way a typical MySQL database instance would. Some of the round-trip execution time is mitigated by the way in which Aurora MySQL nodes write transactions to disk. For more information on exactly how this works, check out Amazon Senior Engineering Manager, Kamal Gupta’s deep-dive into Aurora MySQL from AWS re:Invent 2018.

HA and DR Options

While durability can be said to be a default characteristic to the platform, HA and DR are configurable capabilities. Let’s take a look at some of the HA and DR options available. Aurora databases are deployed as members of an Aurora DB Cluster. The cluster configuration is fairly flexible. Database nodes are given the roles of either Writer or Reader. In most cases, there will only be one Writer node. The Reader nodes are known as Aurora Replicas. A single Aurora Cluster may contain up to 15 Aurora Replicas. We’ll discuss a few common configurations and the associated levels of HA and DR which they provide. This is only a sample of possible configurations: it is not meant to represent an exhaustive list of the possible configuration options available on the Aurora platform.

Single-AZ, Single Instance Deployment

great durability with Aurora but DA and HA less so

The most basic implementation of Aurora is a single compute instance in a single availability zone. The compute instance is monitored by the Aurora Cluster service and will be restarted if the database instance or compute VM has a failure. In this architecture, there is no redundancy at the compute level. Therefore, there is no database level HA or DR. The storage tier provides the same high level of durability described in the sections above. The image below is a view of what this configuration looks like in the AWS Console.

Single-AZ, Multi-Instance

Introducing HA into an Amazon Aurora solutionHA can be added to a basic Aurora implementation by adding an Aurora Replica.  We increase our HA level by adding Aurora Replicas within the same AZ. If desired, the Aurora Replicas can be used to also service some of the read traffic for the Aurora Cluster. This configuration cannot be said to provide DR because there are no database nodes outside the single datacenter or AZ. If that datacenter were to fail, then database availability would be lost until it was manually restored in another datacenter (AZ). It’s important to note that while Aurora has a lot of built-in automation, you will only benefit from that automation if your base configuration facilitates a path for the automation to follow. If you have a single-AZ base deployment, then you will not have the benefit of automated Multi-AZ availability. However, as in the previous case, durability remains the same. Again, durability is a characteristic of the storage layer. The image below is a view of what this configuration looks like in the AWS Console. Note that the Writer and Reader are in the same AZ.

Multi-AZ Options

Partial disaster recovery with Amazon auroraBuilding on our previous example, we can increase our level of HA and add partial DR capabilities to the configuration by adding more Aurora Replicas. At this point we will add one additional replica in the same AZ, bringing the local AZ replica count to three database instances. We will also add one replica in each of the two remaining regional AZs. Aurora provides the option to configure automated failover priority for the Aurora Replicas. Choosing your failover priority is best defined by the individual business needs. That said, one way to define the priority might be to set the first failover to the local-AZ replicas, and subsequent failover priority to the replicas in the other AZs. It is important to remember that AZs within a region are physical datacenters located within the same metro area. This configuration will provide protection for a disaster localized to the datacenter. It will not, however, provide protection for a city-wide disaster. The image below is a view of what this configuration looks like in the AWS Console. Note that we now have two Readers in the same AZ as the Writer and two Readers in two other AZs.

Cross-Region Options

The three configuration types we’ve discussed up to this point represent configuration options available within an AZ or metro area. There are also options available for cross-region replication in the form of both logical and physical replication.

Logical Replication

Aurora supports replication to up to five additional regions with logical replication.  It is important to note that, depending on the workload, logical replication across regions can be notably susceptible to replication lag.

Physical Replication

Durability, High Availability and Disaster Recovery with Amazon AuroraOne of the many announcements to come out of re:Invent 2018 is a product called Aurora Global Database. This is Aurora’s implementation of cross-region physical replication. Amazon’s published details on the solution indicate that it is storage level replication implemented on dedicated cross-region infrastructure with sub-second latency. In general terms, the idea behind a cross-region architecture is that the second region could be an exact duplicate of the primary region. This means that the primary region can have up to 15 Aurora Replicas and the secondary region can also have up to 15 Aurora Replicas. There is one database instance in the secondary region in the role of writer for that region. This instance can be configured to take over as the master for both regions in the case of a regional failure. In this scenario the secondary region becomes primary, and the writer in that region becomes the primary database writer. This configuration provides protection in the case of a regional disaster. It’s going to take some time to test this, but at the moment this architecture appears to provide the most comprehensive combination of Durability, HA, and DR. The trade-offs have yet to be thoroughly explored.

Multi-Master Options

Amazon is in the process of building out a new capability called Aurora Multi-Master. Currently, this feature is in preview phase and has not been released for general availability. While there were a lot of talks at re:Invent 2018 which highlighted some of the components of this feature, there is still no affirmative date for release. Early analysis points to the feature being localized to the AZ. It is not known if cross-region Multi-Master will be supported, but it seems unlikely.

Summary

As a post re:Invent takeaway, what I learned was that there is an Aurora configuration to fit almost any workload that requires strong performance behind it. Not all heavy workloads also demand HA and DR. If this describes one of your workloads, then there is an Aurora configuration that fits your needs. On the flip side, it is also important to remember that while data durability is an intrinsic quality of Aurora, HA and DR are not. These are completely configurable. This means that the Aurora architect in your organization must put thought and due diligence into the way they design your Aurora deployment. While we all need to be conscious of costs, don’t let cost consciousness become a blinder to reality. Just because your environment is running in Aurora does not mean you automatically have HA and DR for your database. In Aurora, HA and DR are configuration options, and just like the on-premise world, viable HA and DR have additional costs associated with them.

For More Information See Also:

 

 

 

Jan
10
2019
--

PostgreSQL Updatable Views: Performing Schema Updates With Minimal Downtime

postgres updatable views

postgres updatable viewsRecently, one of our customers asked us how to minimize downtime when upgrading the database structure with changes that are not backwards-compatible. It’s an interesting question and I would like to visit some alternatives here. I will use PostgreSQL for this series of posts and walk through updatable views, INSTEAD OF Triggers, and the Rule System. Later, we’ll discuss alternatives available for other databases like MySQL.

This first post will give an overview of the problem and also the first implementation of the solution in PostgreSQL using updatable Views.

The Motivation

Software is like a living organism and as such, they evolve. It’s not surprising that the database schemas also evolve, and this brings us a problem: how to minimize downtime when performing upgrades? Or even further, is it possible to upgrade them without activating maintenance mode thereby making the service unavailable for our customers?

Let’s say that we want to push out an update 2.0. It’s a major update, and in this update, there are application code changes and changes to the database such as altered tables, dropped columns, new tables and so on. Checking the changelog, we notice that most of the database changes are backwards-compatible but a few modified tables are not so we can’t just push out the new database changes without breaking some functionality in the existing codebase. To avoid triggering errors while we upgrade the database, we need to shutdown the application servers, update the database, update the codebase, and then get the servers back and running again. That means that we need an unwanted maintenance window!

As per our definition of the problem, we want to get to the point where we don’t have to use this maintenance window, a point where the old and new codebase could coexist for a period of time while we upgrade the system. One solution is to not make changes that the current codebase can’t handle, but, as you may have already assumed, it isn’t really an option when we are constantly trying to optimize and improve our databases. Another option, then, would be to use PostgreSQL updatable views.

Updatable Views

PostgreSQL has introduced automatically updatable views in 9.3. The documentation[1] says that simple views are automatically updatable and the system will allow INSERT, UPDATE or DELETE statements to be used on the view in the same way as on a regular table. A view is automatically updatable if it satisfies all of the following conditions:

  • The view must have exactly one entry in its FROM list, which must be a table or another updatable view.
  • The view definition must not contain WITH, DISTINCT, GROUP BY, HAVING, LIMIT, or OFFSET clauses at the top level.
  • The view definition must not contain set operations (UNION, INTERSECT or EXCEPT) at the top level.
  • The view’s select list must not contain any aggregates, window functions, or set-returning functions.

Note that the idea is to give a simple mechanism that helps when using views, and if the view is automatically updatable the system will convert any INSERT, UPDATE or DELETE statement on the view into the corresponding statement on the underlying base table. This can also be used to increase the security granularity giving the power to define privilege that operates at the level. If using a WHERE clause in the view we can use the CHECK OPTION to prevent the user from being able to UPDATE or INSERT rows that are not in the scope of the view. For example, let’s say we have a view created to limit the user to view records from a specific country.  If the user changes the country of any record, those records would disappear from the view. The CHECK OPTION can help to prevent this from happening. I recommend reading the documentation for more information about how views work in PostgreSQL.

Implementation

Using updatable views makes the implementation as simple as creating views. For our example I will use the below table:

test=# CREATE TABLE t (id INTEGER PRIMARY KEY, name VARCHAR(100) NOT NULL, password VARCHAR(300) NOT NULL, date_created TIMESTAMP NOT NULL DEFAULT now());
CREATE TABLE
test=# INSERT INTO t(id, name, password) VALUES (1, 'user_1', 'pwd_1'), (2, 'user_2','pwd_2'),(3,'user_3','pwd_3'),(4,'user_4','pwd_4'),(5,'user_5','pwd_5');
INSERT 0 5
test=# SELECT * FROM t;
id | name | password | date_created
----+--------+----------+----------------------------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455
3 | user_3 | pwd_3 | 2018-12-27 07:50:39.562455
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455
(5 rows)

We then changed the schema renaming the columns password to pwd, date_created to dt_created and added 2 more columns, pwd_salt and comment. The added columns are not a real problem because they can be either nullable or have a default value but the column name change is a problem. The changes are:

test=# create schema v_10;
CREATE SCHEMA
test=# CREATE VIEW v_10.t AS SELECT id, name, password AS password, date_created AS date_created FROM public.t;
CREATE VIEW
test=# ALTER TABLE public.t RENAME COLUMN password TO pwd;
ALTER TABLE
test=# ALTER TABLE public.t RENAME COLUMN date_created TO dt_created;
ALTER TABLE
test=# ALTER TABLE public.t ADD COLUMN pwd_salt VARCHAR(100);
ALTER TABLE
test=# ALTER TABLE public.t ADD COLUMN comment VARCHAR(500);
ALTER TABLE

To make sure our application will work properly we’ve defined that the tables will be in a specific main schema, in this example is the PUBLIC schema and the views will be in the versioned schemas. In this case, if we have a change in one specific version that needs a view guaranteeing backwards-compatibility, we just create the view inside the versioned schema and apply the changes to the table in the main schema. The application will always define the “search_path” as “versioned_schema,main_schema”, which is “v_10, public” in this example:

test=# SET search_path TO v_10, public;
SET
test=# SELECT * FROM t;
id | name | password | date_created
----+--------+----------+----------------------------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455
3 | user_3 | pwd_3 | 2018-12-27 07:50:39.562455
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455
(5 rows)
test=# select * from public.t;
id | name | pwd | dt_created | pwd_salt | comment
----+--------+-------+----------------------------+----------+---------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455 | |
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455 | |
3 | user_3 | pwd_3 | 2018-12-27 07:50:39.562455 | |
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455 | |
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455 | |
(5 rows)

As we can see, the application still sees the old schema, but does this work? What if someone updates the password of ID #3? Let’s check:

test=# UPDATE t SET password = 'new_pwd_3' WHERE id = 3;
UPDATE 1
test=# SELECT * FROM t;
id | name | password | date_created
----+--------+-----------+----------------------------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455
3 | user_3 | new_pwd_3 | 2018-12-27 07:50:39.562455
(5 rows)
test=# SELECT * FROM public.t;
id | name | pwd | dt_created | pwd_salt | comment
----+--------+-----------+----------------------------+----------+---------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455 | |
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455 | |
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455 | |
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455 | |
3 | user_3 | new_pwd_3 | 2018-12-27 07:50:39.562455 | |
(5 rows)

As we can see, the updatable view worked just like a charm! The new and old application codebase can coexist and work together while we roll up our upgrades. There are some restrictions, as explained in the documentation, like having only one table or view in the WHERE clause but for its simplicity, upgradable views do a great job. For more complex cases where we need to split/join tables? Well, we will discuss these in future articles and show how we can solve them with both TRIGGERS and the PostgreSQL Rule System.

References

[1] https://www.postgresql.org/docs/current/sql-createview.html


Photo by Egor Kamelev from Pexels

Jan
04
2019
--

Percona XtraDB Cluster 5.6.42-28.30 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.6Percona announces the release of Percona XtraDB Cluster 5.6.42-28.30 (PXC) on January 4, 2019. Binaries are available from the downloads section or our software repositories.

Percona XtraDB Cluster 5.6.42-28.30 is now the current release, based on the following:

All Percona software is open-source and free.

Fixed Bugs

  • PXC-2281: Debug symbols were missing in Debian dbg packages.
  • PXC-2220: Starting two instances of Percona XtraDB Cluster on the same node could cause writing transactions to a page store instead of a galera.cache ring buffer, resulting in huge memory consumption because of retaining already applied write-sets.
  • PXC-2230rgcs.fc_limit=0 not allowed as dynamic setting to avoid generating flow control on every message was still possible in my.cnf due to the inconsistent check.
  • PXC-2238: setting read_only=1 caused race condition.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Jan
04
2019
--

Percona XtraDB Cluster 5.7.24-31.33 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.7Percona is glad to announce the release of Percona XtraDB Cluster 5.7.24-31.33 (PXC) on January 4, 2019. Binaries are available from the downloads section or from our software repositories.

Percona XtraDB Cluster 5.7.24-31.33 is now the current release, based on the following:

Deprecated

The following variables are deprecated starting from this release:

  • wsrep_preordered was used to turn on transparent handling of preordered replication events applied locally first before being replicated to other nodes in the cluster. It is not needed anymore due to the carried out performance fix eliminating the lag in asynchronous replication channel and cluster replication.
  • innodb_disallow_writes usage to make InnoDB avoid writes during SST was deprecated in favor of the innodb_read_only variable.
  • wsrep_drupal_282555_workaround avoided the duplicate value creation caused by buggy auto-increment logic, but the correspondent bug is already fixed.
  • session-level variable binlog_format=STATEMENT was enabled only for pt-table-checksum, which would be addressed in following releases of the Percona Toolkit.

Fixed Bugs

  • PXC-2220: Starting two instances of Percona XtraDB Cluster on the same node could cause writing transactions to a page store instead of a galera.cache ring buffer, resulting in huge memory consumption because of retaining already applied write-sets.
  • PXC-2230: rgcs.fc_limit=0 not allowed as dynamic setting to avoid generating flow control on every message was still possible in my.cnf due to the inconsistent check.
  • PXC-2238: setting read_only=1 caused race condition.
  • PXC-1131mysqld-systemd threw an error at MySQL restart in case of non-existing error-log in Centos/RHEL7.
  • PXC-2269: being not dynamic, the pxc_encrypt_cluster_traffic variable was erroneously allowed to be changed by a SET GLOBAL statement.
  • PXC-2275: checking wsrep_node_address value in the wsrep_sst_common command line parser caused parsing the wrong variable.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com