Oct
11
2018
--

How to Fix ProxySQL Configuration When it Won’t Start

restart ProxySQL config

restart ProxySQL configWith the exception of the three configuration variables described here, ProxySQL will only parse the configuration files the first time it is started, or if the proxysql.db file is missing for some other reason.

If we want to change any of this data we need to do so via ProxySQL’s admin interface and then save them to disk. That’s fine if ProxySQL is running, but what if it won’t start because of these values?

For example, perhaps we accidentally configured ProxySQL to run on port 3306 and restarted it, but there’s already a production MySQL instance running on this port. ProxySQL won’t start, so we can’t edit the value that way:

2018-10-02 09:18:33 network.cpp:53:listen_on_port(): [ERROR] bind(): Address already in use

We could delete proxysql.db and have it reload the configuration files, but that would mean any changes we didn’t mirror into the configuration files will be lost.

Another option is to edit ProxySQL’s database file using sqlite3:

[root@centos7-pxc57-4 ~]# cd /var/lib/proxysql/
[root@centos7-pxc57-4 proxysql]# sqlite3 proxysql.db
sqlite> SELECT * FROM global_variables WHERE variable_name='mysql-interfaces';
mysql-interfaces|127.0.0.1:3306
sqlite> UPDATE global_variables SET variable_value='127.0.0.1:6033' WHERE variable_name='mysql-interfaces';
sqlite> SELECT * FROM global_variables WHERE variable_name='mysql-interfaces';
mysql-interfaces|127.0.0.1:6033

Or if we have a few edits to make we may prefer to do so with a text editor:

[root@centos7-pxc57-4 ~]# cd /var/lib/proxysql/
[root@centos7-pxc57-4 proxysql]# sqlite3 proxysql.db
sqlite> .output /tmp/global_variables
sqlite> .dump global_variables
sqlite> .exit

The above commands will dump the global_variables table into a file in SQL format, which we can then edit:

[root@centos7-pxc57-4 proxysql]# grep mysql-interfaces /tmp/global_variables
INSERT INTO “global_variables” VALUES(‘mysql-interfaces’,’127.0.0.1:3306’);
[root@centos7-pxc57-4 proxysql]# vim /tmp/global_variables
[root@centos7-pxc57-4 proxysql]# grep mysql-interfaces /tmp/global_variables
INSERT INTO “global_variables” VALUES(‘mysql-interfaces’,’127.0.0.1:6033’);

Now we need to restore this data. We’ll use the restore command to empty the table (as we’re restoring from a missing backup):

[root@centos7-pxc57-4 proxysql]# sqlite3 proxysql.db
sqlite> .restore global_variables
sqlite> .read /tmp/global_variables
sqlite> .exit

Once we’ve made the change, we should be able to start ProxySQL again:

[root@centos7-pxc57-4 proxysql]# /etc/init.d/proxysql start
Starting ProxySQL: DONE!
[root@centos7-pxc57-4 proxysql]# lsof -I | grep proxysql
proxysql 15171 proxysql 19u IPv4 265881 0t0 TCP localhost:6033 (LISTEN)
proxysql 15171 proxysql 20u IPv4 265882 0t0 TCP localhost:6033 (LISTEN)
proxysql 15171 proxysql 21u IPv4 265883 0t0 TCP localhost:6033 (LISTEN)
proxysql 15171 proxysql 22u IPv4 265884 0t0 TCP localhost:6033 (LISTEN)
proxysql 15171 proxysql 23u IPv4 266635 0t0 TCP *:6032 (LISTEN)

While you are here

You might enjoy my recent post Using ProxySQL to connect to IPV6-only databases over IPV4

You can download ProxySQL from Percona repositories, and you might also want to check out our recorded webinars that feature ProxySQL too.

Oct
10
2018
--

Instrumenting Read Only Transactions in InnoDB

Instrumenting read only transactions MySQL

Instrumenting read only transactions MySQLProbably not well known but quite an important optimization was introduced in MySQL 5.6 – reduced overhead for “read only transactions”. While usually by a “transaction” we mean a query or a group of queries that change data, with transaction engines like InnoDB, every data read or write operation is a transaction.

Now, as a non-locking read operation obviously has less impact on the data, it does not need all the instrumenting overhead a write transaction has. The main thing that can be avoided, as described by documentation, is the transaction ID. So, since MySQL 5.6, a read only transaction does not have a transaction ID. Moreover, such a transaction is not visible in the SHOW ENGINE INNODB STATUS output, though I will not go deeper on what really that means under the hood in this article. The fact is that this optimization allows for better scaling of workloads with many RO threads. An example RO benchmark, where 5.5 vs 5.6/5.7 difference is well seen, may be found here: https://www.percona.com/blog/2016/04/07/mysql-5-7-sysbench-oltp-read-results-really-faster/

To benefit from this optimization in MySQL 5.6, either a transaction has to start with the explicit START TRANSACTION READ ONLY clause or it must be an autocommit, non-locking SELECT statement. In version 5.7 and newer, it goes further, as a new transaction is treated as read-only until a locking read or write is executed, at which point it gets “upgraded” to a read-write one.

Information Schema Instrumentation

Let’s see how it looks like (on MySQL 8.0.12) by looking at information_schema.innodb_trx and information_schema.innodb_metrics tables. The second of these, by default, has transaction counters disabled, so before the test we have to enable it with:

SET GLOBAL innodb_monitor_enable = 'trx%comm%';

or by adding a parameter to the

[mysqld]

 section of the configuration file and restarting the instance:

innodb_monitor_enable = "trx_%"

Now, let’s start a transaction which should be read only according to the rules:

mysql [localhost] {msandbox} (db1) > START TRANSACTION; SELECT count(*) FROM db1.t1;
Query OK, 0 rows affected (0.00 sec)
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec
mysql [localhost] {msandbox} (db1) > SELECT trx_id,trx_weight,trx_rows_locked,trx_rows_modified,trx_is_read_only,trx_autocommit_non_locking
FROM information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 421988493944672
                trx_weight: 0
           trx_rows_locked: 0
         trx_rows_modified: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)

Transaction started as above, did not appear in SHOW ENGINE INNODB STATUS, and its trx_id looks strangely high. And first surprise—for some reason, trx_is_read_only is 0. Now, what if we commit such a transaction—how do the counters change? (I reset them before the test):

mysql [localhost] {msandbox} (db1) > commit;
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     1 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     0 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.01 sec)

OK, so clearly it was a read-only transaction overall, just the trx_is_read_only property wasn’t set as expected. I had to report this problem here: https://bugs.mysql.com/bug.php?id=92558

What about an explicit RO transaction:

mysql [localhost] {msandbox} (db1) > START TRANSACTION READ ONLY; SELECT count(*) FROM db1.t1;
Query OK, 0 rows affected (0.00 sec)
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec
mysql [localhost] {msandbox} (db1) > SELECT trx_id,trx_weight,trx_rows_locked,trx_rows_modified,trx_is_read_only,trx_autocommit_non_locking
FROM information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 421988493944672
                trx_weight: 0
           trx_rows_locked: 0
         trx_rows_modified: 0
          trx_is_read_only: 1
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > commit;
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     2 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     0 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.01 sec)

OK, both transactions are counted as the same type. Moreover, the two transactions shared the same strange trx_id, which appears to be a fake one. For a simple read executed in autocommit mode, the counters increase as expected too:

mysql [localhost] {msandbox} (db1) > select @@autocommit; SELECT count(*) FROM db1.t1;
+--------------+
| @@autocommit |
+--------------+
|            1 |
+--------------+
1 row in set (0.00 sec)
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     2 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     1 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.00 sec)

Now, let’s test how a transaction looks when we upgrade it to RW later:

mysql [localhost] {msandbox} (db1) > START TRANSACTION; SELECT count(*) FROM db1.t1;
Query OK, 0 rows affected (0.00 sec)
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT trx_id,trx_weight,trx_rows_locked,trx_rows_modified,trx_is_read_only,trx_autocommit_non_locking
FROM information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 421988493944672
                trx_weight: 0
           trx_rows_locked: 0
         trx_rows_modified: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT count(*) FROM db1.t1 FOR UPDATE;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT trx_id,trx_weight,trx_rows_locked,trx_rows_modified,trx_is_read_only,trx_autocommit_non_locking
FROM information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 4106
                trx_weight: 2
           trx_rows_locked: 4
         trx_rows_modified: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > commit;
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     1 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     2 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     1 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.00 sec)

OK, as seen above, after a locking read was done, our transaction has transformed: it got a real, unique trx_id assigned. Then, when committed, the RW counter increased.

Performance Schema Problem

Nowadays it may feel natural to use performance_schema for monitoring everything. And, indeed, we can monitor types of transactions with it as well. Let’s enable the needed consumers and instruments:

mysql [localhost] {msandbox} (db1) > UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%transactions%';
Query OK, 0 rows affected (0.00 sec)
Rows matched: 3  Changed: 0  Warnings: 0
mysql [localhost] {msandbox} (db1) > UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME = 'transaction';
Query OK, 0 rows affected (0.01 sec)
Rows matched: 1  Changed: 0  Warnings: 0
mysql [localhost] {msandbox} (db1) > SELECT * FROM performance_schema.setup_instruments WHERE NAME = 'transaction';
+-------------+---------+-------+------------+------------+---------------+
| NAME        | ENABLED | TIMED | PROPERTIES | VOLATILITY | DOCUMENTATION |
+-------------+---------+-------+------------+------------+---------------+
| transaction | YES     | YES   |            |          0 | NULL          |
+-------------+---------+-------+------------+------------+---------------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT * FROM performance_schema.setup_consumers WHERE NAME LIKE '%transactions%';
+----------------------------------+---------+
| NAME                             | ENABLED |
+----------------------------------+---------+
| events_transactions_current      | YES     |
| events_transactions_history      | YES     |
| events_transactions_history_long | YES     |
+----------------------------------+---------+
3 rows in set (0.01 sec)
mysql [localhost] {msandbox} (db1) > SELECT COUNT_STAR,COUNT_READ_WRITE,COUNT_READ_ONLY
FROM performance_schema.events_transactions_summary_global_by_event_name\G
*************************** 1. row ***************************
      COUNT_STAR: 0
COUNT_READ_WRITE: 0
 COUNT_READ_ONLY: 0
1 row in set (0.00 sec)

And let’s do some simple tests:

mysql [localhost] {msandbox} (db1) > START TRANSACTION; COMMIT;
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT COUNT_STAR,COUNT_READ_WRITE,COUNT_READ_ONLY
FROM performance_schema.events_transactions_summary_global_by_event_name\G
*************************** 1. row ***************************
      COUNT_STAR: 1
COUNT_READ_WRITE: 1
 COUNT_READ_ONLY: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     0 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     0 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.00 sec)

A void transaction caused an increase to this RW counter in Performance Schema view! Moreover, a simple autocommit select increases it too:

mysql [localhost] {msandbox} (db1) > SELECT count(*) FROM db1.t1;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.01 sec)
mysql [localhost] {msandbox} (db1) > SELECT COUNT_STAR,COUNT_READ_WRITE,COUNT_READ_ONLY
FROM performance_schema.events_transactions_summary_global_by_event_name\G
*************************** 1. row ***************************
      COUNT_STAR: 2
COUNT_READ_WRITE: 2
 COUNT_READ_ONLY: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > START TRANSACTION READ ONLY; COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT COUNT_STAR,COUNT_READ_WRITE,COUNT_READ_ONLY
FROM performance_schema.events_transactions_summary_global_by_event_name\G
*************************** 1. row ***************************
      COUNT_STAR: 3
COUNT_READ_WRITE: 2
 COUNT_READ_ONLY: 1
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     0 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     1 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.01 sec)

As seen above, with regard to monitoring transactions via Performance Schema, everything seems completely broken, empty transactions increase counters, and the only way to increase RO counter is to call a read-only transaction explicitly, but again, it should not count when no real read was done from a table. For this reason I filed another bug report: https://bugs.mysql.com/bug.php?id=92364

PMM Dashboard

We implemented a transactions information view in PMM, based on Information_schema.innodb_metrics, which—as presented above—is reliable and shows the correct counters. Therefore, I encourage everyone to use the innodb_monitor_enable setting to enable it and have the PMM graph it. It will look something like this:

Oct
10
2018
--

MongoDB Replica set Scenarios and Internals

MongoDB replica sets replication internals r

MongoDB replica sets replication internals rThe MongoDB® replica set is a group of nodes with one set as the primary node, and all other nodes set as secondary nodes. Only the primary node accepts “write” operations, while other nodes can only serve “read” operations according to the read preferences defined. In this blog post, we’ll focus on some MongoDB replica set scenarios, and take a look at the internals.

Example configuration

We will refer to a three node replica set that includes one primary node and two secondary nodes running as:

"members" : [
{
"_id" : 0,
"name" : "192.168.103.100:25001",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 3533,
"optime" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDate" : ISODate("2018-09-24T14:49:44Z"),
"electionTime" : Timestamp(1537797392, 2),
"electionDate" : ISODate("2018-09-24T13:56:32Z"),
"configVersion" : 3,
"self" : true
},
{
"_id" : 1,
"name" : "192.168.103.100:25002",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 3063,
"optime" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDurable" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDate" : ISODate("2018-09-24T14:49:44Z"),
"optimeDurableDate" : ISODate("2018-09-24T14:49:44Z"),
"lastHeartbeat" : ISODate("2018-09-24T14:49:45.539Z"),
"lastHeartbeatRecv" : ISODate("2018-09-24T14:49:44.664Z"),
"pingMs" : NumberLong(0),
"syncingTo" : "192.168.103.100:25001",
"configVersion" : 3
},
{
"_id" : 2,
"name" : "192.168.103.100:25003",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2979,
"optime" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDurable" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDate" : ISODate("2018-09-24T14:49:44Z"),
"optimeDurableDate" : ISODate("2018-09-24T14:49:44Z"),
"lastHeartbeat" : ISODate("2018-09-24T14:49:45.539Z"),
"lastHeartbeatRecv" : ISODate("2018-09-24T14:49:44.989Z"),
"pingMs" : NumberLong(0),
"syncingTo" : "192.168.103.100:25002",
"configVersion" : 3
}

Here, the primary is running on port 25001, and the two secondaries are running on ports 25002 and 25003 on the same host.

Secondary nodes can only sync from Primary?

No, it’s not mandatory. Each secondary can replicate data from the primary or any other secondary to the node that is syncing. This term is also known as chaining, and by default, this is enabled.

In the above replica set, you can see that secondary node

"_id":2 

  is syncing from another secondary node

"_id":1

   as

"syncingTo" : "192.168.103.100:25002" 

This can also be found in the logs as here the parameter

chainingAllowed :true

   is the default setting.

settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5ba8ed10d4fddccfedeb7492') } }

Chaining?

That means that a secondary member node is able to replicate from another secondary member node instead of from the primary node. This helps to reduce the load from the primary. If the replication lag is not tolerable, then chaining could be disabled.

For more details about chaining and the steps to disable it please refer to my earlier blog post here.

Ok, then how does the secondary node select the source to sync from?

If Chaining is False

When chaining is explicitly set to be false, then the secondary node will sync from the primary node only or could be overridden temporarily.

If Chaining is True

  • Before choosing any sync node, TopologyCoordinator performs validations like:
    • Whether chaining is set to true or false.
    • If that particular node is part of the current replica set configurations.
    • Identify the node ahead with oplog with the lowest ping time.
    • The source code that includes validation is here.
  • Once the validation is done, SyncSourceSelector relies on SyncSourceResolver which contains the result and details for the new sync source
  • To get the details and response, SyncSourceResolver coordinates with ReplicationCoordinator
  • This ReplicationCoordinator is responsible for the replication, and co-ordinates with TopologyCoordinator
  • The TopologyCoordinator is responsible for topology of the cluster. It finds the primary oplog time and checks for the maxSyncSourceLagSecs
  • It will reject the source to sync from if the maxSyncSourceLagSecs  is greater than the newest oplog entry. The code for this can be found here
  • If the criteria for the source selection is not fulfilled, then BackgroundSync thread waits and restarts the whole process again to get the sync source.

Example for “unable to find a member to sync from” then, in the next attempt, finding a candidate to sync from

This can be found in the log like this. On receiving the message from rsBackgroundSync thread

could not find member to sync from

, the whole internal process restarts and finds a member to sync from i.e.

sync source candidate: 192.168.103.100:25001

, which means it is now syncing from node 192.168.103.100 running on port 25001.

2018-09-24T13:58:43.197+0000 I REPL     [rsSync] transition to RECOVERING
2018-09-24T13:58:43.198+0000 I REPL     [rsBackgroundSync] could not find member to sync from
2018-09-24T13:58:43.201+0000 I REPL     [rsSync] transition to SECONDARY
2018-09-24T13:58:59.208+0000 I REPL     [rsBackgroundSync] sync source candidate: 192.168.103.100:25001

  • Once the sync source node is selected, SyncSourceResolver probes the sync source to confirm that it is able to fetch the oplogs.
  • RollbackID is also fetched i.e. rbid  after the first batch is returned by oplogfetcher.
  • If all eligible sync sources are too fresh, such as during initial sync, then the syncSourceStatus Oplog start is missing and earliestOpTimeSeen will set a new minValid.
  • This minValid is also set in the case of rollback and abrupt shutdown.
  • If the node has a minValid entry then this is checked for the eligible sync source node.

Example showing the selection of a new sync source when the existing source is found to be invalid

Here, as the logs show, during sync the node chooses a new sync source. This is because it found the original sync source is not ahead, so not does not contain recent oplogs from which to sync.

2018-09-25T15:20:55.424+0000 I REPL     [replication-1] Choosing new sync source because our current sync source, 192.168.103.100:25001, has an OpTime ({ ts: Timestamp 1537879296000|1, t: 4 }) which is not ahead of ours ({ ts: Timestamp 1537879296000|1, t: 4 }), it does not have a sync source, and it's not the primary (sync source does not know the primary)

2018-09-25T15:20:55.425+0000 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source 192.168.103.100:25001 (config version: 3; last applied optime: { ts: Timestamp 1537879296000|1, t: 4 }; sync source index: -1; primary index: -1) is no longer valid

  • If the secondary node is too far behind the eligible sync source node, then the node will enter maintenance node and then resync needs to be call manually.
  • Once the sync source is chosen, BackgroundSync starts oplogFetcher.

Example for oplogFetcher

Here is an example of fetching oplog from the “oplog.rs” collection, and checking for the greater than required timestamp.

2018-09-26T10:35:07.372+0000 I COMMAND  [conn113] command local.oplog.rs command: getMore { getMore: 20830044306, collection: "oplog.rs", maxTimeMS: 5000, term: 7, lastKnownCommittedOpTime: { ts: Timestamp 1537955038000|1, t: 7 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1537903865000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 7, readConcern: { afterOpTime: { ts: Timestamp 1537903865000|1, t: 6 } } } planSummary: COLLSCAN cursorid:20830044306 keysExamined:0 docsExamined:0 numYields:1 nreturned:0 reslen:451 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_command 3063398ms

When and what details replica set nodes communicate with each other?

At a regular interval, all the nodes communicate with each other to check the status of the primary node, check the status of the sync source, to get the oplogs and so on.

ReplicationCoordinator has ReplicaSetConfig that has a list of all the replica set nodes, and each node has a copy of it. This makes nodes aware of other nodes under same replica set.

This is how nodes communicate in more detail:

Heartbeats: This checks the status of other nodes i.e. alive or die

heartbeatInterval: Every node, at an interval of two seconds, sends the other nodes a heartbeat to make them aware that “yes I am alive!”

heartbeatTimeoutSecs: This is a timeout, and means that if the heartbeat is not returned in 10 seconds then that node is marked as inaccessible or simply die.

Every heartbeat is identified by these replica set details:

  • replica set config version
  • replica set name
  • Sender host address
  • id from the replicasetconfig

The source code could be referred to from here.

When the remote node receives the heartbeat, it processes this data and validates if the details are correct. It then prepares a ReplSetHeartbeatResponse, that includes:

  • Name of the replica set, config version, and optime details
  • Details about primary node as per the receiving node.
  • Sync source details and state of receiving node

This heartbeat data is processed, and if primary details are found then the election gets postponed.

TopologyCoordinator checks for the heartbeat data and confirms if the node is OK or NOT. If the node is OK then no action is taken. Otherwise it needs to be reconfigured or else initiate a priority takeover based on the config.

Response from oplog fetcher

To get the oplogs from the sync source, nodes communicate with each other. This oplog fetcher fetches oplogs through “find” and “getMore”. This will only affect the downstream node that gets metadata from its sync source to update its view from the replica set.

OplogQueryMetadata only comes with OplogFetcher responses

OplogQueryMetadata comes with OplogFetcher response and ReplSetMetadata comes with all the replica set details including configversion and replication commands.

Communicate to update Position commands:

This is to get an update for replication progress. ReplicationCoordinatorExternalState creates SyncSourceFeedback sends replSetUpdatePosition commands.

It includes Oplog details, Replicaset config version, and replica set metadata.

If a new node is added to the existing replica set, how will that node get the data?

If a new node is added to the existing replica set then the “initial sync” process takes place. This initial sync can be done in two ways:

  1. Just add the new node to the replicaset and let initial sync threads restore the data. Then it syncs from the oplogs until it reaches the secondary state.
  2. Copy the data from the recent data directory to the node, and restart this new node. Then it will also sync from the oplogs until it reaches the secondary state.

This is how it works internally

When “initial sync” or “rsync” is called by ReplicationCoordinator  then the node goes to “STARTUP2” state, and this initial sync is done in DataReplicator

  • A sync source is selected to get the data from, then it drops all the databases except the local database, and oplogs are recreated.
  • DatabasesCloner asks syncsource for a list of the databases, and for each database it creates DatabaseCloner.
  • For each DatabaseCloner it creates CollectionCloner to clone the collections
  • This CollectionCloner calls ListIndexes on the syncsource and creates a CollectionBulkLoader for parallel index creation while data cloning
  • The node also checks for the sync source rollback id. If rollback occurred, then it restarts the initial sync. Otherwise, datareplicator is done with its work and then replicationCoordinator assumes the role for ongoing replication.

Example for the “initial sync” :

Here node enters  

"STARTUP2"- "transition to STARTUP2"

Then sync source gets selected and drops all the databases except the local database.  Next, replication oplog is created and CollectionCloner is called.

Local database not dropped: because every node has its own “local” database with its own and other nodes’ information, based on itself, this database is not replicated to other nodes.

2018-09-26T17:57:09.571+0000 I REPL     [ReplicationExecutor] transition to STARTUP2
2018-09-26T17:57:14.589+0000 I REPL     [replication-1] sync source candidate: 192.168.103.100:25003
2018-09-26T17:57:14.590+0000 I STORAGE  [replication-1] dropAllDatabasesExceptLocal 1
2018-09-26T17:57:14.592+0000 I REPL     [replication-1] creating replication oplog of size: 990MB... 2018-09-26T17:57:14.633+0000 I REPL     [replication-0] CollectionCloner::start called, on ns:admin.system.version

Finished fetching all the oplogs, and finishing up initial sync.

2018-09-26T17:57:15.685+0000 I REPL     [replication-0] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1537984626000|1, t: 9 }[-1139925876765058240]
2018-09-26T17:57:15.685+0000 I REPL     [replication-0] Initial sync attempt finishing up.

What are oplogs and where do these reside?

oplogs stands for “operation logs”. We have used this term so many times in this blog post as these are the mandatory logs for the replica set. These operations are in the capped collection called “oplog.rs”  that resides in “local” database.

Below, this is how oplogs are stored in the collection “oplog.rs” that includes details for timestamp, operations, namespace, output.

rplint:PRIMARY> use local
rplint:PRIMARY> show collections
oplog.rs
rplint:PRIMARY> db.oplog.rs.findOne()
{
 "ts" : Timestamp(1537797392, 1),
 "h" : NumberLong("-169301588285533642"),
 "v" : 2,
 "op" : "n",
 "ns" : "",
 "o" : {
 "msg" : "initiating set"
 }
}

It consists of rolling update operations coming to the database. Then these oplogs replicate to the secondary node(s) to maintain the high availability of the data in case of failover.

When the replica MongoDB instance starts, it creates an oplog ocdefault size. For Wired tiger, the default size is 5% of disk space, with a lower bound size of 990MB. So here in the example it creates 990MB of data. If you’d like to learn more about oplog size then please refer here

2018-09-26T17:57:14.592+0000 I REPL     [replication-1] creating replication oplog of size: 990MB...

What if the same oplog is applied multiple times, will that not lead to inconsistent data?

Fortunately, oplogs are Idempotent that means the value will remain unchanged, or will provide the same output, even when applied multiple times.

Let’s check an example:

For the $inc operator that will increment the value by 1 for the filed “item”, if this oplog is applied multiple times then the result might lead to an inconsistent record if this is not Idempotent. However, rather than increasing the item value multiple times, it is actually applied only once.

rplint:PRIMARY> use db1
//inserting one document
rplint:PRIMARY> db.col1.insert({item:1, name:"abc"})
//updating document by incrementing item value with 1
rplint:PRIMARY> db.col1.update({name:"abc"},{$inc:{item:1}})
//updated value is now item:2
rplint:PRIMARY> db.col1.find()
{ "_id" : ObjectId("5babd57cce2ef78096ac8e16"), "item" : 2, "name" : "abc" }

This is how these operations are stored in oplog, here this $inc value is stored in oplog as $set

rplint:PRIMARY> db.oplog.rs.find({ns:"db1.col1"})
//insert operation
{ "ts" : Timestamp(1537987964, 2), "t" : NumberLong(9), "h" : NumberLong("8083740413874479202"), "v" : 2, "op" : "i", "ns" : "db1.col1", "o" : { "_id" : ObjectId("5babd57cce2ef78096ac8e16"), "item" : 1, "name" : "abc" } }
//$inc operation is changed as ""$set" : { "item" : 2"
{ "ts" : Timestamp(1537988022, 1), "t" : NumberLong(9), "h" : NumberLong("-1432987813358665721"), "v" : 2, "op" : "u", "ns" : "db1.col1", "o2" : { "_id" : ObjectId("5babd57cce2ef78096ac8e16") }, "o" : { "$set" : { "item" : 2 } } }

That means that however many  times it is applied, it will generate the same results, so no inconsistent data!

I hope this blog post helps you to understand multiple scenarios for MongoDB replica sets, and how data replicates to the nodes.

Oct
08
2018
--

Persistence of autoinc fixed in MySQL 8.0

MySQL 8.0 autoinc persistence fixed

MySQL 8.0 autoinc persistence fixedThe release of MySQL 8.0 has brought a lot of bold implementations that touched on things that have been avoided before, such as added support for common table expressions and window functions. Another example is the change in how AUTO_INCREMENT (autoinc) sequences are persisted, and thus replicated.

This new implementation carries the fix for bug #73563 (Replace result in auto_increment value less or equal than max value in row-based), which we’ve only found about recently. The surprising part is that the use case we were analyzing is a somewhat common one; this must be affecting a good number of people out there.

Understanding the bug

The business logic of the use case is such the UNIQUE column found in a table whose id is managed by an AUTO_INCREMENT sequence needs to be updated, and this is done with a REPLACE operation:

“REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.”

So, what happens in practice in this particular case is a DELETE followed by an INSERT of the target row.

We will explore this scenario here in the context of an oversimplified currency converter application that uses USD as base reference:

CREATE TABLE exchange_rate (
id INT PRIMARY KEY AUTO_INCREMENT,
currency VARCHAR(3) UNIQUE,
rate FLOAT(5,3)
) ENGINE=InnoDB;

Let’s add a trio of rows to this new table:

INSERT INTO exchange_rate (currency,rate) VALUES ('EUR',0.854), ('GBP',0.767), ('BRL',4.107);

which gives us the following initial set:

master (test) > select * from exchange_rate;
+----+----------+-------+
| id | currency | rate  |
+----+----------+-------+
|  1 | EUR      | 0.854 |
|  2 | GBP      | 0.767 |
|  3 | BRL      | 4.107 |
+----+----------+-------+
3 rows in set (0.00 sec)

Now we update the rate for Brazilian Reais using a REPLACE operation:

REPLACE INTO exchange_rate SET currency='BRL', rate=4.500;

With currency being a UNIQUE field the row is fully replaced:

master (test) > select * from exchange_rate;
+----+----------+-------+
| id | currency | rate  |
+----+----------+-------+
|  1 | EUR      | 0.854 |
|  2 | GBP      | 0.767 |
|  4 | BRL      | 4.500 |
+----+----------+-------+
3 rows in set (0.00 sec)

and thus the autoinc sequence is updated:

master (test) > SHOW CREATE TABLE exchange_rate\G
*************************** 1. row ***************************
     Table: exchange_rate
Create Table: CREATE TABLE `exchange_rate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`currency` varchar(3) DEFAULT NULL,
`rate` float(5,3) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `currency` (`currency`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

The problem is that the autoinc sequence is not updated in the replica as well:

slave1 (test) > select * from exchange_rate;show create table exchange_rate\G
+----+----------+-------+
| id | currency | rate  |
+----+----------+-------+
|  1 | EUR      | 0.854 |
|  2 | GBP      | 0.767 |
|  4 | BRL      | 4.500 |
+----+----------+-------+
3 rows in set (0.00 sec)
*************************** 1. row ***************************
     Table: exchange_rate
Create Table: CREATE TABLE `exchange_rate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`currency` varchar(3) DEFAULT NULL,
`rate` float(5,3) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `currency` (`currency`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Now, the moment we promote that replica as master and start writing to this table we’ll hit a duplicate key error:

slave1 (test) > REPLACE INTO exchange_rate SET currency='BRL', rate=4.600;
ERROR 1062 (23000): Duplicate entry '4' for key 'PRIMARY'

Note that:

a) the transaction fails and the row is not replaced, however the autoinc sequence is incremented:

slave1 (test) > SELECT AUTO_INCREMENT FROM information_schema.TABLES WHERE table_schema='test' AND table_name='exchange_rate';
+----------------+
| AUTO_INCREMENT |
+----------------+
|              5 |
+----------------+
1 row in set (0.00 sec)

b) this problem only happens with row-based replication (binlog_format=ROW), where REPLACE in this case is logged as a row UPDATE:

# at 6129
#180829 18:29:55 server id 100  end_log_pos 5978 CRC32 0x88da50ba Update_rows: table id 117 flags: STMT_END_F
### UPDATE `test`.`exchange_rate`
### WHERE
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=1 is_null=0 */
###   @3=4.107                /* FLOAT meta=4 nullable=1 is_null=0 */
### SET
###   @1=4 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=1 is_null=0 */
###   @3=4.5                  /* FLOAT meta=4 nullable=1 is_null=0 */

With statement-based replication—or even mixed format—the REPLACE statement is replicated as is: it will trigger a DELETE+INSERT in the background on the replica and thus update the autoinc sequence in the same way it did on the master.

This example (tested with Percona Server versions 5.5.61, 5.6.36 and 5.7.22) helps illustrate the issue with autoinc sequences not being persisted as they should be with row-based replication. However, MySQL’s Worklog #6204 includes a couple of scarier scenarios involving the master itself, such as when the server crashes while a transaction is writing to a table similar to the one used in the example above. MySQL 8.0 remedies this bug.

Workarounds

There are a few possible workarounds to consider if this problem is impacting you and if neither upgrading to the 8 series nor resorting to statement-based or mixed replication format are viable options.

We’ll be discussing three of them here: one that resorts around the execution of checks before a failover (to detect and fix autoinc inconsistencies in replicas), another that requires a review of all REPLACE statements like the one from our example and adapt it as to include the id field, thus avoiding the bug, and finally one that requires changing the schema of affected tables in such a way that the target field is made the Primary Key of the table while id (autoinc) is converted into a UNIQUE key.

a) Detect and fix

The less intrusive of the workarounds we conceived for the problem at hand in terms of query and schema changes is to run a check for each of the tables that might be facing this issue in a replica before we promote it as master in a failover scenario:

slave1 (test) > SELECT ((SELECT MAX(id) FROM exchange_rate)>=(SELECT AUTO_INCREMENT FROM information_schema.TABLES WHERE table_schema='test' AND table_name='exchange_rate')) as `check`;
+-------+
| check |
+-------+
|     1 |
+-------+
1 row in set (0.00 sec)

If the table does not pass the test, like ours didn’t at first (just before we attempted a REPLACE after we failed over to the replica), then update autoinc accordingly. The full routine (check + update of autoinc) could be made into a single stored procedure:

DELIMITER //
CREATE PROCEDURE CheckAndFixAutoinc()
BEGIN
 DECLARE done TINYINT UNSIGNED DEFAULT 0;
 DECLARE tableschema VARCHAR(64);
 DECLARE tablename VARCHAR(64);
 DECLARE columnname VARCHAR(64);  
 DECLARE cursor1 CURSOR FOR SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME FROM information_schema.COLUMNS WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys') AND EXTRA LIKE '%auto_increment%';
 DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
 OPEN cursor1;  
 start_loop: LOOP
  IF done THEN
    LEAVE start_loop;
  END IF;
  FETCH cursor1 INTO tableschema, tablename, columnname;
  SET @get_autoinc = CONCAT('SELECT @check1 := ((SELECT MAX(', columnname, ') FROM ', tableschema, '.', tablename, ')>=(SELECT AUTO_INCREMENT FROM information_schema.TABLES WHERE TABLE_SCHEMA=\'', tableschema, '\' AND TABLE_NAME=\'', tablename, '\')) as `check`');
  PREPARE stm FROM @get_autoinc;
  EXECUTE stm;
  DEALLOCATE PREPARE stm;
  IF @check1>0 THEN
    BEGIN
      SET @select_max_id = CONCAT('SELECT @max_id := MAX(', columnname, ')+1 FROM ', tableschema, '.', tablename);
      PREPARE select_max_id FROM @select_max_id;
      EXECUTE select_max_id;
      DEALLOCATE PREPARE select_max_id;
      SET @update_autoinc = CONCAT('ALTER TABLE ', tableschema, '.', tablename, ' AUTO_INCREMENT=', @max_id);
      PREPARE update_autoinc FROM @update_autoinc;
      EXECUTE update_autoinc;
      DEALLOCATE PREPARE update_autoinc;
    END;
  END IF;
 END LOOP start_loop;  
 CLOSE cursor1;
END//
DELIMITER ;

It doesn’t allow for as clean a failover as we would like but it can be helpful if you’re stuck with MySQL<8.0 and binlog_format=ROW and cannot make changes to your queries or schema.

b) Include Primary Key in REPLACE statements

If we had explicitly included the id (Primary Key) in the REPLACE operation from our example it would have also been replicated as a DELETE+INSERT even when binlog_format=ROW:

master (test) > REPLACE INTO exchange_rate SET currency='BRL', rate=4.500, id=3;
# at 16151
#180905 13:32:17 server id 100  end_log_pos 15986 CRC32 0x1d819ae9  Write_rows: table id 117 flags: STMT_END_F
### DELETE FROM `test`.`exchange_rate`
### WHERE
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=1 is_null=0 */
###   @3=4.107                /* FLOAT meta=4 nullable=1 is_null=0 */
### INSERT INTO `test`.`exchange_rate`
### SET
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=1 is_null=0 */
###   @3=4.5                  /* FLOAT meta=4 nullable=1 is_null=0 */
# at 16199
#180905 13:32:17 server id 100  end_log_pos 16017 CRC32 0xf11fed56  Xid = 184
COMMIT/*!*/;

We could point out that we are doing it wrong by not having the id included in the REPLACE statement in the first place; the reason for not doing so would be mostly related to avoiding an extra lookup for each replace (to obtain the id for the currency we want to update). On the other hand, what if your business logic do expects the id to change at each REPLACE ? You should have such requirement in mind when considering this workaround as it is effectively a functional change to what we had initially.

c) Make the target field the Primary Key and keep autoinc as a UNIQUE key

If we make currency the Primary Key of our table and id a UNIQUE key instead:

CREATE TABLE exchange_rate (
id INT UNIQUE AUTO_INCREMENT,
currency VARCHAR(3) PRIMARY KEY,
rate FLOAT(5,3)
) ENGINE=InnoDB;

the same REPLACE operation will be replicated as a DELETE+INSERT too:

# at 19390
#180905 14:03:56 server id 100  end_log_pos 19225 CRC32 0x7042dcd5  Write_rows: table id 131 flags: STMT_END_F
### DELETE FROM `test`.`exchange_rate`
### WHERE
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=0 is_null=0 */
###   @3=4.107                /* FLOAT meta=4 nullable=1 is_null=0 */
### INSERT INTO `test`.`exchange_rate`
### SET
###   @1=4 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=0 is_null=0 */
###   @3=4.5                  /* FLOAT meta=4 nullable=1 is_null=0 */
# at 19438
#180905 14:03:56 server id 100  end_log_pos 19256 CRC32 0x79efc619  Xid = 218
COMMIT/*!*/;

Of course, the same would be true if we had just removed id entirely from the table and kept currency as the Primary Key. This would work in our particular test example but that won’t always be the case. Please note though that if you do keep id on the table you must make it a UNIQUE key: this workaround is based on the fact that this key becomes a second unique constraint, which triggers a different code path to log a replace operation. Had we made it a simple, non-unique key instead that wouldn’t be the case.

If you have any comments or suggestions about the issue addressed in this post, the workarounds we propose, or even a different view of the problem you would like to share please leave a comment in the section below.

Co-Author: Trey Raymond

Trey RaymondTrey Raymond is a Sr. Database Engineer for Oath Inc. (née Yahoo!), specializing in MySQL. Since 2010, he has worked to build the company’s database platform and supporting team into industry leaders.

While a performance guru at heart, his experience and responsibilities range from hardware and capacity planning all through the stack to database tool and utility development.

He has a reputation for breaking things to learn something new.

Co-Author: Fernando Laudares

fernando laudaresFernando is a Senior Support Engineer with Percona. Fernando’s work experience includes the architecture, deployment and maintenance of IT infrastructures based on Linux, open source software and a layer of server virtualization. He’s now focusing on the universe of MySQL, MongoDB and PostgreSQL with a particular interest in understanding the intricacies of database systems, and contributes regularly to this blog. You can read his other articles here.

Oct
08
2018
--

Detailed Logging for Enterprise-Grade PostreSQL

detailed logging PostgreSQL

PostgreSQL® logoIn this penultimate post from our series on building an enterprise-grade PostgreSQL environment we cover the parameters we have enabled to configure detailed logging in the demo setup we will showcase in our upcoming webinar.

Detailed logging in PostgreSQL and log analyzer

Like other RDBMS, PostgreSQL allows you to maintain a log of activities and error messages. Until PostgreSQL 9.6, PostgreSQL log files were generated in pg_log directory (inside the data directory) by default. Since PostgreSQL 10, pg_log has been renamed to simply log. However, this directory can be modified to a different location by modifying the parameter log_directory.

Unlike MySQL, PostgreSQL writes the error and activity log to the same log file thus it may grow to several GBs when detailed logging is enabled. In these cases, logging becomes IO-intensive thus it is recommended to store log files in a different storage to the one hosting the data directory.

Parameters to enable detailed logging

Here’s a list of parameters used to customize logging in PostgreSQL. All of them need to be modified in the postgresql.conf or postgresql.auto.conf files.

logging_collector: in order to log any activity in PostgreSQL this parameter must be enabled. The backend process responsible for logging database activity is called logger, it gets started when logging_collector is set to ON. Changing this parameter requires a PostgreSQL restart.

log_min_duration_statement: this parameter is used primarily to set a time threshold: queries running longer than such should be logged (as “slow queries”). Setting it to -1 disables logging of statements. Setting it to 0 enables the logging of every statement running in the database, regardless of its duration. The time unit should follow the actual value, for example: 250ms,  250s, 250min, 1h. Changing this parameter does not require a PostgreSQL restart – a simple reload of the configuration is enough.reload but not a restart. For example:

log_min_duration_statement = 5s

  logs every statement running for 5 seconds or longer.

log_line_prefix: helps you customize every log line being printed in the PostgreSQL log file. You can log the process id, application name, database name and other details for every statement as required. The following log_line_prefix may be helpful in most scenarios:

log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h'

The above setting records the following for every statement being logged:

%t : Time stamp without milliseconds
%p : Process id
%l-1 : Number of the log line for each session or process, starting at 1
%u : User name
%d : Database name
%a : Application name
%h : Remote host name or IP address

With the above settings employed for log_line_prefix, the log appears as follows:

2018-06-08 12:36:26 EDT [21315]: [1-1] user=postgres,db=percona,app=psql,client=192.168.0.12 LOG: duration: 2.611 ms statement: select * from departments where id = 26;
2018-06-08 12:36:49 EDT [21323]: [1-1] user=postgres,db=percona,app=psql,client=192.168.0.12 LOG: duration: 2.251 ms statement: select count(*) from departments;

To see more on log_line_prefix, You can refer to the PostgreSQL documentation for further details on this feature.

log_duration: the enabling of this parameter records the duration of every completed statement in PostgreSQL log, and this irrespective of log_min_duration_statement. Have in mind that, as it happens with log_min_duration_statement, the enabling of log_duration may increase log file usage and add affect the server’s general performance. For this reason, if you already have log_min_duration_statement enabled it is often suggested to disable log_duration, unless there’s a specific need to keep track of both.

log_lock_waits: when log_lock_waits is enabled a log message is recorded when a session waits longer than deadlock_timeout to acquire a lock.

log_checkpoints: logs all checkpoints and restart points to the PostgreSQL log file.

log_rotation_size: defines the size limit for each log file; once it reaches this threshold the log file is rotated.
Example: 

log_rotation_size = '500MB'

  – every log file is limited to a size of 500 MB.

log_rotation_age: determines the maximum life span for a log file, forcing its rotation once this threshold is reached. This parameter is usually set in terms of hours, or maybe days, however the minimum granularity is a minute. However, if log_rotation_size is reached first, the log gets rotated anyway, irrespective of this setting.
Example: 

log_rotation_age = 1d

log_statement: controls what type of SQLs are logged. The recommended setting is DDL, which logs all DDLs that are executed. Tracking those allow you to later audit when a given DDL was executed, and by who. By monitoring and understanding the amount of information it may write to the log file you may consider modifying this setting. Other possible values are none, mod (includes DDLs plus DMLs) and all.

log_temp_files: logs information related to a temporary table file whose size is greater than this value (in KBs).

log_directory: defines the directory in which log files are created. Once more, please note that if you have enabled detailed logging it is recommended to have a separate disk—different from the data directory disk—allocated for log_directory.
Example: 

log_directory = /this_is_a_new_disk/pg_log

Log Analyzer – pgBadger

You cannot have a separate error log and a slow query log. Everything is written to one log file, which may be periodically rotated based on time and size. Over a period of time, this log file size may increase to several MB’s or even GB’s depending on the amount of logging that has been enabled. It could get difficult for a DBA/developer to parse the log files and get a better view about what is running slowly and how many times a query has run. To help with this taks you may use pgBadger, a log analyzer for PostgreSQL, to parse log files and generate a rich HTML-based report that you can access from a browser. An example report can be seen in the screenshots below:

We’ll be showing detailed logging and pgBadger in action in a couple of days so, if there’s still time, sign up for our October webinar !

Oct
05
2018
--

PostgreSQL Extensions for an Enterprise-Grade System

PostgreSQL extensions for logging

PostgreSQL® logoIn this current series of blog posts we have been discussing various relevant aspects when building an enterprise-grade PostgreSQL setup, such as security, back up strategy, high availability, and different methods to scale PostgreSQL. In this blog post, we’ll get to review some of the most popular open source extensions for PostgreSQL, used to expand its capabilities and address specific needs. We’ll cover some of them during a demo in our upcoming webinar on October 10.

Expanding with PostgreSQL Extensions

PostgreSQL is one of the world’s most feature-rich and advanced open source RDBMSs. Its features are not just limited to those released by the community through major/minor releases. There are hundreds of additional features developed using the extensions capabilities in PostgreSQL, which can cater to needs of specific users. Some of these extensions are very popular and useful to build an enterprise-grade PostgreSQL environment. We previously blogged about a couple of FDW extensions (mysql_fdw and postgres_fdw ) which will allow PostgreSQL databases to talk to remote homogeneous/heterogeneous databases like PostgreSQL and MySQL, MongoDB, etc. We will now cover a few other additional extensions that can expand your PostgreSQL server capabilities.

pg_stat_statements

The pg_stat_statements module provides a means for tracking execution statistics of all SQL statements executed by a server. The statistics gathered by the module are made available via a view named pg_stat_statements. This extension must be installed in each of the databases you want to track, and like many of the extensions in this list, it is available in the contrib package from the PostgreSQL PGDG repository.

pg_repack

Tables in PostgreSQL may end up with fragmentation and bloat due to the specific MVCC implementation in PostgreSQL, or simply due to a high number of rows being naturally removed. This could lead to not only unused space being held inside the table but also to sub-optimal execution of SQL statements. pg_repack is the most popular way to address this problem by reorganizing and repacking the table. It can reorganize the table’s content without placing an exclusive lock on it during the process. DMLs and queries can continue while repacking is happening.  Version 1.2 of pg_repack introduces further new features of parallel index builds, and the ability to rebuild just the indexes. Please refer to the official documentation for more details.

pgaudit

PostgreSQL has a basic statement logging feature. It can be implemented using the standard logging facility with

log_statement = all

 . But this is not sufficient for many audit requirements. One of the essential features for enterprise deployments is the capability for fine-grained auditing the user interactions/statements issued to the database. This is a major compliance requirement for many security standards. The pgaudit extension caters to these requirements.

The PostgreSQL Audit Extension (pgaudit) provides detailed session and/or object audit logging via the standard PostgreSQL logging facility. Please refer to the settings section of its official documentation for more details.

pldebugger

This is a must-have extension for developers who work on stored functions written in PL/pgSQL. This extension is well integrated with GUI tools like pgadmin, which allows developers to step through their code and debug it. Packages for pldebugger are also available in the PGDG repository and installation is straightforward.Once it is set up, we can step though and debug the code remotely.

The official git repo is available here

plprofiler

This is a wonderful extension for finding out where the code is slowing down. This is very helpful, particularly during complex migrations from proprietary databases, like from Oracle to PostgreSQL, which affect application performance. This extension can prepare a report on the overall execution time and tables representation, including flamegraphs, with clear information about each line of code. This extension is not, however, available from the PGDG repo: you will need to build it from source. Details on building and installing plprofiler will be covered in a future blog post. Meanwhile, the official repository and documentation is available here

PostGIS

PostGIS is arguably the most versatile implementation of the specifications of the Open Geospatial Consortium. We can see a large list of features in PostGIS that are rarely available in any other RDBMSs.

There are many users who have primarily opted to use PostgreSQL because of the features supported by PostGIS. In fact, all these features are not implemented as a single extension, but are instead delivered by a collection of extensions. This makes PostGIS one of the most complex extensions to build from source. Luckily, everything is available from the PGDG repository:

$ sudo yum install postgis24_10.x86_64

Once the postgis package is installed, we are able to create the extensions on our target database:

postgres=# CREATE EXTENSION postgis;
CREATE EXTENSION
postgres=# CREATE EXTENSION postgis_topology;
CREATE EXTENSION
postgres=# CREATE EXTENSION postgis_sfcgal;
CREATE EXTENSION
postgres=# CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION
postgres=# CREATE EXTENSION postgis_tiger_geocoder;
CREATE EXTENSION
postgres=# CREATE EXTENSION address_standardizer;
CREATE EXTENSION

Language Extensions : PL/Python, PL/Perl, PL/V8,PL/R etc.

Another powerful feature of PostgreSQL is its programming languages support. You can code database functions/procedures in pretty much every popular language.

Thanks to the enormous number of libraries available, which includes machine learning ones, and its vibrant community, Python has claimed the third spot amongst the most popular languages of choice according to the TIOBE Programming index. Your team’s skills and libraries remain valid for PostgreSQL server coding too! Teams that regularly code in JavaScript for Node.js or Angular can easily write PostgreSQL server code in PL/V8. All of the packages required are readily available from the PGDG repository.

cstore_fdw

cstore_fdw is an open source columnar store extension for PostgreSQL. Columnar stores provide notable benefits for analytics use cases where data is loaded in batches. Cstore_fdw’s columnar nature delivers performance by only reading relevant data from disk. It may compress data by 6 to 10 times to reduce space requirements for data archive. The official repository and documentation is available here

HypoPG

HypoPG is an extension for adding support for hypothetical indexes – that is, without actually adding the index. This helps us to answer questions such as “how will the execution plan change if there is an index on column X?”. Installation and setup instructions are part of its official documentation

mongo_fdw

Mongo_fdw presents collections from mongodb as tables in PostgreSQL. This is a case where the NoSQL world meets the SQL world and features combine. We will be covering this extension in a future blog post. The official repository is available here

tds_fdw

Another important FDW (foreign data wrapper) extension in the PostgreSQL world is tds_fdw. Both Microsoft SQL Server and Sybase uses TDS (Tabular Data Stream) format. This fdw allows PostgreSQL to use tables stored in remote SQL Server or Sybase database as local tables. This FDW make use of FreeTDS libraries.

orafce

As previously mentioned, there are lot of migrations underway from Oracle to PostgreSQL. Incompatible functions in PostgreSQL are often painful for those who are migrating server code. The “orafce” project implements some of the functions from the Oracle database. The functionality was verified on Oracle 10g and the module is useful for production work. Please refer to the list in its official documentation about the Oracle functions implemented in PostgreSQL

TimescaleDB

In this new world of IOT and connected devices, there is a growing need of time-series data. Timescale can convert PostgreSQL into a scalable time-series data store. The official site is available here with all relevant links.

pg_bulkload

Is loading a large volume of data into database in a very efficient and faster way a challenge for you? If so pg_bulkload may help you solve that problem. Official documentation is available here

pg_partman

PostgreSQL 10 introduced declarative partitions. But creating new partitions and maintaining existing ones, including purging unwanted partitions, requires a good dose of manual effort. If you are looking to automate part of this maintenance you should have a look at what pg_partman offers. The repository with documentation is available here.

wal2json

PostgreSQL has feature related to logical replication built-in. Extra information is recorded in WALs which will facilitate logical decoding. wal2json is a popular output plugin for logical decoding. This can be utilized for different purposes including change data capture. In addition to wal2json, there are other output plugins: a concise list is available in the PostgreSQL wiki.

There are many more extensions that help us build an enterprise-grade PostgreSQL set up using open source solutions. Please feel free to comment and ask us if we know about one that satisfies your particular needs. Or, if there’s still time, sign up for our October webinar and ask us in person!

Oct
05
2018
--

MongoDB-Disable Chained Replication

MongoDB chained replication

MongoDB chained replicationIn this blog post, we will learn what MongoDB chained replication is, why you might choose to disable it, and the steps you need to take to do so.

What is chain replication?

Chain Replication in MongoDB, as the name suggests, means that a secondary member is able to replicate from another secondary member instead of a primary.

Default settings

By default, chained replication is enabled in MongoDB. It helps to reduce the load from the primary but it may lead to a replication lag. When enabled, the secondary node selects its target using the ping time for the closest node.

Reasons to disable chained replication

The main reason to disable chain replication is replication lag. In other words, the length of the delay between MongoDB writing an operation on the primary and replicating the same operation to the secondary.

In either case—chained replication enabled or disabled—replication works in the same way when the primary node fails: the secondary server will promote to the primary. Therefore, writing and reading of data from the application is not affected.

Steps to disable chained replication

1) Check the current status of chained replication in replica set configuration for “settings” like this:

PRIMARY> rs.config().settings
{
"chainingAllowed" : true,
}

2) Disable chained replication, set “chainingAllowed” to false and then reconfig to implement changes.

PRIMARY> cg = rs.config()
PRIMARY> cg.settings.chainingAllowed = false
false
PRIMARY> rs.reconfig(cg)

3) Check again for the current status of chained replication and its done.

PRIMARY> rs.config().settings
{
	"chainingAllowed" : false,
}

Can I override sync source target even after disabling chaining?

Yes, even after you have disabled chained replication, you can still override sync target, though only temporarily. That means it will be overridden until:

  • mognod instance restarts
  • established connection between sync source and secondary node.
  • Additional: if chaining is enabled and sync source falls more than 30 seconds behind another member then the SyncSourceResolver will choose other member having recent oplogs to sync from.

Override sync source

Parameter “replSetSyncFrom” could be used, for example, the secondary node is syncing from host

192.168.103.100:27001

  and we would like to sync it from

192.168.103.100:27003

1) Check for the current host it is syncing from:

PRIMARY> rs.status()
{
			"_id" : 1,
			"name" : "192.168.103.100:27002",
			"syncingTo" : "192.168.103.100:27001",
			"syncSourceHost" : "192.168.103.100:27001",
		},

2) Login to that mongod, and execute:

SECONDARY> db.adminCommand( { replSetSyncFrom: "192.168.103.100:27003" })

3) Check replica set status again

SECONDARY> rs.status()
{
			"_id" : 1,
			"name" : "192.168.103.100:27002",
			"syncingTo" : "192.168.103.100:27003",
			"syncSourceHost" : "192.168.103.100:27003",
		},

This is how we can override the sync source in case of testing, maintenance or while the replica is not syncing from the required host.

I hope this blog helps you to understand how to disable chained replication or override the sync source for the specific purpose or reason. The preferred setting of the chainingAllowed parameter is true as it reduces the load from the primary node and also a default setting.

Oct
04
2018
--

The Importance of mysqlbinlog –version

Importance of MySQL binlog version

Importance of MySQL binlog versionWhen deciding on your backup strategy, one of the key components for Point In Time Recovery (PITR) will be the binary logs. Thankfully, the mysqlbinlog command allows you to easily take binary log backups, including those that would otherwise be encrypted on disk using encrypt_binlog=ON.

When

mysqlbinlog

  is used with

--raw --read-from-remote-server --stop-never --verify-binlog-checksum

  then it will retrieve binary logs from whichever master it is pointed to, and store them locally on disk in the same format as they were written on the master. Here is an example with the extra arguments that would normally be used:

/usr/bin/mysqlbinlog --raw --read-from-remote-server \
 --stop-never --connection-server-id=1234 \
 --verify-binlog-checksum \
 --host=localhost --port=3306 mysql-bin.000024

This would retrieve the localhost binary logs (starting from mysql-bin.000024) reporting as server_id 1234, verify the checksum and then write each of them to disk.

Changes to the mysqlbinlog source code are relatively infrequent, except for when developing for a new major version, so you may be fooled into thinking that the specific version that you use is not so important—a little like the client version. This is something that is more likely to vary when you are taking remote backups.

Here is the result from the 5.7 branch of mysql-server to show the history of commits by year:

$ git blame --line-porcelain client/mysqlbinlog.cc | egrep "^(committer-time|committer-tz)" | cut -f2 -d' ' | while read ct; do read ctz; date --date "Jan 1, 1970 00:00:00 ${ctz} + ${ct} seconds" --utc +%Y; done | sort -n | uniq -c
   105 2000
    52 2001
    52 2002
   103 2003
   390 2004
   113 2005
    58 2006
   129 2007
   595 2008
    53 2009
   349 2010
   151 2011
   382 2012
   191 2013
   308 2014
   404 2015
    27 2016
    42 2017
    15 2018

Since the first GA release of 5.7 (October 2015), there haven’t been too many bugs and so if you aren’t using new features then you may think that it is OK to keep using the same version as before:

$ git log --regexp-ignore-case --grep bug --since="2015-10-19" --oneline client/mysqlbinlog.cc
1ffd7965a5e Bug#27558169 BACKPORT TO 5.7 BUG #26826272: REMOVE GCC 8 WARNINGS [noclose]
17c92835bb3 Bug #24674276 MYSQLBINLOG -R --HEXDUMP CRASHES FOR INTVAR,                   USER_VAR, OR RAND EVENTS
052dbd7b079 BUG#26878022 MYSQLBINLOG: ASSERTION `(OLD_MH->M_KEY == KEY) ||              (OLD_MH->M_KEY == 0)' FAILED
543129a577c BUG#26878022 MYSQLBINLOG: ASSERTION `(OLD_MH->M_KEY == KEY) || (OLD_MH->M_KEY == 0)' FAILED
ba1a99c5cd7 Bug#26825211 BACKPORT FIX FOR #25643811 TO 5.7
1f0f4476b28 Bug#26117735: MYSQLBINLOG READ-FROM-REMOTE-SERVER NOT HONORING REWRITE_DB FILTERING
12508f21b63 Bug #24323288: MAIN.MYSQLBINLOG_DEBUG FAILS WITH A LEAKSANITIZER ERROR
e8e5ddbb707 Bug#24609402 MYSQLBINLOG --RAW DOES NOT GET ALL LATEST EVENTS
22eec68941f Bug#23540182:MYSQLBINLOG DOES NOT FREE THE EXISTING CONNECTION BEFORE OPENING NEW REMOTE ONE
567bb732bc0 Bug#22932576 MYSQL5.6 DOES NOT BUILD ON SOLARIS12
efc42d99469 Bug#22932576 MYSQL5.6 DOES NOT BUILD ON SOLARIS12
6772eb52d66 Bug#21697461 MEMORY LEAK IN MYSQLBINLOG

However, this is not always the case and some issues are more obvious than others! To help show this, here are a couple of the issues that you might happen to notice.

Warning: option ‘stop-never-slave-server-id’: unsigned value <xxxxxxxx> adjusted to <yyyyy>

The server_id that is used by a server in a replication topology should always be unique within the topology. One of the easy ways to ensure this is to use a conversion of the external IPv4 address to an integer, such as INET_ATON , which provides you with an unsigned integer.

The introduction of

--connection-server-id

 (which deprecates

--stop-never-slave-server-id

 ) changes the behaviour here (for the better). Prior to this you may experience warnings where your server_id was cast to the equivalent of an UNSIGNED SMALLINT. This didn’t seem to be a reported bug, just fixed as a by-product of the change.

ERROR: Could not find server version: Master reported unrecognized MySQL version ‘xxx’

When running mysqlbinlog, the version of MySQL is checked so that the event format is set correctly. Here is the code from MySQL 5.7:

switch (*version) {
 case '3':
   glob_description_event= new Format_description_log_event(1);
   break;
 case '4':
   glob_description_event= new Format_description_log_event(3);
   break;
 case '5':
   /*
     The server is soon going to send us its Format_description log
     event, unless it is a 5.0 server with 3.23 or 4.0 binlogs.
     So we first assume that this is 4.0 (which is enough to read the
     Format_desc event if one comes).
   */
   glob_description_event= new Format_description_log_event(3);
   break;
 default:
   glob_description_event= NULL;
   error("Could not find server version: "
         "Master reported unrecognized MySQL version '%s'.", version);
   goto err;
 }

This section of the code last changed in 2008, but of course there is another vendor that no longer uses a 5-prefixed-version number: MariaDB. With MariaDB, it is impossible to take a backup without using a MariaDB version of the program, as you are told that the version is unrecognised. The MariaDB source code contains a change to this section to resolve the issue when the version was bumped to 10:

83c02f32375b client/mysqlbinlog.cc (Michael Widenius    2012-05-31 22:39:11 +0300 1900) case 5:
83c02f32375b client/mysqlbinlog.cc (Michael Widenius    2012-05-31 22:39:11 +0300 1901) case 10:

Interestingly, MySQL 8.0 gets a little closer to not producing an error (although it still does), but finally sees off those rather old ancestral relatives:

 switch (*version) {
   case '5':
   case '8':
   case '9':
     /*
       The server is soon going to send us its Format_description log
       event.
     */
     glob_description_event = new Format_description_log_event;
     break;
   default:
     glob_description_event = NULL;
     error(
         "Could not find server version: "
         "Master reported unrecognized MySQL version '%s'.",
         version);
     goto err;
 }

These are somewhat trivial examples. In fact, you are more likely to suffer from more serious differences, perhaps ones that do not become immediately apparent, if you are not matching the mysqlbinlog version to the one provided by the version for the server producing the binary logs.

Sadly, it is not so easy to check the versions as the reported version was seemingly left unloved for quite a while (Ver 3.4), so you should check the binary package versions (e.g. using Percona-Server-client-57-5.7.23 with Percona-Server-server-57-5.7.23). Thankfully, the good news is that MySQL 8.0 fixes it!

So reduce the risk and match your package versions!

Oct
03
2018
--

Finding Table Differences on Nullable Columns Using MySQL Generated Columns

MySQL generated columns

MySQL generated columnsSome time ago, a customer had a performance issue with an internal process. He was comparing, finding, and reporting the rows that were different between two tables. This is simple if you use a LEFT JOIN and an 

IS NULL

  comparison over the second table in the WHERE clause, but what if the column could be null? That is why he used UNION, GROUP BY and a HAVING clauses, which resulted in poor performance.

The challenge was to be able to compare each row using a LEFT JOIN over NULL values.

The challenge in more detail

I’m not going to use the customer’s real table. Instead, I will be comparing two sysbench tables with the same structure:

CREATE TABLE `sbtest1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `k` int(10) unsigned DEFAULT NULL,
  `c` char(120) DEFAULT NULL,
  `pad` char(60) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `k` (`k`,`c`,`pad`)
) ENGINE=InnoDB

It is sightly different from the original sysbench schema, as this version can hold NULL values. Both tables have the same number of rows. We are going to set to NULL one row on each table:

update sbtest1 set k=null where limit 1;
update sbtest2 set k=null where limit 1;

If we execute the comparison query, we get this result:

mysql> select "sbtest1",a.* from
    -> sbtest1 a left join
    -> sbtest2 b using (k,c,pad)
    -> where b.id is null union
    -> select "sbtest2",a.* from
    -> sbtest2 a left join
    -> sbtest1 b using (k,c,pad)
    -> where b.id is null;
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | id   | k    | c                                                                                                                       | pad                                                         |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
| sbtest2 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
2 rows in set (3.00 sec)

As you can see, column k is NULL. In both cases it failed and reported those rows to be different. This is not new in MySQL, but it would be nice to have a way to sort this issue out.

Solution

The solution is based on GENERATED COLUMNS with a hash function (md5) and stored in a binary(16) column:

ALTER TABLE sbtest1
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,
ADD INDEX (real_id);
ALTER TABLE sbtest2
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,
ADD INDEX (real_id);

Adding the index is also part of the solution. Now, let’s execute the query using the new column to join the tables:

mysql> select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null union
    -> select "sbtest2",a.k,a.c,a.pad from
    -> sbtest2 a left join
    -> sbtest1 b using (real_id)
    -> where b.id is null;
Empty set (2.31 sec)

We can see an improvement in the query performance—it now takes 2.31 sec whereas before it was 3.00 sec—and that the result is as expected. We could say that that’s all, and no possible improvement can be made. However, is not true. Even though the query is running faster, it is possible to optimize it in this way:

mysql> select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null) union
    -> select "sbtest2",a.k,a.c,a.pad
    -> from sbtest2 a
    -> where a.id in (select a.id
    ->   from sbtest2 a left join
    ->   sbtest1 b using (real_id)
    ->   where b.id is null);
Empty set (1.60 sec)

Why is this faster? The first query is performing two subqueries. Each subquery is very similar. Let’s check the explain plan:

mysql> explain select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null;
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref              | rows   | filtered | Extra                                |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE      | a     | NULL       | ALL  | NULL          | NULL    | NULL    | NULL             | 315369 |   100.00 | NULL                                 |
|  1 | SIMPLE      | b     | NULL       | ref  | real_id       | real_id | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

As you can see, it is performing a full table scan over the first table and using real_id to join the second table. The real_id is a generated column, so it needs to execute the function to get the value to join the second table. That means that it’s going to take time.

If we analyze the subquery of the second query:

mysql> explain select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null);
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type  | table       | partitions | type   | possible_keys | key        | key_len | ref              | rows   | filtered | Extra                                |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE       | a           | NULL       | index  | PRIMARY       | k          | 187     | NULL             | 315369 |   100.00 | Using where; Using index             |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_key>    | <auto_key> | 4       | sbtest.a.id      |      1 |   100.00 | NULL                                 |
|  2 | MATERIALIZED | a           | NULL       | index  | PRIMARY       | real_id    | 17      | NULL             | 315369 |   100.00 | Using index                          |
|  2 | MATERIALIZED | b           | NULL       | ref    | real_id       | real_id    | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

We are going to see that it is performing a full index scan over the first table, and that the generated column has never been executed. That is how we can go from an inconsistent result of three seconds, to a consistent result of 2.31 seconds, to finally reach a performant query using the faster time of 1.60 seconds.

Conclusions

This is not the first blog post that I’ve done about generated columns. I think that it is a useful feature for several scenarios where you need to improve performance. In this particular case, it’s also presenting a workaround to expected inconsistencies with LEFT JOINS with NULL values. It is also important to mention that this improved a process in a real world scenario.

The post Finding Table Differences on Nullable Columns Using MySQL Generated Columns appeared first on Percona Database Performance Blog.

Oct
02
2018
--

Scaling PostgreSQL using Connection Poolers and Load Balancers for an Enterprise Grade environment

PostgreSQL Load balancing connection pooling

PostgreSQL® logo

In the previous blog posts in this series we’ve covered some of the essential aspects of an Enterprise-grade solution: security, high availability, and backups. Another important aspect is the scalability of the solution: as our application grows how do we accommodate an increase in traffic while maintaining the quality of the service (response time)? The answer to this question depends on the nature of the workload at play but it is often shaped around:

(a) improving its efficiency and
(b) increasing the resources available.

Why connection pooling ?

When it comes to improving the efficiency of a database workload, one of the first places we start looking at is the list of slow queries; if the most popular ones can be optimized to run faster then we can easily gain some overall performance back. Arguably, we may look next at the number and frequency of client connections: is the workload composed of a high number of very frequent but short-lived connections? Or are clients connections of a more moderate number, and tend to stick around for longer ?

If we consider the first scenario further–a high number of short lived connections–and that each connection spawns a new OS process, the server may hit a practical limit as to the number of transactions—or connections—it can manage per second, considering the hardware available and the workload being processed. Remember that PostgreSQL is process-based, as opposed to thread-based, which is itself an expensive operation in terms of resources, both CPU and memory.

A possible remedy for this would be the use of a connection pooler, acting as a mediator between the application and the database. The connection pooler keeps a number of connections permanently opened with the database, and receives and manages all incoming requests from clients itself, allowing them to temporarily use one of the connections it already has established with PostgreSQL. This removes the burden of creating a new process each time a client establishes a connection with PostgreSQL, and allows it to employ the resources that it would otherwise use for this into serving more requests (or completing them faster).

Rule of thumb?

A general rule of thumb that we often hear is that you may need a connection pooler once you reach around 350 concurrent connections. However, the actual threshold is highly dependent on your database traffic and server configuration: as we find out recently, you may need one much sooner.

You may implement connection pooling using your native application connection pooler (if there is one available) or through an external connection pooler such as PgBouncer and pgPool-II. For the solution we have built, which we demonstrate in our webinar of October 10, we have used PgBouncer as our connection pooler.

PgBouncer

PgBouncer is a lightweight (thread-based) connection pooler that has been widely used in PostgreSQL based environments. It “understands” the PostgreSQL connection protocol and has been a stable project for over a decade.

PgBouncer allows you to configure the pool of connections to operate in three distinct modes: session, statement and transaction. Unless you’ve a good reason to reserve a connection in the pool to a single user for the duration of its session, or are operating with single-statements exclusively, transaction mode is the one you should investigate.

A feature that is central to our enterprise-grade solution is that you can add multiple connection strings using unique alias names (referred to as database names). This allows greater flexibility when mediating connections with multiple database servers. We can then have an alias named “master_db” that will route connections to a master/primary server and another alias named “slave_db” that will route connections to a slave/standby server.

Scaling up

Once efficiency is taken care of, we can then start working on increasing the resources, or computing power, available to process database requests. Scaling vertically means, in short, upgrading the server: more and faster cores, memory, storage. It’s a simple approach, but one that reaches a practical limitation rather quickly. It is not inline with other requirements of an enterprise grade solution, such as high availability. The alternative is scaling horizontally. As briefly introduced above, a common way for implementing horizontal scalability is to redirect reads to standby servers (replicas) with the help of a proxy, which can also act as a load balancer, such as HAProxy. We’ll be discussing these ideas further here, and showcase their integration in our webinar.

HAProxy 

HAProxy is a popular open source TCP/HTTP load balancer that can distribute the workload across multiple servers. It can be leveraged in a PostgreSQL replication cluster that has been built using streaming replication. When you build replication using streaming replication method standby replicas are open for reads. With the help of HAProxy you can efficiently utilize the computing power of all database servers, distributing read requests among the available replicas using algorithms such as Least Connection and Round Robin.

A combination of connection pooler and load balancer to scale PostgreSQL

The following diagram represents a simplified part of the architecture that composes the enterprise grade solution we’ve designed, where we employ PgBouncer and HAProxy to scale our PostgreSQL cluster:

Our PgBouncer contains two database (alias) names, one for redirecting writes to the master and another for balancing reads across standby replicas, as discussed above. Here is how the database section looks in the pgbouncer.ini, PgBouncer’s main configuration file:

[databases]
master = host=haproxy port=5002 dbname=postgres
slave = host=haproxy port=5003 dbname=postgres

Notice that both database entries redirect their connections to the HAProxy server, but each to a different port. The HAProxy, in turn, is configured to route the connections in functions of the incoming port they reach. Considering the above pgBouncer config file as a reference, writes (master connections) are redirected to port 5002 and reads (slave connections) to port 5003. Here is how the HAProxy config file looks:

# Connections to port 5002
listen Master
    bind *:5002
    option tcp-check
    tcp-check send GET\ / HTTP/1.0\r\n
    tcp-check send HOST\r\n
    tcp-check send \r\n
    tcp-check expect string "role":\ "master"
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg0 pg0:5432 maxconn 100 check port 8008
    server pg1 pg1:5432 maxconn 100 check port 8008
    server pg2 pg2:5432 maxconn 100 check port 8008
# Connections to port 5003
listen Slaves
    bind *:5003
    option tcp-check
    tcp-check send GET\ / HTTP/1.0\r\n
    tcp-check send HOST\r\n
    tcp-check send \r\n
    tcp-check expect string "role":\ "replica"
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg0 pg0:5432 maxconn 100 check port 8008
    server pg1 pg1:5432 maxconn 100 check port 8008
    server pg2 pg2:5432 maxconn 100 check port 8008

As seen above:

  • connections to port 5002 are sent to server with role: “master”
  • connections to port 5003 are sent to server with role : “replica”

In a previous post, we discussed using Patroni in our high availability setup. HAProxy relies on Patroni to determine the role of the PostgreSQL server. Patroni is being used here for cluster management and automatic failover. By using Patroni’s REST API (on port 8008 in this scenario) we can obtain the role of a given PostgreSQL server. The example below shows this in practice, the IP addresses denoting the PostgreSQL servers in this setup:

$ curl -s 'http://192.168.50.10:8008' | python -c "import sys, json; print json.load(sys.stdin)['role']"
replica
$ curl -s 'http://192.168.50.20:8008' | python -c "import sys, json; print json.load(sys.stdin)['role']"
master
$ curl -s 'http://192.168.50.30:8008' | python -c "import sys, json; print json.load(sys.stdin)['role']"
replica

HAProxy can thus rely on Patroni’s REST API to redirect connections from the master alias in PgBouncer to a server with role master. Similarly, HAProxy uses server role information to redirect connections from a slave alias to one of the servers with role replica, using the appropriate load balancer algorithm.

This way, we ensure that the application uses the advantage of a connection pooler to leverage connections to the database, and also of the load balancer which distributes the read load to multiple database servers as configured.

There are many other open source connection poolers and load balancers available to build a similar setup. You can choose the one that best suits your environment—just make sure to test your custom solution appropriately before bringing it to production.

The post Scaling PostgreSQL using Connection Poolers and Load Balancers for an Enterprise Grade environment appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com