Jan
13
2022
--

How Patroni Addresses the Problem of the Logical Replication Slot Failover in a PostgreSQL Cluster

PostgreSQL Patroni Logical Replication Slot Failover

Failover of the logical replication slot has always been the pain point while using the logical replication in PostgreSQL. This lack of feature undermined the use of logical replication and acted as one of the biggest deterrents. The stake and impact were so high that many organizations had to discard their plans around logical replication, and it affected many plans for migrations to PostgreSQL. It was painful to see that many had to opt for proprietary/vendor-specific solutions instead.

At Percona, we have written about this in the past: Missing Piece: Failover of the Logical Replication Slot.  In that post, we discussed one of the possible approaches to solve this problem, but there was no reliable mechanism to copy the slot information to a Physical standby and maintain it.

The problem, in nutshell, is: the replication slot will be always maintained on the Primary node. If there is a switchover/failover to promote one of the standby, the new primary won’t have any idea about the replication slot maintained by the previous primary node. This breaks the logical replication from the downstream systems or if a new slot is created, it becomes unsafe to use.

The good news is that Patroni developers and maintainers addressed this problem from Version 2.1.0 and provided a working solution without any invasive methods/extensions. For me, this is a work that deserves a big round of applause from the Patroni community and that is the intention of this blog post and to make sure that a bigger crowd is aware of it.

How to Set it Up

A ready-to-use Patroni package is available from the Percona repository. But you are free to use Patroni from any source.

Basic Configuration

In case you are excited about this and want to try it, the following steps might be helpful.

The entire discussion is about logical replication. So the minimum requirement is to have a wal_level set to “logical”. If the existing Patroni configuration is having wal_level set to “replica” and if you want to use this feature, you may just edit the Patroni configuration.

Patroni configuration

However, this change requires the PostgreSQL restart:

“Pending restart” with * marking indicates the same.

You may use Patroni’s “switchover” feature to restart the node to make the changes into effect because the demoted node goes for a restart.

If there are any remaining nodes, they can be restarted later.

Creating Logical Slots

Now we can add a permanent logical replication slot to PostgreSQL which will be maintained by Patroni.

Edit the patroni configuration:

$ patronictl -c /etc/patroni/patroni.yml edit-config

A slot specification can be added as follows:

…
slots:
  logicreplia:
    database: postgres
    plugin: pgoutput
    type: logical
…

The “slots:” section defines permanent replication slots. These slots will be preserved during switchover/failover. “pgoutput” is the decoding plugin for PostgreSQL logical replication.

Once the change is applied, the logical replication slot will be created on the primary node. Which can be verified by querying:

select * from pg_replication_slots;

The following is a sample output:

patroni output

Now here is the first level of magic! The same replication slot will be created on the standbys, also. Yes, Patroni does it. Patroni internally copies the replication slot information from the primary to all eligible standby nodes!.

We can use the same query on the pg_replication_slots on the standby and see similar information.

The following is an example showing the same replication slot reflecting on the standby side:

replication slot

This slot can be used by the subscription by explicitly specifying the slot name while creating the subscription.

CREATE SUBSCRIPTION sub2 CONNECTION '<connection_string' PUBLICATION <publication_name> WITH (copy_data = true, create_slot=false, enabled=true, slot_name=logicreplia);

Alternatively, an existing subscription can be modified to use the new slot which I generally prefer to do.

For example:

ALTER SUBSCRIPTION name SET (slot_name=logicreplia);

Corresponding PostgreSQL log entries can confirm the slot name change:

2021-12-27 15:56:58.294 UTC [20319] LOG:  logical replication apply worker for subscription "sub2" will restart because the replication slot name was changed
2021-12-27 15:56:58.304 UTC [20353] LOG:  logical replication apply worker for subscription "sub2" has started

From the publisher side, We can confirm the slot usage by checking the active_pid and advancing LSN for the slots.

The second level of Surprise! The Replication Slot information in all the standby nodes of the Patroni cluster is also advanced as the logical replication progresses from the primary side

At a higher level, this is exactly what this feature is doing:

  1. Automatically create/copy the replication slot information from the primary node of the Patroni cluster to all eligible standby nodes.
  2. Automatically advances the LSN numbers on slots of standby nodes as the LSN number advances on the corresponding slot on the primary.

After a Switchover/Failover

In the event of a switchover or failover, we are not losing any slot information as they are already maintained on the standby nodes.

After the switchover, the topology looks like this:

Now, any downstream logical replica can be repointed to the new primary.

postgres=# ALTER SUBSCRIPTION sub2 CONNECTION 'host=192.168.50.10 port=5432 dbname=postgres user=postgres password=vagrant';                                                                                                                                                  
ALTER SUBSCRIPTION

This continues the replication, and pg_replication_slot information can confirm this.

Summary + Key Points

The logical replication slot is conceptually possible only on the primary Instance because that is where the logical decoding happens. Now with this improvement, Patroni makes sure that the slot information is available on standby also and it will be ready to take over the connection from the subscriber.

  • This solution requires PostgreSQL 11 or above because it uses the  pg_replication_slot_advance() function which is available from PostgreSQL 11 onwards, for advancing the slot.
  • The downstream connection can use HAProxy so that the connection will be automatically routed to the primary (not covered in this post). No modification to PostgreSQL code or Creation of any extension is required.
  • The copying of the slot happens over PostgreSQL protocol (libpq) rather than any OS-specific tools/methods. Patroni uses rewind or superuser credentials. Patroni uses the pg_read_binary_file()  function to read the slot information. Source code Reference.
  • Once the logical slot is created on the replica side, Patroni uses pg_replication_slot_advance() to move the slot forward.
  • The permanent slot information will be added to DCS and will be continuously maintained by the primary instance of the Patroni. A New DCS key with the name “status” is introduced and supported across all DCS options (zookeeper, etcd, consul, etc.).
  • hot_standby_feedback must be enabled on all standby nodes where the logical replication slot needs to be maintained.
  • Patroni parameter postgresql.use_slots must be enabled to make sure that every standby node uses a slot on the primary node.
Jan
10
2022
--

MySQL 8.0 Functional Indexes

MySQL 8.0 Functional Indexes

MySQL 8.0 Functional IndexesWorking with hundreds of different customers I often face similar problems around running queries. One very common problem when trying to optimize a database environment is index usage. A query that cannot use an index is usually a long-running one, consuming more memory or triggering more disk iops.

A very common case is when a query uses a filter condition against a column that is involved in some kind of functional expression. An index on that column can not be used.

Starting from MySQL 8.0.13 functional indexes are supported. In this article, I’m going to show what they are and how they work.

The Well-Known Problem

As already mentioned, a very common problem about index usage is when you have a filter condition against one or more columns involved in some kind of functional expression.

Let’s see a simple example.

You have a table called products containing the details of your products, including a create_time TIMESTAMP column. If you would like to calculate the average price of your products on a specific month you could do the following:

mysql> SELECT AVG(price) FROM products WHERE MONTH(create_time)=10;
+------------+
| AVG(price) |
+------------+
| 202.982582 |
+------------+

The query returns the right value, but take a look at the EXPLAIN:

mysql> EXPLAIN SELECT AVG(price) FROM products WHERE MONTH(create_time)=10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 99015
     filtered: 100.00
        Extra: Using where

 

The query triggers a full scan of the table. Let’s create an index on create_time and check again:

mysql> ALTER TABLE products ADD INDEX(create_time);
Query OK, 0 rows affected (0.71 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> explain SELECT AVG(price) FROM products WHERE MONTH(create_time)=10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 99015
     filtered: 100.00
        Extra: Using where

 

A full scan again. The index we have created is not effective. Indeed any time an indexed column is involved in a function the index can not be used.

To optimize the query the workaround is rewriting it differently in order to isolate the indexed column from the function.

Let’s test the following equivalent query:

mysql> SELECT AVG(price) FROM products WHERE create_time BETWEEN '2019-10-01' AND '2019-11-01';
+------------+
| AVG(price) |
+------------+
| 202.982582 |
+------------+

mysql> EXPLAIN SELECT AVG(price) FROM products WHERE create_time BETWEEN '2019-10-01' AND '2019-11-01'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: range
possible_keys: create_time
          key: create_time
      key_len: 5
          ref: NULL
         rows: 182
     filtered: 100.00
        Extra: Using index condition

 

Cool, now the index is used. Then rewriting the query was the typical suggestion.

Quite a simple solution, but not all the times it was possible to change the application code for many valid reasons. So, what to do then?

 

MySQL 8.0 Functional Indexes

Starting from version 8.0.13, MySQL supports functional indexes. Instead of indexing a simple column, you can create the index on the result of any function applied to a column or multiple columns.

Long story short, now you can do the following:

mysql> ALTER TABLE products ADD INDEX((MONTH(create_time)));
Query OK, 0 rows affected (0.74 sec)
Records: 0  Duplicates: 0  Warnings: 0

Be aware of the double parentheses. The syntax is correct since the expression must be enclosed within parentheses to distinguish it from columns or column prefixes.

Indeed the following returns an error:

mysql> ALTER TABLE products ADD INDEX(MONTH(create_time));
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'create_time))' at line 1

Let’s check now our original query and see what happens to the EXPLAIN

mysql> SELECT AVG(price) FROM products WHERE MONTH(create_time)=10;
+------------+
| AVG(price) |
+------------+
| 202.982582 |
+------------+

 

mysql> EXPLAIN SELECT AVG(price) FROM products WHERE MONTH(create_time)=10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: ref
possible_keys: functional_index
          key: functional_index
      key_len: 5
          ref: const
         rows: 182
     filtered: 100.00
        Extra: NULL

 

The query is no longer a full scan and runs faster. The functional_index has been used, with only 182 rows examined. Awesome.

Thanks to the functional index we are no longer forced to rewrite the query.

Which Functional Indexes are Permitted

We have seen an example involving a simple function applied to a column, but you are granted to create more complex indexes.

A functional index may contain any kind of expressions, not only a single function. The following patterns are valid functional indexes:

INDEX( ( col1 + col2 ) )
INDEX( ( FUNC(col1) + col2 – col3 ) )

You can use ASC or DESC as well:

INDEX( ( MONTH(col1) ) DESC )

You can have multiple functional parts, each one included in parentheses:

INDEX( ( col1 + col2 ), ( FUNC(col2) ) )

You can mix functional with nonfunctional parts:

INDEX( (FUNC(col1)), col2, (col2 + col3), col4 )

There are also limitations you should be aware of:

  • A functional key can not contain a single column. The following is not permitted:
    INDEX( (col1), (col2) )
  • The primary key can not include a functional key part
  • The foreign key can not include a functional key part
  • SPATIAL and FULLTEXT indexes can not include functional key parts
  • A functional key part can not refer to a column prefix

At last, remember that the functional index is useful only to optimize the query that uses the exact same expression. An index created with nonfunctional parts can be used instead to solve multiple different queries.

For example, the following conditions can not rely on the functional index we have created:

WHERE YEAR(create_time) = 2019

WHERE create_time > ‘2019-10-01’

WHERE create_time BETWEEN ‘2019-10-01’ AND ‘2019-11-01’

WHERE MONTH(create_time+INTERVAL 1 YEAR)

All these will trigger a full scan.

Functional Index Internal

The functional indexes are implemented as hidden virtual generated columns. For this reason, you can emulate the same behavior even on MySQL 5.7 by explicitly creating the virtual column. We can test this, starting by dropping the indexes we have created so far.

mysql> SHOW CREATE TABLE products\G
*************************** 1. row ***************************
       Table: products
Create Table: CREATE TABLE `products` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `description` longtext,
  `price` decimal(8,2) DEFAULT NULL,
  `create_time` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `create_time` (`create_time`),
  KEY `functional_index` ((month(`create_time`)))
) ENGINE=InnoDB AUTO_INCREMENT=149960 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

 

mysql> ALTER TABLE products DROP INDEX `create_time`, DROP INDEX `functional_index`;
Query OK, 0 rows affected (0.03 sec)

We can try now to create the virtual generated column:

mysql> ALTER TABLE products ADD COLUMN create_month TINYINT GENERATED ALWAYS AS (MONTH(create_time)) VIRTUAL;
Query OK, 0 rows affected (0.04 sec)

Create the index on the virtual column:

mysql> ALTER TABLE products ADD INDEX(create_month);
Query OK, 0 rows affected (0.55 sec)

 

mysql> SHOW CREATE TABLE products\G
*************************** 1. row ***************************
       Table: products
Create Table: CREATE TABLE `products` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `description` longtext,
  `price` decimal(8,2) DEFAULT NULL,
  `create_time` timestamp NULL DEFAULT NULL,
  `create_month` tinyint GENERATED ALWAYS AS (month(`create_time`)) VIRTUAL,
  PRIMARY KEY (`id`),
  KEY `create_month` (`create_month`)
) ENGINE=InnoDB AUTO_INCREMENT=149960 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

 

We can now try our original query. We expect to see the same behavior as the functional index.

mysql> SELECT AVG(price) FROM products WHERE MONTH(create_time)=10;
+------------+
| AVG(price) |
+------------+
| 202.982582 |
+------------+

mysql> EXPLAIN SELECT AVG(price) FROM products WHERE MONTH(create_time)=10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: ref
possible_keys: create_month
          key: create_month
      key_len: 2
          ref: const
         rows: 182
     filtered: 100.00
        Extra: NULL

 

Indeed, the behavior is the same. The index on the virtual column can be used and the query is optimized.

The good news is that you can use this workaround to emulate a functional index even on 5.7, getting the same benefits. The advantage of MySQL 8.0 is that it is completely transparent, no need to create the virtual column.

Since the functional index is implemented as a hidden virtual column, there is no additional space needed for the data, only the index space will be added to the table.

By the way, this is the same technique used for creating indexes on JSON documents’ fields.

Conclusion

The functional index support is an interesting improvement you can find in MySQL 8.0. Some of the queries that required rewriting to get optimized don’t require that anymore. Just remember that only the queries having the same filter pattern can rely on the functional index. Then you need to create additional indexes or other functional indexes to improve other search patterns.

The same feature can be implemented on MySQL 5.7 with the explicit creation of a virtual generated column and the index.

For more detailed information, read the following page:

https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-functional-key-parts

Jan
07
2022
--

Configure wiredTiger cacheSize Inside Percona Distribution for MongoDB Kubernetes Operator

wiredTiger cacheSize Inside Percona Distribution for MongoDB Kubernetes Operator

wiredTiger cacheSize Inside Percona Distribution for MongoDB Kubernetes OperatorNowadays we are seeing a lot of customers starting to use our Percona Distribution for MongoDB Kubernetes Operator. The Percona Kubernetes Operators are based on best practices for the configuration of a Percona Server for MongoDB replica set or the sharded cluster. The main component in MongoDB is the wiredTiger cache which helps to define the cache used by this engine and we can set it based on our load.

In this blog post, we will see how to define the resources’ memory and set the wiredTiger cache for the shard replicaset to improve the performance of the sharded cluster.

The Necessity of WT cache

The parameter storage.wiredTiger.engineConfig.cacheSizeGB limits the size of the WiredTiger internal cache. The operating system will use the available free memory for filesystem cache, which allows the compressed MongoDB data files to stay in memory. In addition, the operating system will use any free RAM to buffer file system blocks and file system cache. To accommodate the additional consumers of RAM, you may have to set WiredTiger’s internal cache size properly.

Starting from MongoDB 3.4, the default WiredTiger internal cache size is the larger of either:

50% of (RAM - 1 GB), or 256 MB.

For example, on a system with a total of 4GB of RAM the WiredTiger cache will use 1.5GB of RAM (0.5 * (4 GB – 1 GB) = 1.5 GB). Conversely, a system with a total of 1.25 GB of RAM will allocate 256 MB to the WiredTiger cache because that is more than half of the total RAM minus one gigabyte (0.5 * (1.25 GB – 1 GB) = 128 MB < 256 MB).

WT cacheSize in Kubernetes Operator

The mongodb wiredTiger cacheSize can be tune with the parameter storage.wiredTiger.engineConfig.cacheSizeRatio and its default value is 0.5. As explained above, if the system allocated memory limit is too low, then the WT cache is set to 256M or calculated as per the formula.

Prior to PSMDB operator 1.9.0, the cacheSizeRatio can be tuned under the sharding section of the cr.yaml file. This is deprecated from v1.9.0+ and unavailable from v1.12.0+. So you have to use the cacheSizeRatio parameter available under replsets configuration instead. The main thing that you will need to check here before changing the cacheSize is to make sure that the resources’ memory limit allocated is also available as per your cacheSize’s requirement. i.e the below section limiting the memory:

     resources:
       limits:
         cpu: "300m"
         memory: "0.5G"
       requests:
         cpu: "300m"
         memory: "0.5G"

 

https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/psmdb/container.go#L307

From the source code that calculates the mongod.storage.wiredTiger.engineConfig.cacheSizeRatio:

// In normal situations WiredTiger does this default-sizing correctly but under Docker
// containers WiredTiger fails to detect the memory limit of the Docker container. We
// explicitly set the WiredTiger cache size to fix this.
//
// https://docs.mongodb.com/manual/reference/configuration-options/#storage.wiredTiger.engineConfig.cacheSizeGB//

func getWiredTigerCacheSizeGB(resourceList corev1.ResourceList, cacheRatio float64, subtract1GB bool) float64 {
 maxMemory := resourceList[corev1.ResourceMemory]
 var size float64
 if subtract1GB {
  size = math.Floor(cacheRatio * float64(maxMemory.Value()-gigaByte))
 } else {
  size = math.Floor(cacheRatio * float64(maxMemory.Value()))
 }
 sizeGB := size / float64(gigaByte)
 if sizeGB < minWiredTigerCacheSizeGB {
  sizeGB = minWiredTigerCacheSizeGB
 }
 return sizeGB
}

 

Changing the cacheSizeRatio

Here for the test, we deployed the PSMDB operator on GCP. You can refer here for the steps – https://www.percona.com/doc/kubernetes-operator-for-psmongodb/gke.html. With the latest operator v1.11.0, the sharded cluster has been started with a shard and a config server replicaSets along with mongos pods.

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-cluster-name-cfg-0 2/2 Running 0 4m9s
my-cluster-name-cfg-1 2/2 Running 0 2m55s
my-cluster-name-cfg-2 2/2 Running 1 111s
my-cluster-name-mongos-758f9fb44-d4hnh 1/1 Running 0 99s
my-cluster-name-mongos-758f9fb44-d5wfm 1/1 Running 0 99s
my-cluster-name-mongos-758f9fb44-wmvkx 1/1 Running 0 99s
my-cluster-name-rs0-0 2/2 Running 0 4m7s
my-cluster-name-rs0-1 2/2 Running 0 2m55s
my-cluster-name-rs0-2 2/2 Running 0 117s
percona-server-mongodb-operator-58c459565b-fc6k8 1/1 Running 0 5m45s

Now login into the shard and check the default memory allocated to the container and to the mongod instance. In below, the memory size available is 15G, but the memory limit to use in this container is 476MB only:

rs0:PRIMARY> db.hostInfo()
{
"system" : {
"currentTime" : ISODate("2021-12-30T07:16:59.441Z"),
"hostname" : "my-cluster-name-rs0-0",
"cpuAddrSize" : 64,
"memSizeMB" : NumberLong(15006),
"memLimitMB" : NumberLong(476),
"numCores" : 4,
"cpuArch" : "x86_64",
"numaEnabled" : false
},
"os" : {
"type" : "Linux",
"name" : "Red Hat Enterprise Linux release 8.4 (Ootpa)",
"version" : "Kernel 5.4.144+"
},
"extra" : {
"versionString" : "Linux version 5.4.144+ (builder@7d732a1aec13) (Chromium OS 12.0_pre408248_p20201125-r7 clang version 12.0.0 (/var/tmp/portage/sys-devel/llvm-12.0_pre408248_p20201125-r7/work/llvm-12.0_pre408248_p20201125/clang f402e682d0ef5598eeffc9a21a691b03e602ff58)) #1 SMP Sat Sep 25 09:56:01 PDT 2021",
"libcVersion" : "2.28",
"kernelVersion" : "5.4.144+",
"cpuFrequencyMHz" : "2000.164",
"cpuFeatures" : "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities",
"pageSize" : NumberLong(4096),
"numPages" : 3841723,
"maxOpenFiles" : 1048576,
"physicalCores" : 2,
"mountInfo" : [
..
..

 

The cachesize in MB of wiredTiger engine allocated in Shard is as follows:

rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]/1024/1024
256

The cache size of 256MB is too low for the real environment. So let’s see how to tune the memory limit and also the cacheSize of WT engine. You can use the parameter called cacheSizeRatio to mention the WT cache ratio (out of 1) and memlimit to mention the memory allocated to the container. To do this, edit the cr.yaml file under deploy directory in the operator to change the settings. From the PSMDB operator v1.9.0, editing cacheSizeRatio parameter under mongod section is deprecated. So for the WT cache limit, use the cacheSizeRatio parameter under the section “replsets” and to set memory, use the memlimit parameter. Setting 3G for the container and 80% of the memory calculations.

deploy/cr.yaml:58

46 configuration: |
47 # operationProfiling:
48 # mode: slowOp
49 # systemLog:
50 # verbosity: 1
51 storage:
52 engine: wiredTiger
53 # inMemory:
54 # engineConfig:
55 # inMemorySizeRatio: 0.9
56 wiredTiger:
57 engineConfig:
58 cacheSizeRatio: 0.8

 

deploy/cr.yaml:229-232:

226 resources:
227 limits:
228 cpu: "300m"
229 memory: "3G"
230 requests:
231 cpu: "300m"
232 memory: "3G"

 

Apply the new cr.yaml

# kubectl appli -f deploy/cr.yaml
perconaservermongodb.psmdb.percona.com/my-cluster-name configured

The shard pods are re-allocated and you can check the progress as follows:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-cluster-name-cfg-0 2/2 Running 0 36m
my-cluster-name-cfg-1 2/2 Running 0 35m
my-cluster-name-cfg-2 2/2 Running 1 34m
my-cluster-name-mongos-758f9fb44-d4hnh 1/1 Running 0 34m
my-cluster-name-mongos-758f9fb44-d5wfm 1/1 Running 0 34m
my-cluster-name-mongos-758f9fb44-wmvkx 1/1 Running 0 34m
my-cluster-name-rs0-0 0/2 Init:0/1 0 13s
my-cluster-name-rs0-1 2/2 Running 0 60s
my-cluster-name-rs0-2 2/2 Running 0 8m33s
percona-server-mongodb-operator-58c459565b-fc6k8 1/1 Running 0 38m

Now check the new settings of WT cache as follows:

rs0:PRIMARY> db.hostInfo().system
{
"currentTime" : ISODate("2021-12-30T08:37:38.790Z"),
"hostname" : "my-cluster-name-rs0-1",
"cpuAddrSize" : 64,
"memSizeMB" : NumberLong(15006),
"memLimitMB" : NumberLong(2861),
"numCores" : 4,
"cpuArch" : "x86_64",
"numaEnabled" : false
}
rs0:PRIMARY> 
rs0:PRIMARY> 
rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]/1024/1024
1474

Here, the memory calculation for WT is done roughly as follows (Memory limit should be more than 1G, else 256MB is allocated by default:
(Memory limit – 1G) * cacheSizeRatio

(2861 - 1) *0.8 = 1467

 

NOTE:

Till PSMDB operator v1.10.0, the operator takes the change of cacheSizeRatio only if the resources.limit.cpu is also set. This is a bug and it got fixed in v1.11.0 – refer https://jira.percona.com/browse/K8SPSMDB-603 . So if you’re in an older version, don’t be surprised and you have to make sure the resources.limit.cpu is set as well.

https://github.com/percona/percona-server-mongodb-operator/blob/v1.10.0/pkg/psmdb/container.go#L194

if limit, ok := resources.Limits[corev1.ResourceCPU]; ok && !limit.IsZero() {
args = append(args, fmt.Sprintf(
"--wiredTigerCacheSizeGB=%.2f",
getWiredTigerCacheSizeGB(resources.Limits, replset.Storage.WiredTiger.EngineConfig.CacheSizeRatio, true),
))
}

From v1.11.0:
https://github.com/percona/percona-server-mongodb-operator/blob/v1.11.0/pkg/psmdb/container.go#L194

if limit, ok := resources.Limits[corev1.ResourceMemory]; ok && !limit.IsZero() {
    args = append(args, fmt.Sprintf(
       "--wiredTigerCacheSizeGB=%.2f",
       getWiredTigerCacheSizeGB(resources.Limits, replset.Storage.WiredTiger.EngineConfig.CacheSizeRatio, true),
))
}

 

Conclusion

So based on the application load, you will need to set the cacheSize of WT for better performance. You can use the above methods to tune the cache size for the shard replicaset in the PSMDB operator.

Reference Links :

https://www.percona.com/doc/kubernetes-operator-for-psmongodb/operator.html

https://www.percona.com/doc/kubernetes-operator-for-psmongodb/gke.html

https://www.percona.com/doc/kubernetes-operator-for-psmongodb/operator.html#mongod-storage-wiredtiger-engineconfig-cachesizeratio

MongoDB 101: How to Tune Your MongoDB Configuration After Upgrading to More Memory

Jan
05
2022
--

In Application and Database Design, Small Things Can Have a Big Impact

Application and Database Design

Application and Database DesignWith modern application design, systems are becoming more diverse, varied and have more components than ever before. Developers are often forced to become master chefs adding the ingredients from dozens of different technologies and blending them together to create something tasty and amazing. But with so many different ingredients, it is often difficult to understand how the individual ingredients interact with each other. The more diverse the application, the more likely it is that some seemingly insignificant combination of technology may cause cascading effects.

Many people I talk to have hundreds if not thousands of different libraries, APIs, components, and services making up the systems they support. In this type of environment, it is very difficult to know what small thing could add up to something much bigger. Look at some of the more recent big cloud or application outages, they often have their root cause in something rather small.

Street Fight: Python 3.10 -vs- 3.9.7

Let me give you an example of something I ran across recently. I noticed that performance on two different nodes running the same hardware/application code was performing drastically different than one another. The app server was running close to 100% CPU on one server while the other was around 40%. Each had the same workload and the same database, etc. It turned out that one server was using Python 3.10.0 and the other was running 3.9.7. This combined with the MySQL Connector/Python lead to almost a 50% reduction in database throughput (the 3.10.0 release saw a regression). However, this performance change was not seen either in my PostgreSQL testing or in testing the mysqlclient connector. It happened only when running the pure python version of the MySQL connector. See the results below:

Python 3.10 -vs- 3.9.7

Note: This workload was using a warmed BP, with workload running for over 24 hours before the test. I cycled the application server making the change to either the version of python or the MySQL library. These tests are repeatable regardless of the length of the run. All data fits into memory here. I am not trying to make an authoritative statement on a specific technology, merely pointing out the complexity of layers of technology.

Looking at this purely from the user perspective, this particular benchmark simulates certain types of users.  Let’s look at the number of users who could complete their actions per second.  I will also add another pure python MySQL driver to the mix.

Note: There appears to be a significant regression in 3.10. The application server was significantly busier when using Python 3.10 and one of the pure python drivers than when running the same test in 3.9 or earlier.

The main difference between the MySQL Connector and mysqlclient is the mysqlclient is using the C libmysqlclient. Oddly the official MySQL Connector says it should switch between the pure python and C version if available, but I was not seeing that behavior ( so I have to look into it ). This resulted in the page load time for the app in Python 3.10.0 taking 0.05 seconds up from 0.03 seconds in Python 3.9.7. However, the key thing I wanted to highlight is that sometimes seemingly small or insignificant changes can lead to a drastic difference in performance and stability. You could be running along fine for months or even years before something upgrades to something that you would not think would drastically impact performance.

Can This Be Fixed With Better Testing?

You may think this is a poster child for testing before upgrading components, and while that is a requirement, it won’t necessarily prevent this type of issue. While technology combinations, upgrades, and releases can often have some odd side effects, oftentimes issues they introduce remain hidden and don’t manifest until some outside influence pops up. Note: These generally happen at the worst time possible, like during a significant marketing campaign, event, etc. The most common way I see nasty bugs get exposed is not through a release or in testing but often with a change in workload. Workload changes can hide or raise bottlenecks at the worst times. Let’s take a look at the above combination of Python version and different connectors with a different workload:

Here the combination of reporting and read/write workload push all the app nodes and database nodes to the redline. These look fairly similar in terms of performance, but the workload is hiding the above issues I mentioned. A system pushed to the red will behave differently than it will in the real world. If you ended up testing upgrading python on your app servers to 3.10.0 by pushing your systems to the max, you may see the above small regression as within acceptable limits. In reality, however, the upgrade could net you seeing a 50% decrease in throughput when moved to production.

Do We Care?

Depending on how your application is built many people won’t notice the above-mentioned performance regression after the upgrade happens. First, most people do not run their servers even close to 100% load, adding more load on the boxes may not immediately impact their user’s performance. Adding .02 seconds of load time to a user may be imperceptible unless under heavy load ( which would increase that load time). The practical impact is to speed up the point at which you need to either add more nodes or upgrade their instances sooner.

Second, scaling application nodes automatically is almost a requirement in most modern cloud-native environments. Reaching a point where you need to add more nodes and more processing power will come with increases in users on your application, so it is easily explained away.

Suppose users won’t immediately notice and the system will automatically expand as needed ( preventing you from knowing or getting involved ). In that case, do we, or should we care about adding more nodes or servers? Adding nodes is cheap; it is not free.

First, there is a direct cost to you. Take your hosting costs for your application servers and double them in the case above. What is that? $10K, $100K, $1M a year? That is money that is wasted. Look no further than the recent news lamenting the ever-increasing costs of the cloud i.e.:

Second, there is a bigger cost that comes with complexity. Observability is such a huge topic because everything we end up doing in modern environments is done in mass. The more nodes and servers you have the more potential exists for issues or one node behaving badly. While our goal is to create a system where everything is replaceable and can be torn down and rebuilt to overcome problems, this is often not the reality. Instead, we end up replicating bad code, underlying bottlenecks, and making a single problem a problem at 100x the scale.

We need to care. Application workload is a living entity that grows, shrinks, and expands with users. While modern systems need to scale quickly up and down to meet the demand, that should not preclude us from having to look out for hidden issues and bottlenecks. It is vitally important that we understand our applications workloads and look for deviations in the normal patterns.  We need to ask why something changed and dig in to find the answer.  Just because we can build automation to mitigate the problem does not mean we should get complacent and lazy about fixing and optimizing our systems.

Jan
05
2022
--

Comparing AMD EPYC Performance with Intel Xeon in GCP

AMD EPYC Intel Xeon

AMD EPYC Intel XeonRecently we were asked to check the performance of the new family of AMD EPYC processors when using MySQL in Google Cloud Virtual Machines. This was motivated by a user running MySQL in the N1 machines family and willing to upgrade to N2D generation considering the potential cost savings using the new AMD family. 

The idea behind the analysis is to do a side-by-side comparison of performance considering some factors: 

  • EPYC processors have demonstrated better performance in purely CPU-based operations according to published benchmarks. 
  • EPYC platform has lower costs compared to the Intel Xeon platform. 

The goal of this analysis is to check if cost reductions by upgrading from N1 to N2D are worth the change to avoid suffering from performance problems and eventually reduce the machine size from the current 64 cores based (N1 n1-highmem-64 – Intel Haswell) to either N2D 64 cores (n2d-highmem-64 – AMD Rome) or even to 48 cores (n2d-highmem-48 – AMD Rome), to provide some extra context we included N2 (the new generation of Intel machines) into the analysis. 

In order to do a purely CPU performance comparison we created 4 different VMs:

NAME: n1-64
MACHINE_TYPE: n1-highmem-64
Intel Haswell – Xeon 2.30GHz
*This VM corresponds to the same type as the type we use in Production.

NAME: n2-64
MACHINE_TYPE: n2-highmem-64
Intel Cascade Lake – Xeon 2.80GHz

NAME: n2d-48
MACHINE_TYPE: n2d-highmem-48
AMD Epyc Rome – 2.25Ghz

NAME: n2d-64
MACHINE_TYPE: n2d-highmem-64
AMD Epyc Rome – 2.25Ghz

For the analysis, we used MySQL Community Server 5.7.35-log and this is the basic configuration:

[mysqld]
datadir   = /var/lib/mysql
socket = /var/lib/mysql/mysql.sock
log-error   = /var/lib/mysql/mysqld.err
pid-file = /var/run/mysqld/mysqld.pid
server_id                       = 100
log_bin
binlog_format                   = ROW
sync_binlog                     = 1000
expire_logs_days                = 2
skip_name_resolve

innodb_buffer_pool_size         = 350G
innodb_buffer_pool_instances    = 32
innodb_concurrency_tickets      = 5000
innodb_thread_concurrency       = 128
innodb_write_io_threads         = 16
innodb_read_io_threads          = 16
innodb_flush_log_at_trx_commit  = 1
innodb_flush_method             = O_DIRECT
innodb_log_file_size            = 8G
innodb_file_per_table           = 1
innodb_autoinc_lock_mode        = 2
innodb_buffer_pool_dump_at_shutdown = 1
innodb_buffer_pool_load_at_startup  = 1

table_open_cache                = 5000
thread_cache_size               = 2000
query_cache_size                = 0
query_cache_type                = 0

In all cases, we placed a 1TB balanced persistent drive so we get enough IO performance for the tests. We wanted to normalize all the specs so we can focus on the CPU performance, so don’t pay too much attention to the chances for improving performance for IO operations and so. 

The analysis is based on sysbench oltp read-only workload with an in-memory dataset, the reason for this is that we want to generate traffic that can saturate CPU while not being affected by IO or Memory. 

The approach for the benchmark was also simple, we executed RO OLTP work for 16, 32, 64, 128, and 256 threads with a one-minute wait between runs. Scripts and results from tests can be found here

Let’s jump into the analysis, these are the number of Queries that instances are capable to run: 

MySQL Queries

The maximum amount of TPS by Instance Type by the number of threads:

Threads/Instance N1-64 N2-64 N2D-48 N2D-64

16

164k 230k 144k

155k

32

265k 347k 252k 268k

64

415k 598k 345k

439k

128

398k 591k 335k

444k

256 381k 554k 328k

433k

Some observations: 

  • In all cases we reached the maximum TPS at 64 threads, this is somehow expected as we are not generating CPU context switches. 
  • Roughly we get a maximum of 598k tps in n2-highmem-64 and 444k tps in n2d-highmem-64 instance types which are the bigger ones. While this is expected Intel-based architecture outperforms AMD by a 35% 
  • Maximum tps seems to be reached with 64 threads, this is expected considering the number of CPU threads we can use in parallel. 
  • While n1-highmem-64 (Intel Xeon) and n2d-highmem-48 (AMD Epyc) seems to start suffering performance issues when the amount of threads exceeds the max number of cores the bigger instances running with 64 cores are capable to sustain the throughput a bit better, these instances start to be impacted when we reach 4x the amount of CPU cores. 

Let’s have a look at the CPU utilization on each node:

CPU utilization on each node

Additional observations: 

  • n1-highmem-64 and n2d-highmem-48 are reaching 100% utilization at 64 threads running. 
  • With 64 threads running n2-highmem-64 reaches 100% utilization while n2d-highmem-64 is still below. Although Intel provides better throughput overall probably by having a faster CPU clock (2.8Ghz vs 2.25Ghz) 
  • For 128 and 256 threads all CPUs show similar utilization. 

For the sake of analysis this is the estimated costs of each of used machines (at the moment of writing the post):
n1-highmem-64 $2,035.49/month = $0.000785297/second
n2-highmem-64 $2,549.39/month = $0.000983561/second
n2d-highmem-48 $1,698.54/month = $0.000655301/second
n2d-highmem-64 $2,231.06/month = $0.000860748/second

Costs above will give us roughly at peaks of TPS:
n1-highmem-64 costs are $0.0000000019/trx
n2-highmem-64 costs are $0.0000000016/trx
n2d-highmem-48 costs are $0.0000000019/trx
n2d-highmem-64 costs are $0.0000000019/trx

Conclusions

While this is not a super exhaustive analysis of all implications of CPU performance for MySQL workload we get a very good understanding of cost vs performance analysis. 

  • n1 family, currently used in production, shows very similar performance to n2d family (AMD) when running with the same amount of cores. This changes a lot when we move into the n2 family (Intel) which outperforms all other instances. 
  • While the cut in costs for moving into n2d-highmem-48 will represent ~$4k/year the performance penalty is close to 20%.
  • Comparing the costs per trx at peaks of loads we can see that both n2-64 and n2d-64 are pretty much the same but n2-64 will give us 35% more throughput, this is definitely something to consider if we plan to squeeze the CPU power.   
  • If the consideration is to go with n2 generation then definitely the n2d-highmem-64 is a very good choice to balance performance and costs but n2-highmem-64 will give much better performance per dollar spent. 
Jan
03
2022
--

MongoDB – Converting Replica Set to Standalone

Converting Replica Set to Standalone MongoDB

Converting Replica Set to Standalone MongoDBOne of the reasons for using MongoDB is its simplicity in scaling its structure,  either as Replicaset(scale-up) or as Sharded Cluster(scale-out).

Although a bit tricky in Sharded environments, the reverse process is also feasible.

From the possibilities, you can:

So far, so good, but if for some reason that mongo project which started as a Replica Set does not require such structure and now a Standalone server meets the requirements; 

  • Can you shrink from Replica Set to a Standalone server? 
    • If possible, what is the procedure to achieve this configuration?

This blog post will walk through the necessary steps to convert a Replica Set into a Standalone server.

Observations:

Before starting the process, it is important to mention that it’s not advised to run Standalone configuration on production servers as your entire availability will only rely on a single point.

Another observation is this activity requires downtime, that’s because you will need to remove the replication parameter, which can not be at runtime.

Last note; to avoid any data change during the procedure, it’s recommended to stop writing from the application. You will need to adjust the connection string at the end of the process for the new configuration.

How To:

The testing environment is a standard Replica Set configuration with three nodes, as seen at the following replication status.

[
	{
		"_id" : 0,
		"name" : "node1:27017",
		"stateStr" : "PRIMARY"
	},
	{
		"_id" : 1,
		"name" : "node2:27017",
		"stateStr" : "SECONDARY",
	},
	{
		"_id" : 2,
		"name" : "node3:27017",
		"stateStr" : "SECONDARY",
]

  • The PRIMARY is our reliable data source; thus, it will be the Standalone server at the end of the process.

It is not needed to trigger any election process unless you have a preferred node to be the Standalone. If that’s your case and that node is not the PRIMARY yet, please make sure to make PRIMARY before starting the procedure.

That said, let’s get our hands dirty!

1) Removing the secondaries:

  • From the PRIMARY node, let’s remove node 2 and 3:
rs0:PRIMARY> rs.remove("node2:27017")
{
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1640109346, 1),
		"signature" : {
			"hash" : BinData(0,"663Y5u3GkZkQ0Ci7EU8IYUldVIM="),
			"keyId" : NumberLong("7044208418021703684")
		}
	},
	"operationTime" : Timestamp(1640109346, 1)
}

rs0:PRIMARY> rs.remove("node3:27017")
{
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1640109380, 1),
		"signature" : {
			"hash" : BinData(0,"RReAzLZmHWfzVcdnLoGJ/uz04Vo="),
			"keyId" : NumberLong("7044208418021703684")
		}
	},
	"operationTime" : Timestamp(1640109380, 1)
}

Please note: Although my Replica Set comprises three nodes, you must let only the PRIMARY on replication at this step. If you have more nodes, the process is the same.

Then, the expected status should be:

rs0:PRIMARY> rs.status().members
[
	{
		"_id" : 0,
		"name" : "node1:27017",
		"health" : 1,
		"state" : 1,
		"stateStr" : "PRIMARY",
		"uptime" : 1988,
		"optime" : {
			"ts" : Timestamp(1640109560, 1),
			"t" : NumberLong(1)
		},
		"optimeDate" : ISODate("2021-12-21T17:59:20Z"),
		"syncSourceHost" : "",
		"syncSourceId" : -1,
		"infoMessage" : "",
		"electionTime" : Timestamp(1640107580, 2),
		"electionDate" : ISODate("2021-12-21T17:26:20Z"),
		"configVersion" : 5,
		"configTerm" : 1,
		"self" : true,
		"lastHeartbeatMessage" : ""
	}
]

 

2) Changing the configuration:

  • Even though node2 and node3 are no longer part of the replication, let’s stop the mongod process. Keeping their data safe, as it can be used in a recovery scenario if necessary:
node2-shell> systemctl stop mongod
node3-shell> systemctl stop mongod

  • Then, you can remove the replication parameter from PRIMARY and restart it, but first! In the configuration file:
    • Please remove or comment the lines:
#replication:
#  replSetName: "rs0"

If you start MongoDB via the command line, please remove the option  --replSet.

  • Once changed, you can restart the mongod on PRIMARY:
node1-shell> systemctl restart mongod

3) Removing replication objects:

  • Once connected to former PRIMARY, now Standalone, MongoDB will warn you with the following message:
2021-12-21T18:23:52.056+00:00: Document(s) exist in 'system.replset', but started without --replSet. Database contents may appear inconsistent with the writes that were visible when this node was running as part of a replica set. Restart with --replSet unless you are doing maintenance and no other clients are connected. The TTL collection monitor will not start because of this. For more info see http://dochub.mongodb.org/core/ttlcollections

 

That is because the node was working under a Replica Set configuration, and although it’s a Standalone now, there is the old structure existing.

For further detail, that structure resides under the local database, which holds all necessary data for replication, and for the database itself:

mongo> use local
switched to db local

mongo> show collections
oplog.rs
replset.election
replset.minvalid
startup_log
system.replset

 

To remove that warning, you should remove the old Replica Set object on local.system.replset.

local.system.replset holds the replica set’s configuration object as its single document. To view the object’s configuration information, issue rs.conf() from mongosh. You can also query this collection directly.

However, it’s important to mention the local.system.replset is a system object, and there is no built-in role that you can use to remove objects –  Unless you create a custom role that grants anyAction on anyResource to the user, Or you grant the internal role __system to a user.

?

Please note, both of that options gives access to every resource in the system and is intended for internal use. Do not use this permissions, other than in exceptional circumstances.  

For this process, we will create a new user, temporary, with __system privilege in which we can drop later after the action.

3. a) Creating the temporary user:

mongo> db.getSiblingDB("admin").createUser({user: "dba_system", "pwd": "sekret","roles" : [{ "db" : "admin", "role" : "__system" }]});

Confirmation output:

Successfully added user:{
   "user":"dba_system",
   "roles":[
      {
         "db":"admin",
         "role":"__system"
      }
   ]
}

3. b) Removing the object:
  1. First; Connect with the created user.
  2. Then:
mongo> use local;
mongo> db.system.replset.remove({})

Confirmation output:

WriteResult({ "nRemoved" : 1 })

 

3. c) Dropping the temporary user:
  1. Disconnect from that system user.
  2. Then:
mongo> db.dropUser("dba_system")

 

4) Restart the service:

shell> systemctl restart mongod

 

Once reconnected, you will not see that warning again, and the server is ready!

 

5) Connection advice:

  • Please make sure to adjust the connection string from the app and clients, passing only the Standalone server.

 

Conclusion

The conversion process, in general, is simple and should not face problems. But is essential to highlight that we strongly recommend you test and run this procedure in a lower environment. And if you plan to use it on production, please bear in mind the said before.

And if you have any questions, feel free to contact us in the comment section below!

 

See you!

 

Dec
30
2021
--

MongoDB Config Server Upgrade From SCCC to CSRS ReplicaSet

MongoDB Config Server Upgrade

MongoDB Config Server UpgradeRecently, I got some of our customers doing an upgrade to v4.x from v3.0 (Yeah, still could see older MongoDB versions and we suggest everyone do the upgrade to stay with the current GA!). There were some pain points in the upgrade process, and especially before migrating to v3.4, you will need to change from SCCC config setup to CSRS replicaSet setup. Interestingly, I did some tests in that process and would like to share them here. Even though this topic is pretty outdated, this would be beneficial to the people out there who still seek to migrate from the older version and are stuck at this config server upgrade.

In this blog post, we will see how to migrate a mirrored config server configuration (SCCC) to the replicaSet configuration (CSRS) along with the storage engine from MMAP to wiredTiger change. From MongoDB 3.2.4, the replicaSet configuration is supported for config servers too and the support of SCCC (Sync Cluster Connection Configuration) configuration is removed from MongoDB 3.4. So it is important to make sure the config servers are configured with the replicaSet before upgrading to MongoDB 3.4. Also, the config servers must run on wiredTiger when configured as replicaSet. There are two ways to do this – one is that when you want to replace existing config servers and another one is to migrate to the new set of servers/ports. Let’s see them in the following topics.

Migrate From SCCC to CSRS on the Different Hosts/Ports:

This section gives a brief about migrating the existing mirrored hosts/ports to different hosts or ports. It is always recommended to backup your config server before your steps and get downtime to avoid any writes on this crucial step. The test shared cluster setup runs on Percona Server for MongoDB (PSMDB) v3.2.22 with SCCC set up on the following ports:

47080 – mongos
47087, 47088, 47089 – mongoc – config servers with mmapv1 engine (sccc)
47081, 47082, 47083 – shard01 replicaset
47084, 47085, 47086 – shard02 replicaset

New Config servers to migrate (for testing purposes, using localhost but different ports to migrate):

47090, 47091, 47092 – new mongoc (csrs)

Now stop the balancer:

mongos> sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
Waiting again for active hosts after balancer is off...
WriteResult({ "nMatched" : 0, "nUpserted" : 1, "nModified" : 0, "_id" : "balancer" })
mongos> 
mongos> sh.getBalancerState()
false

 

Initiate the replicaSet in one of the config servers (here using config db running on 47087). Use this member as a base for migrating from SCCC to CSRS:

$ mongo32 localhost:47087/admin
MongoDB shell version: 3.2.21-2-g105acca0d4
connecting to: localhost:47087/admin
configsvr> 
configsvr> rs.initiate({ _id: "configRepl", configsvr: true, version:1, members: [ {_id:0, host: "localhost:47087"} ] } )
{ "ok" : 1 }

 

Start the new config DB instances using the same replicaset name used in the above initiate command:

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongod --dbpath /home/balaguru/mongodb/testshard32/data/configrepl/rs1/db/ --logpath /home/balaguru/mongodb/testshard32/data/configrepl/rs1/mongod.log --port 47090 --fork --configsvr --storageEngine wiredTiger --wiredTigerCacheSizeGB 1 --replSet configRepl
about to fork child process, waiting until server is ready for connections.
forked process: 476923
child process started successfully, parent exiting

 

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongod --dbpath /home/balaguru/mongodb/testshard32/data/configrepl/rs2/db/ --logpath /home/balaguru/mongodb/testshard32/data/configrepl/rs2/mongod.log --port 47091 --fork --configsvr --storageEngine wiredTiger --wiredTigerCacheSizeGB 1 --replSet configRepl
about to fork child process, waiting until server is ready for connections.
forked process: 478193
child process started successfully, parent exiting

 

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongod --dbpath /home/balaguru/mongodb/testshard32/data/configrepl/rs3/db/ --logpath /home/balaguru/mongodb/testshard32/data/configrepl/rs3/mongod.log --port 47092 --fork --configsvr --storageEngine wiredTiger --wiredTigerCacheSizeGB 1 --replSet configRepl
about to fork child process, waiting until server is ready for connections.
forked process: 478254
child process started successfully, parent exiting

 

Now restart 47087 with the option –replSet and –configsvrMode which allow this mirrored config DB to be in replicaset and have MMAP engine.

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongod --dbpath /home/balaguru/mongodb/testshard32/data/config/db --logpath /home/balaguru/mongodb/testshard32/data/config/mongod.log --port 47087 --fork --configsvr --storageEngine mmapv1 --configsvrMode sccc --replSet configRepl
about to fork child process, waiting until server is ready for connections.
forked process: 477682
child process started successfully, parent exiting

 

Add the new config DB instances to this replicaSet (set priority/votes 0 to avoid election):

configRepl:PRIMARY> rs.add({host: "localhost:47090", priority:0, votes:0})
{ "ok" : 1 }

configRepl:PRIMARY> rs.add({host: "localhost:47091", priority:0, votes:0})
{ "ok" : 1 }

configRepl:PRIMARY> rs.add({host: "localhost:47092", priority:0, votes:0})
{ "ok" : 1 }

configRepl:PRIMARY> rs.status()
{
        "set" : "configRepl",
        "date" : ISODate("2021-11-23T08:11:13.065Z"),
        "myState" : 1,
        "term" : NumberLong(1),
        "configsvr" : true,
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "localhost:47087",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 154,
                        "optime" : {
                                "ts" : Timestamp(1637655072, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2021-11-23T08:11:12Z"),
                        "electionTime" : Timestamp(1637654919, 1),
                        "electionDate" : ISODate("2021-11-23T08:08:39Z"),
                        "configVersion" : 4,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "localhost:47090",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 69,
                        "optime" : {
                                "ts" : Timestamp(1637655069, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2021-11-23T08:11:09Z"),
                        "lastHeartbeat" : ISODate("2021-11-23T08:11:11.608Z"),
                       "lastHeartbeatRecv" : ISODate("2021-11-23T08:11:11.611Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "localhost:47087",
                        "configVersion" : 4
                },
                {
                        "_id" : 2,
                        "name" : "localhost:47091",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 6,
                        "optime" : {
                                "ts" : Timestamp(1637655069, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2021-11-23T08:11:09Z"),
                        "lastHeartbeat" : ISODate("2021-11-23T08:11:11.608Z"),
                        "lastHeartbeatRecv" : ISODate("2021-11-23T08:11:09.609Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "localhost:47090",
                        "configVersion" : 4
                },
                {
                        "_id" : 3,
                        "name" : "localhost:47092",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 3,
                        "optime" : {
                                "ts" : Timestamp(1637655069, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2021-11-23T08:11:09Z"),
                        "lastHeartbeat" : ISODate("2021-11-23T08:11:11.608Z"),
                        "lastHeartbeatRecv" : ISODate("2021-11-23T08:11:11.620Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "localhost:47091",
                        "configVersion" : 4
                }
        ],
        "ok" : 1
}

 

Shut down one of the other mirrored config servers (either 47088 or 47089). If one of the mirrored DB is down in between the migration, then the sharded cluster goes into read-only mode, i.e there will be a failure when issuing DDL (mentioned in the below example). So it is advised to do the activity when there is downtime to avoid any unexpected error from the application side.

// try to create new db and collection as follows to test when 47088 is down
mongos> use vinodh  
switched to db vinodh
mongos> db.testCreateColl.insert({id:1})
WriteResult({
        "writeError" : {
                "code" : 13663,
                "errmsg" : "unable to target write op for collection vinodh.testCreateColl :: caused by :: Location13663: exception creating distributed lock vinodh/balaguru-UbuntuPC:47080:1637654366:1487823871 :: caused by :: SyncClusterConnection::update prepare failed:  localhost:47088 (127.0.0.1) failed:9001 socket exception [CONNECT_ERROR] server [couldn't connect to server localhost:47088, connection attempt failed] "
        }
})

 

Once the replicaSet is set, change the priority and votes of other nodes to participate in the election:

configRepl:PRIMARY> cfg.members[1].host
localhost:47090
configRepl:PRIMARY> cfg.members[1].priority
0
configRepl:PRIMARY> cfg.members[1].priority = 1
1
configRepl:PRIMARY> cfg.members[1].votes 
0
configRepl:PRIMARY> cfg.members[1].votes = 1
1
configRepl:PRIMARY> cfg.members[2].priority = 1
1
configRepl:PRIMARY> cfg.members[3].priority = 1
1
configRepl:PRIMARY> cfg.members[2].votes = 1
1
configRepl:PRIMARY> cfg.members[3].votes = 1
1
configRepl:PRIMARY> rs.reconfig(cfg)
{ "ok" : 1 }

 

Now change the PRIMARY role to the new member and make 47087 as SECONDARY:

configRepl:PRIMARY> rs.stepDown()
2021-11-23T13:58:37.885+0530 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host 'localhost:47087'  :
DB.prototype.runCommand@src/mongo/shell/db.js:135:1
DB.prototype.adminCommand@src/mongo/shell/db.js:152:1
rs.stepDown@src/mongo/shell/utils.js:1202:12
@(shell):1:1

2021-11-23T13:58:37.887+0530 I NETWORK  [thread1] trying reconnect to localhost:47087 (127.0.0.1) failed
2021-11-23T13:58:37.887+0530 I NETWORK  [thread1] reconnect localhost:47087 (127.0.0.1) ok
configRepl:SECONDARY>

 

Restart the old config member 47087 without –configsvrMode option. Here 47087 will go into REMOVED which is totally fine as it still has MMAPV2 engine + running without the configsvrMode option. Then you can restart mongos with new –-configdb option mentioning only the new config servers (–configdb “configRepl/localhost:47090,localhost:47091,localhost:47092” ):

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongos --logpath /home/balaguru/mongodb/testshard32/data/mongos.log --port 47080 --configdb "configRepl/localhost:47090,localhost:47091,localhost:47092" --fork

 

Now 47087 is ready to be removed from the replicaSet finally:

configRepl:PRIMARY> rs.remove("localhost:47087")
{ "ok" : 1 }

You can then check the shard status and the data through mongos. Once everything looks good, stop the other mirrored config DBs on 47088 and 47089 ports followed by enabling the balancer.

 

Migrate From SCCC to CSRS on the Same Hosts:

This section gives a brief about replacing the existing mirrored hosts/port. This has steps until enabling the replicaSet in one of the mirrored config servers. For this testing case, the shared cluster setup runs with PSMDB v3.2.22 with SCCC set up on the following ports:

27100 – mongos
27107, 27108, 27109 – mongoc – config servers with mmapv1 engine
27101, 27102, 27103 – shard01 replicaset
27104, 27105, 27106 – shard02 replicaset

Here using the instance running on port 27107 to start the migration. As said in the above method,

  • Initiate the replicaSet
  • Restart the instance with –-replSet and –-configsvrMode options

The replicaSet status on 27107 instance.

csReplSet:PRIMARY> rs.status()
{
 "set" : "csReplSet",
 "date" : ISODate("2021-10-07T09:04:20.266Z"),
 "myState" : 1,
 "term" : NumberLong(1),
 "configsvr" : true,
 "heartbeatIntervalMillis" : NumberLong(2000),
 "members" : [
  {
   "_id" : 0,
   "name" : "localhost:27107",
   "health" : 1,
   "state" : 1,
   "stateStr" : "PRIMARY",
   "uptime" : 9,
   "optime" : {
    "ts" : Timestamp(1633597452, 1),
    "t" : NumberLong(1)
   },
   "optimeDate" : ISODate("2021-10-07T09:04:12Z"),
   "infoMessage" : "could not find member to sync from",
   "electionTime" : Timestamp(1633597451, 1),
   "electionDate" : ISODate("2021-10-07T09:04:11Z"),
   "configVersion" : 1,
   "self" : true
  }
 ],
 "ok" : 1
}

 

Then stop the second config 27108 instance and clear the dbpath files to remove MMAPv2 files. Start it with the replicaset + WT engine options like below:

$ rm -rf data/config2/db/*

$ /home/vinodh.krishnaswamy/mongo/mongodb-linux-x86_64-3.2.22/bin/mongod --dbpath /home/vinodh.krishnaswamy/mongo/testshard32/data/config2/db --logpath /home/vinodh.krishnaswamy/mongo/testshard32/data/config2/mongod.log --port 27108 --fork --configsvr  --wiredTigerCacheSizeGB 1  --replSet csReplSet
about to fork child process, waiting until server is ready for connections.
forked process: 42341
child process started successfully, parent exiting

At this point, as said earlier let me remind you of the point of the sharded cluster going into a read-only mode when an existing mirrored instance is down or not reachable. So don’t be surprised if you get any warning about DDL errors while in this situation.

Now add this 27108 to the replicaset from PRIMARY (27107) as follows with priority:0 and votes:0 to avoid this server taking part in the election:

$ mongo32 localhost:27107
MongoDB shell version: 3.2.22
connecting to: localhost:27107/test
csReplSet:PRIMARY> rs.add({host: "localhost:27108", priority: 0, votes: 0 })
{ "ok" : 1 }

 

Then repeat the same for the third config server – 27109. Remove DB files + restart it with –replSet and WT options for 27109 followed by adding it to the replicaSet from PRIMARY:

csReplSet:PRIMARY> rs.add({host: "localhost:27109", priority:0 , votes: 0})
{ "ok" : 1 }

 

Once they are in sync and the replicaSet looks healthy, you can adjust the priority and votes for the added members(27108, 27109) as follows to allow them to participate in the election.

csReplSet:PRIMARY> cfg = rs.conf()
{
 "_id" : "csReplSet",
 "version" : 3,
 "configsvr" : true,
 "protocolVersion" : NumberLong(1),
 "members" : [
  {
   "_id" : 0,
   "host" : "localhost:27107",
   "arbiterOnly" : false,
   "buildIndexes" : true,
   "hidden" : false,
   "priority" : 1,
   "tags" : {
   },
   "slaveDelay" : NumberLong(0),
   "votes" : 1
  },
  {
   "_id" : 1,
   "host" : "localhost:27108",
   "arbiterOnly" : false,
   "buildIndexes" : true,
   "hidden" : false,
   "priority" : 0,
   "tags" : {
   },
   "slaveDelay" : NumberLong(0),
  "votes" : 0
  },
  {
   "_id" : 2,
   "host" : "localhost:27109",
   "arbiterOnly" : false,
   "buildIndexes" : true,
   "hidden" : false,
   "priority" : 0,
   "tags" : {
   },
   "slaveDelay" : NumberLong(0),
   "votes" : 0
  }
 ],

 "settings" : {
  "chainingAllowed" : true,
  "heartbeatIntervalMillis" : 2000,
  "heartbeatTimeoutSecs" : 10,
  "electionTimeoutMillis" : 10000,
  "getLastErrorModes" : {
  },
  "getLastErrorDefaults" : {
   "w" : 1,
   "wtimeout" : 0
  },
  "replicaSetId" : ObjectId("615eb78fdb1ed4092c166786")
 }
}

csReplSet:PRIMARY> cfg.members[1].priority = 1
1
csReplSet:PRIMARY> cfg.members[2].priority = 1
1
csReplSet:PRIMARY> cfg.members[1].votes = 1
1
csReplSet:PRIMARY> cfg.members[2].votes = 1
1
csReplSet:PRIMARY> cfg.members[2].votes
1
csReplSet:PRIMARY> rs.reconfig(cfg)
{ "ok" : 1 }

 

Then go to the first config server 27107 and step down it for another member to become PRIMARY. So that you can make it down and change the storage engine to WT in it like above as well + without configsvrMode option. When it joins back, it does the initial sync from the syscSource of the replicaSet.

csReplSet:PRIMARY> rs.stepDown()
2021-10-07T05:14:26.617-0400 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host 'localhost:27107'  :
DB.prototype.runCommand@src/mongo/shell/db.js:135:1
DB.prototype.adminCommand@src/mongo/shell/db.js:153:16
rs.stepDown@src/mongo/shell/utils.js:1202:12
@(shell):1:1

2021-10-07T05:14:26.619-0400 I NETWORK  [thread1] trying reconnect to localhost:27107 (127.0.0.1) failed
2021-10-07T05:14:26.619-0400 I NETWORK  [thread1] reconnect localhost:27107 (127.0.0.1) ok
csReplSet:SECONDARY>

 

To remind you that if you forget to change the storage engine to WT and without —configsvrMode option, it will go into REMOVED state, so please make note of it. (Here it is started purposefully started to show as an example.)

$ mongo32 localhost:27107
MongoDB shell version: 3.2.22
connecting to: localhost:27107/test
csReplSet:REMOVED>

 

Now, bringing down the DB instance 27017 and clearing DB files + starting back with WT and replSet options (remove configsvrMode option this time)

$ rm -rf data/config/db/*

$ /home/vinodh.krishnaswamy/mongo/mongodb-linux-x86_64-3.2.22/bin/mongod --dbpath /home/vinodh.krishnaswamy/mongo/testshard32/data/config/db --logpath /home/vinodh.krishnaswamy/mongo/testshard32/data/config/mongod.log --port 27107 --fork --configsvr --wiredTigerCacheSizeGB 1 --replSet csReplSet
about to fork child process, waiting until server is ready for connections.
forked process: 35202
child process started successfully, parent exiting

 

Then restart the mongos instances with the below –configDB option for the new config to take effect:

mongos <options> –configDB csReplSet/localhost:27107,localhost:27108,localhost:27109

From the mongos logfile, you will see the below message if the config server replicaset was connected successfully:

2021-10-07T05:48:25.679-0400 I NETWORK  [conn6] Successfully connected to csReplSet/localhost:27107,localhost:27108,localhost:27109 (1 connections now open to csReplSet/localhost:27107,localhost:27108,localhost:27109 with a 0 second timeout)

Conclusion

We at Percona don’t recommend anyone to have EOL versions running as they don’t get any bugs fixed or you cannot use the recently added features. Hope this blog helps some who still looking to migrate from older versions.

Dec
27
2021
--

Backup Performance Comparison: mysqldump vs MySQL Shell Utilities vs mydumper vs mysqlpump vs XtraBackup

MySQL Backup Performance Comparison

MySQL Backup Performance ComparisonIn this blog post, we will compare the performance of performing a backup from a MySQL database using mysqldump, MySQL Shell feature called Instance Dump, mysqlpump, mydumper, and Percona XtraBackup. All these available options are open source and free to use for the entire community.

To start, let’s see the results of the test.

Benchmark Results

The benchmark was run on an m5dn.8xlarge instance, with 128GB RAM, 32 vCPU, and 2xNVMe disks of 600GB (one for backup and the other one for MySQL data). The MySQL version was 8.0.26 and configured with 89Gb of buffer pool, 20Gb of redo log, and a sample database of 177 GB (more details below).

We can observe the results in the chart below:

MySQL Backup Results

And if we analyze the chart only for the multi-threaded options:

multi-threaded options

As we can see, for each software, I’ve run each command three times in order to experiment using 16, 32, and 64 threads. The exception for this is mysqldump, which does not have a parallel option and only runs in a single-threaded mode.

We can observe interesting outcomes:

  1. When using zstd compression, mydumper really shines in terms of performance. This option was added not long ago (MyDumper 0.11.3).
  2. When mydumper is using gzip, MySQL Shell is the fastest backup option.
  3. In 3rd we have Percona XtraBackup.
  4. mysqlpump is the 4th fastest followed closer by mydumper when using gzip.
  5. mysqldump is the classic old-school style to perform dumps and is the slowest of the four tools.
  6. In a server with more CPUs, the potential parallelism increases, giving even more advantage to the tools that can benefit from multiple threads.

Hardware and Software Specs

These are the specs of the benchmark:

  • 32 CPUs
  • 128GB Memory
  • 2x NVMe disks 600 GB
  • Centos 7.9
  • MySQL 8.0.26
  • MySQL shell 8.0.26
  • mydumper 0.11.5 – gzip
  • mydumper 0.11.5 – zstd
  • Xtrabackup 8.0.26

The my.cnf configuration:

[mysqld]
innodb_buffer_pool_size = 89G
innodb_log_file_size = 10G

Performance Test

For the test, I used sysbench to populate MySQL. To load the data, I choose the tpcc method:

$ ./tpcc.lua  --mysql-user=sysbench --mysql-password='sysbench' --mysql-db=percona --time=300 --threads=64 --report-interval=1 --tables=10 --scale=100 --db-driver=mysql prepare

Before starting the comparison, I ran mysqldump once and discarded the results to warm up the cache, otherwise our test would be biased because the first backup would have to fetch data from the disk and not the cache.

With everything set, I started the mysqldump with the following options:

$ time mysqldump --all-databases --max-allowed-packet=4294967295 --single-transaction -R --master-data=2 --flush-logs | gzip > /backup/dump.dmp.gz

For the Shell utility:

$ mysqlsh
MySQL JS > shell.connect('root@localhost:3306');
MySQL localhost:3306 ssl test JS > util.dumpInstance("/backup", {ocimds: true, compatibility: ["strip_restricted_grants","ignore_missing_pks"],threads: 16})

For mydumper:

$ time mydumper  --threads=16  --trx-consistency-only --events --routines --triggers --compress --outputdir /backup/ --logfile /backup/log.out --verbose=2

PS: To use zstd, there are no changes in the command line, but you need to download the zstd binaries.

For mysqlpump:

$ time mysqlpump --default-parallelism=16 --all-databases > backup.out

For xtrabackup:

$ time xtrabackup --backup --parallel=16 --compress --compress-threads=16 --datadir=/mysql_data/ --target-dir=/backup/

Analyzing the Results

And what do the results tell us?

Parallel methods have similar performance throughput. The mydumper tool cut the execution time by 50% when using zstd instead of gzip, so the compression method makes a big difference when using mydumper.

For the util.dumpInstance utility, one advantage is that the tool stores data in both binary and text format and uses zstd compression by default. Like mydumper, it uses multiple files to store the data and has a good compression ratio. 

XtraBackup got third place with a few seconds of difference from MySQL shell. The main advantage of XtraBackup is its flexibility, providing PITR and encryption for example. 

Next, mysqlpump is more efficient than mydumper with gzip, but only by a small margin. Both are logical backup methods and works in the same way. I tested mysqlpump with zstd compression, but the results were the same, hence the reason I didn’t add it to the chart. One possibility is because mysqlpump streams the data to a single file.

Lastly, for mysqldump, we can say that it has the most predictable behavior and has similar execution times with different runs. The lack of parallelism and compression is a disadvantage for mysqldump; however, since it was present in the earliest MySQL versions, based on Percona cases, it is still used as a logical backup method.

Please leave in the comments below what you thought about this blog post, if I missed something, or if it helped you. I will be glad to discuss it!

Useful Resources

Finally, you can reach us through the social networks, our forum, or access our material using the links presented below:

Dec
09
2021
--

A Look Into Percona XtraDB Cluster Non-Blocking Operation for Online Schema Upgrade

Percona XtraDB Cluster Non-Blocking Operation

Percona XtraDB Cluster Non-Blocking OperationPercona XtraDB Cluster 8.0.25 (PXC) has introduced a new option to perform online schema modifications: NBO (Non-Blocking Operation).

When using PXC, the cluster relies on the wsrep_OSU_method parameter to define the Online Schema Upgrade (OSU) method the node uses to replicate DDL statements. 

Until now, we normally have three options:

  • Use Total Isolation Order (TOI, the default)
  • Use Rolling Schema Upgrade (RSU)
  • Use Percona’s online schema change tool (TOI + PTOSC)

Each method has some positive and negative aspects. TOI will lock the whole cluster from being able to accept data modifications for the entire time it takes to perform the DDL operation. RSU will misalign the schema definition between the nodes, and in any case, the node performing the DDL operation is still locked. Finally, TOI+PTOSC will rely on creating triggers and copying data, so in some cases, this can be very impactful. 

The new Non-Blocking Operation (NBO) method is to help to reduce the impact on the cluster and make it easier to perform some DDL operations.

At the moment we only support a limited set of operations with NBO like:

  • ALTER INDEX
  • CREATE INDEX
  • DROP INDEX

Any other command will result in an error message ER_NOT_SUPPORTED_YET.

But let us see how it works and what the impact is while we will also compare it with the default method TOI.

What we will do is work with four connections:

1 – to perform ddl
2 – to perform insert data in the table being altered
3 – to perform insert data on a different table
4-5 – checking the other two nodes operations

PXC must be at least Version 8.0.25-15.1.

The table we will modify is :

DC1-1(root@localhost) [windmills_s]>show create table windmills_test\G
*************************** 1. row ***************************
       Table: windmills_test
Create Table: CREATE TABLE `windmills_test` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `uuid` char(36) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `millid` smallint NOT NULL,
  `kwatts_s` int NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `active` tinyint NOT NULL DEFAULT '1',
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `strrecordtype` char(3) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`),
  KEY `kuuid_x` (`uuid`),
  KEY `millid_x` (`millid`),
  KEY `active_x` (`active`)
) ENGINE=InnoDB AUTO_INCREMENT=8199260 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC
1 row in set (0.00 sec)

And contains ~five million rows.

DC1-1(root@localhost) [windmills_s]>select count(*) from windmills_test;
+----------+
| count(*) |
+----------+
|  5002909 |
+----------+
1 row in set (0.44 sec)

The Commands

Connection 1:

ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE;
  ALTER TABLE windmills_test drop INDEX idx_1, ALGORITHM=INPLACE;

 

Connection 2:

while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

 

Connection 3:

while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills8  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

 

Connections 4-5:

while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "show full processlist;"|egrep -i -e "(windmills_test|windmills_s)"|grep -i -v localhost;sleep 1;done

Operations

  • Start inserts from connections
  • Start commands in connections 4 – 5 on the other nodes
  • Execute: 
    • For TOI 
      • DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=TOI;
    • For NBO
      • DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=NBO;
    • For both
      • DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=shared;

Let’s Run It

Altering a Table with TOI

DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE;
Query OK, 0 rows affected (1 min 4.74 sec)
Records: 0  Duplicates: 0  Warnings: 0

 

Inserts in the altering table (connection 2):

.450
.492
64.993 <--- Alter blocks all inserts on the table we are altering
.788
.609

 

Inserts on the other table (connection 3):

.455
.461
64.161 <--- Alter blocks all inserts on all the other tables as well
.641
.483

 

On the other nodes at the same time of the ALTER we can see:

Id  User             db         Command Time  State             Info                                                                            Time_ms Rows_sent Rows_examined 
15	system user		windmills_s	Query	102	  altering table	ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE	102238	0	       0    <--- time from start

In short, we have the whole cluster locked for ~64 seconds. During this period of time, all the operations to modify data or structure were on hold. 

Let’s Now Try With NBO

Inserts in the altering table:

.437
.487
120.758 <---- Execution time increase
.617
.510

 

Inserts on the other table:

.468
.485
25.061 <---- still a metalock, but not locking the other tables for the whole duration 
.494
.471

 

On the other nodes at the same time of the ALTER we can see:

Id      User         db             Command Time  State             Info                                                                            Time_ms Rows_sent Rows_examined 
110068	system user	 windmills_s	Connect	86	  altering table	ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE	 120420	 0	          0

 

In this case, what is also interesting to note is that:

  1. We have a moment of metalock:
    1. 110174 pmm 127.0.0.1:42728 NULL Query 2 Waiting for table metadata lock  SELECT x FROM information_schema.tables  WHERE TABLE_SCHEMA = 'windmills_s' 1486 10    0
    2. 110068  system user connecting host windmills_s Connect 111 closing tables ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE 111000 0 0
  2. The execution time is longer

Summarizing:

TOI            NBO
Time on hold for insert for altering table   	~64 sec    	~120 sec
Time on hold for insert for another table   	~64 sec      ~25 sec 
metalock                       			whole time  	 only at the end

What is Happening? What are the Differences and Why Does it Take Longer with NBO?

Let’s see at a very high level how the two work:

  • TOI: when you issue a DDL like ADD INDEX a metadata lock is taken on the table and it will be released only at the end of the operation. During this time, you cannot:
    • Perform DMLs on any cluster node
    • Alter another table in the cluster
  • NBO: the metadata lock is taken at the start and at the end for a very brief period of time. The ADD INDEX operation will then work on each node independently. The lock taken at the end is to have all the nodes agree on the operation and commit or rollback (using cluster error voting). This final phase costs a bit more in time and is what adds a few seconds to the operation execution. But during the operation:
    • You can alter another table (using NBO)
    • You can continue to insert data, except in the table(s) you are altering.
    • On node crash, the operation will continue on the other nodes, and if successful it will persist.  

In short, the cluster server behavior changes significantly when using NBO, offering significant flexibility compared to TOI. The cost in time should not linearly increase with the dimension of the table, but more in relation to the single node efficiency in performing the ALTER operation.       

Conclusion

NBO can be significantly helpful to reduce the impact of DDL on the cluster, for now, limited to the widely used creation/modification/drop of an index. But in the future … we may expand it. 

The feature is still a technology preview, so do not trust it in production, but test it and let us know what you think. 

Final comment: another distribution has introduced NBO, but only if you buy the enterprise version.

Percona, which is truly open source with facts not just words, has implemented NBO in standard PXC, and the code is fully open source. This is not the first one, but just another of the many features Percona is offering for free while others ask you to buy the enterprise version.

Enjoy the product and let us have your feedback! Great MySQL to all! 

Dec
03
2021
--

MyDumper Github Repository Is Now an Organization

MyDumper Github Repository organization

For a long time, MyDumper has been in Max Bubenick’s personal GitHub repository. Now, we decided to move to a new MyDumper’s Organization as requested earlier this year by a user from the community.

There were also two other reasons why we decided to move it. The first one is related to how the project is evolving, and the second is that it will allow us to implement integrations to other projects.

We can see the evolution of the project, noting the increase in commits of the last year:

mydumper

We tried to keep the release cycle every two months, focusing on closing as many bugs as possible and implementing new features requested. It was not an easy task, as lots of changes had to be implemented in mydumper and myloader engine to allow the new features to be developed. 

Seeing the progress that has been done, we can encourage ourselves and expect to release version 1.0.1 next year. This will be a huge step for the project, as it will mean that MyDumper is mature enough to endeavor easily any export/import task.

On the other hand, moving MyDumper to an Organization will allow us to create an official Docker image and it will also allow us to create a pipeline with CircleCI. Both are on the Community wish list but it will be also useful for current and future development members, as one of the most important and difficult tasks is related to testing because of the number of use cases that we have. We expect to see higher quality on releases, not just because of the quality of code, but also because the tests cover most of the uses of mydumper and myloader.

Now MyDumper has its own merch, too. By shopping for the merch you contribute to the project. Learn more here.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com