Group replication is a fault-tolerant/highly available replication topology that ensures if the primary node goes down, one of the other candidates or secondary members takes over so write and read operations can continue without any interruptions. However, there are some scenarios where, due to outages, network partitions, or database crashes, the group membership could be broken, or we end […]
29
2023
Compression Methods in MongoDB: Snappy vs. Zstd
Compression in any database is necessary as it has many advantages, like storage reduction, data transmission time, etc.
Storage reduction alone results in significant cost savings, and we can save more data in the same space. As the amount of data grows, the need for efficient data compression becomes increasingly important to save storage space, reduce I/O overhead, and improve query performance.
In this blog, we will discuss both data and network-level compression offered in MongoDB. We will discuss snappy and zstd for data block and zstd compression in a network.
Percona Server for MongoDB (PSMDB) supports all types of compression and enterprise-grade features for free. I am using PSMDB 6.0.4 here.
Data compression
MongoDB offers various block compression methods used by the WiredTiger storage engine, like snappy, zlib, and zstd.
When data is written to disk, MongoDB compresses it with a specified block compression method and then writes it to disk. When this data block is read, it decompresses it in memory and presents it to the incoming request.
Block compression is a type of compression that compresses data in blocks rather than compressing the entire data set at once. Block compression can improve performance by allowing data to be read and written in smaller chunks.
By default, MongoDB provides a snappy block compression method for storage and network communication.
Snappy compression is designed to be fast and efficient regarding memory usage, making it a good fit for MongoDB workloads. Snappy is a compression library developed by Google.
Benefits of snappy compression in MongoDB:
- Fast compression and decompression speeds
- Low CPU usage
- A streamable format that allows for quick processing
- Minimal impact on query performance
Zstandard Compression or zstd, another newer block compression method provided by MongoDB starting for v4.2, provides higher compression rates. Zstd is a compression library that Facebook developed.
Zstd typically offers a higher compression ratio than snappy, meaning that it can compress data more effectively and achieve a smaller compressed size for the same input data.
Benefits of zstd compression in MongoDB:
- Higher compression ratios than Snappy
- Highly configurable compression levels
- Fast compression and decompression speeds
- Minimal impact on query performance
To enable zstd block compression, you need to specify the block compressor as “zstd” in the configuration file:
storage: engine: wiredTiger wiredTiger: collectionConfig: blockCompressor: zstd blockCompressorQuality: 6 #(available since v5.0) engineConfig: cacheSizeGB: 4
In the above example, blockCompressorQuality is set to 6, which is the default.
blockCompressorQuality specifies the level of compression applied when using the zstd compressor. Values can range from 1 to 22.
The higher the specified value for zstdCompressionLevel, the higher the compression which is applied. So, it becomes very important to test for the optimal required use case before implementing it in production.
Here, we are going to test snappy and zstd compression with the following configurations.
Host config: 4vCPU, 14 GB RAM
DB version: PSMDB 6.0.4
OS: CentOS Linux 7
I’ve used mgenerate command to insert a sample document.
mgeneratejs '{"name": "$name", "age": "$age", "emails": {"$array": {"of": "$email", "number": 3}}}' -n 120000000 | mongoimport --uri mongodb://localhost:27017/<db> --collection <coll_name> --mode insert
Sample record:
_id: ObjectId("64195975e40cea62af1be510"), name: 'Verna Grant', age: 44, emails: [ 'guzwev@gizusuzu.mv', 'ba@ewobisrut.tl', 'doz@bi.ag' ]
I’ve created a collection using the below command with a specific block compression method. This does not affect any existing collection or any new collection being created after this.
db.createCollection("user", {storageEngine: {wiredTiger: {configString: "block_compressor=zstd"}}})
If any new collection is created in the default manner, it will always be the default snappy or compression method specified in the mongod config file.
At the time of insert ops, no other queries or DML ops were running in the database.
Snappy
Data size: 14.95GB
Data size after compression: 10.75GB
Avg latency: 12.22ms
Avg cpu usage: 34%
Avg insert ops rate: 16K/s
Time taken to import 120000000 document: 7292 seconds
Zstd (with default compression level 6)
Data size: 14.95GB
Data size after compression: 7.69GB
Avg latency: 12.52ms
Avg cpu usage: 31.72%
Avg insert ops rate: 14.8K/s
Time taken to import 120000000 document: 7412 seconds
We can see from the above comparison that we can save almost 3GB of disk space without impacting the CPU or memory.
Network compression
MongoDB also offers network compression.
This can further reduce the amount of data that needs to be transmitted between server and client over the network. This, in turn, requires less bandwidth and network resources, which can improve performance and reduce costs.
It supports the same compression algorithms for network compression, i.e., snappy, zstd, and zlib. All these compression algorithms have various compression ratios and CPU needs.
To enable network compression in mongod and mongos, you can specify the compression algorithm by adding the following line to the configuration file.
net: compression: compressors: snappy
We can also use multiple compression algorithms like
net: compression: compressors: snappy,zstd,zlib
The client should also use at least one or the same compression method specified in the config to have data over the network compressed, or the data between the client and server would be uncompressed.
In the below example, I am using a python driver to connect to my server with no compression, and zstd compression algorithm
I am doing simple find ops on the sample record shown above.
This is the outbound data traffic without any compression method
Here we can see data transmitted is around 2.33MB/s:
Now, I’ve enabled zstd compression algorithm in both the server and client
client = pymongo.MongoClient("mongodb://user:pwd@xx.xx.xx.xx:27017/?replicaSet=rs1&authSource=admin&compressors=zstd")
Here we can see data avg outbound transmission is around 1MB/s which is almost a 50% reduction.
Note that network compression can have a significant impact on network performance and CPU usage. In my case, there was hardly anything else running, so I did not see any significant CPU usage.
Conclusion
Choosing between snappy and zstd compression depends on the specific use cases. By understanding the benefits of each algorithm and how they are implemented in MongoDB, you can choose the right compression setting for your specific use case and save some disk space.
Choosing the appropriate compression algorithm is important based on your specific requirements and resources. It’s also important to test your applications with and without network compression to determine the optimal configuration for your use case.
I also recommend using Percona Server for MongoDB, which provides MongoDB enterprise-grade features without any license, as it is free. You can learn more about it in the blog MongoDB: Why Pay for Enterprise When Open Source Has You Covered?
Percona also offers some more great products for MongoDB, like Percona Backup for MongoDB, Percona Kubernetes Operator for MongoDB, and Percona Monitoring and Management.
Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.
28
2023
WiredTiger Logging and Checkpoint Mechanism
In this blog post, we will dive deep into WiredTiger’s Logging and Checkpoint mechanism.
Every database system has to ensure durability and reliability. MongoDB uses classic Write-Ahead-Logging (WAL) using Journals and Checkpoints.
Starting with the basics, why is WAL needed in the first place? It’s to ensure that our data is durable after each write operation and to make it persistent and consistent without compromising the performance.
In terms of MongoDB, it achieves WAL and data durability using a combination of both Journaling and Checkpoints. Let’s understand both of them.
1. Journal
It’s a process where every write operation gets written (appended) from Memory to a Journal file, AKA transaction log that exists on disk at a specific interval configured using “journalCommitIntervalMs.”
This acts as a step to ensure durability by recovering lost data from the same journal files in case of crashes, power, and hardware failures between the checkpoints (see below)
Here’s what the process looks like.
- For each write operation, MongoDB writes the changes into Journal files, AKA transaction log files, which is basically a WAL mechanism used by MongoDB, as discussed above. This happens at every journalCommitIntervalMs.
- The same data, in the form of pages inside the Wiredtiger cache, are also marked dirty.
Example of journal file when exported using WiredTiger Binary (wt):
$ wt printlog -u -x { "lsn" : [15,256], "hdr_flags" : "", "rec_len" : 256, "mem_len" : 256, "type" : "commit", "txnid" : 3836082, "ops": [ { "optype": "row_put", "fileid": 14 0xe, "key": "u00e8du001au0015bu00ffu00ffu00dfu00c1", "key-hex": "e8641a1562ffffdfc1", "value": "gu0000u0000u0000u0002o…., "value-hex": "67000000026f7….." } ] }, { "lsn" : [15,512], "hdr_flags" : "", "rec_len" : 128, "mem_len" : 128, "type" : "commit", "txnid" : 3836083, "ops": [ { "optype": "row_modify", "fileid": 6 0x6, "key": "u0081", "key-hex": "81", "value": "u0001u0000u0000u….", "value-hex": "010000000000000008000000000000003e0000000000000008000000000000000100000063151a64" }
The important part of it is the byte and offset, which contains any data modifications that happened.
2. Checkpoint
The role of a checkpoint in durability and consistency is equally important. A checkpoint is equivalent to a log, which records the changes in related data files after the last checkpoint.
Each checkpoint consists of a root page, three lists of pages pointing to specific locations on the disk, and the file size on the disk.
At every checkpoint interval (Default 60 seconds), MongoDB flushes the modified pages that are marked as dirty in the cache to their respective data files (both collection-*.wt and index-*.wt).
Using the same “wt” utility, we can list the checkpoints and view the information they contain. The checkpoint information shown below is stored with respect to each data file (collection and index). These checkpoints are stored in WiredTiger.wt file.
The above diagram shows the information present in a checkpoint, while the below output shows how it looks when listed using the “wt” utility.
$ wt list -c WiredTigerCheckpoint.33: Sun Mar 26 08:35:59 2022 (size 8 KB) file-size: 8 KB, checkpoint-size: 4 KB offset, size, checksum root : 8192, 4096, 3824871989 (0xe3faea35) alloc: 12288, 4096, 4074814944 (0xf2e0bde0) discard : 0, 0, 0 (0) available : 0, 0, 0 (0) WiredTigerCheckpoint.34: Sun Mar 26 08:35:59 2022 (size 8 KB) file-size: 8 KB, checkpoint-size: 4 KB offset, size, checksum root : 8192, 4096, 997122142 (0x3b6ee05e) alloc: 12288, 4096, 4074814944 (0xf8e0cde0) discard : 0, 0, 0 (0) available : 0, 0, 0 (0)
This key information resides inside each checkpoint and consists of the following:
- Root page:
- Contains the size (size) of the root page, the position in the file (offset), and the checksum (checksum). When a checkpoint is created, a new root page will be generated.
- Internal page:
- Only carries the keys. WiredTiger traverses through internal pages to look for the respective Leaf page.
- Leaf page:
- Contains actual key: value pair
- Allocated list pages:
- After the recent checkpoint, WiredTiger block manager keeps a record of newly allocated pages and their information, such as size, offset, and checksum.
- Discarded list pages:
- Upon completion of the last checkpoint, associated pages will be discarded; however, key information such as size, offset, and the checksum of each such discarded page will be stored.
- Upon completion of the last checkpoint, associated pages will be discarded; however, key information such as size, offset, and the checksum of each such discarded page will be stored.
- Available list pages:
- When this checkpoint is executed, all pages allocated by the WiredTiger block manager but not yet used; when deleting a previously created checkpoint, the available pages attached to it will be merged into the latest available list of this checkpoint, and also the size, offset, and checksum of each available page will be recorded.
- File size:
- Information about the size of a data file on disk upon completion of a checkpoint.
Although both the processes (involving disk) might look the same, they have different purposes. Journal, on the one hand, is an append-only operation in a journal file, AKA transaction log file present on disk. Checkpoints, on the other hand, deal with persisting the data on respective data files, which does include a lot of overhead due to the complexity involved, especially random disk operations and reconciliation.
Generally, the checkpoint is triggered.
- At every 60 seconds (default), unless there’s a large amount of data that needs to be written, which creates a backlog due to I/O bottlenecks.
- When eviction_dirty_target or eviction_dirty_trigger reaches 5% and 20%, respectively. However, it’s not normal and only happens when there’s too much write activity beyond what the hardware can handle.
So, what happens when there’s an unexpected crash or hardware failure? Let’s take a look at the process when we start mongod.
- MongoD attempts to go into crash recovery and looks for anything there in the Journal files.
The trimmed output would look something like the one below in the “mongod log” files.
{"t":{"$date":"2023-03-27T11:22:48.360+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1679916168:360670][9811:0x7f43b45d 7bc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 15 through 16"}} 7bc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1679916159, 1)"}} {"t":{"$date":"2023-03-27T11:22:48.688+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1679916168:688481][9811:0x7f43b45d 7bc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (1679916154, 1)"}} {"t":{"$date":"2023-03-27T11:22:48.695+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1679916168:695497][9811:0x7f43b45d 7bc0], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 10, snapshot max: 10 snapshot count: 0, oldest timestamp: (1679916154, 1) , meta checkpoint timest amp: (1679916159, 1) base write gen: 11982970"}} {"t":{"$date":"2023-03-27T11:22:48.705+00:00"},"s":"I", "c":"RECOVERY", "id":23987, "ctx":"initandlisten","msg":"WiredTiger recoveryTimestamp","attr":{"recoveryTimestamp":{"$timestamp":{" t":1679916159,"i":1}}}}
2. Identifies the last successful checkpoint from the data files and recovers the uncommitted dirty data from the journal files back into the WireTtiger cache. The same pages will then again be marked as dirty.
Output is trimmed to show relevant information only.
file:demo/collection/108-2625234990440311433.wt access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=snappy………checkpoint=(WiredTigerCheckpoint.33=(addr="018081e49e1d91ae9a81e4b44eefcd9b81e4be132c8a808080e30d3fc0e30c4fc0",order=33,time=1679897278,size=819200,newest_start_durable_ts=7215101747935576783,oldest_start_ts=0,......,checkpoint_backup_info=,checkpoint_lsn=(4294967295,2147483647)
3. This dirty page’s data will then again be ready to be flushed out during the next checkpoint to their respective data files on disk. This is handled by “WiredTiger Block Manager.” Unwanted journal entries will be then cleaned up accordingly post-checkpoint execution.
Voila!! We now have a durable and consistent data state even after a crash.
References:
Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.
15
2023
How Bloom Filters Work in MyRocks
Bloom filters are an essential component of an LSM-based database engine like MyRocks. This post will illustrate through a simple example how bloom filters work in MyRocks.
Why?
With MyRocks/RocksDB, data is stored in a set of large SST files. When MyRocks needs to find the value associated with a given key, it uses a bloom filter to guess if the key could potentially be in an SST file.
How?
A bloom filter is a space-efficient way of storing information about a list of keys. At its base, there is a bitmap and a hash function. The hash value of the keys stored in an SST is computed, and the results are used to set some bits to “1” in the bitmap. When you want to know if a key is present or not in the list, you run it through the hash function and check if the corresponding bits in the bitmap are “1” or “0”. If one of the bits is “0”, you know for sure the key is not in the list. If all the bits are “1”, the value may be present. The probability of a false positive depends only on a few factors:
- Size of the bitmap
- Number of keys in the list
- Number of bits set to “1” for each key value
An example
Let’s assume we are creating a list of names and want a bloom filter on that list. We’ll use an 8 bytes bloom filter (64 bits), and we’ll set two bits according to this function:
mysql> set @md5value=md5('Yves');select mod(conv(mid(@md5value,1,8),16,10),32) b1, mod(conv(mid(@md5value,9,8),16,10),32)+32 b2; Query OK, 0 rows affected (0.00 sec) +------+------+ | b1 | b2 | +------+------+ | 15 | 39 | +------+------+ 1 row in set (0.00 sec)
So the value ‘Yves’ sets the bits 15 and 39 to ‘1’.
Now, if I add 10 names (taken from the Percona slack channel “architecture_projects”) we have:
- Alexey: 29 53
- Alick: 8 49
- Bennie: 31 63
- Brenda: 20 54
- Brett: 12 61
- Carlos: 3 46
- Corrado: 27 34
- David: 15 57
- Deb: 13 38
- Eliana: 11 39
The bitmap now looks like:
0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1
Now, if I want to use the bitmap to “guess” if “Bennie” is in the list, that will work just fine, bits 31 and 63 are “1”. But… “Yves” is not in the list even though bits 15 (from David) and 39 (from Eliana) are set to “1”. A query for “Yves” would be a false positive.
On the other hand, let’s consider “Francisco” for which the bits are 27 and 60. Bit 27 is set to “1” by the “Corrado” value, but bit 60 is “0”. We thus know for sure that “Francisco” is not in the list.
Tuning
In terms of tuning, two parameters can be tuned, the size of the bitmap and the number of bits set by every value. The bitmap size must be large enough so that the proportion of bits set to “1” is not overwhelming. In the above example, if you add all the first names of the people at Percona you’ll end up with only “1”. That would be obviously useless. To avoid this issue, we could instead use a 1 GB bitmap. The number of false positives would be low, but 1GB is a lot of memory. In most cases a bitmap of a few hundred bytes is sufficient.
With MyRocks, you can control the size of the bloom filter bitmap with the column family parameter memtable_prefix_bloom_size_ratio. For good performance, the filter blocks are cached in the RocksDB block cache and normally stay there since they are accessed frequently.
The filter_policy variable controls the number of bits per key in the column family settings. By default, 10 bits are used per key which yields less than 1% of false positives.
Conclusion
I hope that this post has provided you with a basic understanding of bloom filters and how they are implemented in MyRocks. LSM storage engines like MyRocks are very different from the more common B-Tree-based storage engines like InnoDB. Bloom filter is an important feature improving performance of LSM storage engines. A good understanding of how bloom filters work can only lead to improved database performance.
Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!
10
2023
Percona XtraBackup and MySQL 5.7 Queries in Waiting for Table Flush State
Percona XtraBackup is an open source hot backup utility for MySQL-based servers. To take consistent and hot backup, it uses various locking methods, especially for non-transactional storage engine tables. This blog post discusses the cause and possible solution for queries with ?Waiting for table flush state in processlist when taking backups using Percona XtraBackup. Only MySQL 5.7 version is affected by this, as per my tests.
Type of locks taken by Percona XtraBackup
Before discussing the main issue, let’s learn about the type of locks used by Percona XtraBackup to take consistent backups. Percona XtraBackup uses backup locks as a lightweight alternative to FLUSH TABLES WITH READ LOCK. This feature is available in Percona Server for MySQL 5.6+. MySQL 8.0 allows acquiring an instance-level backup lock via the LOCK INSTANCE FOR BACKUP statement.
Locking is only done for MyISAM storage engine tables and other non-InnoDB tables after Percona XtraBackup finishes backing up all InnoDB/XtraDB data and logs. With backup locks, Percona XtraBackup uses LOCK TABLES FOR BACKUP automatically to copy non-InnoDB data and avoid blocking DML queries that modify InnoDB tables.
So for Percona Server for MySQL 5.7 and Percona XtraDB Cluster 5.7, it uses the following locking command while performing MyISAM and other non-InnoDB tables:
Executing LOCK TABLES FOR BACKUP... Starting to backup non-InnoDB tables and files
For upstream MySQL 5.7 versions, it uses the following locking commands:
Executing FLUSH NO_WRITE_TO_BINLOG TABLES... Executing FLUSH TABLES WITH READ LOCK... Starting to backup non-InnoDB tables and files
MySQL 8.0 uses the LOCK INSTANCE FOR BACKUP command.
Queries with ?Waiting for table flush state in processlist output while XtraBackup is running, this issue is visible in upstream MySQL 5.7 versions only; Percona Server for MySQL/Percona XtraDB Cluster/MySQL 8.0 will not have this issue since they use a different locking mechanism.
Root cause
When MySQL has a long-running query for a table/s, running XtraBackup will run FLUSH NO_WRITE_TO_BINLOG TABLES and all types of new queries on a table where a long-running query is running will have Waiting for table flush state in processlist.
This will be resolved after the long-running query is finished, and the next backup will complete successfully.
Test case example:
CREATE TABLE `joinit` ( `i` int(11) NOT NULL AUTO_INCREMENT, `s` varchar(64) DEFAULT NULL, `t` time NOT NULL, `g` int(11) NOT NULL, PRIMARY KEY (`i`) ) ENGINE=innodb; Create table joinit_new like joinit; INSERT INTO joinit VALUES (NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 ))); INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit; INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 33554618 rows
#Start sysbench load on the database:
$ sysbench /usr/share/sysbench/oltp_read_write.lua --mysql_storage_engine=innodb --table-size=1000 --tables=5 --mysql-db=test --mysql-host=127.0.0.1 --mysql-user=msandbox --mysql-password=msandbox --threads=4 --time=0 --report-interval=1 --events=0 --db-driver=mysql run
Session 1: Ran the following query:
mysql > select distinct i from joinit for update;
mysql > show processlist; +-----+----------+-----------+-------+---------+------+--------------+------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+----------+-----------+-------+---------+------+--------------+------------------------------------------+ | 23 | msandbox | localhost | test1 | Query | 6 | Sending data | select distinct i from joinit for update | | 76 | msandbox | localhost | NULL | Query | 0 | starting | show processlist | | 171 | msandbox | localhost | test1 | Sleep | 57 | | NULL | | 207 | msandbox | localhost | test | Sleep | 135 | | NULL | | | 243 | msandbox | localhost | test | Execute | 0 | starting | COMMIT | | 244 | msandbox | localhost | test | Execute | 0 | starting | COMMIT | | 245 | msandbox | localhost | test | Execute | 0 | starting | COMMIT | | 247 | msandbox | localhost | NULL | Sleep | 3 | | NULL | +-----+----------+-----------+-------+---------+------+--------------+------------------------------------------+ 9 rows in set (0.00 sec)
#Start XtraBackup:
xtrabackup --user=msandbox --password=msandbox --backup --target-dir=~/backup Copying ./mysql/help_relation.ibd to /home/lalit/backup/mysql/help_relation.ibd ...done >> log scanned up to (13605240561) Executing FLUSH NO_WRITE_TO_BINLOG TABLES... >> log scanned up to (13605923511) >> log scanned up to (13606547653) >> log scanned up to (13607307899)
Session 2: while session1 long-query is running:
mysql > insert into joinit_new (SELECT * from joinit);
Session 3: while session1 long-query is running:
mysql > select count(*) from test1.joinit;
As we can see from the example, all new queries on joinit table are in the Waiting for table flush state. Queries on other tables are not affected.
mysql [localhost] {msandbox} ((none)) > show processlist; +-----+----------+-----------+-------+---------+------+-------------------------+-----------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+----------+-----------+-------+---------+------+-------------------------+-----------------------------------------------+ | 23 | msandbox | localhost | test1 | Query | 22 | Sending data | select distinct i from joinit for update | | 76 | msandbox | localhost | NULL | Query | 0 | starting | show processlist | | 171 | msandbox | localhost | test1 | Query | 7 | Waiting for table flush | insert into joinit_new (SELECT * from joinit) | | 207 | msandbox | localhost | test | Query | 3 | Waiting for table flush | select count(*) from test1.joinit | | 242 | msandbox | localhost | test | Execute | 0 | starting | COMMIT | | | 245 | msandbox | localhost | test | Execute | 0 | closing tables | SELECT c FROM sbtest5 WHERE id=? | | 247 | msandbox | localhost | NULL | Query | 11 | Waiting for table flush | FLUSH NO_WRITE_TO_BINLOG TABLES | +-----+----------+-----------+-------+---------+------+-------------------------+-----------------------------------------------+ 9 rows in set (0.00 sec)
Once the long query is completed, we don’t see the Waiting for table flush state in processlist, and backup will continue.
>> log scanned up to (13619342387) >> log scanned up to (13619630994) Executing FLUSH TABLES WITH READ LOCK... >> log scanned up to (13619733489) >> log scanned up to (13619737067)
mysql > show processlist; +-----+----------+-----------+-------+---------+------+------------------------------+-----------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+----------+-----------+-------+---------+------+------------------------------+-----------------------------------------------+ | 23 | msandbox | localhost | test1 | Sleep | 184 | | NULL | | 76 | msandbox | localhost | NULL | Query | 0 | starting | show processlist | | 171 | msandbox | localhost | test1 | Query | 169 | Sending data | insert into joinit_new (SELECT * from joinit) | | | 242 | msandbox | localhost | test | Execute | 131 | Waiting for global read lock | UPDATE sbtest3 SET k=k+1 WHERE id=? | | 243 | msandbox | localhost | test | Execute | 131 | Waiting for global read lock | UPDATE sbtest2 SET k=k+1 WHERE id=? | 247 | msandbox | localhost | NULL | Query | 131 | Waiting for global read lock | FLUSH TABLES WITH READ LOCK | +-----+----------+-----------+-------+---------+------+------------------------------+-----------------------------------------------+ 9 rows in set (0.00 sec) mysql > show processlist; +-----+----------+-----------+-------+---------+------+----------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+----------+-----------+-------+---------+------+----------+------------------+ | 23 | msandbox | localhost | test1 | Sleep | 256 | | NULL | | 76 | msandbox | localhost | NULL | Query | 0 | starting | show processlist | | 171 | msandbox | localhost | test1 | Sleep | 241 | | NULL | | 207 | msandbox | localhost | test | Sleep | 237 | | NULL | +-----+----------+-----------+-------+---------+------+----------+------------------+ 4 rows in set (0.00 sec)
Writing /home/lalit/backup/backup-my.cnf ...done Writing /home/lalit/backup/xtrabackup_info ...done xtrabackup: Transaction log of lsn (13596553991) to (13619761343) was copied. completed OK!
Possible solutions
List of options that you can use with Percona XtraBackup:
- xtrabackup –ftwrl-wait-timeout (seconds) – how long to wait for a good moment. Default is 0, not to wait.
- xtrabackup –ftwrl-wait-query-type – which long queries should be finished before FLUSH TABLES WITH READ LOCK is run. Default is ALL.
- xtrabackup –ftwrl-wait-threshold (seconds) – how long the query should be running before we consider it long running and potential blocker of global lock.
- xtrabackup –kill-long-queries-timeout (seconds) – how much time we give for queries to complete after FLUSH TABLES WITH READ LOCK is issued before start to kill. Default if 0, not to kill.
- xtrabackup –kill-long-query-type – which queries should be killed once kill-long-queries-timeout has expired. The default is SELECT.
Note: Killing select queries is OK but for DML/DDL it could lead to data inconsistency) so better to retry backup later/during non-peak hours.
References:
How Percona XtraBackup works (8.0)
09
2023
COUNT(*) vs COUNT(col) in MySQL
Looking at how people are using COUNT(*) and COUNT(col), it looks like most of them think they are synonyms and just use what they happen to like, while there is a substantial difference in performance and even query results. Also, we find a difference in execution on InnoDB and MyISAM engines.
NOTE: All tests were applied for MySQL version 8.0.30, and in the background, I ran every query three to five times to make sure that all of them were fully cached in the buffer pool (for InnoDB) or by the filesystem (for MyISAM).
Count function for Innodb engine:
Let’s have look at the following series of examples for InnoDB engine:
CREATE TABLE count_innodb ( id int(10) unsigned NOT NULL AUTO_INCREMENT, val_with_nulls int(11) default NULL, val_no_null int(10) unsigned NOT NULL, PRIMARY KEY idx (id) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; (mysql) > select count(*) from count_innodb; +----------+ | count(*) | +----------+ | 10000000 | +----------+ 1 row in set (0.38 sec) (mysql) > select count(val_no_null) from count_innodb; +--------------------+ | count(val_no_null) | +--------------------+ | 10000000 | +--------------------+ 1 row in set (0.38 sec)
In this InnoDB engine, we can see that it requires some time to get COUNT(*) and COUNT(val_no_null) of rows for the table, and as we will see further, MyiSAM is significantly faster compared to InnoDB table in the sense of getting an answer for COUNT(*).
But why we can’t just cache the actual number of the rows? InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. Consequently, SELECT COUNT(*) statements only count rows visible to the current transaction. By the way, we can use information schema to instantly get approximately the number of rows of the table in question:
(mysql) > select table_rows from information_schema.tables where table_name='count_innodb'; +------------+ | TABLE_ROWS | +------------+ | 9980586 | +------------+ 1 row in set (0.00 sec)
As you can see it’s not the exact number of rows. However, sometimes a rough count might be sufficient.
Let’s take a look into COUNT(val_with_nulls);
(mysql) > select count(val_with_nulls) from count_innodb; +-----------------------+ | count(val_with_nulls) | +-----------------------+ | 9990001 | +-----------------------+ 1 row in set (2.14 sec)
And there, as you can see we have differences in the result of COUNT(*) vs COUNT(val_with_nulls)
Why? Because the val_with_nulls column is not defined as NOT NULL there can be some NULL values in it and so MySQL has to perform a table scan to find out. This is also why the result is different for the second query
So COUNT(*) and COUNT(col) queries not only could have substantial performance differences but also ask different questions.
Let’s have another round of queries, there let’s take a look at how InnoDB manages to do COUNT(*), COUNT(val_no_null), COUNT(val_with_nulls) with the same WHERE clause:
(mysql) > select count(*) from count_innodb where id<1000000; +----------+ | count(*) | +----------+ | 980000 | +----------+ 1 row in set (0.30 sec) (mysql) > explain select count(*) from count_innodb where id<1000000\G *************************** 1. row *************************** select_type: SIMPLE table: count_innodb type: range possible_keys: PRIMARY key: PRIMARY rows: 1955802 filtered: 100.00 Extra: Using where; Using index (mysql) > select count(val_no_null) from count_innodb where id<1000000; +--------------------+ | count(val_no_null) | +--------------------+ | 980000 | +--------------------+ 1 row in set (0.33 sec) (mysql) > explain select count(val_no_null) from count_innodb where id<1000000\G *************************** 1. row *************************** select_type: SIMPLE table: count_innodb type: range possible_keys: PRIMARY key: PRIMARY rows: 2013804 filtered: 100.00 Extra: Using where
We can see the performance of the query is equal for both cases, and it has only differences of 10%, and if you pay closer attention to EXPLAIN for COUNT(*) query, you will notice Using index. This means that MySQL can use only the index and does not touch the rest table data, which might be sufficient to get the count of rows for huge tables.
You might want to use columns that already have an index to speed up the query for huge tables.
Will we have any surprises with COUNT(val_with_nulls)? Let’s see:
(mysql) > select count(val_with_nulls) from count_innodb where id<1000000; +-----------------------+ | count(val_with_nulls) | +-----------------------+ | 970001 | +-----------------------+ 1 row in set (0.33 sec) (mysql) > explain select count(val_with_nulls) from count_innodb where id<1000000\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: count_innodb partitions: NULL type: range possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: NULL rows: 1955802 filtered: 100.00 Extra: Using where 1 row in set, 1 warning (0.00 sec)
No surprises; we can see the performance of the query is pretty even across all COUNT(*), COUNT(val_with_nulls), COUNT(val_with_nulls).
Count function for MyISAM engine:
Now let’s take a look into COUNT() function for the MyISAM engine:
CREATE TABLE count_myisam ( id int(10) unsigned NOT NULL, val_with_nulls int(11) default NULL, val_no_null int(10) unsigned NOT NULL, KEY idx (id) ) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci; (mysql) > select count(*) from count_myisam; +----------+ | count(*) | +----------+ | 10000000 | +----------+ 1 row in set (0.00 sec) (mysql) > select count(val_no_null) from count_myisam; +--------------------+ | count(val_no_null) | +--------------------+ | 10000000 | +--------------------+ 1 row in set (0.00 sec)
What flash speed we saw there!
As this is a MyISAM table, we have cached the number of rows inside of the table, this is how the MyISAM engine works. That is why it can instantly answer COUNT(*) and COUNT(val_no_null) queries.
Please, pay attention to the difference between engines: InnoDB is a transaction engine, and MyISAM is a non-transactional storage engine.
(mysql) > select count(val_with_nulls) from count_myisam; +-----------------------+ | count(val_with_nulls) | +-----------------------+ | 9990001 | +-----------------------+ 1 row in set (14.18 sec)
But when it comes to COUNT(val_with_nulls) for MyISAM table we can see that’s a slower InnoDB in 7 times; what a huge difference. Also, we can see the same behavior for COUNT(val_with_nulls), as NULL values obviously will not be considered. MySQL Optimizer does a good job in this case, doing a full table scan only if it is needed because the column can be NULL.
Now let’s try a few more queries for MyISAM table with WHERE clause:
(mysql) > select count(*) from count_myisam where id<1000000; +----------+ | count(*) | +----------+ | 1001237 | +----------+ 1 row in set (0.41 sec) (mysql) > explain select count(*) from count_myisam where id<1000000 \\G *************************** 1. row *************************** select_type: SIMPLE table: count_myisam type: range possible_keys: idx key: idx rows: 1041561 filtered: 100.00 Extra: Using where; Using index (mysql) > select count(val_no_null) from count_myisam where id<1000000; +--------------------+ | count(val_no_null) | +--------------------+ | 1001237 | +--------------------+ 1 row in set (2.55 sec) (mysql) > explain select count(val_no_null) from count_myisam where id<1000000\\G *************************** 1. row *************************** select_type: SIMPLE table: count_myisam type: range possible_keys: idx key: idx rows: 1041561 filtered: 100.00 Extra: Using index condition; Using MRR (mysql) > select count(val_with_nulls) from count_myisam where id<1000000; +-----------------------+ | count(val_with_nulls) | +-----------------------+ | 1000281 | +-----------------------+ 1 row in set (2.55 sec) (mysql) > explain select count(val_with_nulls) from count_myisam where id<1000000\\G *************************** 1. row *************************** select_type: SIMPLE table: count_myisam type: range possible_keys: idx key: idx rows: 1041561 filtered: 100.00 Extra: Using index condition; Using MRR
As you can see, even if you have a WHERE clause, performance for COUNT(*) and COUNT(col) can be significantly different. In fact, this example shows a five times performance difference because all data fits in memory (for your information, as it’s the MyISAM engine, caching of data happens in the filesystem cache level). For IO-bound workloads, you frequently can see even a 100 times performance difference in this case.
The COUNT(*) query can use a covering index even while COUNT(col) can’t. Of course, you can extend the index to be (id,val_with_nulls) and get the query to be index covered again, but I would use this workaround only if you can’t change the query (ie, it is a third-party application) or case of when the column name is in the query for a reason, and you need a count of non-NULL values.
It is worth to note in this case, MySQL Optimizer does not do a good job of optimizing the query. One could notice (val_with_nulls) column is not null, so COUNT(val_with_nulls) is the same as COUNT(*), and so the query could be run as an index-covered query. It does not, and both queries have to perform row reads in this case.
(mysql) > alter table count_myisam drop key idx, add key idx (id,val_with_nulls); Query OK, 10000000 rows affected (1 min 38.71 sec) Records: 10000000 Duplicates: 0 Warnings: 0 (mysql) > select count(val_with_nulls) from count_myisam where id<1000000; +-----------------------+ | count(val_with_nulls) | +-----------------------+ | 1000281 | +-----------------------+ 1 row in set (0.42 sec) (mysql) > select count(*) from count_myisam where id<1000000; +----------+ | count(*) | +----------+ | 1000762 | +----------+ 1 row in set (0.56 sec)
As you can see, extending the index helps improve COUNT(val_with_nulls) query for null values about seven times compared to COUNT(val_with_nulls) without index. But also, you can see that COUNT(*) becomes around 0,6 times slower, probably because the index becomes about two times longer in this case.
At last, I want to dispel some of the delusions about COUNT(0) and COUNT(1).
(mysql) > select count(1) from count_innodb where id<1000000; +----------+ | count(1) | +----------+ | 980000 | +----------+ 1 row in set (0.30 sec) (mysql) > select count(0) from count_innodb where id<1000000; +----------+ | count(0) | +----------+ | 980000 | +----------+ 1 row in set (0.30 sec) (mysql) > explain select count(1) from count_innodb where id<1000000\G *************************** 1. row *************************** select_type: SIMPLE table: count_innodb type: range possible_keys: PRIMARY key: PRIMARY rows: 1955802 filtered: 100.00 Extra: Using where; Using index (mysql) > explain select count(0) from count_innodb where id<1000000\G *************************** 1. row *************************** select_type: SIMPLE table: count_innodb type: range possible_keys: PRIMARY key: PRIMARY rows: 1955802 filtered: 100.00 Extra: Using where; Using index
As you can see, the performance and explain of the queries are the same, and it does not really matter what number you will put inside brackets in COUNT() function. It can be whatever number you want and it will be fully equal to COUNT(*) by performance and by the actual output of this query.
23
2022
A MyRocks Use Case
I wrote this post on MyRocks because I believe it is the most interesting new MySQL storage engine to have appeared over the last few years. Although MyRocks is very efficient for writes, I chose a more generic workload that will provide a different MyRocks use case.
The use case is the TPC-C benchmark but executed not on a high-end server but on a lower-spec virtual machine that is I/O limited like for example, with AWS EBS volumes. I decided to use a virtual machine with two CPU cores, four GB of memory, and storage limited to a maximum of 1000 IOPs of 16KB. The storage device has performance characteristics pretty similar to an AWS gp2 EBS volume of about 330 GB in size. I emulated these limits using the KVM iotune settings in my lab.
<iotune> <total_iops_sec>1000</total_iops_sec> <total_bytes_sec>16384000</total_bytes_sec> </iotune>
MyRocks and RocksDB
If you wonder what is the difference between MyRocks and RocksDB, consider MyRocks as the piece of code, or the glue, that allows MySQL to store data in RocksDB. RocksDB is a very efficient key-value store based on LSM trees. MyRocks stores table rows in RocksDB using an index id value concatenated with the primary key as the key and then the internal MySQL binary row representation as the value. MyRocks handles indexes in a similar fashion. There are obviously tons of details but that is the main principle behind MyRocks. Inside MyRocks, there is an embedded instance of RocksDB running.
Dataset
The TPC-C dataset I used was with a scale of 200. As seen in the figure below, the sizes of the dataset are very different using InnoDB vs MyRocks. While with InnoDB the size is 20GB, it is only 4.3GB with MyRocks. This is a tribute to the efficient compression capabilities of MyRocks.
A keen observer will quickly realize the compressed dataset size with MyRocks is roughly the same as the amount of memory of the virtual machine. This is not an accident, it is on purpose. I want to illustrate, maybe using an obvious use case, that you can’t use general rules like “InnoDB is faster for reads” or “MyRocks is only good for writes”. A careful answer would be: “it depends…”
TPC-C on MyRocks
In order to be able to run the sysbench TPC-C script, you need to use a binary collation and the read-committed isolation level. You must also avoid foreign key constraints. A typical sysbench invocation would look like this:
./tpcc.lua --mysql-host=10.0.4.112 --mysql-user=sysbench --mysql-password=sysbench --mysql-ssl=off \ --mysql-db=sysbench --threads=4 --scale=200 --use_fk=0 --mysql_storage_engine=rocksdb \ --mysql_table_options="COLLATE latin1_bin" --trx_level=RC --report-interval=10 --time=3600 run
I used a rocksdb_block_cache_size of 512MB. I wanted most of the memory to be available for the file cache, where the compressed SST files will be cached. The block cache just needs to be large enough to keep the index and filter blocks in memory. In terms of compression, the relevant settings in the column family options are:
compression_per_level=kLZ4Compression;bottommost_compression=kZSTD;compression_opts=-14:1:0
MyRocks uses ZStd:1 compression for the bottom level and LZ4 for the upper levels. The bottom-level compression is really critical as it contains most of the data.
Being an LSM-type storage engine, RocksDB must frequently perform level compactions. Level compactions consume IOPs and in environments where IOPs are scarce, those impact performance. Fortunately, RocksDB has the variable rocksdb_rate_limiter_bytes_per_sec to limit the impacts of compaction. The IO bandwidth used by the background compaction threads is limited by this parameter. The following figure illustrates the impacts.
As the filesystem cache and the block cache warms up, the TPC-C transaction rates rise from 50 to around 175/s. After roughly 500s, the need for compaction arises and the performance drops. With no rate limit (0), the background threads consume too much IOPs and the compaction adversely affects the workload. With lower values of rocksdb_rate_limiter_bytes_per_sec, the impacts are reduced and the compactions are spread over longer periods of time.
For this environment, a rate limit of 4 MB/s achieves the lowest performance drops. Once warmed, the performance level never felt under 100 Trx/s. If you set rocksdb_rate_limiter_bytes_per_sec too low, like at 1MB/s, compaction cannot keep up and processing has to stall for some time. You should allocate enough bandwidth for compaction to avoid these stalls.
Long term stability
Over time, as data accumulates in the RocksDB LSM tree, performance can degrade. Using the 2 MB/s rate limiter, I pushed the runtime to 10 hours and observed very little degradation as shown in the following figure.
There are of course many compaction events but the performance baseline remains stable.
MyRocks Vs InnoDB
Now, how does this workload perform on InnoDB? InnoDB is more IO bound than MyRocks, essentially the 20GB dataset is large for the 3GB buffer pool.
The compaction event diminishes MyRocks performance but even then, the transaction rate stays well above the InnoDB one. Over the course of one hour, InnoDB executed 125k transactions while MyRocks achieved in excess of 575k transactions. Even if InnoDB uses compression (CMP8k), the performance level is still much lower.
Conclusion
I hope this post has raised your interest in the MyRocks storage engine. If you are paying too much for cloud-based storage and IOPs, make sure you evaluate MyRocks as it has super compression capabilities and is IO efficient.
Note: all the raw results, scripts, and configuration files used for this post can be found on Github.
17
2022
MyRocks Use Case: Big Dataset
One of the questions I am often asked is in what cases I would prefer MyRocks over InnoDB. We have covered MyRocks in our blog previously:
MyRocks Performance – Percona Database Performance Blog
Saving With MyRocks in The Cloud – Percona Database Performance Blog
But it would be good to refresh some materials.
This time I want to take an interesting (and real) data set, which I also covered previously: the Reddit Comments dataset (see Big Dataset: All Reddit Comments – Analyzing with ClickHouse – Percona Database Performance Blog). The dataset is still available for download from http://files.pushshift.io/reddit/submissions/ and it includes all recent comments to June 2022.
The size of the dataset is what is interesting, for example, the comments for January 2022 is 118GB and the total January-June 2022 dataset size is 729GB.
What is also interesting about this dataset is it comes in JSON format, and now MySQL provides wide capabilities to work with JSON files directly.
For example, we can directly load the dataset as:
mysqlsh root@127.0.0.1/rs -- util importJson /storage/vadim/reddit/RS_2022-01 --table=stor --tableColumn=doc
Into a table created as:
CREATE TABLE
stor
(
doc
json DEFAULT NULL,
id
int NOT NULL AUTO_INCREMENT,
PRIMARY KEY (id
)
)
Now we can compare load time into MyRocks vs InnoDB for the same files.
It is worth noting that I used a fast NVMe storage for the MySQL database, which is beneficial for InnoDB (as InnoDB performs better on the fast storage, see Saving With MyRocks in The Cloud – Percona Database Performance Blog).
So loading six files into MyRocks storage engine:
Processed 124.26 GB in 32091070 documents in 1 hour 45 min 4.8562 sec (5.09K documents/s)
Processed 116.83 GB in 29843162 documents in 1 hour 40 min 45.0204 sec (4.94K documents/s)
Processed 128.81 GB in 32677372 documents in 1 hour 54 min 16.1496 sec (4.77K documents/s)
Processed 130.34 GB in 33002461 documents in 1 hour 56 min 43.0586 sec (4.71K documents/s)
Processed 142.88 GB in 34838318 documents in 2 hours 1 min 22.7340 sec (4.78K documents/s)
Processed 139.09 GB in 34395243 documents in 2 hours 4 min 49.8066 sec (4.59K documents/s)
And the final size in MyRocks is 238G. MyRocks uses a mix of LZ4 + Zstandard compression to achieve this size (smaller than the original dataset).
The same for InnoDB storage engine:
Processed 124.26 GB in 32091070 documents in 2 hours 17 min 33.0404 sec (3.89K documents/s)
Processed 116.83 GB in 29843162 documents in 2 hours 8 min 6.3595 sec (3.88K documents/s)
Processed 128.81 GB in 32677372 documents in 2 hours 28 min 8.5292 sec (3.68K documents/s)
Processed 130.34 GB in 33002461 documents in 2 hours 31 min 25.8357 sec (3.63K documents/s)
Processed 142.88 GB in 34838318 documents in 2 hours 44 min 56.9327 sec (3.52K documents/s)
Processed 139.09 GB in 34395243 documents in 2 hours 39 min 53.3889 sec (3.59K documents/s)
The final size in InnoDB is 991G.
Note: InnoDB does not use compression in this test. Although the InnoDB engine has compression capabilities, this is not what we typically recommend to use due to different issues.
The MyRock average insert throughout is 4.81K documents/sec and InnoDB is 3.7K documents/sec, so there is a 24 percent gain in loading speed comparing MyRocks to InnoDB.
Now let’s compare some aggregation queries, with benefits to show how we can handle JSON documents in MySQL.
Find the top subreddits (by the number of comments) that are not marked as “Adult only” (I did not risk including “adult” subreddits in the output…):
SELECT doc->"$.subreddit",count(*) cnt FROM stor WHERE JSON_VALUE(doc, "$.over_18" RETURNING UNSIGNED) IS NOT TRUE GROUP BY 1 ORDER BY cnt DESC LIMIT 100;
The first lines of output:
+-------------------------+--------+
| doc->"$.subreddit" | cnt |
+-------------------------+--------+
| "AskReddit" | 329683 |
| "teenagers" | 131401 |
| "memes" | 102118 |
| "AutoNewspaper" | 82080 |
| "relationship_advice" | 66883 |
| "UltraAlgo" | 64958 |
| "antiwork" | 54697 |
| "NoStupidQuestions" | 52531 |
| "NFTsMarketplace" | 48647 |
| "CryptoCurrency" | 46817 |
And execution times:
MyRocks: 100 rows in set (6 min 20.31 sec)
InnoDB: 100 rows in set (8 min 10.29 sec)
The not-so-good parts of MyRocks
Full disclosure: I also want to show the part where MyRocks performs worse than InnoDB. For this, I will use queries that extract comments from the author.
For this we need to create an index by “author”, and because this is all stored in a JSON document, we need to create a VIRTUAL column “author” as:
ALTER TABLE stor ADD COLUMN author VARCHAR(255) GENERATED ALWAYS AS ( doc->"$.author" );
And for InnoDB this is practically an instant operation:
Query OK, 0 rows affected (1 min 22.61 sec)
While for MyRocks it takes time:
Query OK, 61934232 rows affected (45 min 46.61 sec)
The difference is that MyRocks does not support any “INSTANT” DDL operations yet, while InnoDB does. However, we see that exactly “INSTANT” operations cause major bugs and incompatibilities in InnoDB in versions 8.0.29 and 8.0.30 (see Percona XtraBackup 8.0.29 and INSTANT ADD/DROP Columns – Percona Database Performance Blog) so I would ask for extra caution before using “INSTANT” operations in InnoDB.
And now we can create an index on a virtual column:
MyRocks:
ALTER TABLE stor ADD KEY (author);
Query OK, 0 rows affected (14 min 37.34 sec)
InnoDB:
ALTER TABLE stor ADD KEY (author);
Query OK, 0 rows affected (40 min 4.39 sec)
MyRocks took much less time creating a new index on a column. To show where MyRocks is worse I will use the poor-man benchmark available in MySQL, namely the BENCHMARK command, such as:
SELECT BENCHMARK(50000000,(with t1 as (with q as (select FLOOR(1 + RAND()*(7685162 -1 + 1)) c1) select * from authors,q where id=q.c1) select count(*) cnt from stor,t1 where stor.author=t1.author limit 100));
Basically, I will execute 50mln times a query that extracts comments from a random author.
For InnoDB execution times (less is better, three different attempts):
1 row in set (2 min 14.66 sec)
1 row in set (1 min 53.10 sec)
1 row in set (1 min 40.18 sec)
For MyRocks:
1 row in set (6 min 38.60 sec)
1 row in set (5 min 42.83 sec)
1 row in set (6 min 4.76 sec)
To put this into perspective, for InnoDB it resulted in 442k queries/sec, and for MyRocks 137k queries/sec (compression and write-optimized data structure play a role here). We need to highlight these results WITHOUT network communication. In normal circumstances, there will be added network latency and the difference between two engines will be a lot less.
Conclusion
MyRocks makes a good use case for a big dataset (in our test we loaded ¾ TB dataset) providing a good insertion rate and small compressed size of the final product.
As a drawback of compression and a write-optimized engine, MyRocks is behind InnoDB in quick “index lookup” queries, which is acceptable in my opinion, but you should evaluate this for your usage.
Appendix:
Hardware specifications and configurations:
2022-MyRocks-reddit/benchmarkspec.md at main · Percona-Lab-results/2022-MyRocks-reddit (github.com)
10
2022
Building Percona Server for MySQL 8.0 with RocksDB Storage Engine on macOS
In Percona Server for MySQL 8.0.29-21, we added one more patch that helps us to build server code on macOS. To be precise here, we still could do this even before this patch but only partially. Now it is possible to build RocksDB Storage Engine as well.
A word of disclaimer here, at the moment, by macOS we still understand macOS for Intel x86_64 architecture (the most recent ARM versions with Apple M1 / M2 processors are out of the scope of this blog post). Moreover, Percona does not provide official macOS packages for Percona Server for MySQL. It’s just that a lot of our developers use Macs and we care a lot about this platform. As proof, here is a link to our Azure pipelines CI scripts with the latest CI changes that accompany this patch.
Prerequisites
In order to build Percona Server for MySQL, we need to install a number of dependencies. The most important one is Xcode which can be downloaded and installed directly from the App Store. Make sure that you run it at least once after installing. You will be asked to install Xcode Command Line Tools – please answer “yes” as this is the component we are looking for.
We will also need to install a number of third-party libraries on which Percona Server for MySQL code depends and the easiest way to do this would be via the brew package manager. If you don’t already have it installed, please follow the instructions.
After that, install the following tools/libraries via brew:
brew install cmake zstd opensl@1.1 libevent lz4 icu4c protobuf libfido2
Please also notice that macOS itself, Xcode Command Line Tools, and brew packages are being constantly updated and when you try to follow the provided instructions their versions may not be the same as at the time when this blog post was written and as a result, something may not go as expected. So, just in case, I am including an excerpt from the brew config output to simply give you an idea of how old those components were.
HOMEBREW_VERSION: 3.5.4-52-g0070591 ORIGIN: https://github.com/Homebrew/brew HEAD: 007059160f1a9d7afba296e9aa30ab52d4ef29b7 Core tap ORIGIN: https://github.com/Homebrew/homebrew-core Core tap HEAD: 83ce9a03239fe496e292f23a576bc7d1bcea4cca Core tap branch: master Clang: 13.1.6 build 1316 Git: 2.32.1 => /Library/Developer/CommandLineTools/usr/bin/git Curl: 7.79.1 => /usr/bin/curl macOS: 12.4-x86_64 Xcode: 13.4.1
Building
The first step would be to create a workspace directory and make it default:
mkdir ws cd ws
Second, download Percona Server for MySQL source code from the git repository. In these instructions, we will be checking out code marked with the Percona-Server-8.0.29-21 tag (just because it was the most recent release when this blog post was written) but feel free to experiment with the trunk (the head of the 8.0 branch).
git clone https://github.com/percona/percona-server.git cd percona-server git checkout -b current Percona-Server-8.0.29-21 git submodule init git submodule update
Then, we create a separate directory percona-build for intermediate objects and result libraries/executables.
cd .. mkdir percona-build cd percona-build
After that, we need to run cmake to configure the project and generate Unix Makefiles.
cmake ../percona-server \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DBUILD_CONFIG=mysql_release \ -DMYSQL_MAINTAINER_MODE=OFF \ -DDOWNLOAD_BOOST=ON \ -DWITH_BOOST=../deps \ -DWITH_SYSTEM_LIBS=ON \ -DWITHOUT_TOKUDB=ON -DWITH_ROCKSDB=ON
I am not going to describe every parameter here as there is an official MySQL Server documentation page but the most important ones are:
- -DCMAKE_BUILD_TYPE=RelWithDebInfo – configure to build optimized binaries with Debug information
- -DMYSQL_MAINTAINER_MODE=OFF – do not treat compiler warnings as errors
- -DWITH_SYSTEM_LIBS=ON – use system libraries (those we installed via the brew) instead of their bundled version
- -DWITHOUT_TOKUDB=ON – do not build TokuDB Storage Engine, it is deprecated and is incompatible with macOS
- -DWITH_ROCKSDB=ON – build with RocksDB Storage Engine (the main reason for this blog post)
And finally, initiate the build process
cmake --build . -- -j
Please be patient as this may take from 15 to 60 min depending on your hardware.
Testing
Just to make sure that everything has been built successfully let us run a few MySQL tests via MTR.
./mysql-test –debug-server rocksdb.1st
If you see something like this:
============================================================================== TEST NAME RESULT TIME (ms) COMMENT ------------------------------------------------------------------------------ [ 25%] rocksdb.1st 'write_prepared' [ pass ] 89 [ 50%] rocksdb.1st 'write_unprepared' [ pass ] 85 [ 75%] rocksdb.1st 'write_committed' [ pass ] 85 [100%] shutdown_report [ pass ] ------------------------------------------------------------------------------
Congratulations, you have successfully built Percona Server for MySQL with RocksDB on macOS.
07
2022
Configure wiredTiger cacheSize Inside Percona Distribution for MongoDB Kubernetes Operator
Nowadays we are seeing a lot of customers starting to use our Percona Distribution for MongoDB Kubernetes Operator. The Percona Kubernetes Operators are based on best practices for the configuration of a Percona Server for MongoDB replica set or the sharded cluster. The main component in MongoDB is the wiredTiger cache which helps to define the cache used by this engine and we can set it based on our load.
In this blog post, we will see how to define the resources’ memory and set the wiredTiger cache for the shard replicaset to improve the performance of the sharded cluster.
The Necessity of WT cache
The parameter storage.wiredTiger.engineConfig.cacheSizeGB limits the size of the WiredTiger internal cache. The operating system will use the available free memory for filesystem cache, which allows the compressed MongoDB data files to stay in memory. In addition, the operating system will use any free RAM to buffer file system blocks and file system cache. To accommodate the additional consumers of RAM, you may have to set WiredTiger’s internal cache size properly.
Starting from MongoDB 3.4, the default WiredTiger internal cache size is the larger of either:
50% of (RAM - 1 GB), or 256 MB.
For example, on a system with a total of 4GB of RAM the WiredTiger cache will use 1.5GB of RAM (0.5 * (4 GB – 1 GB) = 1.5 GB). Conversely, a system with a total of 1.25 GB of RAM will allocate 256 MB to the WiredTiger cache because that is more than half of the total RAM minus one gigabyte (0.5 * (1.25 GB – 1 GB) = 128 MB < 256 MB).
WT cacheSize in Kubernetes Operator
The mongodb wiredTiger cacheSize can be tune with the parameter storage.wiredTiger.engineConfig.cacheSizeRatio and its default value is 0.5. As explained above, if the system allocated memory limit is too low, then the WT cache is set to 256M or calculated as per the formula.
Prior to PSMDB operator 1.9.0, the cacheSizeRatio can be tuned under the sharding section of the cr.yaml file. This is deprecated from v1.9.0+ and unavailable from v1.12.0+. So you have to use the cacheSizeRatio parameter available under replsets configuration instead. The main thing that you will need to check here before changing the cacheSize is to make sure that the resources’ memory limit allocated is also available as per your cacheSize’s requirement. i.e the below section limiting the memory:
resources: limits: cpu: "300m" memory: "0.5G" requests: cpu: "300m" memory: "0.5G"
https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/psmdb/container.go#L307
From the source code that calculates the mongod.storage.wiredTiger.engineConfig.cacheSizeRatio:
// In normal situations WiredTiger does this default-sizing correctly but under Docker // containers WiredTiger fails to detect the memory limit of the Docker container. We // explicitly set the WiredTiger cache size to fix this. // // https://docs.mongodb.com/manual/reference/configuration-options/#storage.wiredTiger.engineConfig.cacheSizeGB// func getWiredTigerCacheSizeGB(resourceList corev1.ResourceList, cacheRatio float64, subtract1GB bool) float64 { maxMemory := resourceList[corev1.ResourceMemory] var size float64 if subtract1GB { size = math.Floor(cacheRatio * float64(maxMemory.Value()-gigaByte)) } else { size = math.Floor(cacheRatio * float64(maxMemory.Value())) } sizeGB := size / float64(gigaByte) if sizeGB < minWiredTigerCacheSizeGB { sizeGB = minWiredTigerCacheSizeGB } return sizeGB }
Changing the cacheSizeRatio
Here for the test, we deployed the PSMDB operator on GCP. You can refer here for the steps – https://www.percona.com/doc/kubernetes-operator-for-psmongodb/gke.html. With the latest operator v1.11.0, the sharded cluster has been started with a shard and a config server replicaSets along with mongos pods.
$ kubectl get pods NAME READY STATUS RESTARTS AGE my-cluster-name-cfg-0 2/2 Running 0 4m9s my-cluster-name-cfg-1 2/2 Running 0 2m55s my-cluster-name-cfg-2 2/2 Running 1 111s my-cluster-name-mongos-758f9fb44-d4hnh 1/1 Running 0 99s my-cluster-name-mongos-758f9fb44-d5wfm 1/1 Running 0 99s my-cluster-name-mongos-758f9fb44-wmvkx 1/1 Running 0 99s my-cluster-name-rs0-0 2/2 Running 0 4m7s my-cluster-name-rs0-1 2/2 Running 0 2m55s my-cluster-name-rs0-2 2/2 Running 0 117s percona-server-mongodb-operator-58c459565b-fc6k8 1/1 Running 0 5m45s
Now login into the shard and check the default memory allocated to the container and to the mongod instance. In below, the memory size available is 15G, but the memory limit to use in this container is 476MB only:
rs0:PRIMARY> db.hostInfo() { "system" : { "currentTime" : ISODate("2021-12-30T07:16:59.441Z"), "hostname" : "my-cluster-name-rs0-0", "cpuAddrSize" : 64, "memSizeMB" : NumberLong(15006), "memLimitMB" : NumberLong(476), "numCores" : 4, "cpuArch" : "x86_64", "numaEnabled" : false }, "os" : { "type" : "Linux", "name" : "Red Hat Enterprise Linux release 8.4 (Ootpa)", "version" : "Kernel 5.4.144+" }, "extra" : { "versionString" : "Linux version 5.4.144+ (builder@7d732a1aec13) (Chromium OS 12.0_pre408248_p20201125-r7 clang version 12.0.0 (/var/tmp/portage/sys-devel/llvm-12.0_pre408248_p20201125-r7/work/llvm-12.0_pre408248_p20201125/clang f402e682d0ef5598eeffc9a21a691b03e602ff58)) #1 SMP Sat Sep 25 09:56:01 PDT 2021", "libcVersion" : "2.28", "kernelVersion" : "5.4.144+", "cpuFrequencyMHz" : "2000.164", "cpuFeatures" : "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities", "pageSize" : NumberLong(4096), "numPages" : 3841723, "maxOpenFiles" : 1048576, "physicalCores" : 2, "mountInfo" : [ .. ..
The cachesize in MB of wiredTiger engine allocated in Shard is as follows:
rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]/1024/1024 256
The cache size of 256MB is too low for the real environment. So let’s see how to tune the memory limit and also the cacheSize of WT engine. You can use the parameter called cacheSizeRatio to mention the WT cache ratio (out of 1) and memlimit to mention the memory allocated to the container. To do this, edit the cr.yaml file under deploy directory in the operator to change the settings. From the PSMDB operator v1.9.0, editing cacheSizeRatio parameter under mongod section is deprecated. So for the WT cache limit, use the cacheSizeRatio parameter under the section “replsets” and to set memory, use the memlimit parameter. Setting 3G for the container and 80% of the memory calculations.
deploy/cr.yaml:58
46 configuration: | 47 # operationProfiling: 48 # mode: slowOp 49 # systemLog: 50 # verbosity: 1 51 storage: 52 engine: wiredTiger 53 # inMemory: 54 # engineConfig: 55 # inMemorySizeRatio: 0.9 56 wiredTiger: 57 engineConfig: 58 cacheSizeRatio: 0.8
deploy/cr.yaml:229-232:
226 resources: 227 limits: 228 cpu: "300m" 229 memory: "3G" 230 requests: 231 cpu: "300m" 232 memory: "3G"
Apply the new cr.yaml
# kubectl appli -f deploy/cr.yaml perconaservermongodb.psmdb.percona.com/my-cluster-name configured
The shard pods are re-allocated and you can check the progress as follows:
$ kubectl get pods NAME READY STATUS RESTARTS AGE my-cluster-name-cfg-0 2/2 Running 0 36m my-cluster-name-cfg-1 2/2 Running 0 35m my-cluster-name-cfg-2 2/2 Running 1 34m my-cluster-name-mongos-758f9fb44-d4hnh 1/1 Running 0 34m my-cluster-name-mongos-758f9fb44-d5wfm 1/1 Running 0 34m my-cluster-name-mongos-758f9fb44-wmvkx 1/1 Running 0 34m my-cluster-name-rs0-0 0/2 Init:0/1 0 13s my-cluster-name-rs0-1 2/2 Running 0 60s my-cluster-name-rs0-2 2/2 Running 0 8m33s percona-server-mongodb-operator-58c459565b-fc6k8 1/1 Running 0 38m
Now check the new settings of WT cache as follows:
rs0:PRIMARY> db.hostInfo().system { "currentTime" : ISODate("2021-12-30T08:37:38.790Z"), "hostname" : "my-cluster-name-rs0-1", "cpuAddrSize" : 64, "memSizeMB" : NumberLong(15006), "memLimitMB" : NumberLong(2861), "numCores" : 4, "cpuArch" : "x86_64", "numaEnabled" : false } rs0:PRIMARY> rs0:PRIMARY> rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]/1024/1024 1474
Here, the memory calculation for WT is done roughly as follows (Memory limit should be more than 1G, else 256MB is allocated by default:
(Memory limit – 1G) * cacheSizeRatio
(2861 - 1) *0.8 = 1467
NOTE:
Till PSMDB operator v1.10.0, the operator takes the change of cacheSizeRatio only if the resources.limit.cpu is also set. This is a bug and it got fixed in v1.11.0 – refer https://jira.percona.com/browse/K8SPSMDB-603 . So if you’re in an older version, don’t be surprised and you have to make sure the resources.limit.cpu is set as well.
https://github.com/percona/percona-server-mongodb-operator/blob/v1.10.0/pkg/psmdb/container.go#L194
if limit, ok := resources.Limits[corev1.ResourceCPU]; ok && !limit.IsZero() { args = append(args, fmt.Sprintf( "--wiredTigerCacheSizeGB=%.2f", getWiredTigerCacheSizeGB(resources.Limits, replset.Storage.WiredTiger.EngineConfig.CacheSizeRatio, true), )) }
From v1.11.0:
https://github.com/percona/percona-server-mongodb-operator/blob/v1.11.0/pkg/psmdb/container.go#L194
if limit, ok := resources.Limits[corev1.ResourceMemory]; ok && !limit.IsZero() { args = append(args, fmt.Sprintf( "--wiredTigerCacheSizeGB=%.2f", getWiredTigerCacheSizeGB(resources.Limits, replset.Storage.WiredTiger.EngineConfig.CacheSizeRatio, true), )) }
Conclusion
So based on the application load, you will need to set the cacheSize of WT for better performance. You can use the above methods to tune the cache size for the shard replicaset in the PSMDB operator.
Reference Links :
https://www.percona.com/doc/kubernetes-operator-for-psmongodb/operator.html
https://www.percona.com/doc/kubernetes-operator-for-psmongodb/gke.html
MongoDB 101: How to Tune Your MongoDB Configuration After Upgrading to More Memory