Feb
22
2018
--

How to Restore MySQL Logical Backup at Maximum Speed

Restore MySQL Logical Backup

Restore MySQL Logical BackupThe ability to restore MySQL logical backups is a significant part of disaster recovery procedures. It’s a last line of defense.

Even if you lost all data from a production server, physical backups (data files snapshot created with an offline copy or with Percona XtraBackup) could show the same internal database structure corruption as in production data. Backups in a simple plain text format allow you to avoid such corruptions and migrate between database formats (e.g., during a software upgrade and downgrade), or even help with migration from completely different database solution.

Unfortunately, the restore speed for logical backups is usually bad, and for a big database it could require days or even weeks to get data back. Thus it’s important to tune backups and MySQL for the fastest data restore and change settings back before production operations.

Disclaimer

All results are specific to my combination of hardware and dataset, but could be used as an illustration for MySQL database tuning procedures related to logical backup restore.

Benchmark

There is no general advice for tuning a MySQL database for a bulk logical backup load, and any parameter should be verified with a test on your hardware and database. In this article, we will explore some variables that help that process. To illustrate the tuning procedure, I’ve downloaded IMDB CSV files and created a MySQL database with pyimdb.

You may repeat the whole benchmark procedure, or just look at settings changed and resulting times.

Database:

  • 16GB – InnoDB database size
  • 6.6GB – uncompressed mysqldump sql
  • 5.8GB – uncompressed CSV + create table statements.

The simplest restore procedure for logical backups created by the mysqldump tool:

mysql -e 'create database imdb;'
time mysql imdb < imdb.sql
# real 129m51.389s

This requires slightly more than two hours to restore the backup into the MySQL instance started with default settings.

I’m using the Docker image percona:latest – it contains Percona Server 5.7.20-19 running on a laptop with 16GB RAM, Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, two disks: SSD KINGSTON RBU-SNS and HDD HGST HTS721010A9.

Let’s start with some “good” settings: buffer pool bigger than default, 2x1GB transaction log files, disable sync (because we are using slow HDD), and set big values for IO capacity,
the load should be faster with big batches thus use 1GB for max_allowed_packet.

Values were chosen to be bigger than the default MySQL parameters because I’m trying to see the difference between the usually suggested values (like 80% of RAM should belong to InnoDB buffer pool).

docker run --publish-all --name p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7
  --innodb_buffer_pool_size=4GB
  --innodb_log_file_size=1G
  --skip-log-bin
  --innodb_flush_log_at_trx_commit=0
  --innodb_flush_method=nosync
  --innodb_io_capacity=2000
  --innodb_io_capacity_max=3000
  --max_allowed_packet=1G
  time (mysql --max_allowed_packet=1G imdb1 < imdb.sql )
  # real 59m34.252s

The load is IO bounded, and there is no reaction on set global foreign_key_checks=0 and unique_checks=0 because these variables are already disabled in the dump file.

How can we reduce IO?

Disable InnoDB double write: --innodb_doublewrite=0

time (mysql --max_allowed_packet=1G imdb1 < imdb.sql )
# real 44m49.963s

A huge improvement, but we still have an IO-bounded load.

We will not be able to improve load time significantly for IO bounded load. Let’s move to SSD:

time (mysql --max_allowed_packet=1G imdb1 < imdb.sql )
# real 33m36.975s

Is it vital to disable disk sync for the InnoDB transaction log?

sudo rm -rf mysql/*
docker rm p57
docker run -v /home/ihanick/Private/Src/tmp/data-movies/imdb.sql:/root/imdb.sql -v /home/ihanick/Private/Src/tmp/data-movies/mysql:/var/lib/mysql
--name p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7
--innodb_buffer_pool_size=4GB
--innodb_log_file_size=1G
--skip-log-bin
--innodb_flush_log_at_trx_commit=0
--innodb_io_capacity=700
--innodb_io_capacity_max=1500
--max_allowed_packet=1G
--innodb_doublewrite=0
# real 33m49.724s

There is no significant difference.

By default, mysqldump produces SQL data, but it could also save data to CSV format:

cd /var/lib/mysql-files
mkdir imdb
chown mysql:mysql imdb/
time mysqldump --max_allowed_packet=128M --tab /var/lib/mysql-files/imdb imdb1
# real 1m45.983s
sudo rm -rf mysql/*
docker rm p57
docker run -v /srv/ihanick/tmp/imdb:/var/lib/mysql-files/imdb -v /home/ihanick/Private/Src/tmp/data-movies/mysql:/var/lib/mysql
--name p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7
--innodb_buffer_pool_size=4GB
--innodb_log_file_size=1G
--skip-log-bin
--innodb_flush_log_at_trx_commit=0
--innodb_io_capacity=700
--innodb_io_capacity_max=1500
--max_allowed_packet=1G
--innodb_doublewrite=0
time (
mysql -e 'drop database imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1 ;
for i in $PWD/*.txt ; do mysqlimport imdb1 $i ; done
)
# real 21m56.049s
1.5X faster, just because of changing the format from SQL to CSV!

We’re still using only one CPU core, let’s improve the load with the –use-threads=4 option:

time (
mysql -e 'drop database if exists imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1
mysqlimport --use-threads=4 imdb1 $PWD/*.txt
)
# real 15m38.147s

In the end, the load is still not fully parallel due to a big table: all other tables are loaded, but one thread is still active.

Let’s split CSV files into smaller ones. For example, 100k rows in each file and load with GNU/parallel:

# /var/lib/mysql-files/imdb/test-restore.sh
apt-get update ; apt-get install -y parallel
cd /var/lib/mysql-files/imdb
time (
cd split1
for i in ../*.txt ; do echo $i ; split -a 6 -l 100000 -- $i `basename $i .txt`. ; done
for i in `ls *.*|sed 's/^[^.]+.//'|sort -u` ; do
mkdir ../split-$i
for j in *.$i ; do mv $j ../split-$i/${j/$i/txt} ; done
done
)
# real 2m26.566s
time (
mysql -e 'drop database if exists imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1
parallel 'mysqlimport imdb1 /var/lib/mysql-files/imdb/{}/*.txt' ::: split-*
)
#real 16m50.314s

Split is not free, but you can split your dump files right after backup.

The load is parallel now, but the single big table strikes back with ‘setting auto-inc lock’ in SHOW ENGINE INNODB STATUSG

Using the --innodb_autoinc_lock_mode=2 option fixes this issue: 16m2.567s.

We got slightly better results with just mysqlimport --use-threads=4. Let’s check if hyperthreading helps and if the problem caused by “parallel” tool:

  • Using four parallel jobs for load: 17m3.662s
  • Using four parallel jobs for load and two threads: 16m4.218s

There is no difference between GNU/Parallel and --use-threads option of mysqlimport.

Why 100k rows? With 500k rows: 15m33.258s

Now we have performance better than for mysqlimport --use-threads=4.

How about 1M rows at once? Just 16m52.357s.

I see periodic flushing logs message with bigger transaction logs (2x4GB): 12m18.160s:

--innodb_buffer_pool_size=4GB --innodb_log_file_size=4G --skip-log-bin --innodb_flush_log_at_trx_commit=0 --innodb_io_capacity=700 --innodb_io_capacity_max=1500 --max_allowed_packet=1G --innodb_doublewrite=0 --innodb_autoinc_lock_mode=2 --performance-schema=0

Let’s compare the number with myloader 0.6.1 also running with four threads (myloader have only -d parameter, myloader execution time is under corresponding mydumper command):

# oversized statement size to get 0.5M rows in one statement, single statement per chunk file
mydumper -B imdb1 --no-locks --rows 500000 --statement-size 536870912 -o 500kRows512MBstatement
17m59.866s
mydumper -B imdb1 --no-locks -o default_options
17m15.175s
mydumper -B imdb1 --no-locks --chunk-filesize 128 -o chunk128MB
16m36.878s
mydumper -B imdb1 --no-locks --chunk-filesize 64 -o chunk64MB
18m15.266s

It will be great to test mydumper with CSV format, but unfortunately, it wasn’t implemented in the last 1.5 years: https://bugs.launchpad.net/mydumper/+bug/1640550.

Returning back to parallel CSV files load, even bigger transaction logs 2x8GB: 11m15.132s.

What about a bigger buffer pool: --innodb_buffer_pool_size=12G? 9m41.519s

Let’s check six-year-old server-grade hardware: Intel(R) Xeon(R) CPU E5-2430 with SAS raid (used only for single SQL file restore test) and NVMe (Intel Corporation PCIe Data Center SSD, used for all other tests).

I’m using similar options as for previous tests, with 100k rows split for CSV files load:

--innodb_buffer_pool_size=8GB --innodb_log_file_size=8G --skip-log-bin --innodb_flush_log_at_trx_commit=0 --innodb_io_capacity=700 --innodb_io_capacity_max=1500 --max_allowed_packet=1G --innodb_doublewrite=0 --innodb_autoinc_lock_mode=2

  • Single SQL file created by mysqldump loaded for 117m29.062s = 2x slower.
  • 24 parallel processes of mysqlimport: 11m51.718s
  • Again hyperthreading making a huge difference! 12 parallel jobs: 18m3.699s.
  • Due to higher concurrency, adaptive hash index is a reason for locking contention. After disabling it with --skip-innodb_adaptive_hash_index: 10m52.788s.
  • In many places, disable unique checks referred as a performance booster: 10m52.489s
    You can spend more time reading advice about unique_checks, but it might help for some databases with many unique indexes (in addition to primary one).
  • The buffer pool is smaller than the dataset, can you change old/new pages split to make insert faster? No: --innodb_old_blocks_pct=5 : 10m59.517s.
  • O_DIRECT is also recommended: --innodb_flush_method=O_DIRECT: 11m1.742s.
  • O_DIRECT is not able to improve performance by itself, but if you can use a bigger buffer pool: O_DIRECT + 30% bigger buffer pool: --innodb_buffeer_pool_size=11G: 10m46.716s.

Conclusions

  • There is no common solution to improve logical backup restore procedure.
  • If you have IO-bounded restore: disable InnoDB double write. It’s safe because even if the database crashes during restore, you can restart the operation.
  • Do not use SQL dumps for databases > 5-10GB. CSV files are much faster for mysqldump+mysql. Implement mysqldump --tabs+mysqlimport or use mydumper/myloader with appropriate chunk-filesize.
  • The number of rows per load data infile batch is important. Usually 100K-1M, use binary search (2-3 iterations) to find a good value for your dataset.
  • InnoDB log file size and buffer pool size are really important options for backup restore performance.
  • O_DIRECT reduces insert speed, but it’s good if you can increase the buffer pool size.
  • If you have enough RAM or SSD, the restore procedure is limited by CPU. Use a faster CPU (higher frequency, turboboost).
  • Hyperthreading also counts.
  • A powerful server could be slower than your laptop (12×2.4GHz vs. 4×2.8+turboboost).
  • Even with modern hardware, it’s hard to expect backup restore faster than 50MBps (for the final size of InnoDB database).
  • You can find a lot of different advice on how to improve backup load speed. Unfortunately, it’s not possible to implement improvements blindly, and you should know the limits of your system with general Unix performance tools like vmstat, iostat and various MySQL commands like SHOW ENGINE INNODB STATUS (all can be collected together with pt-stalk).
  • Percona Monitoring and Management (PMM) also provides good graphs, but you should be careful with QAN: full slow query log during logical database dump restore can cause significant processing load.
  • Default MySQL settings could cost you 10x backup restore slowdown
  • This benchmark is aimed at speeding up the restore procedure while the application is not running and the server is not used in production. Make sure that you have reverted all configuration parameters back to production values after load. For example, if you disable the InnoDB double write buffer during restore and left it enabled in production, you may have scary data corruption due to partial InnoDB pages writes.
  • If the application is running during restore, in most cases you will get an inconsistent database due to missing support for locking or correct transactions for restore methods (discussed above).
Feb
19
2018
--

Percona Server for MySQL 5.7.21-20 Is Now Available

Percona Server for MySQL 5.7.20-18

Percona Server for MySQL 5.7.20-19Percona announces the GA release of Percona Server for MySQL 5.7.21-20 on February 19, 2018. Download the latest version from the Percona web site or the Percona Software Repositories. You can also run Docker containers from the images in the Docker Hub repository.

Based on MySQL 5.7.21, including all the bug fixes in it, Percona Server for MySQL 5.7.21-20 is the current GA release in the Percona Server for MySQL 5.7 series. Percona provides completely open-source and free software.

New Features:
  • A new string variable version_suffix allows to change suffix for the Percona Server version string returned by the read-only version variable. Also version_comment is converted from a global read-only to a global read-write variable.
  • A new keyring_vault_timeout variable allows to set the amount of seconds for the Vault server connection timeout. Bug fixed #298.
Bugs Fixed:
  • mysqld startup script was unable to detect jemalloc library location for preloading, and that prevented starting Percona Server on systemd based machines. Bugs fixed #3784 and #3791.
  • There was a problem with fulltext search, which could find a word with punctuation marks in natural language mode only, but not in boolean mode. Bugs fixed #258#2501 (upstream #86164).
  • Build errors were present on FreeBSD (caused by fixing the bug #255 in Percona Server 5.6.38-83.0) and on MacOS (caused by fixing the bug #264 in Percona Server 5.7.20-19). Bugs fixed #2284 and #2286.
  • A bunch of fixes was introduced to remove GCC 7 compilation warnings for
    the Percona Server build. Bugs fixed #3780 (upstream #89420#89421, and #89422).
  • CMake error took place at compilation with bundled zlib. Bug fixed #302.
  • A GCC 7 warning fix introduced regression in Percona Server that led to a wrong SQL query built to access the remote server when Federated storage engine was used. Bug fixed #1134.
  • It was possible to enable encrypt_binlog with no binary or relay logging enabled. Bug fixed #287.
  • Long buffer wait times where occurring on busy servers in case of the IMPORT TABLESPACE command.
  • Bug fixed #276.
  • Server queries that contained JSON special characters and were logged by Audit Log Plugin in JSON format caused invalid output due to lack of escaping. Bug fixed #1115.
  • Percona Server now uses Travis CI for additional tests. Bug fixed #3777.

Other bugs fixed:  #257#264#1090  (upstream #78048),  #1109#1127#2204#2414#2415#3767#3794, and  #3804 (upstream #89598).

 This release also contains fixes for the following CVE issues: CVE-2018-2565, CVE-2018-2573, CVE-2018-2576, CVE-2018-2583, CVE-2018-2586, CVE-2018-2590, CVE-2018-2612, CVE-2018-2600, CVE-2018-2622, CVE-2018-2640, CVE-2018-2645, CVE-2018-2646, CVE-2018-2647, CVE-2018-2665, CVE-2018-2667, CVE-2018-2668, CVE-2018-2696, CVE-2018-2703, CVE-2017-3737.
MyRocks Changes:
  • A new behavior makes Percona Server fail to restart on detected data corruption;  rocksdb_allow_to_start_after_corruption variable can be passed to mysqld as a command line parameter to switch off this restart failure.
  • A new cmake option ALLOW_NO_SSE42 was introduced to allow MyRocks build on hosts not supporting SSE 4.2 instructions set, which makes MyRocks usable without FastCRC32-capable hardware. Bug fixed MYR-207.
  • rocksdb_bytes_per_sync  and rocksdb_wal_bytes_per_sync  variables were turned into dynamic ones.
  • rocksdb_flush_memtable_on_analyze variable has been removed.
  • rocksdb_concurrent_prepare is now deprecated, as it has been renamed in upstream to  rocksdb_two_write_queues.
  • rocksdb_row_lock_deadlocks and rocksdb_row_lock_wait_timeouts global status counters were added to track the number of deadlocks and the number of row lock wait timeouts.
  • Creating table with string indexed column to non-binary collation now generates warning about using inefficient collation instead of error. Bug fixed MYR-223.
TokuDB Changes:
  • A memory leak was fixed in the PerconaFT library, caused by not destroying PFS key objects on shutdown. Bug fixed TDB-98.
  • A clang-format configuration was added to PerconaFT and TokuDB. Bug fixed TDB-104.
  • A data race was fixed in minicron utility of the PerconaFT. Bug fixed TDB-107.
  • Row count and cardinality decrease to zero took place after long-running REPLACE load.

Other bugs fixed: TDB-48TDB-78TDB-93, and TDB-99.

The release notes for Percona Server for MySQL 5.7.21-20 are available in the online documentation. Please report any bugs on the project bug tracking system.

Feb
15
2018
--

Troubleshooting MySQL Crashes Webinar: Q&A

Troubleshooting MySQL Crashes

Troubleshooting MySQL CrashesIn this blog, I will provide answers to the Q & A for the Troubleshooting MySQL Crashes webinar.

First, I want to thank everybody for attending our January 25, 2018, webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I was unable to answer fully during the webinar.

Q: I have the 600 seconds “Long semaphore wait” assertion failure / crashing issue following DDL queries, sometimes on the master, sometimes just the slaves. Any hints for troubleshooting these? How can I understand what semaphore holding threads are doing?

A: These are hardest errors to troubleshoot. Especially because in some cases (like long-running

CHECK TABLE

 commands) long semaphore waits could be expected and appropriate behavior. If you see long semaphore waits when performing DDL operations, it makes sense to consider using pt-online-schema-change or gh-ost utilities. Also, check the list of supported online DDL operations in the MySQL User Reference Manual.

But if you want to know how to analyze such messages, let’s check the output from page #17 in the slide deck used in the webinar:

2018-01-19T20:38:43.381127Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 139970010412800 has waited at ibuf0ibuf.cc line 3454 for 321.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4dde2ea310 created in file buf0buf.cc line 1453
a writer (thread id 139965530261248) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: fffffffff0000000
Last time read locked in file ibuf0ibuf.cc line 3454
Last time write locked in file /mnt/workspace/percona-server-5.7-binaries-release/label_exp/
debian-wheezy-x64/percona-server-5.7.14-8/storage/innobase/btr/btr0btr.cc line 177
2018-01-19T20:38:43.381143Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 139965135804160 has waited at buf0buf.cc line 4196 for 321.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4f257d33c0 created in file hash0hash.cc line 353
a writer (thread id 139965345621760) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0buf.cc line 4196
Last time write locked in file ...

The line

--Thread 139970010412800 has waited at ibuf0ibuf.cc line 3454 for 321.00 seconds the semaphore:

Shows that some transaction was waiting for a semaphore. The code responsible for this wait is located on line 3454 in file

ibuf0ibuf.cc

. I received this crash when I ran Percona Server for MySQL version 5.7.14-8. Therefore, to check what this code is doing, I need to use Percona Server 5.7.14-8 source code:

sveta@Thinkie:~/mysql_packages/percona-server-5.7.14-8$ vim storage/innobase/ibuf/ibuf0ibuf.cc
...
3454 btr_pcur_open(ibuf->index, ibuf_entry, PAGE_CUR_LE, mode, &pcur, &mtr);
...

A few lines above in the same file contain function definition and comment:

3334 /** Buffer an operation in the insert/delete buffer, instead of doing it
3335 directly to the disk page, if this is possible.
3336 @param[in] mode BTR_MODIFY_PREV or BTR_MODIFY_TREE
3337 @param[in] op operation type
3338 @param[in] no_counter TRUE=use 5.0.3 format; FALSE=allow delete
3339 buffering
3340 @param[in] entry index entry to insert
3341 @param[in] entry_size rec_get_converted_size(index, entry)
3342 @param[in,out] index index where to insert; must not be unique
3343 or clustered
3344 @param[in] page_id page id where to insert
3345 @param[in] page_size page size
3346 @param[in,out] thr query thread
3347 @return DB_SUCCESS, DB_STRONG_FAIL or other error */
3348 static MY_ATTRIBUTE((warn_unused_result))
3349 dberr_t
3350 ibuf_insert_low(
3351 ulint mode,
3352 ibuf_op_t op,
3353 ibool no_counter,
3354 const dtuple_t* entry,
3355 ulint entry_size,
3356 dict_index_t* index,
3357 const page_id_t& page_id,
3358 const page_size_t& page_size,
3359 que_thr_t* thr)
3360 {
...

The first line of the comment gives us an idea that InnoDB tries to insert data into change buffer.

Now, let’s check the next line from the error log file:

S-lock on RW-latch at 0x7f4dde2ea310 created in file buf0buf.cc line 1453
sveta@Thinkie:~/mysql_packages/percona-server-5.7.14-8$ vim storage/innobase/buf/buf0buf.cc
...
1446 /* If PFS_SKIP_BUFFER_MUTEX_RWLOCK is defined, skip registration
1447 of buffer block rwlock with performance schema.
1448
1449 If PFS_GROUP_BUFFER_SYNC is defined, skip the registration
1450 since buffer block rwlock will be registered later in
1451 pfs_register_buffer_block(). */
1452
1453 rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);
...

And again let’s check what this function is doing:

1402 /********************************************************************//**
1403 Initializes a buffer control block when the buf_pool is created. */
1404 static
1405 void
1406 buf_block_init(

Even without knowledge of how InnoDB works internally, by reading only these comments I can guess that a thread waits for some global InnoDB lock when it tries to insert data into change buffer. The solution for this issue could be either disabling change buffer, limiting write concurrency, upgrading or using a software solution that allows you to scale writes.

Q: For the page cleaner messages, when running app using replication we didn’t get them. After switching to PXC we started getting them. Something we should look at particular to PXC to help resolve this?

A: Page cleaner messages could be a symptom of starving IO activity. You need to compare Percona XtraDB Cluster (PXC) and standalone server installation and check how exactly the write load increased.

Q: Hi, I have one question, we have a query we were joining on 

BLOB

 or

TEXT

 fields that is causing system locks and high CPU alerts and causing a lot of system locks, can you please suggest how can we able to make it work? Can you please send the answer in a text I missed some information?

A: If you are joining on

BLOB

 or

TEXT

 fields you most likely don’t use indexes. This means that InnoDB has to perform a full table scan. It increases IO and CPU activity by itself, but also increases the number of locks that InnoDB has to set to resolve the query. Even if you have partial indexes on the 

BLOB

 and

TEXT

 columns, mysqld has to compare full values for the equation, so it cannot use index only to resolve

ON

 clause. It is a best practice to avoid such kinds of

JOIN

s. You can use surrogate integer keys, for example.

Q: Hi, please notice that “MySQL server has gone away” is the worst one, in my opinion, and there was no mention about that ….can you share some tips on this? Thank you.
Both MySQL from Oracle and Percona error log does not help on that, by the way …

A:

MySQL Server has gone away

” error maybe the result of a crash. In this case, you need to handle it like any other crash symptom. But in most cases, this is a symptom of network failure. Unfortunately, MySQL doesn’t have much information why connection failures happen. Probably because, from mysqld’s point of view, a problematic network only means that the client unexpectedly disconnected after a timeout, and the client still waiting for a response receives “

MySQL Server has gone away

”. I discussed these kinds of errors in my  “Troubleshooting hardware resource usage” webinar. A good practice for situations when you see this kind of error often is don’t leave idle connections open for a long time.

Q: I see that a lot of work is doing hard investigation about some possibilities of what is going wrong….is there a plan at development roadmap on improve error log output messages? If you can comment on that …

A: Percona Engineering does a lot for better diagnostics. For example, Percona Server for MySQL has an extended slow log file format, and Percona Server for MySQL 5.7.20 introduced a new

innodb_print_lock_wait_timeout_info

  variable that allows log information about all InnoDB lock wait timeout errors (manual). More importantly, it logs not only blocked transaction, but also locking transaction. This feature was requested at lp:1657737 for one of our Percona Support customers and is now implemented

Oracle MySQL Engineering team also does a lot for better error logging. The start of these improvements happened in version 5.7.2, when variable log_error_verbosity was introduced. Version 8.0.4 added much better tuning control. You can read about it in the Release Notes.

Q: Hello, you do you using strace to find what exactly table have problems in case there is not clear information in mysql error log?

A: I am not a big fan of

strace

 when debugging mysqld crashes, but Percona Support certainly uses this tool. I myself prefer to work with

strace

 when debugging client issues, such as trying to identify why Percona XtraBackup behaves incorrectly.

Thanks everybody for attending the webinar. You can find the slides and recording of the webinar at the Troubleshooting MySQL Crashes web page.

Feb
09
2018
--

This Week in Data with Colin Charles 27: Percona Live Tutorials Released and a Comprehensive Review of the FOSDEM MySQL DevRoom

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Percona Live Santa Clara 2018 update: tutorials have been announced. The committee rated over 300+ talks, and easily 70% of the schedule should go live next week as well. In practice, then, you should see about 50 talks announced next week. There’s been great competition: we only have 70 slots in total, so about 1 in 5 talks get picked — talk about a competitive ratio.

FOSDEM

FOSDEM was truly awesome last week. From a Percona standpoint, we had a lot of excellent booth traffic (being outside of the PostgreSQL room on Saturday, and not too far out from the MySQL room on Sunday). We gave away bottle openers — useful in Brussels with all the beer; we tried a new design with a magnet to attach it to your fridge — stickers, some brochures, but most of all we had plenty of great conversations. There was quite a crowd from Percona, and it was excellent to see the MySQL & Friends DevRoom almost constantly full! A few of us brave souls managed to stay there the whole day, barely with any breaks, so as to enjoy all the talks.

I find the quality of talks to be extremely high. And when it comes to a community run event, with all content picked by an independent program committee, FOSDEM really sets the bar high. There is plenty of competition to get a good talk in, and I enjoyed everything we picked (yes, I was on the committee too). We’ve had plenty of events in the ecosystem that sort of had “MySQL” or related days, but FOSDEM might be the only one that has really survived. I understand we will have a day of some sort at SCALE16x, but even that has been scaled down. So if you care about the MySQL ecosystem, you will really want to ensure that you are at FOSDEM next year.

This year, we started with the usual MySQL Day on Friday. I could not be present, as I was at the CentOS Dojo, giving a presentation. So, the highlight of Friday for me? The community dinner. Over 80 people showed up, I know there was a waiting list, and lots of people were trying to get tickets at the last minute. Many missed out too; sorry, better luck next year; and also, hopefully, we will get a larger venue going forward. I really thank the organizers for this — we affectionately refer to them as the Belconians (i.e. a handful of Perconians based in Belgium). The conversation, the food, the drink —  they were all excellent. It’s good to see representation from all parts of the community: MySQL, Percona, MariaDB, Pythian, and others. So thank you again, Liz, Dimitri, Tom, and Kenny in absentia. I think Tjerk also deserves special mention for always helping (this year with the drinks)

As for FOSDEM itself, beyond the booth, I think the most interesting stuff was the talks. There are video recordings and slides of pretty much all talks, but I will also give you the “Cliff’s Notes” of them here.

MySQL DevRoom talk quick summaries

Beyond WHERE and GROUP BY – Sergei Golubchik

  • EXCEPT is in MariaDB Server 10.3
  • recursive CTEs are good for hierarchical data, graphs, data generation, Turing complete (you can use it to solve Sudoku even)
  • non-recursive CTEs can be an alternative syntax for subqueries in the FROM clause
  • Window functions:
    • Normal: one result per row, depend on that row only
    • Aggregate: one result per group, depending on the whole group
    • Window: one result per row, depending on the whole group
  • System versioned tables with AS OF
  • Aggregate stored functions

MySQL 8.0 Performance: InnoDB Re-Design – Dimitri Kravtchuk

  • Contention-Aware Transactions Scheduling (CATS), since 8.0.3. Not all transactions are equal, FIFO could not be optimal, unblock the most blocking transactions first
  • CATS (VATS) had a few issues, and there were bugs (they thought everything worked since MariaDB Server had implemented it). They spent about 9 months before fixing everything.
  • Where does CATS help? Workloads hitting row lock contentions. You can monitor via SHOW ENGINE INNODB MUTEX.
  • the main problem is because of repeatable read versus read committed transaction isolation on the same workload. You really need to understand your workload when it comes to VATS.

MySQL 8.0 Roles – Giuseppe Maxia

  • Created like a user, granted like privileges. You need to activate them to use them.
  • Before roles, you created a user, then grant, grant, and more grant’s… Add another user? Same deal. Lots of repetitive work and a lot of chances to make mistakes.
  • Faster user administration – define a role, assign it many times. Centralized grant handling – grant and revoke privileges to roles, add/edit all user profiles.
  • You need to remember to set the default role.
  • A user can have many roles; default role can be a list of roles.
  • Roles are users without a login – roles are saved in user tables. This is useful from an account lock/unlock perspective.
  • You can grant a user to a user
  • SET ROLE is for session management; SET DEFAULT ROLE is a permanent assignment of a role for a user. SET ROLE DEFAULT means assign the default role for this user for this session
  • The role_edges table reports which roles are assigned to which users. default_roles keeps track of the current default roles assigned to users. A default role may not exist.

Histogram support in MySQL 8.0 – Øystein Grøvlen

  • You can now do ANALYZE TABLE table UPDATE HISTOGRAM on column WITH n BUCKETS;
  • New storage engine API for sampling (default implementation is full table scan even when sampling)
  • Histogram is stored in a JSON column in the data dictionary. Grab this from the INFORMATION_SCHEMA.
  • Histograms are useful for columns that are not the first column of any index, and used in WHERE conditions of JOIN queries, queries with IN-subqueries, ORDER BY … LIMIT queries. Best fit: low cardinality columns (e.g. gender, orderStatus, dayOfWeek, enums), columns with uneven distribution (skew), stable distribution (do not change much over time)
  • How many buckets? equi-height, 100 buckets should be enough.
  • Histograms are stored in the data dictionary, so will persist over restarts of course.

Let’s talk database optimizers – Vicen?iu Ciorbaru

TLS for MySQL at Large Scale – Jaime Crespo

  • Literally took 3 lines in the my.cnf to turn on TLS
  • https://dbtree.wikimedia.org
  • They wanted to do a data centre failover and wanted to ensure replication would be encrypted.
  • They didn’t have proper orchestration in place (MySQL could have this too). Every time OpenSSL or MySQL had to be upgraded, the daemon needed restarting. If there was an incompatible change, you had to sync master/replicas too.
  • The automation and orchestration that Wikipedia uses: https://fosdem.org/2018/schedule/event/cumin_automation/ (it is called Cumin: https://wikitech.wikimedia.org/wiki/Cumin)
  • Server support was poor – OpenSSL – so they had to deploy wmf-mysql and wmf-mariadb of their own
  • Currently using MariaDB 10.0, and looking to migrate to MariaDB 10.1
  • Client library pain they’ve had
  • TLSv1.2 from the beginning (2015).
  • 20-50x slower for actual connecting; the impact is less than 5% for the actual query performance. Just fix client libraries, make them use persistent connections. They are now very interested in ProxySQL for this purpose.
  • https://grafana.wikimedia.org/?orgId=1
  • Monty asks, would a double certificate help? Jaime says sure. But he may not actually use double certificates; might not solve CA issues, and the goal is not to restart the server.
  • Monty wonders why not to upgrade to 10.2? “Let’s talk outside because it’s a much larger question.”

MySQL InnoDB Cluster – Miguel Araújo

  • group replication: update everywhere (multi-master), virtually synchronous replication, automatic server failover, distributed recovery, group reconfiguration, GCS (implementation of Paxos – group communication system). HA is a critical factor.
  • mysqlsh: interactive and batch operations. Document store (CRUD and relational access)
  • admin API in mysqlsh: create & manage clusters, hide complexity of configuration/provisioning/orchestration of the InnoDB clusters. Works with JavaScript and Python
  • Usability. HA out of the box.
  • It’s easy to join a new node; new node goes into recovery mode (and as long as you have all the binary logs, this is easy; otherwise start from a backup)
  • SET PERSIST – run a command remotely, and the configuration is persisted in the server
  • Network flapping? Group replication will just reject the node from the cluster if its flapping too often

Why we’re excited about MySQL 8 – Peter Zaitsev

  • Native data dictionary – atomic, crash safe, DDLs, no more MyISAM system table requirements
  • Fast INFORMATION_SCHEMA
  • utf8mb4 as default character set
  • Security: roles, breakdown of SUPER privileges, password history, faster cached-SHA2 authentication (default), builds using OpenSSL (like Percona Server), skip grants blocks remote connections, logs now encrypted when tablespace encryption enabled
  • Persistent AUTO_INCREMENT
  • auto-managed undo tablespaces – do not use system table space for undo space. Automatically reclaim space on disks.
  • Self-tuning, limited to InnoDB (innodb_dedicated_server to auto-tune)
  • partial in-place update for JSON – update filed in JSON object without full rewrite. Good for counters/statuses/timestamps. Update/removal of element is supported
  • Invisible indexes – test impact of dropping indexes before actually dropping them. Maintained but unused by the optimizer. If not needed or used, then drop away.
  • TmpTable Storage Engine – more efficient storage engine for internal temporary tables. Efficient storage for VARCHAR and VARBINARY columns. Good for GROUP BY queries. Doesn’t support BLOB/TEXT columns yet (this reverts to InnoDB temp table now)
  • Backup locks – prevent operations which may result in inconsistent backups. CHECK INSTANCE FOR BACKUP (something Percona Server has had before)
  • Optimizer histograms – detailed statistics on columns, not just indexes
  • improved cost model for the optimizer – www.unofficialmysqlguide.com
  • Performance schematic – faster (via “fake” indexes), error instrumentation, response time histograms (global & per query), digest summaries
  • select * from sys.session – fast potential replacement for show processlist
  • RESTART (command)
  • SET PERSIST – e.g. change the buffer pool size, and this helps during a restart
  • assumes default storage is SSD now
  • binary log on by default, log_slave_updates enabled by default, and log expires after 30 days by default
  • query cache removed. Look at ProxySQL or some other caching solution
  • native partitioning only – remove partitions from MyISAM or convert to InnoDB
  • resource groups – isolation and better performance (map queries to specific CPU cores; can jail your costly queries, like analytical queries)
  • Feature Requests: better single thread performance, no parallel query support

MySQL Test Framework for Support and Bugs Work – Sveta Smirnova

  • MTR allows you to add multiple connections
  • has commands for flow control

ProxySQL – GTID Consistent Reads – René Cannaò, Nick Vyzas

  • threshold is configurable in increments of 1 second. Replication lag can be monitored with ProxySQL. Want to ensure you don’t have stale reads.
  • Why is GTID important? To guarantee consistently. Auto positioning for restructuring topologies.
  • –session-track-gtids is an important feature which allows sending the GTID for a transaction on the OK packet for a transaction. Not available in MariaDB.
  • There is a ProxySQL Binlog Reader now – GTID information about a MySQL server to all connected ProxySQL instances. Lightweight process to run on your MySQL server.
  • ProxySQL can be configured to enforce GTID consistency for reads on any hostgroup/replication hostgroup.
  • Live demo by René

Turbocharging MySQL with Vitess – Sugu Sougoumarane

  • trend for the cloud: container instances, short-lived containers, tolerate neighbors, discoverability. No good tools yet for Kubernetes.
  • non-ideal options: application sharing, NoSQL, paid solutions, NewSQL (CockroachDB, TiDB, Yugabyte)
  • Vitess: leverage MySQL at massive scale, opensource, 8+ years of work, and multiple production examples
  • Square uses Vitess for Square Cash application.
  • Can MySQL run on Docker? Absolutely, many of the companies do huge QPS on Docker.
  • YouTube does a major re-shard every 2-3 months once. No one notices nowadays when that happens.
  • app server connects to vtgate, and only underneath it’s a bunch of smaller databases with vttablet + mysqld. The lockserver is what makes it run well in the cloud.
  • pluggable architecture with no compromise on performance: monitoring, health check, ACLs, tracing, more.
  • at most, it adds about 2ms overhead to connections
  • Go coding standards are enforced, unit tests with strict coverage requirements, end-to-end tests, Travis, CodeClimate and Netlify. Readability is king.
  • On February 5 2018, it will be a CNCF project. One year of due diligence. They said there was nothing to compare it with. Looked at maturity and contributors. It’s becoming a truly community-owned project! (CNCF to Host Vitess is already live as of now)
  • roadmap: full cross-shard queries, migration tools, simplify configurability, documentation.
  • full MySQL protocol, but a limited query set – they want to get it to a point where it accepts a full MySQL query.

Orchestrator on Raft – Shlomi Noach

  • Raft: guaranteed to be in-order replication log, an increasing index. This is how nodes choose a leader based on who has the higher index. Get periodic snapshots (node runs a full backup).
  • HashiCorp raft, a Golang raft implementation, used by Consul
  • orchestrator manages topology for HA topologies; also want orchestrator to be highly available. Now with orchestrator/raft, remove the MySQL backend dependency, and you can have data center fencing too. Now you get: better cross-DC deploys, DC-local KV control, and also Kubernetes friendly.
  • n-orchestrator nodes, each node still runs its own backend (either MySQL or SQLite). Orchestrator provides the communication for SQLite between the nodes. Only one (the Raft leader) will handle failovers
  • implementation & deployment @ Github – one node per DC (deployed at 3 different DCs). 1-second raft polling interval. 2 major DCs, one in the cloud. Step-down, raft-yield, SQLite-backed log store, and still a MySQL backend (SQLite backend use case is in the works)
  • They patched the HashiCorp raft library. The library doesn’t care about the identity of nodes, with Github they do want to control the identity of the leader. There is an “active” data center, and locality is important. This is what they mean by raft-yield (picking a candidate leader).
  • The ability for a leader to step down is also something they had to patch.
  • HashiCorp Raft only supports LMDB and another database, so the replication log is now kept in a relational SQLite backed log store. Another patch.
  • once orchestrator can’t run its own self-health check, it recognizes this. The application can tell raft now that it’s stepping down. Takes 5 seconds to step down, and raft then promotes another orchestrator node to be the leader. This is their patch.
  • can also grab leadership
  • DC fencing handles network partitioning.
  • orchestrator is Consul-aware. Upon failover, orchestrator updates Consul KV with the identity of the promoted master.
  • considerations to watch out for: what happens if, upon replay of the Raft log, you hit two failovers for the same cluster? NOW() and otherwise time-based assumptions. Reapplying snapshot/log upon startup
  • roadmap: use Kubernetes (cluster IP based configuration in progress, already container friendly via auto-re-provisioning of nodes via Raft)

MyRocks Roadmaps – Yoshinori Matsunobu

  • Facebook has a large User Database (UDB). Social graph, massively sharded, low latency, automated operations, pure flash storage (constrained by space, not CPU/IOPS)
  • They have a record cache in-front of MySQL – Tao for reads. If cache misses, then it hits the database. And all write requests go thru MySQL. UDB has to be fast to ensure a good user experience.
  • they also at Facebook run 2 instances of MySQL on the same machine, because CPU wasn’t huge, but the space savings were awesome.
  • design decisions: clustered index (same as InnoDB), slower for reads, faster for writes (bloom filters, column family), support for transactions including consistency between binlog and MyRocks. Faster data loading/deletes/replication, dynamic options (instead of having to restart mysqld), TTL (comparable to HBase TTL feature, specify the TTL, any data older than time, can be removed), online logical (for recovery purposes) & binary backup (for creating replicas)
  • Pros: smaller space, better cache hit rate, writes are faster so you get faster replication, much smaller bytes written
  • Cons: no statement based replication, GAP locks, foreign keys, full-text index, spatial index support. Need to use case sensitive collations for performance. Reads are slower, especially if the data fits in memory. Dependent on file system and OS; lack of solid direct I/O (uses buffered I/O). You need a newer than 4.6 kernel. Too many tuning options beyond buffer pool such as bloom filter, compactions, etc.
  • https://twitter.com/deniszh/status/960163082642382849
  • Completed InnoDB to MyRocks migration. Saved 50% space in UDB compared to compressed InnoDB.
  • Roadmaps: getting in MariaDB and Percona Server for MySQL. Read Mark’s blog for matching read performance vs InnoDB. Supporting mixed engines. Better replication and bigger instance sizes.
  • mixed engines: InnoDB and MyRocks on the same instance, though single transaction does not overlap engines. Plan to extend star backup to integrate `myrocks_hotbackup. Backport gtid_pos_auto_engines from MariaDB?
  • Removing engine log. Could be caused by binlog and engine log, which requires 2pc and ordered commits. Use one log? Either binlog or binlog like service or RocksDB WAL? Rely on binlog now (semi-sync, binlog consumers), need to determine how much performance is gained by stopping writing to WAL.
  • Parallel replication apply is important in MySQL 8
  • support bigger instance sizes: shared nothing database is not a general purpose database. Today you can get 256GB+ RAM and 10TB+ flash on commodity servers. Why not run one big instance and put everything there? Bigger instances may help general purpose small-mid applications. Then you don’t have to worry about sharing. Atomic transactions, joins and secondary keys will just work. Amazon Aurora today supports a 60TB instance!
  • today: you can start deploying slaves with consistency check. Many status counters for instance monitoring.

ProxySQL internals – René Cannaò

  • reduce latency, scales, maximize throughput. Single instance to travel hundreds of thousands of connections and to handle thousands of backend servers.
  • threading models: one thread per connection (blocking I/O), thread pooling (non-blocking I/O, scalable).
  • ProxySQL thread pool implementation: known as “MySQL threads”, fixed number of worker threads (configurable), all threads listen on the same port(s), client connections are not shared between threads, all threads perform their own network I/O, and it uses poll() (does that scale? True, but there is a reason why poll over epoll)
  • threads never share client connections – no need for synchronization, thread contention is reduced, each thread calls poll(). Possibly imbalanced load as a con (one thread that has way more connections that another). Is it really a problem? Most of the time, no, connections will automatically balance.
  • poll() is O(N), epoll() is O(1). Poll() is faster than epoll() for fewer connections (around 1000). Performance degrees when there are a lot of connections. So by default, it uses poll() instead of epoll(), around 50,000 connections performance degrades badly – so ProxySQL has auxiliary threads.
  • MySQL_Session() is implemented as a state machine. Stores metadata associated with the client session (running timers, default hostgroup, etc.)

MySQL Point-in-time recovery like a rockstar – Frederic Descamps

Releases

  • Percona Monitoring and Management 1.7.0 (PMM) – This release features improved support for external services, which enables a PMM Server to store and display metrics for any available Prometheus exporter. For example, you could deploy the postgres_exporter and use PMM’s external services feature to store PostgreSQL metrics in PMM. Immediately, you’ll see these new metrics in the Advanced Data Exploration dashboard. Then you could leverage many of the pre-developed PostgreSQL dashboards available on Grafana.com, and with a minimal amount of edits have a working PostgreSQL dashboard in PMM!
  • MariaDB Server 10.1.31 – usual updates to storage engines, and a handful of bug fixes.

Link List

Upcoming appearances

  • SCALE16x – Pasadena, California, USA – March 8-11 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

Feb
08
2018
--

Fsync Performance on Storage Devices

fsync

fsync performanceWhile preparing a post on the design of ZFS based servers for use with MySQL, I stumbled on the topic of fsync call performance. The fsync call is very expensive, but it is essential to databases as it allows for durability (the “D” of the ACID acronym).

Let’s first review the type of disk IO operations executed by InnoDB in MySQL. I’ll assume the default InnoDB variable values.

The first and most obvious type of IO are pages reads and writes from the tablespaces. The pages are most often read one at a time, as 16KB random read operations. Writes to the tablespaces are also typically 16KB random operations, but they are done in batches. After every batch, fsync is called on the tablespace file handle.

To avoid partially written pages in the tablespaces (a source of data corruption), InnoDB performs a doublewrite. During a doublewrite operation, a batch of dirty pages, from 1 to about 100 pages, is first written sequentially to the doublewrite buffer and fsynced. The doublewrite buffer is a fixed area of the ibdata1 file, or a specific file with the latest Percona Server for MySQL 5.7. Only then do the writes to the tablespaces of the previous paragraph occur.

That leaves us with the writes to the InnoDB log files. During those writes, the transaction information — a kind of binary diff of the affected pages — is written to the log files and then the log file is fsynced. The duration of the fsync call can be a major contributor to the COMMIT latency.

Because the fsync call takes time, it greatly affects the performance of MySQL. Because of this, you probably noticed there are many status variables that relate to fsyncs. To overcome the inherent limitations of the storage devices, group commit allows multiple simultaneous transactions to fsync the log file once for all the transactions waiting for the fsync. There is no need for a transaction to call fsync for a write operation that another transaction already forced to disk. A series of write transactions sent over a single database connection cannot benefit from group commit.

Fsync Results

In order to evaluate the fsync performance, I used the following Python script:

#!/usr/bin/python
import os, sys, mmap
# Open a file
fd = os.open( "testfile", os.O_RDWR|os.O_CREAT|os.O_DIRECT )
m = mmap.mmap(-1, 512)
for i in range (1,1000):
   os.lseek(fd,os.SEEK_SET,0)
   m[1] = "1"
   os.write(fd, m)
   os.fsync(fd)
# Close opened file
os.close( fd )

The script opens a file with the O_DIRECT flag, writes and fsyncs it 1000 times and close the file. I added O_DIRECT after an internal discussion with my colleagues, but it doesn’t change the results and it doesn’t remove the need for calling fsync. We’ll discuss in more detail the impacts of O_DIRECT after we reviewed the results. The script is called with the time command like below:

root@lab:/tmp/testfsync# time python /root/fsync.py
real 0m18.320s
user 0m0.060s
sys 0m0.096s

In the above example using a 7.2k rpm drive, the fsync rate is about 56/s for a latency of 18ms. A 7.2k RPM drive performs 120 rotations per second. On average, the fsyncs require a bit more than two rotations to complete. The filesystem appears to make very little differences: ext4 and XFS show similar results. That means if MySQL uses such storage devices for the InnoDB log files, the latency of each transaction is at least 18ms. If the application workload requires 200 write transactions per second, they’ll need to be executed using at least four database connections.

So, let’s begin with rotational devices. These are becoming a bit less common now with databases, especially without a raid controller. I could only find a few.

Drive RPM Rate Latency Notes
WDC WD2500BJKT 5400 22/s 45 ms Laptop SATA from 2009
ST2000LM003 5400 15/s 66 ms USB-3 portable drive
ST3750528AS 7200 40/s 25 ms Desktop grade SATA
WD2502ABYS-18B7A0 7200 56/s 18 ms Desktop grade SATA
HUA723020ALA641 7200 50/s 20 ms Enterprise grade SATA, md mirror
Dell SAS unknown 7200 58/s 17 ms Behind Perc ctrl but no write cache
HDWE150 7200 43/s 23 ms Recent Desktop grade SATA, 5TB

 

I unfortunately didn’t have access to any 10k or 15k RPM drives that were not behind a raid controller with a write cache. If you have access to such drives, run the above script a few times and send me your results, that would help create a more complete picture! So, we can see a correlation between the rotational speed and the fsync rate, which makes sense. The faster a disk turns, the faster it can fsync. The fsync call saves the data and then updates the metadata. Hence, the heads need to move. That’s probably the main explanation for the remaining disparity. A good point, all drives appears to be fully complying with the SATA flush command even though they all have an enabled write cache. Disabling the drives write caches made no difference.

With the above number, the possible transaction rates in fully ACID mode is pretty depressing. But those drives were rotating ones, what about SSD drives? SSD are memory devices and are much faster for random IO operations. There are extremely fast for reads, and good for writes. But as you will see below, not that great for fsyncs.

Drive rate latency notes
SAMSUNG MZ7LN512 160/s 6.3ms Consumer grade SATA
Crucial_CT480M500SSD1 108/s 9.3ms Consumer grade SATA
Intel 520 2031/s 0.49ms Consumer grade SATA
SAMSUNG MZVPV512HDGL 104/s 9.6ms Consumer grade NVMe
Samsung SSD 960 PRO 267/s 3.8ms High-end consumer grade NVMe
Intel PC-3100 1274/s 0.79ms Low-end consumer grade NVMe (cheat?)
Intel 750 2038/s 0.49ms High-end consumer grade NVMe
Intel PC-3700 7380/s 0.14ms High-end enterprise-grade NVMe

 

Again, this is a small sample of the devices I have access to. All SSD/Flash have write caches, but only the high-end devices have capacitors to flush their write cache to the flash with a loss of power. The PC-3100 device is actually in my home server, and it is obviously cheating. If you look at the card specs on the Intel website, it doesn’t have the “Enhanced Power Loss Data Protection” and “End-to-End Data Protection” features. The much more expansive PC-3700 does. I use the PC-3100 as a ZFS L2ARC device, so I am good. In general, the performance of a flash device varies a bit more than rotational devices, since factors like the number of recent writes and the filling factor come into play.

Even when using a high-end NVMe device like the PC-3700, you can’t reach 10k fully ACID transactions per second at low thread concurrency. How do you reach the higher levels? The answer here is the good old raid controller with a protected write cache. The write cache is basically using DRAM memory protected from power loss by a battery. SAN controllers have similar caches. The writes to the InnoDB log files are sequential writes interleaved with fsyncs. The raid controller concatenates the sequential writes, eventually writing one big chunk on disk and… ignoring the fsyncs. Here’s the result from the only device I had access to:

Drive rate latency notes
Dell Perc with BBU 23000/s 0.04ms Array of 7.2k rpm drives

 

That’s extremely fast but, of course, it is memory. I modified the script to loop 10k times instead of 1k. In theory, something a single slave thread doing simple transactions could reach a rate of 20k/s or more while being fully ACID.

Discussion

We must always consider the results we got in the previous section in the context of a given application. For example, a server using an Intel PC-3700 NVMe card can do more than 7000 fully ACID transactions per second even if it is fully durable provided those transactions are issued by a sufficient number of threads. Adding threads will not allow scaling infinitely. At some point, other bottlenecks like mutex contention or page flushing will dominate.

We often say that Galera-based cluster solutions like Percona XtraDB Cluster (PXC) add latency to the transactions, since it involves communication over the network. With the Galera protocol, a commit operation returns only when all the nodes have received the data. Thus, tt is a good practice to relax the local durability and use innodb_flush_log_at_trx_commit set to 0 or 2. On a local network, the ping time is always below 1ms and often below 0.1ms. As a result, the transaction latency is often smaller.

About fdatasync

The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two. In the Python script, if I replace os.fsync with os.fdatasync, here are the results for a subset of devices:

Drive rpm rate latency notes
ST2000LM003 5400 72/s 13 ms USB-3 portable drive
WD2502ABYS-18B7A0 7200 118/s 8.5 ms Desktop grade SATA
SAMSUNG MZ7LN512 N/A 333/s 3.0ms Consumer grade SATA
Crucial_CT480M500SSD1 N/A 213/s 4.7ms Consumer grade SATA
Samsung SSD 960 PRO N/A 714/s 1.4ms High-end consumer grade NVMe

 

In all cases, the resulting rates have more than doubled. The fdatasync call has a troubled history, as there were issues with it many years ago. Because of those issues, InnoDB never uses fdatasync, only fsyncs. You can find the following comments in the InnoDB os/os0file.cc:

/* We let O_SYNC only affect log files; note that we map O_DSYNC to
O_SYNC because the datasync options seemed to corrupt files in 2001
in both Linux and Solaris */

2001 is a long time ago. Given the above results, maybe we should reconsider the use of fdatasync. From the Linux main page on fdatasync, you find:

fdatasync() is similar to fsync(), but does not flush modified
metadata unless that metadata is needed in order to allow a
subsequent data retrieval to be correctly handled. For example,
changes to st_atime or st_mtime (respectively, time of last
access and time of last modification; see stat(2)) do not require
flushing because they are not necessary for a subsequent data
read to be handled correctly. On the other hand, a change to
the file size (st_size, as made by say ftruncate(2)), would
require a metadata flush.

So, even with fdatasync, operations like extending an InnoDB tablespace will update the metadata correctly. This appears to be an interesting low-hanging fruit in term of MySQL performance. In fact, webscalesql already have fdatasync available

O_DIRECT

Why do we need a fsync or fdatasync with O_DIRECT? With O_DIRECT, the OS is not buffering anything along the way. So the data should be persisted right? Actually, the OS is not buffering but the device very likely is. Here are a few results to highlight the point using a 7.2k rpm SATA drive:

Test rate latency
O_DIRECT, drive Write cache enabled 4651/s 0.22ms
O_DIRECT, drive Write cache disabled 101/s 9.9ms
ASYNC + fdatasync, Write cache enabled 119/s 8.4ms
ASYNC + fdatasync, Write cache disabled 117/s 8.5ms

 

The drive write cache was enabled/disabled using the hdparm command. Clearly, there’s no way the drive can persist 4651 writes per second. O_DIRECT doesn’t send the SATA flush command to the disk, so we are only writing to the drive write cache. If the drive write cache is disabled, the rate falls to a more reasonable value of 101/s. What is interesting — and I don’t really understand why — is that opening the file in async mode and performing fdatasync is significantly faster. As expected, the presence of the drive write cache has no impacts on ASYNC + fdatasync. When the fdatasync call occurs, the data is still in the OS file cache.

If you want to use only O_DIRECT, you should make sure all the storage write caches are crash safe. That’s why MySQL adds a fsync call after a write to a file opened with O_DIRECT.

ZFS

These days, I find it difficult to write a blog post without talking about ZFS. How does ZFS handles fsyncs and fdatasyncs? ZFS, like a database, performs write ahead logging in the ZIL. That means calls like fsync and fdatasync return when the data has been persisted to the ZIL, and not to the actual filesystem. The real write operation is done a few seconds later by a background thread. That means the added write for the metadata does not impact performance right away. My home server uses ZFS over a pair of 7.2k RPM drive and doesn’t have a SLOG device. The ZIL is thus stored on the 7.2k RPM drives. The results are the following:

Drive rpm rate latency
ZFS fsync 7200 104/s 9.6 ms
ZFS fdatasync 7200 107/s 9.3 ms

 

Remember that with ZFS, you need to disable the O_DIRECT mode. The fdatasync rate appears to be slightly faster, but it is not really significant. With ZFS, the fsync/fdatasync performance relates to where the ZIL is stored. If there is no SLOG device, the ZIL is stored with the data and thus, the persitence performance of the devices used for the data matter. If there is a SLOG device, the persistence performance is governed by the specs of the device(s) on which the SLOG is located. That’s a very important aspect we have to consider when designing a MySQL server that will use ZFS. The design of such server will be discussed in more details in a future post.

Feb
05
2018
--

Four Ways MySQL Executes GROUP BY

MySQL GROUP BY

MySQL GROUP BYIn this blog post, I’ll look into four ways MySQL executes GROUP BY. 

In my previous blog post, we learned that indexes or other means of finding data might not be the most expensive part of query execution. For example, MySQL GROUP BY could potentially be responsible for 90% or more of the query execution time. 

The main complexity when MySQL executes GROUP BY is computing aggregate functions in a GROUP BY statement. How this works is shown in the documentation for UDF Aggregate Functions. As we see, the requirement is that UDF functions get all values that constitute the single group one after another. That way, it can compute the aggregate function value for the single group before moving to another group.

The problem, of course, is that in most cases the source data values aren’t grouped. Values coming from a variety of groups follow one another during processing. As such, we need a special step to handle MySQL GROUP BY.

Let’s look at the same table we looked at before:

mysql> show create table tbl G
*************************** 1. row ***************************
      Table: tbl
Create Table: CREATE TABLE `tbl` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `k` int(11) NOT NULL DEFAULT '0',
 `g` int(10) unsigned NOT NULL,
 PRIMARY KEY (`id`),
 KEY `k` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=2340933 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

And the same GROUP BY statements executed in different ways:

1: Index Ordered GROUP BY in MySQL

mysql> select k, count(*) c from tbl group by k order by k limit 5;
+---+---+
| k | c |
+---+---+
| 2 | 3 |
| 4 | 1 |
| 5 | 2 |
| 8 | 1 |
| 9 | 1 |
+---+---+
5 rows in set (0.00 sec)
mysql> explain select k, count(*) c from tbl group by k order by k limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: index
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 5
    filtered: 100.00
       Extra: Using index
1 row in set, 1 warning (0.00 sec)

In this case, we have an index on the column we use for GROUP BY. This way, we can just scan data group by group and perform GROUP BY on the fly (inexpensively).

It works especially well when we use LIMIT to restrict the number of groups we retrieve or when a “covering index” is in use, as a sequential index-only scan is a very fast operation.

If you have a small number of groups though, and no covering index, index order scans can cause a lot of IO. So this might not be the most optimal plan.

2: External Sort GROUP BY in MySQL

mysql> explain select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using filesort
1 row in set, 1 warning (0.00 sec)
mysql> select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5;
+---+---+
| g | c |
+---+---+
| 0 | 1 |
| 1 | 2 |
| 4 | 1 |
| 5 | 1 |
| 6 | 2 |
+---+---+
5 rows in set (0.88 sec)

If we do not have an index that allows us to scan the data in group order, we can instead get data sorted through an external sort (also referred to as “filesort” in MySQL).

You may notice I’m using an SQL_BIG_RESULT hint here to get this plan. Without it, MySQL won’t choose this plan in this case.

In general, MySQL prefers to use this plan only if we have a large number of groups, because in this case sorting is more efficient than having a temporary table (which we will talk about next).

3: Temporary Table GROUP BY in MySQL

mysql> explain select  g, sum(g) s from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using temporary
1 row in set, 1 warning (0.00 sec)
mysql> select  g, sum(g) s from tbl group by g order by null limit 5;
+---+------+
| g | s    |
+---+------+
| 0 |    0 |
| 1 |    2 |
| 4 |    4 |
| 5 |    5 |
| 6 |   12 |
+---+------+
5 rows in set (7.75 sec)

In this case, MySQL also does a full table scan. But instead of running additional sort passes, it creates a temporary table instead. This temporary table contains one row per group, and with each incoming row the value for the corresponding group is updated. Lots of updates! While this might be reasonable in-memory, it becomes very expensive if the resulting table is so large that updates are going to cause a lot of disk IO. In this case, external sort plans are usually better.

Note that while MySQL selects this plan by default for this use case, if we do not supply any hints it is almost 10x slower than the plan we get using the SQL_BIG_RESULT hint.

You may notice I added “ORDER BY NULL” to this query. This is to show you “clean” the temporary table only plan. Without it, we get this plan:

mysql> explain select  g, sum(g) s from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)

In it, we get the “worst of both worlds” with Using Temporary Table and filesort.  

MySQL 5.7 always returns GROUP BY results sorted in group order, even if this the query doesn’t require it (which can then require an expensive additional sort pass). ORDER BY NULL signals the application doesn’t need this.

You should note that in some cases – such as JOIN queries with aggregate functions accessing columns from different tables – using temporary tables for GROUP BY might be the only option.

If you want to force MySQL to use a plan that does temporary tables for GROUP BY, you can use the SQL_SMALL_RESULT  hint.

4:  Index Skip-Scan-Based GROUP BY in MySQL

The previous three GROUP BY execution methods apply to all aggregate functions. Some of them, however, have a fourth method.

mysql> explain select k,max(id) from tbl group by k G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 2
    filtered: 100.00
       Extra: Using index for group-by
1 row in set, 1 warning (0.00 sec)
mysql> select k,max(id) from tbl group by k;
+---+---------+
| k | max(id) |
+---+---------+
| 0 | 2340920 |
| 1 | 2340916 |
| 2 | 2340932 |
| 3 | 2340928 |
| 4 | 2340924 |
+---+---------+
5 rows in set (0.00 sec)

This method applies only to very special aggregate functions: MIN() and MAX(). These do not really need to go through all the rows in the group to compute the value at all.

They can just jump to the minimum or maximum group value in the group directly (if there is such an index).

How can you find MAX(ID) value for each group if the index is only built on (K) column? This is an InnoDB table. Remember InnoDB tables effectively append the PRIMARY KEY to all indexes. (K) becomes (K,ID), allowing us to use Skip-Scan optimization for this query.

This optimization is only enabled if there is a large number of rows per group. Otherwise, MySQL prefers more conventional means to execute this query (like Index Ordered GROUP BY detailed in approach #1).

While we’re on MIN()/MAX() aggregate functions, other optimizations apply to them as well. For example, if you have an aggregate function with no GROUP BY (effectively  having one group for all tables), MySQL fetches those values from indexes during a statistics analyzes phase and avoids reading tables during the execution stage altogether:

mysql> explain select max(k) from tbl G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: NULL
  partitions: NULL
        type: NULL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: NULL
    filtered: NULL
       Extra: Select tables optimized away
1 row in set, 1 warning (0.00 sec)

Filtering and Group By

We have looked at four ways MySQL executes GROUP BY.  For simplicity, I used GROUP BY on the whole table, with no filtering applied. The same concepts apply when you have a WHERE clause:

mysql> explain select  g, sum(g) s from tbl where k>4 group by g order by NULL limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 1
    filtered: 100.00
       Extra: Using index condition; Using temporary
1 row in set, 1 warning (0.00 sec)

For this case, we use the range on the K column for data filtering/lookup and do a GROUP BY when there is a temporary table.

In some cases, the methods do not conflict. In others, however, we have to choose either to use one index for GROUP BY or another index for filtering:

mysql> alter table tbl add key(g);
Query OK, 0 rows affected (4.17 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain select  g, sum(g) s from tbl where k>1 group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: index
possible_keys: k,g
         key: g
     key_len: 4
         ref: NULL
        rows: 16
    filtered: 50.00
       Extra: Using where
1 row in set, 1 warning (0.00 sec)
mysql> explain select  g, sum(g) s from tbl where k>4 group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k,g
         key: k
     key_len: 4
         ref: NULL
        rows: 1
    filtered: 100.00
       Extra: Using index condition; Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)

Depending on specific constants used in this query, we can see that we either use an index ordered scan for GROUP BY (and  “give up”  benefiting from the index to resolve the WHERE clause), or use an index to resolve the WHERE clause (but use a temporary table to resolve GROUP BY).

In my experience, this is where MySQL GROUP BY does not always make the right choice. You might need to use FORCE INDEX to execute queries the way you want them to.

Summary

I hope this article provides a good overview of how MySQL executes GROUP BY.  In my next blog post, we will look into techniques you can use to optimize GROUP BY queries.

Jan
16
2018
--

Webinar January 18, 2018: MySQL Troubleshooting and Performance Optimization with Percona Monitoring and Management (PMM) Part 2

Percona Monitoring and Management

Percona Monitoring and ManagementJoin Percona’s Product Manager Michael Coburn as he presents MySQL Troubleshooting and Performance Optimization with Percona Monitoring and Management (PMM) Part 2 on Thursday, January 18, 2018, at 11:00 am PST / 2:00 pm EST (UTC-8).

Tags: Percona Monitoring and Management, PMM, Monitoring, MySQL, Performance, Optimization, DBA, SysAdmin, DevOps
Experience Level: Expert

Optimizing MySQL performance and troubleshooting MySQL problems are two of the most critical and challenging tasks for MySQL DBAs. The databases powering your applications need to handle heavy traffic loads while remaining responsive and stable. This is so that you can deliver an excellent user experience. Furthermore, DBA’s are also expected to find cost-efficient means of solving these issues.

In this webinar — the second part of a two-part series — Michael discusses how you can optimize and troubleshoot MySQL performance and demonstrate how Percona Monitoring and Management (PMM) enables you to solve these challenges using free and open source software. We will look at specific, common MySQL problems and review the essential components in PMM that allow you to diagnose and resolve them.

By the end of this webinar, you will have a better understanding of how you can troubleshoot MySQL problems in your database.

Register for the webinar now.

Percona Monitoring and ManagementMichael Coburn, Product Manager

Michael joined Percona as a Consultant in 2012 and progressed through various roles including Managing Consultant, Principal Architect, Technical Account Manager, and Technical Support Engineer. He is now leading the Product Manager of Percona Monitoring and Management.

Dec
18
2017
--

Webinar Wednesday, December 20, 2017: InnoDB Performance Optimization

InnoDB Performance Optimization

InnoDB Performance OptimizationJoin Percona’s, CEO and Co-Founder, Peter Zaitsev as he presents InnoDB Performance Optimization on Wednesday, December 20, 2017, at 11:00 am PST / 2:00 pm EST (UTC-8).

InnoDB is one of the most commonly used storage engines for MySQL and Percona Server for MySQL. It balances high reliability with high performance and is the focus of the majority of storage engine development by the MySQL and Percona Server for MySQL teams.

This webinar looks at InnoDB, including new developments in MySQL 5.7 as well as Percona Server for MySQL. In it, Peter explains how to use it, and many of the configuration options that help you to get the best performance from your application.

Register for the webinar.

Peter ZaitsevPeter Zaitsev, CEO

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 140 professionals in 30 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups. Percona was named to the Inc. 5000 in 2013, 2014, 2015 and 2016. Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group.

A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Database Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization Volume 1 is one of percona.com’s most popular downloads. 

Dec
04
2017
--

Internal Temporary Tables in MySQL 5.7

InnoDB row operations graph from PMM

In this blog post, I investigate a case of spiking InnoDB Rows inserted in the absence of a write query, and find internal temporary tables to be the culprit.

Recently I was investigating an interesting case for a customer. We could see the regular spikes on a graph depicting “InnoDB rows inserted” metric (jumping from 1K/sec to 6K/sec), however we were not able to correlate those spikes with other activity. The

innodb_row_inserted

 graph (picture from PMM demo) looked similar to this (but on a much larger scale):

InnoDB row operations graph from PMM

Other graphs (Com_*, Handler_*) did not show any spikes like that. I’ve examined the logs (we were not able to enable general log or change the threshold of the slow log), performance_schema, triggers, stored procedures, prepared statements and even reviewed the binary logs. However, I was not able to find any single write query which could have caused the spike to 6K rows inserted.

Finally, I figured out that I was focusing on the wrong queries. I was trying to correlate the spikes on the InnoDB Rows inserted graph to the DML queries (writes). However, the spike was caused by SELECT queries! But why would SELECT queries cause the massive InnoDB insert operation? How is this even possible?

It turned out that this is related to temporary tables on disk. In MySQL 5.7 the default setting for internal_tmp_disk_storage_engine is set for InnoDB. That means that if the SELECT needs to create a temporary table on disk (e.g., for GROUP BY) it will use the InnoDB storage engine.

Is that bad? Not necessarily. Krunal Bauskar published a blog post originally about the InnoDB Intrinsic Tables performance in MySQL 5.7. The InnoDB internal temporary tables are not redo/undo logged. So in general performance is better. However, here is what we need to watch out for:

  1. Change of the place where MySQL stores temporary tables. InnoDB temporary tables are stored in ibtmp1 tablespace file. There are a number of challenges with that:
    • Location of the ibtmp1 file. By default it is located inside the innodb datadir. Originally MyISAM temporary tables were stored in  tmpdir. We can configure the size of the file, but the location is always relative to InnoDB datadir, so to move it to tmpdir we need something like this: 
      innodb_temp_data_file_path=../../../tmp/ibtmp1:12M:autoextend
    • Like other tablespaces it never shrinks back (though it is truncated on restart). The huge temporary table can fill the disk and hang MySQL (bug opened). One way to fix that is to set the maximum size of ibtmp1 file: 
      innodb_temp_data_file_path=ibtmp1:12M:autoextend:max:1G
    • Like other InnoDB tables it has all the InnoDB limitations, i.e., InnoDB row or column limits. If it exceeds these, it will return “Row size too large” or “Too many columns” errors. The workaround is to set internal_tmp_disk_storage_engine to MYISAM.
  2. When all temp tables go to InnoDB, it may increase the total engine load as well as affect other queries. For example, if originally all datasets fit into buffer_pool and temporary tables were created outside of the InnoDB, it will not affect the InnoDB memory footprint. Now, if a huge temporary table is created as an InnoDB table it will use innodb_buffer_pool and may “evict” the existing pages so that other queries may perform slower.

Conclusion

Beware of the new change in MySQL 5.7, the internal temporary tables (those that are created for selects when a temporary table is needed) are stored in InnoDB ibtmp file. In most cases this is faster. However, it can change the original behavior. If needed, you can switch the creation of internal temp tables back to MyISAM: 

set global internal_tmp_disk_storage_engine=MYISAM

Nov
20
2017
--

InnoDB Page Compression: the Good, the Bad and the Ugly

InnoDB Page CompressionIn this blog post, we’ll look at some of the facets of InnoDB page compression.

Somebody recently asked me about the best way to handle JSON data compression in MySQL. I took a quick look at InnoDB page compression and wanted to share my findings.

There is also some great material on this topic that was prepared and presented by Yura Sorokin at Percona Live Europe 2017: https://www.percona.com/live/e17/sessions/percona-xtradb-compressed-columns-with-dictionaries-an-alternative-to-innodb-table-compression. Yura also implemented Compressed Columns in Percona Server.

First, the good part.

InnoDB page compression is actually really easy to use and provides a decent compression ratio. To use it, I just ran

CREATE TABLE commententry (...) COMPRESSION="zlib";

 – and that’s all. By the way, for my experiment I used the subset of Reddit comments stored in JSON (described here: Big Dataset: All Reddit Comments – Analyzing with ClickHouse).

This method got me a compressed table of 3.9GB. Compare this to 8.4GB for an uncompressed table and it’s about a 2.15x compression ratio.

Now, the bad part.

As InnoDB page compression uses “hole punching,” the standard Linux utils do not always properly support files created this way. In fact, to see the size “3.9GB” I had to use

du --block-size=1 tablespace_name.ibd

 , as the standard

ls -l tablespace_name.ibd

 shows the wrong size (8.4GB). There is a similar limitation on copying files. The standard way

cp old_file new_file

 may not always work, and to be sure I had to use

cp --sparse=always old_file new_file

.

Speaking about copying, here’s the ugly part.

The actual time to copy the sparse file was really bad.

On a fairly fast device (a Samsung SM863), copying the sparse file mentioned above in its compressed size of 3.9GB took 52 minutes! That’s shocking, so let me repeat it again: 52 minutes to copy a 3.9GB file on an enterprise SATA SSD.

By comparison, copying regular 8.4GB file takes 9 seconds! Compare 9 sec and 52 mins.

To be fair, the NMVe device (Intel® SSD DC D3600) handles sparse files much better. It took only 12 seconds to copy the same sparse file on this device.

Having considered all this, it is hard to recommend that you use InnoDB page compression for serious production. Well, unless you power your database servers with NVMe storage.

For JSON data, the Compressed Columns in Percona Server for MySQL should work quite well using Dictionary to store JSON keys – give it a try!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com