Oct
17
2017
--

Webinar Wednesday, October 18, 2017: How to Scale with MongoDB

Scale with MongoDB

Scale with MongoDBJoin Percona’s Senior Technical Services Engineer Adamo Tonete as he presents How To Scale with MongoDB on Wednesday, October 18, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

In this webinar, we will talk about how to scale with MongoDB, up to thousands of writes and reads per second. What are the common issues when you scale with MongoDB? Is it better to shard or to add further secondaries?

We will walk through many common scaling situations, and through the steps needed to deploy a sharded cluster: from a single instance to a sharded environment. We will also talk about common mistakes/pitfalls a company can make when scaling its database – and how to avoid such situations.

Register for the webinar.

Adamo Tonete, Senior Technical Services Engineer

Adamo joined Percona in 2015, after working as a MongoDB/MySQL database administrator for three years. As the main database member of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24×7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three has moved to NoSQL technologies without giving up relational databases. He likes to play video games and study everything that is related to engines. Adamo lives with his wife in São Paulo, Brazil.

Oct
01
2017
--

One Million Tables in MySQL 8.0

MySQL 8.0

In my previous blog post, I talked about new general tablespaces in MySQL 8.0. Recently MySQL 8.0.3-rc was released, which includes a new data dictionary. My goal is to create one million tables in MySQL and test the performance.

Background questions

Q: Why million tables in MySQL? Is it even realistic? How does this happen?

Usually, millions of tables in MySQL is a result of “a schema per customer” Software as a Service (SaaS) approach. For the purposes of customer data isolation (security) and logical data partitioning (performance), each “customer” has a dedicated schema. You can think of a WordPress hosting service (or any CMS based hosting) where each customer has their own dedicated schema. With 10K customers per MySQL server, we could end up with millions of tables.

Q: Should you design an application with >1 million tables?

Having separate tables is one of the easiest designs for a multi-tenant or SaaS application, and makes it easy to shard and re-distribute your workload between servers. In fact, the table-per-customer or schema-per-customer design has the quickest time-to-market, which is why we see it a lot in consulting. In this post, we are not aiming to cover the merits of should you do this (if your application has high churn or millions of free users, for example, it might not be a good idea). Instead, we will focus on if the new data dictionary provides relief to a historical pain point.

Q: Why is one million tables a problem?

The main issue results from the fact that MySQL needs to open (and eventually close) the table structure file (FRM file). With one million tables, we are talking about at least one million files. Originally MySQL fixed it with table_open_cache and table_definition_cache. However, the maximum value for table_open_cache is 524288. In addition, it is split into 16 partitions by default (to reduce the contention). So it is not ideal. MySQL 8.0 has removed FRM files for InnoDB, and will now allow you to create general tablespaces. I’ve demonstrated how we can create tablespace per customer in MySQL 8.0, which is ideal for “schema-per-customer” approach (we can move/migrate one customer data to a new server by importing/exporting the tablespace).

One million tables in MySQL 5.7

Recently, I’ve created the test with one million tables. The test creates 10K databases, and each database contains 100 tables. To use a standard benchmark I’ve employed sysbench table structure.

mysql> select count(*) from information_schema.schemata where schema_name like 'test_sbtest%';
+----------+
| count(*) |
+----------+
| 10000    |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from information_schema.tables where table_schema like 'test_sbtest%';
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (4.61 sec)

This also creates a huge overhead: with one million tables we have ~two million files. Each .frm file and .ibd file size sums up to 175G:

# du -sh /ssd/mysql_57
175G    /ssd/mysql_57

Now I’ve used sysbench Lua script to insert one row randomly into one table

pathtest = "/usr/share/sysbench/tests/include/oltp_legacy/"
if pathtest then
   dofile(pathtest .. "common.lua")
else
   require("common")
end
function event()
   local table_name
   local i
   local c_val
   local k_val
   local pad_val
   local oltp_tables_count = 100
   local oltp_db_count = 10000
   table_name = "test_sbtest_" .. sb_rand_uniform(1, oltp_db_count) .. ".sbtest".. sb_rand_uniform(1, oltp_tables_count)
   k_val = sb_rand(1, oltp_table_size)
   c_val = sb_rand_str([[
###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]])
   pad_val = sb_rand_str([[
###########-###########-###########-###########-###########]])
   rs = db_query("INSERT INTO " .. table_name ..
                       " (id, k, c, pad) VALUES " ..
                       string.format("(%d, %d, '%s', '%s')", i, k_val, c_val,
                                     pad_val))
   end
end

With:

local oltp_tables_count = 100
   local oltp_db_count = 10000

Sysbench will choose one table randomly out of one million. With oltp_tables_count = 1 and oltp_db_count = 100, it will only choose the first table (sbtest1) out of the first 100 databases (randomly).

As expected, MySQL 5.7 has a huge performance degradation when going across one million tables. When running a script that only inserts data into 100 random tables, we can see ~150K transactions per second. When the data is inserted in one million tables (chosen randomly) performance drops to 2K (!) transactions per second:

Insert into 100 random tables:

SQL statistics:
    queries performed:
        read:                            0
        write:                           16879188
        other:                           0
        total:                           16879188
    transactions:                        16879188 (140611.72 per sec.)
    queries:                             16879188 (140611.72 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Insert into one million random tables:

SQL statistics:
    queries performed:
        read:                            0
        write:                           243533
        other:                           0
        total:                           243533
    transactions:                        243533 (2029.21 per sec.)
    queries:                             243533 (2029.21 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

This is expected. Here I’m testing the worse case scenario, where we can’t keep all table open handlers and table definitions in cache (memory) since the table_open_cache and table_definition_cache both have a limit of 524288.

Also, normally we can expect a huge skew between access to the tables. There can be only 20% active customers (80-20 rule), meaning that we can only expect an active access to 2K databases. In addition, there will be old or unused tables so we can expect around 100K or less of active tables.

Hardware and config files

The above results are from this server:

Processors   | 64xGenuine Intel(R) CPU @ 2.00GHz
Memory Total | 251.8G
Disk         | Samsung 950 Pro PCIE SSD (nvme)

Sysbench script:

sysbench $conn --report-interval=1 --num-threads=32 --max-requests=0 --max-time=600 --test=/root/drupal_demo/insert_custom.lua run

My.cnf:

innodb_buffer_pool_size = 100G
innodb_io_capacity=20000
innodb_flush_log_at_trx_commit = 0
innodb_log_file_size = 2G
innodb_flush_method=O_DIRECT_NO_FSYNC
skip-log-bin
open_files_limit=1000000
table_open_cache=524288
table_definition_cache=524288

One million tables in MySQL 8.0 + general tablespaces

In MySQL 8.0 is it easy and logical to create one general tablespace per each schema (it will host all tables in this schema). In MySQL 5.7, general tablespaces are available – but there are still .frm files.

I’ve used the following script to create 100 tables in one schema all in one tablespace:

mysql test -e "CREATE TABLESPACE t ADD DATAFILE 't.ibd' engine=InnoDB;"
for i in {1..10000}
do
           mysql test -e "create table ab$i(i int) tablespace t"
done

The new MySQL 8.0.3-rc also uses the new data dictionary, so all MyISAM tables in the mysql schema are removed and all metadata is stored in additional mysql.ibd file.

Creating one million tables

Creating InnoDB tables fast enough can be a task by itself. Stewart Smith published a blog post a while ago where he focused on optimizing time to create 30K tables in MySQL.

The problem is that after creating an .ibd file, MySQL needs to “fsync” it. However, when creating a table inside the tablespace, there is no fsync. I’ve created a simple script to create tables in parallel, one thread per database:

#/bin/bash
function do_db {
        mysql -vvv -e "create database $db";
        mysql -vvv $db -e "CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB;"
        for i in {1..100}
        do
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                mysql $db -e "$table"
        done
}
c=0
for m in {1..100}
do
        for i in {1..100}
        do
                let c=$c+1
                echo $c
                db="test_sbtest_$c"
                do_db &
        done
        wait
done

That script works perfectly in MySQL 8.0.1-dmr and creates one million tables in 25 minutes and 28 seconds (1528 seconds). That is ~654 tables per second. That is significantly faster than ~30 tables per second in the original Stewart’s test and 2x faster than a test where all fsyncs were artificially disabled using libeat-my-data library.

Unfortunately, in MySQL 8.0.3-rc some regression was introduced. In MySQL 8.0.3-rc I can see heavy mutex contention, and the table creation speed dropped from 25 minutes to ~280 minutes. I’ve filed a bug report: performance regression: “create table” speed and scalability in 8.0.3.

Size on disk

With general tablespaces and no .frm files, the size on disk decreased:

# du -h -d1 /ssd/
147G    /ssd/mysql_801
119G    /ssd/mysql_803
175G    /ssd/mysql_57

Please note though that in MySQL 8.0.3-rc, with new native data dictionary, the size on disk increased as it needs to write additional information (Serialized Dictionary Information, SDI) to the tablespace files:

InnoDB: Serialized Dictionary Information (SDI) is now present in all InnoDB tablespace files
except for temporary tablespace and undo tablespace files.
SDI is serialized metadata for schema, table, and tablespace objects.
The presence of SDI data provides metadata redundancy.
...
The inclusion of SDI data in tablespace files increases tablespace file size.
An SDI record requires a single index page, which is 16k in size by default.
However, SDI data is compressed when it is stored to reduce the storage footprint.

The general mysql data dictionary in MySQL 8.0.3 is 6.6Gb:

6.6G /ssd/mysql/mysql.ibd

Benchmarking the insert speed in MySQL 8.0 

I’ve repeated the same test I’ve done for MySQL 5.7 in MySQL 8.0.3-rc (and in 8.0.1-dmr), but using general tablespace. I created 10K databases (=10K tablespace files), each database has100 tables and each database resides in its own tablespace.

There are two new tablespace level caches we can use in MySQL 8.0: tablespace_definition_cache and schema_definition_cache:

tablespace_definition_cache = 15000
schema_definition_cache = 524288

Unfortunately, with one million random table accesses in MySQL 8.0 (both 8.0.1 and 8.0.3), we can still see that it stalls on opening tables (even with no .frm files and general tablespaces):

mysql> select conn_id, current_statement, state, statement_latency, lock_latency from sys.processlist where current_statement is not null and conn_id <> CONNECTION_ID();
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
| conn_id | current_statement                                                 | state          | statement_latency | lock_latency |
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
|     199 | INSERT INTO test_sbtest_9749.s ... 8079-53209333270-93105555128') | Opening tables | 4.45 ms           | 0 ps         |
|     198 | INSERT INTO test_sbtest_1863.s ... 9574-29782886623-39251573705') | Opening tables | 9.95 ms           | 5.67 ms      |
|     189 | INSERT INTO test_sbtest_3948.s ... 9365-63502117132-66650064067') | Opening tables | 16.29 ms          | 15.38 ms     |
|     190 | INSERT INTO test_sbtest_6885.s ... 8436-41291265610-60894472357') | Opening tables | 13.78 ms          | 9.52 ms      |
|     191 | INSERT INTO test_sbtest_247.sb ... 7467-89459234028-92064334941') | Opening tables | 8.36 ms           | 3.18 ms      |
|     192 | INSERT INTO test_sbtest_9689.s ... 8058-74586985382-00185651578') | Opening tables | 6.89 ms           | 0 ps         |
|     193 | INSERT INTO test_sbtest_8777.s ... 1900-02582963670-01868315060') | Opening tables | 7.09 ms           | 5.70 ms      |
|     194 | INSERT INTO test_sbtest_9972.s ... 9057-89011320723-95018545652') | Opening tables | 9.44 ms           | 9.35 ms      |
|     195 | INSERT INTO test_sbtest_6977.s ... 7902-29158428721-66447528241') | Opening tables | 7.82 ms           | 789.00 us    |
|     196 | INSERT INTO test_sbtest_129.sb ... 2091-86346366083-87657045906') | Opening tables | 13.01 ms          | 7.30 ms      |
|     197 | INSERT INTO test_sbtest_1418.s ... 6581-90894769279-68213053531') | Opening tables | 16.35 ms          | 10.07 ms     |
|     208 | INSERT INTO test_sbtest_4757.s ... 4592-86183240946-83973365617') | Opening tables | 8.66 ms           | 2.84 ms      |
|     207 | INSERT INTO test_sbtest_2152.s ... 5459-55779113235-07063155183') | Opening tables | 11.08 ms          | 3.89 ms      |
|     212 | INSERT INTO test_sbtest_7623.s ... 0354-58204256630-57234862746') | Opening tables | 8.67 ms           | 2.80 ms      |
|     215 | INSERT INTO test_sbtest_5216.s ... 9161-37142478639-26288001648') | Opening tables | 9.72 ms           | 3.92 ms      |
|     210 | INSERT INTO test_sbtest_8007.s ... 2999-90116450579-85010442132') | Opening tables | 1.33 ms           | 0 ps         |
|     203 | INSERT INTO test_sbtest_7173.s ... 2718-12894934801-25331023143') | Opening tables | 358.09 us         | 0 ps         |
|     209 | INSERT INTO test_sbtest_1118.s ... 8361-98642762543-17027080501') | Opening tables | 3.32 ms           | 0 ps         |
|     219 | INSERT INTO test_sbtest_5039.s ... 1740-21004115002-49204432949') | Opening tables | 8.56 ms           | 8.44 ms      |
|     202 | INSERT INTO test_sbtest_8322.s ... 8686-46403563348-31237202393') | Opening tables | 1.19 ms           | 0 ps         |
|     205 | INSERT INTO test_sbtest_1563.s ... 6753-76124087654-01753008993') | Opening tables | 9.62 ms           | 2.76 ms      |
|     213 | INSERT INTO test_sbtest_5817.s ... 2771-82142650177-00423653942') | Opening tables | 17.21 ms          | 16.47 ms     |
|     216 | INSERT INTO test_sbtest_238.sb ... 5343-25703812276-82353892989') | Opening tables | 7.24 ms           | 7.20 ms      |
|     200 | INSERT INTO test_sbtest_2637.s ... 8022-62207583903-44136028229') | Opening tables | 7.52 ms           | 7.39 ms      |
|     204 | INSERT INTO test_sbtest_9289.s ... 2786-22417080232-11687891881') | Opening tables | 10.75 ms          | 9.01 ms      |
|     201 | INSERT INTO test_sbtest_6573.s ... 0106-91679428362-14852851066') | Opening tables | 8.43 ms           | 7.03 ms      |
|     217 | INSERT INTO test_sbtest_1071.s ... 9465-09453525844-02377557541') | Opening tables | 8.42 ms           | 7.49 ms      |
|     206 | INSERT INTO test_sbtest_9588.s ... 8804-20770286377-79085399594') | Opening tables | 8.02 ms           | 7.50 ms      |
|     211 | INSERT INTO test_sbtest_4657.s ... 4758-53442917995-98424096745') | Opening tables | 16.62 ms          | 9.76 ms      |
|     218 | INSERT INTO test_sbtest_9672.s ... 1537-13189199316-54071282928') | Opening tables | 10.01 ms          | 7.41 ms      |
|     214 | INSERT INTO test_sbtest_1391.s ... 9241-84702335152-38653248940') | Opening tables | 21.34 ms          | 15.54 ms     |
|     220 | INSERT INTO test_sbtest_6542.s ... 7778-65788940102-87075246009') | Opening tables | 2.96 ms           | 0 ps         |
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
32 rows in set (0.11 sec)

And the transactions per second drops to ~2K.

Here I’ve expected different behavior. With the .frm files gone and with tablespace_definition_cache set to more than 10K (we have only 10K tablespace files), I’ve expected that MySQL does not have to open and close files. It looks like this is not the case.

I can also see the table opening (since the server started):

mysql> show global status like '%open%';
+----------------------------+-----------+
| Variable_name              | Value     |
+----------------------------+-----------+
| Com_ha_open                | 0         |
| Com_show_open_tables       | 0         |
| Innodb_num_open_files      | 10040     |
| Open_files                 | 0         |
| Open_streams               | 0         |
| Open_table_definitions     | 524288    |
| Open_tables                | 499794    |
| Opened_files               | 22        |
| Opened_table_definitions   | 1220904   |
| Opened_tables              | 2254648   |
| Slave_open_temp_tables     | 0         |
| Table_open_cache_hits      | 256866421 |
| Table_open_cache_misses    | 2254643   |
| Table_open_cache_overflows | 1254766   |
+----------------------------+-----------+

This is easier to see on the graphs from PMM. Insert per second for the two runs (both running 16 threads):

  1. The first run is 10K random databases/tablespaces and one table (sysbench is choosing table#1 from a randomly chosen list of 10K databases). This way there is also no contention on the tablespace file.
  2. The second run is a randomly chosen table from a list of one million tables.

As we can see, the first run is dong 50K -100K inserts/second. Second run is only limited to ~2.5 inserts per second:

MySQL 8.0

“Table open cache misses” grows significantly after the start of the second benchmark run:
MySQL 8.0

As we can see, MySQL performs ~1.1K table definition openings per second and has ~2K table cache misses due to the overflow:

MySQL 8.0

When inserting against only 1K random tables (one specific table in a random database, that way we almost guarantee that one thread will always write to a different tablespace file), the table_open_cache got warmed up quickly. After a couple of seconds, the sysbench test starts showing > 100K tps. The processlist looks much better (compare the statement latency and lock latency to the above as well):

mysql> select conn_id, current_statement, state, statement_latency, lock_latency from sys.processlist where current_statement is not null and conn_id <> CONNECTION_ID();
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
| conn_id | current_statement                                                 | state          | statement_latency | lock_latency |
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
|     253 | INSERT INTO test_sbtest_3293.s ... 2282-95400708146-84684851551') | starting       | 22.72 us          | 0 ps         |
|     254 | INSERT INTO test_sbtest_3802.s ... 4030-35983148190-23616685226') | update         | 62.88 us          | 45.00 us     |
|     255 | INSERT INTO test_sbtest_5290.s ... 2361-58374942527-86207214617') | Opening tables | 36.07 us          | 0 ps         |
|     256 | INSERT INTO test_sbtest_5684.s ... 4717-34992549120-04746631452') | Opening tables | 37.61 us          | 37.00 us     |
|     257 | INSERT INTO test_sbtest_5088.s ... 5637-75275906887-76112520982') | starting       | 22.97 us          | 0 ps         |
|     258 | INSERT INTO test_sbtest_1375.s ... 8592-24036624620-65536442287') | query end      | 98.66 us          | 35.00 us     |
|     259 | INSERT INTO test_sbtest_8764.s ... 8566-02569157908-49891861265') | Opening tables | 47.13 us          | 37.00 us     |
|     260 | INSERT INTO test_sbtest_560.sb ... 2605-08226572929-25889530906') | query end      | 155.64 us         | 38.00 us     |
|     261 | INSERT INTO test_sbtest_7776.s ... 0243-86335905542-37976752368') | System lock    | 46.68 us          | 32.00 us     |
|     262 | INSERT INTO test_sbtest_6551.s ... 5496-19983185638-75401382079') | update         | 74.07 us          | 40.00 us     |
|     263 | INSERT INTO test_sbtest_7765.s ... 5428-29707353898-77023627427') | update         | 71.35 us          | 45.00 us     |
|     265 | INSERT INTO test_sbtest_5771.s ... 7065-03531013976-67381721569') | query end      | 138.42 us         | 39.00 us     |
|     266 | INSERT INTO test_sbtest_8603.s ... 7158-66470411444-47085285977') | update         | 64.00 us          | 36.00 us     |
|     267 | INSERT INTO test_sbtest_3983.s ... 5039-55965227945-22430910215') | update         | 21.04 ms          | 39.00 us     |
|     268 | INSERT INTO test_sbtest_8186.s ... 5418-65389322831-81706268892') | query end      | 113.58 us         | 37.00 us     |
|     269 | INSERT INTO test_sbtest_1373.s ... 1399-08304962595-55155170406') | update         | 131.97 us         | 59.00 us     |
|     270 | INSERT INTO test_sbtest_7624.s ... 0589-64243675321-62971916496') | query end      | 120.47 us         | 38.00 us     |
|     271 | INSERT INTO test_sbtest_8201.s ... 6888-31692084119-80855845726') | query end      | 109.97 us         | 37.00 us     |
|     272 | INSERT INTO test_sbtest_7054.s ... 3674-32329064814-59707699237') | update         | 67.99 us          | 35.00 us     |
|     273 | INSERT INTO test_sbtest_3019.s ... 1740-35410584680-96109859552') | update         | 5.21 ms           | 33.00 us     |
|     275 | INSERT INTO test_sbtest_7657.s ... 4985-72017519764-59842283878') | update         | 88.91 us          | 48.00 us     |
|     274 | INSERT INTO test_sbtest_8606.s ... 0580-38496560423-65038119567') | freeing items  | NULL              | 37.00 us     |
|     276 | INSERT INTO test_sbtest_9349.s ... 0295-94997123247-88008705118') | starting       | 25.74 us          | 0 ps         |
|     277 | INSERT INTO test_sbtest_3552.s ... 2080-59650597118-53885660147') | starting       | 32.23 us          | 0 ps         |
|     278 | INSERT INTO test_sbtest_3832.s ... 1580-27778606266-19414961452') | freeing items  | 194.14 us         | 51.00 us     |
|     279 | INSERT INTO test_sbtest_7685.s ... 0234-22016898044-97277319766') | update         | 62.66 us          | 40.00 us     |
|     280 | INSERT INTO test_sbtest_6026.s ... 2629-36599580811-97852201188') | Opening tables | 49.41 us          | 37.00 us     |
|     281 | INSERT INTO test_sbtest_8273.s ... 7957-39977507737-37560332932') | update         | 92.56 us          | 36.00 us     |
|     283 | INSERT INTO test_sbtest_8584.s ... 7604-24831943860-69537745471') | starting       | 31.20 us          | 0 ps         |
|     284 | INSERT INTO test_sbtest_3787.s ... 1644-40368085836-11529677841') | update         | 100.41 us         | 40.00 us     |
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
30 rows in set (0.10 sec)

What about the 100K random tables? That should fit into the table_open_cache. At the same time, the default 16 table_open_cache_instances split 500K table_open_cache, so each bucket is only ~30K. To fix that, I’ve set table_open_cache_instances = 4 and was able to get ~50K tps average. However, the contention inside the table_open_cache seems to stall the queries:

MySQL 8.0

There are only a very limited amount of table openings:

MySQL 8.0

 

Conclusion

MySQL 8.0 general tablespaces looks very promising. It is finally possible to create one million tables in MySQL without the need to create two million files. Actually, MySQL 8 can handle many tables very well as long as table cache misses are kept to a minimum.

At the same time, the problem with “Opening tables” (worst case scenario test) still persists in MySQL 8.0.3-rc and limits the throughput. I expected to see that MySQL does not have to open/close the table structure file. I also hope the create table regression bug is fixed in the next MySQL 8.0 version.

I’ve not tested other new features in the new data dictionary in 8.0.3-rc: i.e., atomic DDL (InnoDB now supports atomic DDL, which ensures that DDL operations are either committed in their entirety or rolled back in case of an unplanned server stoppage). That is the topic of the next blog post.

Sep
21
2017
--

Percona Live Europe Featured Talks: Modern sysbench – Teaching an Old Dog New Tricks with Alexey Kopytov

Percona Live Europe 2017

Percona Live EuropeWelcome to another post in our series of interview blogs for the upcoming Percona Live Europe 2017 in Dublin. This series highlights a number of talks that will be at the conference and gives a short preview of what attendees can expect to learn from the presenter.

This blog post is with Alexey Kopytov, sofware developer and maintainer of sysbench. His talk is Modern sysbench: Teaching an Old Dog New Tricks. His presentation present new features provided by recent releases and explain how they can be used to create complex benchmark scenarios and collect performance metrics with a simple Lua API. It will also run a live demo of some of the new sysbench features.

In our conversation, we discussed benchmarking your database environment:

Percona: How did you get into database technology? What do you love about it?

Alexey: It was 2003, and I was working as a software developer for a boring company providing hosted VoIP solutions. I was a big fan of the free and open source software philosophy, which was way less popular back then than it is today. I contributed to a number of open source projects in my free time, but I also had a dream of developing open source software as part of my paid job. This looked completely unrealistic at the time, until I came across a job posting on a Russian IT forum about a Swedish company called MySQL AB looking for software developers to work remotely on MySQL! That sounded like my dream job, so I applied.

I knew very little about database internals at the time, so looking back I was giving terrible answers during my job interviews. Nevertheless, I joined the High Performance Group at MySQL AB after a few months, and that has defined my professional life for many years.

I love database technology because it presents the toughest challenges in software development. Most problems and solutions related to ever-evolving hardware, scalability and data processing requirements are discovered first by people from the database world.

Percona: Your talk is called “Modern sysbench: Teaching an Old Dog New Tricks”. What is sysbench used for generally, why is it important and how have you used it in your career? 

Alexey: sysbench was an internal project that I took over as soon as I joined MySQL AB. We used it to troubleshoot customer issues, find performance bottlenecks in MySQL and evaluate new features. Of course it was an open source project, so over the years we’ve got many people from the MySQL community using sysbench for all kinds of performance research like testing new hardware, identifying performance-related issues and comparing MySQL configurations, versions and forks.

Percona: What are some of the important new developments in the latest release?

Alexey: This year sysbench got a major upgrade in terms of features and performance to meet the modern world of many-core CPUs, powerful storage devices and distributed database systems capable of processing millions of transactions per second. Some feature highlights from the latest release include simplified command-line interface, a revamped API which allows creating more complex benchmark scenarios with less code, new performance metrics, customizable reports and more!

Percona: What do you want attendees to take away from your session? Why should they attend?

Alexey: sysbench is quite popular, but most people rarely use it more than a few bundled OLTP-style benchmarks. I’d like to explain its full potential, especially the possibilities provided by the new features. I want people to use it to create their own benchmarks, not necessarily related to MySQL, and hopefully find sysbench useful in areas that I have not even envisioned myself.

Percona: What are you most looking forward to at Percona Live Europe 2017?

Alexey: For me Percona Live conferences have always been the place where I can feel the pulse of the technology and learn from the smartest people in the industry. This is especially true now that Percona Live provides talks on diverse topics from communities and database management technologies other than MySQL. Which makes it an even greater event to share ideas, solutions and expertise.

Want to find out more about Alexey, sysbench and database benchmarking? Register for Percona Live Europe 2017, and see his talk Modern sysbench: Teaching an Old Dog New Tricks. Register now to get the best price! Use discount code SeeMeSpeakPLE17 to get 10% off your registration.

Percona Live Open Source Database Conference Europe 2017 in Dublin is the premier European open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, MariaDB, MongoDB, time series database, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference Europe will be September 25-27, 2017 at the Radisson Blu Royal Hotel, Dublin.

Aug
28
2017
--

Looking at Disk Utilization and Saturation

DIsk Utilization and Saturation small

In this blog post, I will look at disk utilization and saturation.

In my previous blog post, I wrote about CPU utilization and saturation, the practical difference between them and how different CPU utilization and saturation impact response times. Now we will look at another critical component of database performance: the storage subsystem. In this post, I will refer to the storage subsystem as “disk” (as a casual catch-all). 

The most common tool for command line IO performance monitoring is

iostat

, which shows information like this:

root@ts140i:~# iostat -x nvme0n1 5
Linux 4.4.0-89-generic (ts140i)         08/05/2017      _x86_64_        (4 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.51    0.00    2.00    9.45    0.00   88.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 3555.57 5887.81 52804.15 87440.73    29.70     0.53    0.06    0.13    0.01   0.05  50.71
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.60    0.00    1.06   20.77    0.00   77.57
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.50    0.00    1.26    6.08    0.00   92.16
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7653.20    0.00 113497.60     0.00    29.66     0.99    0.13    0.13    0.00   0.12  93.52

The first line shows the average performance since system start. In some cases, it is useful to compare the current load to the long term average. In this case, as it is a test system, it can be safely ignored. The next line shows the current performance metrics over five seconds intervals (as specified in the command line).

The

iostat

 command reports utilization information in the %util column, and you can look at saturation by either looking at the average request queue size (the avgqu-sz column) or looking at the r_await and w_await columns (which show the average wait for read and write operations). If it goes well above “normal” then the device is over-saturated.

As in my previous blog post, we’ll perform some system Sysbench runs and observe how the

iostat

 command line tool and Percona Monitoring and Management graphs behave.

To focus specifically on the disk, we’re using the Sysbench fileio test. I’m using just one 100GB file, as I’m using DirectIO so all requests hit the disk directly. I’m also using “sync” request submission mode so I can get better control of request concurrency.

I’m using an Intel 750 NVME SSD in this test (though it does not really matter).

Sysbench FileIO 1 Thread

root@ts140i:/mnt/data# sysbench  --threads=1 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      7113.16
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  111.14
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0001s
   total number of events:              4267910
Latency (ms):
        min:                                  0.07
        avg:                                  0.14
        max:                                  6.18
        95th percentile:                      0.17

A single thread run is always great as a baseline, as with only one request in flight we should expect the best response time possible (though typically not the best throughput possible).

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68

Disk LatencyDIsk Utilization and Saturation

The Disk Latency graph confirms the disk IO latency we saw in the

iostat

 command, and it will be highly device-specific. We use it as a baseline to compare changes to with higher concurrency.

Disk IO Utilization

DIsk Utilization and Saturation 2

Disk IO utilization is close to 100% even though we have just one outstanding IO request (queue depth). This is the problem with Linux disk utilization reporting: unlike CPUs, Linux does not have direct visibility on how the IO device is designed. How many “execution units” does it really have? How are they utilized?  Single spinning disks can be seen as a single execution unit while RAID, SSDs and cloud storage (such as EBS) are more than one.

Disk Load

DIsk Utilization and Saturation 3

This graph shows the disk load (or request queue size), which roughly matches the number of threads that are hitting disk as hard as possible.

Saturation (IO Load)

DIsk Utilization and Saturation 4

The IO load on the Saturation Metrics graph shows pretty much the same numbers. The only difference is that unlike Disk IO statistics, it shows the summary for the whole system.

Sysbench FileIO 4 Threads

Now let’s increase IO to four concurrent threads and see how disk responds:

sysbench  --threads=4 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      26248.44
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  410.13
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0002s
   total number of events:              15749205
Latency (ms):
        min:                                  0.06
        avg:                                  0.15
        max:                                  8.73
        95th percentile:                      0.21

We can see the number of requests scales almost linearly, while request latency changes very little: 0.14ms vs. 0.15ms. This shows the device has enough execution units internally to handle the load in parallel, and there are no other bottlenecks (such as the connection interface).

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 28808.60    0.00 427668.00     0.00    29.69     4.05    0.14    0.14    0.00   0.03  99.92

Disk Latency

DIsk Utilization and Saturation 5

Disk Utilization

DIsk Utilization and Saturation 6

Disk Load

DIsk Utilization and Saturation 7

Saturation Metrics (IO Load)

DIsk Utilization and Saturation 8

These stats and graphs show interesting picture: we barely see a response time increase for IO requests, while utilization inches closer to 100% (with four threads submitting requests all the time, it is hard to catch the time when the disk does not have any requests in flight). The load is near four (showing the disk has to handle four requests at the time on average).

Sysbench FileIO 16 Threads

root@ts140i:/mnt/data# sysbench  --threads=16 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      76845.96
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  1200.72
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0003s
   total number of events:              46107727
Latency (ms):
        min:                                  0.07
        avg:                                  0.21
        max:                                  9.72
        95th percentile:                      0.36

Going from four to 16 threads, we again see a good throughput increase with a mild response time increase. If you look at the results closely, you will notice one more interesting thing: the average response time has increased from 0.15ms to 0.21ms (which is a 40% increase), while the 95% response time has increased from 0.21ms to 0.36ms (which is 71%). I also ran a separate test measuring 99% response time, and the difference is even larger: 0.26ms vs. 0.48ms (or 84%).

This is an important observation to make: once saturation starts to happen, the variance is likely to increase and some of the requests will be disproportionately affected (beyond what the average response time shows).

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 82862.20    0.00 1230567.20     0.00    29.70    16.33    0.20    0.20    0.00   0.01 100.00

Disk IO Latency

DIsk Utilization and Saturation 9

Disk IO Utilization

DIsk Utilization and Saturation 10

Disk Load

DIsk Utilization and Saturation 11

Saturation Metrics IO Load

DIsk Utilization and Saturation 12

The graphs show an expected figure: the disk load and IO load from saturation are up to about 16, and utilization remains at 100%.

One thing to notice is increased jitter in the graphs. IO utilization jumps to over 100% and disk IO load spikes to 18, when there should not be as many requests in flight. This comes from how this information is gathered. An attempt is made to sample this data every second, but with the loaded system it takes time for this process to work: sometimes when we try to get the data for a one-second interval but really get data for 1.05- or 0.95-second intervals. When the math is applied to the data, it creates the spikes and dips in the graph when there should be none. You can just ignore them if you’re looking at the big picture.

Sysbench FileIO 64 Threads

Finally, let’s run sysbench with 64 concurrent threads hitting the disk:

root@ts140i:/mnt/data# sysbench  --threads=64 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      127840.59
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  1997.51
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0014s
   total number of events:              76704744
Latency (ms):
        min:                                  0.08
        avg:                                  0.50
        max:                                  9.34
        95th percentile:                      1.25

We can see the average has risen from 0.21ms to 0.50 (more than two times), and 95% almost tripped from 0.36ms to 1.25ms. From a practical standpoint, we can see some saturation starting to happen, but we’re still not seeing a linear response time increase with increasing numbers of parallel operations as we have seen with CPU saturation. I guess this points to the fact that this IO device has a lot of parallel capacity inside and can process requests more effectively (even going from 16 to 64 concurrent threads).

Over the series of tests, as we increased concurrency from one to 64, we saw response times increase from 0.14ms to 0.5ms (or approximately three times). The 95% response time at this time grew from 0.17ms to 1.25ms (or about seven times). For practical purposes, this is where we see the IO device saturation start to show.

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 138090.20    0.00 2049791.20     0.00    29.69    65.99    0.48    0.48    0.00   0.01 100.24

We’ll skip the rest of the graphs as they basically look the same, just with higher latency and 64 requests in flight.

Sysbench FileIO 256 Threads

root@ts140i:/mnt/data# sysbench  --threads=256 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      131558.79
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  2055.61
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0026s
   total number of events:              78935828
Latency (ms):
        min:                                  0.10
        avg:                                  1.95
        max:                                 17.08
        95th percentile:                      3.89

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 142227.60    0.00 2112719.20     0.00    29.71   268.30    1.89    1.89    0.00   0.01 100.00

With 256 threads, finally we’re seeing the linear growth of the average response time that indicates overload and queueing to process requests. There is no easy way to tell if it is due to the IO bus saturation (we’re reading 2GB/sec here) or if it is the internal device processing ability.  

As we’ve seen a less than linear increase in response time going from 16 to 64 connections, and a linear increase going from 64 to 256, we can see the “optimal” concurrency for this device: somewhere between 16 and 64 connections. This allows for peak throughput without a lot of queuing.

Before we get to the summary, I want to make an important note about this particular test. The test is a random reads test, which is a very important pattern for many database workloads, but it might not be the dominant load for your environment. You might be write-bound as well, or have mainly sequential IO access patterns (which could behave differently). For those other workloads, I hope this gives you some ideas on how to also analyze them.

Another Way to Think About Saturation

When I asked the Percona staff for feedback on this blog post by, my colleague Yves Trudeau provided another way to think about saturation: measure saturation as percent increase in the average response time compared to the single user. Like this:

Threads Avg Response Time Saturation
1 0.14
4 0.15 1.07x  or 7%
16 0.21 1.5x  or 50%
64 0.50 3.6x or 260%
256 1.95 13.9x or 1290%

 

Summary

We can see how understanding disk utilization and saturation is much more complicated than for the CPU:

  • The Utilization metric (as reported by
    iostat

     and by PMM) is not very helpful for showing true storage utilization, as it only measures the time when there is at least one request in flight. If you had the same metric for the CPU, it would correspond to something running on at least one of the cores (not very useful for highly parallel systems).

  • Unlike a CPU, Linux tools do not provide us with information about the structure of the underlying storage and how much parallel load it should be able to handle without saturation. Even more so, storage might well have different low-level resources that cause saturation. For example, it could be the network connection, SATA BUS or even the kernel IO stack for older kernels and very fast storage.
  • Saturation as measured by the number of requests in flight is helpful for guessing if there might be saturation, but since we do not know how many requests the device can efficiently process concurrently, just looking the raw metric doesn’t let us determine that the device is overloaded.
  • Avg Response Time is a great metric for looking at saturation, but as with the response time you can’t say what response time is good or bad for this device. You need to look at it in context and compare it to the baseline. When you’re looking at the Avg Response Time, make sure you’re looking at read request response time vs. write request response time separately, and keep the average request size in mind to ensure we are comparing apples to apples.
Jul
12
2017
--

Gh-ost benchmark against pt-online-schema-change performance

gh-ost-benchmark

In this blog post, I will run a gh-ost benchmark against the performance of pt-online-schema-change.

When gh-ost came out, I was very excited. As MySQL ROW replication became commonplace, you could use it to track changes instead of triggers. This practice is cleaner and safer compared to Percona Toolkit’s pt-online-schema-change. Since gh-ost doesn’t need triggers, I assumed it would generate lower overhead and work faster. I frequently called it “pt-online-schema-change on steroids” in my talks. Finally, I’ve found some time to check my theoretical claims with some benchmarks.

DISCLAIMER: These benchmarks correspond to one specific ALTER TABLE on the table of one specific structure and hardware configuration. I have not set up a broad set of tests. If you have other results – please comment!

Benchmark Setup Details

  • pt-online-schema-change from Percona Toolkit 3.0.3
  • gh-ost 1.0.36
  • Percona Server 5.7.18 on Ubuntu 16.04 LTS
  • Hardware: 28CPU cores/56 Threads.  128GB Memory.   Samsung 960 Pro 512GB
  • Sysbench 1.0.7

Prepare the table by running:

sysbench --threads=40 --rate=0 --report-interval=1 --percentile=99 --events=0 --time=0 --db-ps-mode=auto --mysql-user=sbtest --mysql-password=sbtest  /usr/share/sysbench/oltp_read_write.lua --table_size=10000000 prepare

The table size is about 3GB (completely fitting to innodb_buffer_pool).

Run the benchmark in “full ACID” mode with:

  • sync_binlog=1
  • innodb_flush_log_at_trx_commit=1
  • innodb_doublewrite=1

This is important as this workload is heavily commit-bound, and extensively relies on group commit.

This is the pt-online-schema-change command to alter table:

time pt-online-schema-change --execute --alter "ADD COLUMN c1 INT" D=sbtest,t=sbtest1

This the gh-ost command to alter table:

time ./gh-ost  --user="sbtest" --password="sbtest" --host=localhost --allow-on-master --database="sbtest" --table="sbtest1"  --alter="ADD COLUMN c1 INT" --execute

Tests Details

For each test the old sysbench table was dropped and a new one prepared. I tested alter table in three different cases:

  • When nothing else was running (“Idle Load”)   
  • When the system handled about 2% of load it can handle at full capacity (“Light Background Load”)
  • When the system handled about 40% of the possible load, with sysbench injected about 25% of the transactions/sec the system could handle at full load (“Heavy Background Load”)

I measured the alter table completion times for all cases, as well as the overhead generated by the alter (in other words, how much peak throughput is reduced by running alter table through the tools).

Idle Load

gh-ost benchmark 1

For the Idle Load test, pt-online-schema-change completed nearly twice as fast as gh-ost. This was a big surprise for me. I haven’t looked into the reasons or details yet, though I can see most of the CPU usage for gh-ost is on the MySQL server side. Perhaps the differences relate to the SQL used to perform non-blocking alter tables.

Light Background Load

I generated the Light Background Load by running the sysbench command below. It corresponds to a roughly 4% load, as the system can handle some 2500 transactions/sec at this concurrency under full load. Adjust the 

--rate

 value to scale it for your system.

time sysbench --threads=40 --rate=100 --report-interval=1 --percentile=99 --events=0 --time=0 --db-ps-mode=auto --mysql-user=sbtest --mysql-password=sbtest  /usr/share/sysbench/oltp_read_write.lua --table_size=10000000 run

gh-ost benchmark 2

The numbers changed (as expected), but pt-online-schema-change is still approximately twice as fast as gh-ost.

What is really interesting in this case is how a relatively light background load affects the process completion time. It took both pt-online-schema-change and gh-ost about 2.7x times longer to finish! 

Heavy Background Load

I generated the Heavy Background Load running the sysbench command below. It corresponds to a roughly 40% load, as the system can handle some 2500 transactions/sec at this concurrency under full load. Adjust

--rate

 value to scale it for your system.

time sysbench --threads=40 --rate=1000 --report-interval=1 --percentile=99 --events=0 --time=0 --db-ps-mode=auto --mysql-user=sbtest --mysql-password=sbtest  /usr/share/sysbench/oltp_read_write.lua --table_size=10000000 run

gh-ost benchmark 3

What happened in this case? When the load gets higher, gh-ost can’t keep up with binary log processing, and just never finishes at all. While this may be surprising at first, it makes sense if you think more about how these tools work. pt-online-schema-change uses triggers, and while they have a lot of limitations and overhead they can execute in parallel. gh-ost, on the other hand, processes the binary log in a single thread and might not be able to keep up.   

In MySQL 5.6 we didn’t have parallel replication, which applies writes to the same table in parallel. For that version the gh-ost limitation probably isn’t as big a deal, as such a heavy load would also cause replication lag. MySQL 5.7 has parallel replication. This makes it much easier to quickly replicate workloads that are too heavy for gh-ost to handle.

I should note that the workload being simulated in this benchmark is a rather extreme case. The table being altered by gh-ost here is at the same time handling a background load so high it can’t be replicated in a single thread.

Future versions of gh-ost could improve this issue by applying binlog events in parallel, similar to what MySQL replicas do.

An excerpt from the gh-ost log shows how it is totally backed up trying to apply the binary log:

root@rocky:/tmp# time ./gh-ost  --user="sbtest" --password="sbtest" --host=localhost --allow-on-master --database="sbtest" --table="sbtest1"  --alter="ADD COLUMN c1 INT" --execute
2017/06/25 19:16:05 binlogsyncer.go:75: [info] create BinlogSyncer with config &{99999 mysql localhost 3306 sbtest sbtest  false false <nil>}
2017/06/25 19:16:05 binlogsyncer.go:241: [info] begin to sync binlog from position (rocky-bin.000018, 640881773)
2017/06/25 19:16:05 binlogsyncer.go:134: [info] register slave for master server localhost:3306
2017/06/25 19:16:05 binlogsyncer.go:568: [info] rotate to (rocky-bin.000018, 640881773)
2017-06-25 19:16:05 ERROR parsing time "" as "2006-01-02T15:04:05.999999999Z07:00": cannot parse "" as "2006"
# Migrating `sbtest`.`sbtest1`; Ghost table is `sbtest`.`_sbtest1_gho`
# Migrating rocky:3306; inspecting rocky:3306; executing on rocky
# Migration started at Sun Jun 25 19:16:05 -0400 2017
# chunk-size: 1000; max-lag-millis: 1500ms; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# Serving on unix socket: /tmp/gh-ost.sbtest.sbtest1.sock
Copy: 0/9872432 0.0%; Applied: 0; Backlog: 0/100; Time: 0s(total), 0s(copy); streamer: rocky-bin.000018:641578191; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 0; Backlog: 100/100; Time: 1s(total), 1s(copy); streamer: rocky-bin.000018:641626699; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 640; Backlog: 100/100; Time: 2s(total), 2s(copy); streamer: rocky-bin.000018:641896215; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 1310; Backlog: 100/100; Time: 3s(total), 3s(copy); streamer: rocky-bin.000018:642178659; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 1920; Backlog: 100/100; Time: 4s(total), 4s(copy); streamer: rocky-bin.000018:642436043; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 2600; Backlog: 100/100; Time: 5s(total), 5s(copy); streamer: rocky-bin.000018:642722777; State:
...
Copy: 0/9872432 0.0%; Applied: 120240; Backlog: 100/100; Time: 3m0s(total), 3m0s(copy); streamer: rocky-bin.000018:694142377; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 140330; Backlog: 100/100; Time: 3m30s(total), 3m30s(copy); streamer: rocky-bin.000018:702948219; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 160450; Backlog: 100/100; Time: 4m0s(total), 4m0s(copy); streamer: rocky-bin.000018:711775662; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 180600; Backlog: 100/100; Time: 4m30s(total), 4m30s(copy); streamer: rocky-bin.000018:720626338; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 200770; Backlog: 100/100; Time: 5m0s(total), 5m0s(copy); streamer: rocky-bin.000018:729509960; State: migrating; ETA: N/A

Online Schema Change Performance Impact

For this test I started the alter table, waited 60 seconds and then ran sysbench at full speed for five minutes. Then I measured how much the performance was impacted by running the tool:

sysbench --threads=40 --rate=0 --report-interval=1 --percentile=99 --events=0 --time=300 --db-ps-mode=auto --mysql-user=sbtest --mysql-password=sbtest  /usr/share/sysbench/oltp_read_write.lua --table_size=10000000 run

gh-ost benchmark 4

As we can see, gh-ost has negligible overhead in this case. pt-online-schema-change on the other hand, had peformance reduced by 12%. It is worth noting though that pt-online-schema-change still makes progress in this case (though slowly), while gh-ost would never complete.

If anything, I was surprised at how little impact the pt-online-schema-change run had on sysbench performance.

It’s important to note that in this case we only measured the overhead for the “copy” stage of the online schema change. Another thing you should worry about is the impact to performance during “table rotation” (which I have not measured).

Summary

While gh-ost introduces a number of design advantages, and gives better results in some situation, I wouldn’t call it always superior the tried and true pt-online-schema-change. At least in some cases, pt-online-schema-change offers better performance than gh-ost and completes a schema change when gh-ost is unable to keep up. Consider trying out both tools and see what works best in your situation.

Jul
06
2017
--

ClickHouse: One Year!

ClickHouse One Year

ClickHouse One YearIn this blog, we’ll look at ClickHouse on its one year anniversary.

It’s been a year already since the Yandex team released ClickHouse as open source software. I’ve had an interest in this project from the very start, as I didn’t think there was an open source analytical database that could compete with industry leaders like Vertica (for example).

This was an exciting year for ClickHouse early adopters. Let’s look at what it accomplished so far.

ClickHouse initially generated interest due to the Yandex name – the most popular search engine in Russia. It wasn’t long before jaw-dropping responses popped up: guys, this thing is crazy fast! Many early adopters who tried ClickHouse were really impressed.

Fast doesn’t mean convenient though. That was the main community concern to ClickHouse over the past months. Developed as an internal project for an internal customer (Yandex.Metrica), ClickHouse had a lot of limitations for general community use. It took several months before Yandex could restructure the team and mindset, establish proper communication with the community and start addressing external needs. There are still a lot of things that need to be done. The public roadmap is not easily available, for example, but the wheels are pointed in the right direction. The ClickHouse team has added a lot of the features people were screaming for, and more are in ClickHouse’s future plans.

The Yandex guys are actively attending international conferences, and they were:

They are speaking much more in Russia (no big surprise).

We were very excited by Yandex’s ClickHouse performance claims at Percona, and could not resist making our own benchmarks:

ClickHouse did very well in these benchmarks. There are many other benchmarks by other groups as well, including a benchmark against Amazon RedShift by Altinity.

The first ClickHouse production deployments outside of Yandex started in October-November 2016. Today, Yandex reports that dozens of companies around the world are using ClickHouse in production, with the biggest installations operating with up to several petabytes of data. Hundreds of other enterprises are deploying pilot installations or actively evaluating the software.

There are also interesting reports from CloudFare (How Cloudflare analyzes 1M DNS queries per second) and from Carto (Geospatial processing with ClickHouse).

There are also various community projects around ClickHouse worth mentioning:

Percona is also working to adapt ClickHouse to our projects. We are using ClickHouse to handle Query Analytics and as a long term metrics data for Metrics inside a new version (under development) of Percona Monitoring and Management.

I also will be speaking about ClickHouse at BIG DATA DAY LA 2017 on August 5th, 2017. You are welcome to attend if you are in Los Angeles this day!

ClickHouse has the potential to become one of the leading open source analytical DBMSs – much like MySQL and PostreSQL are leaders for OLTP workloads. We will see in the next couple of years if it happens or not. Congratulations to the Yandex team on their one-year milestone!

Jun
22
2017
--

ClickHouse in a General Analytical Workload (Based on a Star Schema Benchmark)

ClickHouse

ClickHouseIn this blog post, we’ll look at how ClickHouse performs in a general analytical workload using the star schema benchmark test.

We have mentioned ClickHouse in some recent posts (ClickHouse: New Open Source Columnar Database, Column Store Database Benchmarks: MariaDB ColumnStore vs. Clickhouse vs. Apache Spark), where it showed excellent results. ClickHouse by itself seems to be event-oriented RDBMS, as its name suggests (clicks). Its primary purpose, using Yandex Metrica (the system similar to Google Analytics), also points to an event-based nature. We also can see there is a requirement for date-stamped columns.

It is possible, however, to use ClickHouse in a general analytical workload. This blog post shares my findings. For these tests, I used a Star Schema benchmark — slightly-modified so that able to handle ClickHouse specifics.

First, let’s talk about schemas. We need to adjust to ClickHouse data types. For example, the biggest fact table in SSB is “lineorder”. Below is how it is defined for Amazon RedShift (as taken from https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-create-test-data.html):

CREATE TABLE lineorder
(
  lo_orderkey          INTEGER NOT NULL,
  lo_linenumber        INTEGER NOT NULL,
  lo_custkey           INTEGER NOT NULL,
  lo_partkey           INTEGER NOT NULL,
  lo_suppkey           INTEGER NOT NULL,
  lo_orderdate         INTEGER NOT NULL,
  lo_orderpriority     VARCHAR(15) NOT NULL,
  lo_shippriority      VARCHAR(1) NOT NULL,
  lo_quantity          INTEGER NOT NULL,
  lo_extendedprice     INTEGER NOT NULL,
  lo_ordertotalprice   INTEGER NOT NULL,
  lo_discount          INTEGER NOT NULL,
  lo_revenue           INTEGER NOT NULL,
  lo_supplycost        INTEGER NOT NULL,
  lo_tax               INTEGER NOT NULL,
  lo_commitdate        INTEGER NOT NULL,
  lo_shipmode          VARCHAR(10) NOT NULL
);

For ClickHouse, the table definition looks like this:

CREATE TABLE lineorderfull (
        LO_ORDERKEY             UInt32,
        LO_LINENUMBER           UInt8,
        LO_CUSTKEY              UInt32,
        LO_PARTKEY              UInt32,
        LO_SUPPKEY              UInt32,
        LO_ORDERDATE            Date,
        LO_ORDERPRIORITY        String,
        LO_SHIPPRIORITY         UInt8,
        LO_QUANTITY             UInt8,
        LO_EXTENDEDPRICE        UInt32,
        LO_ORDTOTALPRICE        UInt32,
        LO_DISCOUNT             UInt8,
        LO_REVENUE              UInt32,
        LO_SUPPLYCOST           UInt32,
        LO_TAX                  UInt8,
        LO_COMMITDATE           Date,
        LO_SHIPMODE             String
)Engine=MergeTree(LO_ORDERDATE,(LO_ORDERKEY,LO_LINENUMBER),8192);

From this we can see we need to use datatypes like UInt8 and UInt32, which are somewhat unusual for database world datatypes.

The second table (RedShift definition):

CREATE TABLE customer
(
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL
);

For ClickHouse, I defined as:

CREATE TABLE customerfull (
        C_CUSTKEY       UInt32,
        C_NAME          String,
        C_ADDRESS       String,
        C_CITY          String,
        C_NATION        String,
        C_REGION        String,
        C_PHONE         String,
        C_MKTSEGMENT    String,
        C_FAKEDATE      Date
)Engine=MergeTree(C_FAKEDATE,(C_CUSTKEY),8192);

For reference, the full schema for the benchmark is here: https://github.com/vadimtk/ssb-clickhouse/blob/master/create.sql.

For this table, we need to define a rudimentary column C_FAKEDATE Date in order to use ClickHouse’s most advanced engine (MergeTree). I was told by the ClickHouse team that they plan to remove this limitation in the future.

To generate data acceptable by ClickHouse, I made modifications to ssb-dbgen. You can find my version here: https://github.com/vadimtk/ssb-dbgen. The most notable change is that ClickHouse can’t accept dates in CSV files formatted as “19971125”. It has to be “1997-11-25”. This is something to keep in mind when loading data into ClickHouse.

It is possible to do some preformating on the load, but I don’t have experience with that. A common approach is to create the staging table with datatypes that match loaded data, and then convert them using SQL functions when inserting to the main table.

Hardware Setup

One of the goals of this benchmark to see how ClickHouse scales on multiple nodes. I used a setup of one node, and then compared to a setup of three nodes. Each node is 24 cores of “Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz” CPUs, and the data is located on a very fast PCIe Flash storage.

For the SSB benchmark I use a scale factor of 2500, which provides (in raw data):

Table lineorder – 15 bln rows, raw size 1.7TB, Table customer – 75 mln rows

When loaded into ClickHouse, the table lineorder takes 464GB, which corresponds to a 3.7x compression ratio.

We compare a one-node (table names lineorderfull, customerfull) setup vs. a three-node (table names lineorderd, customerd) setup.

Single Table Operations

Query:

SELECT
    toYear(LO_ORDERDATE) AS yod,
    sum(LO_REVENUE)
FROM lineorderfull
GROUP BY yod

One node:

7 rows in set. Elapsed: 9.741 sec. Processed 15.00 billion rows, 90.00 GB (1.54 billion rows/s., 9.24 GB/s.)

Three nodes:

7 rows in set. Elapsed: 3.258 sec. Processed 15.00 billion rows, 90.00 GB (4.60 billion rows/s., 27.63 GB/s.)

We see a speed up of practically three times. Handling 4.6 billion rows/s is blazingly fast!

One Table with Filtering

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorderfull
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)

One node:

1 rows in set. Elapsed: 3.175 sec. Processed 2.28 billion rows, 18.20 GB (716.60 million rows/s., 5.73 GB/s.)

Three nodes:

1 rows in set. Elapsed: 1.295 sec. Processed 2.28 billion rows, 18.20 GB (1.76 billion rows/s., 14.06 GB/s.)

It’s worth mentioning that during the execution of this query, ClickHouse was able to use ALL 24 cores on each box. This confirms that ClickHouse is a massively parallel processing system.

Two Tables (Independent Subquery)

In this case, I want to show how Clickhouse handles independent subqueries:

SELECT sum(LO_REVENUE)
FROM lineorderfull
WHERE LO_CUSTKEY IN
(
    SELECT C_CUSTKEY AS LO_CUSTKEY
    FROM customerfull
    WHERE C_REGION = 'ASIA'
)

One node:

1 rows in set. Elapsed: 28.934 sec. Processed 15.00 billion rows, 120.00 GB (518.43 million rows/s., 4.15 GB/s.)

Three nodes:

1 rows in set. Elapsed: 14.189 sec. Processed 15.12 billion rows, 121.67 GB (1.07 billion rows/s., 8.57 GB/s.)

We  do not see, however, the close to 3x speedup on three nodes, because of the required data transfer to perform the match LO_CUSTKEY with C_CUSTKEY

Two Tables JOIN

With a subquery using columns to return results, or for GROUP BY, things get more complicated. In this case we want to GROUP BY the column from the second table.

First, ClickHouse doesn’t support traditional subquery syntax, so we need to use JOIN. For JOINs, ClickHouse also strictly prescribes how it must be written (a limitation that will also get changed in the future). Our JOIN should look like:

SELECT
    C_REGION,
    sum(LO_EXTENDEDPRICE * LO_DISCOUNT)
FROM lineorderfull
ANY INNER JOIN
(
    SELECT
        C_REGION,
        C_CUSTKEY AS LO_CUSTKEY
    FROM customerfull
) USING (LO_CUSTKEY)
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)
GROUP BY C_REGION

One node:

5 rows in set. Elapsed: 31.443 sec. Processed 2.35 billion rows, 28.79 GB (74.75 million rows/s., 915.65 MB/s.)

Three nodes:

5 rows in set. Elapsed: 25.160 sec. Processed 2.58 billion rows, 33.25 GB (102.36 million rows/s., 1.32 GB/s.)

In this case the speedup is not even two times. This corresponds to the fact of the random data distribution for the tables lineorderd and customerd. Both tables were defines as:

CREATE TABLE lineorderd AS lineorder ENGINE = Distributed(3shards, default, lineorder, rand());
CREATE TABLE customerd AS customer ENGINE = Distributed(3shards, default, customer, rand());

Where  rand() defines that records are distributed randomly across three nodes. When we perform a JOIN by LO_CUSTKEY=C_CUSTKEY, records might be located on different nodes. One way to deal with this is to define data locally. For example:

CREATE TABLE lineorderLD AS lineorderL ENGINE = Distributed(3shards, default, lineorderL, LO_CUSTKEY);
CREATE TABLE customerLD AS customerL ENGINE = Distributed(3shards, default, customerL, C_CUSTKEY);

Three Tables JOIN

This is where it becomes very complicated. Let’s consider the query that you would normally write:

SELECT sum(LO_REVENUE),P_MFGR, toYear(LO_ORDERDATE) yod FROM lineorderfull ,customerfull,partfull WHERE C_REGION = 'ASIA' and
LO_CUSTKEY=C_CUSTKEY and P_PARTKEY=LO_PARTKEY GROUP BY P_MFGR,yod ORDER BY P_MFGR,yod;

With Clickhouse’s limitations on JOINs syntax, the query becomes:

SELECT
    sum(LO_REVENUE),
    P_MFGR,
    toYear(LO_ORDERDATE) AS yod
FROM
(
    SELECT
        LO_PARTKEY,
        LO_ORDERDATE,
        LO_REVENUE
    FROM lineorderfull
    ALL INNER JOIN
    (
        SELECT
            C_REGION,
            C_CUSTKEY AS LO_CUSTKEY
        FROM customerfull
    ) USING (LO_CUSTKEY)
    WHERE C_REGION = 'ASIA'
)
ALL INNER JOIN
(
    SELECT
        P_MFGR,
        P_PARTKEY AS LO_PARTKEY
    FROM partfull
) USING (LO_PARTKEY)
GROUP BY
    P_MFGR,
    yod
ORDER BY
    P_MFGR ASC,
    yod ASC

By writing queries this way, we force ClickHouse to use the prescribed JOIN order — at this moment there is no optimizer in ClickHouse and it is totally unaware of data distribution.

There is also not much speedup when we compare one node vs. three nodes:

One node execution time:

35 rows in set. Elapsed: 697.806 sec. Processed 15.08 billion rows, 211.53 GB (21.61 million rows/s., 303.14 MB/s.)

Three nodes execution time:

35 rows in set. Elapsed: 622.536 sec. Processed 15.12 billion rows, 211.71 GB (24.29 million rows/s., 340.08 MB/s.)

There is a way to make the query faster for this 3-way JOIN, however. (Thanks to Alexander Zaytsev from https://www.altinity.com/ for help!)

Optimized query:

SELECT
    sum(revenue),
    P_MFGR,
    yod
FROM
(
    SELECT
        LO_PARTKEY AS P_PARTKEY,
        toYear(LO_ORDERDATE) AS yod,
        SUM(LO_REVENUE) AS revenue
    FROM lineorderfull
    WHERE LO_CUSTKEY IN
    (
        SELECT C_CUSTKEY
        FROM customerfull
        WHERE C_REGION = 'ASIA'
    )
    GROUP BY
        P_PARTKEY,
        yod
)
ANY INNER JOIN partfull USING (P_PARTKEY)
GROUP BY
    P_MFGR,
    yod
ORDER BY
    P_MFGR ASC,
    yod ASC

Optimized query time:

One node:

35 rows in set. Elapsed: 106.732 sec. Processed 15.00 billion rows, 210.05 GB (140.56 million rows/s., 1.97 GB/s.)

Three nodes:

35 rows in set. Elapsed: 75.854 sec. Processed 15.12 billion rows, 211.71 GB (199.36 million rows/s., 2.79 GB/s.

That’s an improvement of about 6.5 times compared to the original query. This shows the importance of understanding data distribution, and writing the optimal query to process the data.

Another option for dealing with JOIN complexity, and to improve performance, is to use ClickHouse’s dictionaries. These dictionaries are described here: https://www.altinity.com/blog/2017/4/12/dictionaries-explained.

I will review dictionary performance in future posts.

Another traditional way to deal with JOIN complexity in an analytics workload is to use denormalization. We can move some columns (for example, P_MFGR from the last query) to the facts table (lineorder).

Observations

  • ClickHouse can handle general analytical queries (it requires special schema design and considerations, however)
  • Linear speedup is possible, but it depends on query design and requires advanced planning — proper speedup depends on data locality
  • ClickHouse is blazingly fast (beyond what I’ve seen before) because it can use all available CPU cores for query, as shown above using 24 cores for single server and 72 cores for three nodes
  • Multi-table JOINs are cumbersome and require manual work to achieve better performance, so consider using dictionaries or denormalization
May
24
2017
--

Percona Software and Roadmap Update with CEO Peter Zaitsev: Q2 2017

Percona Software and Services

This blog post is a summary of the Percona Software and Roadmap Update – Q2 2017 webinar given by Peter Zaitsev on May 4, 2017. This webinar reflects changes and updates since the last update (Q1 2017).

A full recording of this webinar, along with the presentation slide deck, can be found here.

Percona Software

Below are the latest and upcoming features in Percona’s software. All of Percona’s software is 100% free and open source, with no restricted “Enterprise” version. Percona doesn’t restrict users with open core or “open source, eventually” (BSL) licenses.

Percona Server for MySQL 5.7

Latest Improvements

Features About To Be Released 

  • Integration of TokuDB and Performance Schema
  • MyRocks integration in Percona Server
  • Starting to look towards MySQL 8

Percona XtraBackup 2.4

Latest Improvements

Percona Toolkit

Latest Improvements

Percona Server for MongoDB 3.4

Latest Improvements

Percona XtraDB Cluster 5.7

Latest Improvements

Performance Improvement Benchmarks

Below, you can see the benchmarks for improvements to Percona XtraDB Cluster 5.7 performance. You can read about the improvements and benchmark tests in more detail here and here.

Percona Software and Roadmap Update

Percona XtraDB Cluster 5.7 Integrated with ProxySQL 1.3

Percona Monitoring and Management

New in Percona Monitoring and Management

Advanced MariaDB Dashboards in PMM (Links go to PMM Demo)

Percona Q217 Roadmap 4

Improved MongoDB Dashboards in PMM (Links go to PMM Demo)

Percona Q217 Roadmap 7

Percona Q217 Roadmap 9

Percona Q217 Roadmap 10

Check out the PMM Demo

Thanks for tuning in for an update on Percona Software and Roadmap Update – Q2 2017.

New Percona Online Store – Easy to Buy, Pay Monthly

May
03
2017
--

Storing UUID and Generated Columns

Storing UUID

A lot of things have been said about UUID, and storing UUID in an optimized way. Now that we have generated columns, we can store the decomposed information inside the UUID and merge it again with generated columns. This blog post demonstrates this process.

First, I used a simple table with one char field that I called uuid_char to establish a base case. I used this table with and without a primary key:

CREATE TABLE uuid_char (
uuid char(36) CHARACTER SET utf8 NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE uuid_char_pk (
uuid char(36) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (uuid)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

I performed the tests on a local VM over MySQL 5.7.17 for 30 seconds, with only two threads, because I wanted to just compare the executions:

sysbench
--oltp-table-size=100000000
--test=/usr/share/doc/sysbench/tests/db/insert_uuid_generated_columns.uuid_char.lua
--oltp-tables-count=4
--num-threads=2
--mysql-user=root
--max-requests=5000000
--report-interval=5
--max-time=30
--mysql-db=generatedcolumn
run

One pair of executions is with the UUID generated by sysbench, which simulates the UUID that comes from the app:

rs = db_query("INSERT INTO uuid_char (uuid) VALUES " .. string.format("('%s')",c_val))

An alternative execution is for when the UUID is generated by the MySQL function uuid():

rs = db_query("INSERT INTO uuid_char (uuid) VALUES (uuid())")

Below we can see the results: 

The inserts are faster without a PK (but only by 5%), and using the uuid() function doesn’t impact performance.

Now, let’s see the alternative method, which is decomposing the UUID. It has four main information sets:

  • Timestamp: this is a number with seven decimals.
  • MAC: the MAC address of the device that creates the UUID
  • Unique value: this value avoids duplicate cases scenarios
  • UUID version: this will always be “1”, as we are going to use version 1. If you are going to use another version, you will need to review the functions that I used.

The structure of the table that we’ll use is:

CREATE TABLE `uuid_generated` (
`timestamp` decimal(18,7) unsigned NOT NULL,
`mac` bigint(20) unsigned NOT NULL,
`temp_uniq` binary(2) NOT NULL,
PRIMARY KEY (`timestamp`,`mac`,`temp_uniq`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

To understand how a UUID is unwrapped, I used this store procedure (which receives a UUID and inserts it into the table):

CREATE PROCEDURE ins_generated_uuid (uuid char(38))
begin
set @hex_timestamp = concat(substring(uuid, 16, 3), substring(uuid, 10, 4), substring(uuid, 1, 8));
set @timestamp = concat(conv(@hex_timestamp,16,10)div 10000000 - (141427 * 24 * 60 * 60),'.',right(conv(@hex_timestamp,16,10),7));
set @mac = conv(right(uuid,12),16,10);
set @temp_uniq = unhex(substring(uuid,20,4));
insert into uuid_generated (timestamp,mac,temp_uniq) values (@timestamp,@mac,@temp_uniq);
end ;;

Explanation:

  • @hex_timestamp is a temporary variable that collects the timestamp in hexadecimal format from the different sections of the UUID
  • @timestamp transforms the hexadecimal timestamp to a decimal number
  • @mac pulls the last number in the UUID (a MAC) so we can store it in as a bigint
  • @temp_uniq is a value to conserve the uniqueness, which is why we store it as binary and it is at the end of the Primary Key

If I wanted to get the UUID again, I can use these two generated columns:

`hex_timestamp` char(40) GENERATED ALWAYS AS (conv(((`timestamp` * 10000000) + (((141427 * 24) * 60) * 600000000)),10,16)) VIRTUAL,
`uuid_generated` char(38) GENERATED ALWAYS AS (concat(right(`hex_timestamp`,8),'-',substr(`hex_timestamp`,4,4),'-1',left(`hex_timestamp`,3),'-',convert(hex(`temp_uniq`) using utf8),'-',lpad(conv(`mac`,10,16),12,'0'))) VIRTUAL,

We performed tests over five scenarios:

  • Without the generated columns, the insert used data generated dynamically
  • Same as before, but we added a char field that stores the UUID
  • With the char field, and adding the generated column
  • We used the store procedure detailed before to insert the data into the table
  • We also tested the performance using triggers

The difference between the Base and the previous table structure with Primary Keys is very small. So, the new basic structure has no impact on performance.

We see that Base and +Char Field have the same performance. So leaving a char field has no performance impact (it just uses more disk space).

Using generated columns impact performance. This is expected, as the columns are generated to validate the type before the row is inserted.

Finally, the use of triggers and store procedure has the same impact in performance.

These are the three structures to the tables:

CREATE TABLE `uuid_generated` (
`timestamp` decimal(18,7) unsigned NOT NULL,
`mac` bigint(20) unsigned NOT NULL,
`temp_uniq` binary(2) NOT NULL,
PRIMARY KEY (`timestamp`,`mac`,`temp_uniq`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `uuid_generated_char` (
`timestamp` decimal(18,7) unsigned NOT NULL,
`mac` bigint(20) unsigned NOT NULL,
`temp_uniq` binary(2) NOT NULL,
`uuid` char(38) DEFAULT NULL,
PRIMARY KEY (`timestamp`,`mac`,`temp_uniq`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `uuid_generated_char_plus` (
`timestamp` decimal(18,7) unsigned NOT NULL,
`mac` bigint(20) unsigned NOT NULL,
`temp_uniq` binary(2) NOT NULL,
`uuid` char(38) DEFAULT NULL,
`hex_timestamp` char(40) GENERATED ALWAYS AS (conv(((`timestamp` * 10000000) + (((141427 * 24) * 60) * 600000000)),10,16)) VIRTUAL,
`uuid_generated` char(38) GENERATED ALWAYS AS (concat(right(`hex_timestamp`,8),'-',substr(`hex_timestamp`,4,4),'-1',left(`hex_timestamp`,3),'-',convert(hex(`temp_uniq`) using utf8),'-',lpad(conv(`mac`,10,16),12,'0'))) VIRTUAL,
PRIMARY KEY (`timestamp`,`mac`,`temp_uniq`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

And this is the trigger:

DROP TRIGGER IF EXISTS ins_generated_uuid;
delimiter ;;
CREATE TRIGGER ins_uuid_generated BEFORE INSERT ON uuid_generated
FOR EACH ROW
begin
set @hex_timestamp = concat(substring(NEW.uuid, 16, 3), substring(NEW.uuid, 10, 4), substring(NEW.uuid, 1, 8));
set NEW.timestamp = concat(conv(@hex_timestamp,16,10)div 10000000 - (141427 * 24 * 60 * 60),'.',right(conv(@hex_timestamp,16,10),7));
set NEW.mac = conv(right(NEW.uuid,12),16,10);
set NEW.temp_uniq = unhex(substring(NEW.uuid,20,4));
end ;;
delimiter ;

Conclusions

Decomposing the UUID is an alternative to storing them in order, but it won’t speed up inserts. It is simpler to execute queries over a range of dates, and look at the row for a particular device, as you will be able to use the MAC (it is recommended to add an index for it). Generated columns give you the possibility to build the UUID back in just one string.

Apr
10
2017
--

ProxySQL Rules: Do I Have Too Many?

In this blog post we are going to take a closer look at ProxySQL rules. How do they work, and how big is the performance impact of having many rules?

I would like to say thank you to Renè, who was willing to answer all my questions during my tests.

Overview

ProxySQL is heavily based on the query rules. We can set up ProxySQL without rules based only on the host groups, but if we want read/write splitting or sharding (or anything else) we need rules.

ProxySQL knows the SQL protocol and language, so we can easily create rules based on username, schema name and even on the query itself. We can write regular expressions that match the query digest. Let me show you an example:

insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('Testuser',601,1,3,'^SELECT');

This rule matches all the queries starting with “SELECT”, and sends them to host group 601.

After version 1.3.1, the default regex engine was RE2. Starting after version 1.4, the default regex engine will be PCRE.

I would like to highlight three options that can have a bigger impact on your rules than you think: flagINflagOUTapply.

With regards to the manual:

. . .these allow us to create “chains of rules” that get applied one after the other. An input flag value is set to 0, and only rules with flagIN=0 are considered at the beginning. When a matching rule is found for a specific query, flagOUT is evaluated and if NOT NULL the query will be flagged with the specified flag in flagOUT. If flagOUT differs from flagIN, the query will exit the current chain and enters a new chain of rules having flagIN as the new input flag. If flagOUT matches flagIN, the query will be re-evaluated again against the first rule with said flagIN. This happens until there are no more matching rules, or apply is set to 1 (which means this is the last rule to be applied)

You might not be sure what this means, but I will show you later.

As you can see, adding a rule is easy and we can add hundreds of rules, But is there any performance impact?

Test Case

We can write rules based on any part of the query (for example, “userid” or some “sharding key”). In these tests I wrote the rules based on table names because I can easily generate tables with “sysbench”, and run queries against these tables.

I created 1000 tables using sysbench, and I am going to test them with a direct MySQL connection, ProxySQL without rules, with ten rules and with 100 rules.

Time to do some tests to see if adding 100 or more rules have any effect on the performance?

I used two c4.4xlarge instances with SSDs, and I am going to share the steps so anybody can repeat my test and share/compare the results. NodeA is running MySQL 5.7.17 server, and NodeB is running “ProxySQL 1.3.4: and sysbench. During the test I increased the sysbench threads in the following steps:1,2,4,8,12,16,20,24.

I tried to use the simplest ProxySQL configuration as possible:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('10.10.10.243',600,3306,1000,0);
INSERT INTO mysql_replication_hostgroups VALUES (600,'','');
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;
insert into mysql_users (username,password,active,default_hostgroup,default_schema) values ('testuser_rw','Testpass1.',1,600,'test');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK; 

Only one server, one host group. I tried to measure the impact the rules had, so in all the test I sent the queries to the same host group. I only changed the rules (and some ProxySQL settings, as I will explain later).

As I mentioned, I am going to filter based on table names. Here are the 100 rules that I used:

insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('testuser_rw',600,1,3,'(from|into|update|into table) sbtest1b');
insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('testuser_rw',600,1,3,'(from|into|update|into table) sbtest2b'); ... insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('testuser_rw',600,1,3,'(from|into|update|into table) sbtest100b');First Test

First I ran tests with a direct MySQL connection, ProxySQL without rules, ProxySQL with ten rules and ProxySQL with 100 rules.

ProxySQL rules

ProxySQL itself has an impact on the performance, but there is a big difference between 10 and 100 rules. So adding more and more rules can have a negative effect on the performance.

That’s all? Can we do anything to speed things up? I used the default ProxySQL settings. Let’s have a look what can we tune.

Increasing the Number of Threads

Let’s go step by step. First we can increase the thread number inside ProxySQL (the default is 4). We will increase it to 8:

UPDATE global_variables SET variable_value='8' WHERE variable_name='mysql-threads';
SAVE MYSQL VARIABLES TO DISK;

ProxySQL has to be restarted after this changes.

ProxySQL rules

With this simple changes, we can improve the performance. As we can see, the difference is getting larger and larger as we increase the number of the sysbench threads.

Compiling

By compiling our own package, we can gain some extra performance. It is not clear why, so we opened a ticket for further investigation:

ProxySQL rules

I removed some of the columns because the graph got to busy.

ProxySQL 1.4

In ProxySQL 1.4 (which is not GA yet), we can change between the regex engines. However, even using the same engine (RE2) is faster in 1.4:

ProxySQL rules

Apply

As I mentioned, ProxySQL has a few important parameters like “apply”. With apply, if the query matches a rule it won’t check the remaining rules. In an ideal world, if you have 100 rules and 100 queries in random order which match only one rule, you only have to check 50 rules on average.

The new rules:

insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest,apply) values('testuser_rw',600,1,3,'(from|into|update|into table) sbtest1b',1);

ProxySQL rules

As you can see it didn’t help at all. But why? Because in this test we have 1000 tables, and we are running queries on all of the tables. This means 90% the queries have to check all the rules anyway. Let’s make a test with 100 tables to see if the “apply” helps or not:

ProxySQL rules

As we can see, with 100 tables we get a much better performance. But of course this is not a valid solution because we can’t just drop tables, “userids” or “sharding keys”. In the next post I will show you how to use “apply” in a more effective way.

Conclusion

So far, ProxySQL 1.4 with the PCRE engine and eight threads gives us the best performance with 100 rules and 1000 tables. As we can see, both the number of the rules and the query distribution matter. Both impact the performance. In my next blog post, I will show you how you can add some logic into your rules so that, even if you have more rules, you will get better performance.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com