Aug
04
2022
--

Introducing Performance Improvement of Window Functions in PostgreSQL 15

Window Functions in PostgreSQL 15

When working with databases, there are always some projects oriented to performing analytics and reporting tasks over the information stored in the database. Usually, these tasks leverage window functions to do calculations “across a set of table rows that are somehow related to the current row,” as is described in the documentation.

There are several built-in windows functions available in PostgreSQL. In the latest release, PostgreSQL 15, some performance improvements were added for the rank(), row_number(), and count() functions. First, let’s review what these functions can do.

The window functions

As mentioned above, the window functions let us perform some calculations on a set of table rows related to the current one. The “set of table rows” is usually identified as a “partition” defined by a column or columns. As we read in the documentation, the named functions work for:

rank () ? bigint

Returns the rank of the current row, with gaps; that is, the row_number of the first row in its peer group.

row_number () ? bigint

Returns the current row number within its partition, counting from 1.

count ( * ) ? bigint

Computes the number of input rows.

Consider the following table and data:

                                          Table "public.employee"
    Column      |           Type           | Collation | Nullable |                 Default
-----------------+--------------------------+-----------+----------+------------------------------------------
emp_id          | integer                  |           | not null | nextval('employee_emp_id_seq'::regclass)
emp_name        | character varying(20)    |           | not null |
emp_country     | character varying(35)    |           | not null |
emp_salary      | integer                  |           | not null |
date_of_joining | timestamp with time zone |           | not null | now()
Indexes:
    "employee_pkey" PRIMARY KEY, btree (emp_id)

demo=# SELECT * FROM employee ;
emp_id  |   emp_name   | emp_country | emp_salary |    date_of_joining
--------+--------------+-------------+------------+------------------------
      1 | KELIO        | Japon       |       2000 | 2021-10-26 00:00:00+00
      2 | JEAN-VINCENT | Canada      |       6500 | 2021-01-22 00:00:00+00
      3 | JUNO         | Japon       |       4000 | 2021-02-27 00:00:00+00
      4 | GUY-EMMANUEL | Salvador    |       2000 | 2020-07-27 00:00:00+00
      5 | WALI         | Japon       |       7000 | 2021-01-31 00:00:00+00
      6 | HENRI-PAUL   | Canada      |       4500 | 2021-08-19 00:00:00+00
      7 | MUHAMED      | France      |       5000 | 2021-07-20 00:00:00+00
      8 | MUHITTIN     | Madagascar  |       2500 | 2021-12-31 00:00:00+00
      9 | DEVLIN       | Madagascar  |       7000 | 2022-04-03 00:00:00+00
     10 | JOSUE        | Salvador    |       5500 | 2020-09-25 00:00:00+00
(10 rows)

rank()

We can use the rank() function to get the rank of employees (id) per country based on their salary. Look at the next example.

demo=# SELECT 
         emp_id, 
         emp_salary, 
         emp_country, 
         rank() OVER (PARTITION BY emp_country ORDER BY emp_salary DESC) 
       FROM employee;

emp_id  | emp_salary | emp_country | rank
--------+------------+-------------+------
      2 |       6500 | Canada      |    1
      6 |       4500 | Canada      |    2
      7 |       5000 | France      |    1
      5 |       7000 | Japon       |    1
      3 |       4000 | Japon       |    2
      1 |       2000 | Japon       |    3
      9 |       7000 | Madagascar  |    1
      8 |       2500 | Madagascar  |    2
     10 |       5500 | Salvador    |    1
      4 |       2000 | Salvador    |    2
(10 rows)

row_number()

In the next example, the row_number() function gets a sorted list of employees’ names per country and their relative numeric position.

demo=# SELECT 
         emp_id, 
         emp_name, 
         emp_country, 
         row_number() OVER (PARTITION BY emp_country ORDER BY emp_name) 
       FROM employee;

emp_id  |   emp_name   | emp_country | row_number
--------+--------------+-------------+------------
      6 | HENRI-PAUL   | Canada      |          1
      2 | JEAN-VINCENT | Canada      |          2
      7 | MUHAMED      | France      |          1
      3 | JUNO         | Japon       |          1
      1 | KELIO        | Japon       |          2
      5 | WALI         | Japon       |          3
      9 | DEVLIN       | Madagascar  |          1
      8 | MUHITTIN     | Madagascar  |          2
      4 | GUY-EMMANUEL | Salvador    |          1
     10 | JOSUE        | Salvador    |          2
(10 rows)

count()

The count() function is an “old known” tool used by almost everyone with access to a SQL engine. This function is part of the aggregate functions list but can act as windows functions when the OVER clause follows the call. So we can use it to know how many employees share the same salary as a given employee name. 

demo=# SELECT 
         emp_name, 
         emp_salary, 
         emp_country, 
         count(*) OVER (PARTITION BY emp_salary) 
       FROM employee;

  emp_name   | emp_salary | emp_country | count
--------------+------------+-------------+-------
KELIO        |       2000 | Japon       |     2
GUY-EMMANUEL |       2000 | Salvador    |     2
MUHITTIN     |       2500 | Madagascar  |     1
JUNO         |       4000 | Japon       |     1
HENRI-PAUL   |       4500 | Canada      |     1
MUHAMED      |       5000 | France      |     1
JOSUE        |       5500 | Salvador    |     1
JEAN-VINCENT |       6500 | Canada      |     1
WALI         |       7000 | Japon       |     2
DEVLIN       |       7000 | Madagascar  |     2
(10 rows)

Window functions in PostgreSQL 15

Now that we have refreshed what the window functions are, let’s consider what the PostgreSQL 15 release notes say:

Improve the performance of window functions that use row_number(), rank(), and count() (David Rowley)

Accordingly, if we are users of the window functions and move from a previous version to version 15, we should see an improvement in the performance of our workload. Let’s test it.

Laboratory case

To test the performance of the window functions, I created three instances of PostgreSQL, (a) version 13, (b) version 14, and (c) version 15. 

I used the same public.employee table used in the previous examples, and I loaded it with 10K rows. Then I executed the same queries we saw before for the window functions. I got the output from an EXPLAIN (ANALYZE) command which executes the query, and we can see the timing for the specific window function.

The EXPLAIN (ANALYZE) output was the same for each version of PostgreSQL.

rank() 

PG15

                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
WindowAgg  (cost=1288.49..1488.49 rows=10000 width=25) (actual time=11.946..18.732 rows=10000 loops=1)
  ->  Sort  (cost=1288.49..1313.49 rows=10000 width=17) (actual time=11.928..12.803 rows=10000 loops=1)
        Sort Key: emp_country, emp_salary DESC
        Sort Method: quicksort  Memory: 980kB
        ->  Seq Scan on employee  (cost=0.00..180.00 rows=10000 width=17) (actual time=0.008..2.402 rows=10000 loops=1)
Planning Time: 0.143 ms
Execution Time: 19.268 ms
(7 rows)

PG14

                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
WindowAgg  (cost=844.39..1044.39 rows=10000 width=25) (actual time=12.585..20.921 rows=10000 loops=1)
  ->  Sort  (cost=844.39..869.39 rows=10000 width=17) (actual time=12.560..13.545 rows=10000 loops=1)
        Sort Key: emp_country, emp_salary DESC
        Sort Method: quicksort  Memory: 1020kB
        ->  Seq Scan on employee  (cost=0.00..180.00 rows=10000 width=17) (actual time=0.011..1.741 rows=10000 loops=1)
Planning Time: 0.449 ms
Execution Time: 21.407 ms
(7 rows)

PG13

                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
WindowAgg  (cost=844.39..1044.39 rows=10000 width=25) (actual time=18.949..28.619 rows=10000 loops=1)
  ->  Sort  (cost=844.39..869.39 rows=10000 width=17) (actual time=18.896..19.998 rows=10000 loops=1)
        Sort Key: emp_country, emp_salary DESC
        Sort Method: quicksort  Memory: 1020kB
        ->  Seq Scan on employee  (cost=0.00..180.00 rows=10000 width=17) (actual time=0.011..1.228 rows=10000 loops=1)
Planning Time: 0.460 ms
Execution Time: 29.111 ms
(7 rows)

We can easily see from the WindowAgg node that the total time was smaller in the PG15 than in the other two. The performance improvement is clear here.

To verify this is consistent, I got the Total Time for the WindowAgg node from 500 executions and plotted the next graph.

postgresql total time

We can see the timing from the PG15 version is better than the other versions. Also, I added a trending line. We see the PG13 performed the “worst,” and even when the PG14 showed a better trend, the PG15 was the best.

I did the same exercise for the row_number() and count() functions. 

row_number() 

postgresql row number

count()

I also got the results from 500 executions for these last two functions. In both cases, the difference was smaller than the rank() function, but the PG15 still showed better results, as seen in the trend line.

The performance improvement for the rank(), row_number(), and count() window function introduced in the new PostgreSQL 15 will let all those analytic and reporting projects process their data faster than in previous versions. As always, every environment has its characteristics and challenges, but as we saw in these quick tests, the new version delivers the improvement just out of the box.

Jul
29
2022
--

How to Benchmark Replication Performance in MySQL

Benchmark Replication Performance in MySQL

In this blog, I will cover important aspects which you need to test when benchmarking replication setup. MySQL has great tools that could be used to test its performance. They include:

sysbench – https://github.com/akopytov/sysbench

BMK-kit – http://dimitrik.free.fr/blog/posts/mysql-perf-bmk-kit.html

mysqlslap – https://dev.mysql.com/doc/refman/8.0/en/mysqlslap.html

LinkBench – https://github.com/facebookarchive/linkbench

I will not describe how to use them here, as you can find instructions on the provided links or in the Percona blog by browsing tags sysbench, benchmark, benchmarks, and the category benchmarks.

However, most of the instructions take into account standalone MySQL servers. If you want to test how your replication setup performs, you may want to perform additional checks.

Can the replica catch up to the source server?

To answer this question, you need to run a load on the source machine and monitor the replica delay. You can do it by examining the

SHOW REPLICA STATUS

  (

SHOW SLAVE STATUS

) output, particularly the value of the

Seconds_Behind_Source

  (

Seconds_Behind_Master

 ) parameter. The same parameter is monitored by the “MySQL Replication Lag” graph under the “MySQL Replication Summary” dashboard in Percona Monitoring and Management (PMM).

The more advanced check would be to compare

Exec_Source_Log_Pos

  (

Exec_Master_Log_Pos

),

Relay_Log_File

  with

Read_Source_Log_Pos

  (

Read_Master_Log_Pos

),

Source_Log_File

  pairs since

Seconds_Behind_Source

  may be affected by long-running commands and return wrong values.

You may tune your replica server by adding parallelization if you see increasing lag. Check also how option

binlog_transaction_dependency_tracking

  works: https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html#sysvar_binlog_transaction_dependency_tracking

Can the replica run queries while applying updates from the source server?

Once you are happy with the replication lag, you may start adding the load you plan to run on this replica and test its performance. For example, if you plan to use the replica for handling read-only queries, you can run a read-only benchmark on it. You need to watch the performance of the read-only benchmark and replication lag.

If you plan to run analytical queries on the replica: test them. If you plan to use this replica for taking backups: take the backup and measure how it will affect the performance of the replication.

Synchronous replication

Synchronous replication solutions such as Galera, Percona XtraDB Cluster (PXC), and Group Replication are affected by the flow control paused time. You can use PMM to monitor it.

For PXC/Galera clusters, watch graph “Flow Control Paused Time” under “PXC/Galera Cluster Summary” dashboard.

Or you can watch the value of the

wsrep_flow_control_paused

  variable. See Galera Flow Control in Percona XtraDB Cluster for MySQL for more details.

For Group Replication, you need to check the value of the

COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE

column in the table

performance_schema.replication_group_member_stats

 . That shows how many transactions are waiting in the queue on the secondary node to apply.

You may also watch “Replication Lag” graph under “MySQL Group Replication” dashboard in PMM

Your best test is your production

As said by Dimitri Kravtchuk at Fosdem 2020:

While it is hard to create such a benchmark on the server that accepts writes from users, for the replica, it is comparatively easy. All you need to do is to create a replica of your production server and not make it available for the application yet. First, watch if it can handle replication lag while idle, then start adding load to it. You may also use the mirroring feature of ProxySQL Note; however, you need to check if some of the limitations described in this post, still apply.

How fast will the replica catch up?

One more thing you may want to test is how fast the replica will recover after a delay in replicating updates from the source server. To perform this test, stop the replica while running the source server’s load, restart it after a certain amount of time, and watch replication lag. If you are testing a replica that serves read-only load, let the load test continue.

It is better to perform stop, wait, and start in one series of commands, so you will have control over how long the replica was stopped. For example, to stop replica for one hour, use the following commands:

STOP REPLICA;
SELECT SLEEP(3600);
START REPLICA;

Then wait until the replica is running:

SELECT SERVICE_STATE FROM performance_schema.replication_applier_status;

Once this query returns ON, start monitoring the value of

Seconds_Behind_Source

  in the

SHOW REPLICA STATUS

  output until it becomes 0. Once it happens, record the time. This way, you will know how much time will be needed to apply one hour of changes.

Conclusion

You can use the same benchmarking tools that are recommended for standalone MySQL servers to perform performance tests against replication setups. You only need to monitor replication lag together with other performance characteristics.

Jul
14
2022
--

Parallel Commits for Transactions Using postgres_fdw on PostgreSQL 15

Transactions Using postgres_fdw on PostgreSQL 15

Consuming data from a remote foreign database into our local database just like another local table is a great feature that adds an excellent level of flexibility to the design of a database solution. 

This option is present in a number of database products. Oracle (DBLINK), MySQL (FEDERATED Engine), and obviously, PostgreSQL (dblink, postgres_fdw) support it. PostgreSQL supports multiple foreign data wrappers. But in this blog, we will review a specific feature added for postgres_fdw in the new version 15.

How postgres_fdw works

Let’s say that you have a local and a remote PostgreSQL server. The second one has a table that your application using the local server needs to access. You might think of any of the following solutions:

  1. Configure your application to be able to connect both servers.
  2. Set up some background jobs to perform a copy every then and now.
  3. Configure logical replication between the two servers so the local table syncs up with the remote one.

Any of the above has its own advantages and disadvantages. For the first one, you might need to configure your network and grant access privileges to both DB servers. All these might add complexity and open the door for some security issues. The second option will add extra load on both servers. The data is not going to be up-to-date in real-time, so some reading inconsistencies might happen. The third one actually could be a good option. But if you are planning to MODIFY the synced data then you might face some collisions.  

Then the solution might be relaying in the Foreign Data Wrapper. From the official documentation the next steps are required to access data from a remote PostgreSQL server:

  1. Install the postgres_fdw extension using CREATE EXTENSION.
  2. Create a foreign server object, using CREATE SERVER, to represent each remote database you want to connect to. Specify connection information, except user and password, as options of the server object.
  3. Create a user mapping, using CREATE USER MAPPING, for each database user you want to allow to access each foreign server. Specify the remote user name and password to use as user and password options of the user mapping.
  4. Create a foreign table, using CREATE FOREIGN TABLE or IMPORT FOREIGN SCHEMA, for each remote table you want to access. The columns of the foreign table must match the referenced remote table. You can, however, use table and/or column names different from the remote tables, if you specify the correct remote names as options of the foreign table object.

You will end with something like this:

Foreign Data Wrapper PostgreSQL

So now, with the proper privileges, your application is able to access the data from the remote table just like another local table, you can SELECT or even execute DML statements on it.

What has changed on PG15?

The described functionality has been present for a while in the PostgreSQL versions, and the new enhancement that is going to be available in version 15 is the parallel transaction commit when using postgres_fdw. This is particularly useful when your solution design relies on having multiple remote servers, such as a distributed PostgreSQL database. 

From the Transaction Management documentation, we know that “during a query that references any remote tables on a foreign server, postgres_fdw opens a transaction on the remote server if one is not already open corresponding to the current local transaction”, in the case of our local transaction references multiple remote tables from different foreign servers, then a (sub)transaction is opened on each remote server.

Up to version 14, when there are multiple (sub)transactions tied to a local transaction and the transaction is committed locally, postgres_fdw commits each remote (sub)transaction one by one (sequentially). This behavior is still the default on version 15, but now we have the option to enable the parallel_commit option for each foreign server tied to a local transaction, and in that case, when the local transaction commits all the remote (sub)transactions will commit at once (parallel). 

Laboratory case

I built a laboratory to test the sequential and the parallel remote (sub)transactions commit and catch the differences. 

I decided to use the pgbench tool for the benchmark and the option to create partitions for the pgbench_accounts table, I adapted these partitions to be foreign tables, just as was described on a very interesting Percona blog about Sharding on PostgreSQL, worth having a look on this last one. Finally, I created a partitioned table with 70 partitions distributed over seven remote servers, each one tied to a postgres_fdw foreign server. 

The next diagram gives some insight.

postgres_fdw foreign server

Pgbench execution

To test the new feature I tried two different loads with the parallel_commit option disabled (default) and then enabled:

  • I ran the “normal” pgbench workload. It accesses one row per transaction, which means every local transaction opened only one remote (sub)transaction.
  • I ran a pgbench “custom” script to retrieve 100 rows per transaction. In this case, every local transaction opened multiple remote (sub)transactions.

Running the regular pgbench load against the PG15 server was not more than:

pgbench --progress-timestamp -c 16 -j 4 -T 3600 -P 2 \
  -h pg15 -U pgbench pgbench

After getting the results I enabled the parallel_commit option on each foreign server the next way:

ALTER SERVER remote_server<N> OPTIONS (ADD parallel_commit 'true');

And repeated the pgbench load. The results are as follows.

pgbench load

We see no big difference between the load with the parallel option set OFF or ON. This was kind of expected due to in the regular pgbench load every local transaction opened only one remote transaction.

For the second test scenario I used a custom pgbench script to go for 100 rows on each transaction, it was as follows:

$ cat pgbench_custom.sql
\set v1 random(1, 100000 * :scale)
\set v2 :v1 + 100

BEGIN;
  SELECT abalance FROM pgbench_accounts
  WHERE aid BETWEEN :v1 AND :v2;
END;

Before running the load I disabled the parallel_commit option from the foreign servers executing the next for each one:

ALTER SERVER remote_server<N> OPTIONS (DROP parallel_commit);

And then I ran the custom load with NO parallel commit the next way:

pgbench --progress-timestamp -c 8 -j 4 -T 3600 -P 2 \
  -h pg15 -U pgbench -f pgbench_custom.sql pgbench

Then I enabled it on each foreign server with the command I used before and ran the custom load again. The next are the results.

foreign server

Now we see a clear improvement when using the parallel_commit option enabled. Consistently the number of TPS was higher during the test when the option was enabled showing how the performance can be improved.

In general, the TPS dropped down compared with the regular pgbench load. The details are out of the scope of this blog post, the goal here was to get an idea of how the performance might be improved using the new parallel_commit option in similar conditions. 

Conclusion

The capabilities a PostgreSQL database gets with postgres_fdw have been known for a while, and multiple system designs have taken advantage of them. 

The addition of the new parallel_commit option when using postgres_fdw in PostgreSQL 15 brings a very good opportunity to improve the performance of our databases if they are designed as distributed systems and our local transactions would be tied to multiple remote (sub)transactions. Every solution and system design requires its own analysis. This blog post only intends to show the new option, and how we can deliver a higher throughput under the same conditions just by enabling the new parallel_commit option. 

Jan
17
2022
--

Comparing Graviton (ARM) Performance to Intel and AMD for MySQL (Part 3)

Comparing Graviton (ARM) Performance to Intel and AMD for MySQL

Comparing Graviton (ARM) Performance to Intel and AMD for MySQLRecently we published the first part (m5, m5a, m6g) and the second part (C5, C5a, C6g) of research regarding comparing Graviton ARM with AMD and Intel CPU on AWS. We selected general-purpose EC2 instances with the same configurations (amount of vCPU in the first part). In the second part, we compared compute-optimized EC2 instances with the same conditions. The main goal was to see the trend and make a general comparison of CPU types on the AWS platform only for MySQL. We didn’t set the goal to compare the performance of different CPU types. Our expertise is in MySQL performance tuning. We share research “as is” with all scripts, and anyone interested could rerun and reproduce it.
All scripts, raw logs and additional plots are available on GitHub: 

(arm_cpu_comparison_m5, csv_file_with_all_data_m5,

arm_cpu_comparison_c5csv_file_with_all_data_c5

arm_cpu_comparison_m6, csv_file_with_all_data_m6). 

We were happy to see the reactions from our Percona Blog readers to our research. And we are open to any feedback. If anyone has any ideas on updating our methodology, we would be happy to correct it. 

This post is a continuation of research based on our interest in general-purpose EC2 (and, of course, because we saw that our audience wanted to see it). The main inspiration for this research was the feedback of our readers that we compared different generations of instances, especially old AMD instances (m5a.*), and compared it with the latest Graviton instances (m6g.*).  Additionally, we also decided to use the latest Intels instances (m6i.*) too.

Today, we will talk about (AWS) the latest general-purpose EC2: M6i, M6a, M6g (complete list in appendix). 

Short Conclusion:

  1. In most cases for m6i, m6g, and m6a instances, Intel shows better performance in throughput for MySQL read transactions. However, AMD instances are pretty close to Intel’s results.
  2. Sometimes Intel could show a significant advantage — more than almost 200k rps (almost 45% better) than Graviton. However, AMD’s gap wasn’t as significant as in previous results.
    Unfortunately, we compared Graviton with others. So we didn’t concentrate on comparing AMD with Intel. 
  3. If we could say in a few words: m6i instances (with Intel)  are better in their class than other m6a, m6g instances (in performance for MySql). And this advantage starts from 5%-10% and could be up to 45% compared with other CPUs.
  4. But Gravitons instances are still cheaper

Details, or How We Got Our Short Conclusion:

Disclaimer:

  1. Tests were run  on M6i.* (Intel) , M6a.* (AMD),  M6g.*(Graviton) EC2 instances in the US-EAST-1 region. (List of EC2 see in the appendix). It was selected using only the same class of instances without additional upgrades. The main goal is to take the same instances with only differences in CPU types and identify their performance for MySQL.
  2. Monitoring was done with Percona Monitoring and Management (PMM).
  3. OS: Ubuntu 20.04 LTS 
  4. Load tool (sysbench) and target DB (MySQL) installed on the same EC2 instance.
  5. Oracle MySQL Community Server — 8.0.26-0 — installed from official packages (it was installed from Ubuntu repositories).
  6. Load tool: sysbench —  1.0.18
  7. innodb_buffer_pool_size=80% of available RAM
  8. Test duration is five minutes for each thread and then 90 seconds cool down before the next iteration. 
  9. Tests were run four times independently (to smooth outliers / to have more reproducible results). Then results were averaged for graphs. Also, graphs show min and max values that were during the test, which shows the range of variance. 
  10. We are going to use the “high-concurrency” scenario definition for scenarios when the number of threads would be bigger than the number of vCPU. And “low-concurrent” scenario definition with scenarios where the number of threads would be less or equal to a number of vCPU on EC2.
  11. We are comparing MySQL behavior on the same class of EC2, not CPU performance.
  12. We got some feedback regarding our methodology, and we would update it in the next iteration, with a different configuration, but for this particular research we leave previous to have possibility compare “apples to apples”.
  13. The post is not sponsored by any external company. It was produced using only Percona resources. We do not control what AWS uses as CPU in their instances, we only operate with what they offer. 

Test Case:

Prerequisite:

To use only CPU (without disk and network) we decided to use only read queries from memory. To do this we did the following actions. 

1. Create DB with 10 tables with 10 000 000 rows each table

sysbench oltp_read_only --threads=10 --mysql-user=sbtest --mysql-password=sbtest --table-size=10000000 --tables=10 --db-driver=mysql --mysql-db=sbtest prepare

2. Load all data to LOAD_buffer 

sysbench oltp_read_only --time=300 --threads=10 --table-size=1000000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Test:

3. Run in a loop for same scenario but  different concurrency THREAD (1,2,4,8,16,32,64,128) on each EC2 

sysbench oltp_read_only --time=300 --threads=${THREAD} --table-size=100000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Results:

Result reviewing was split into four parts:

  1. For “small” EC2 with 2, 4, and 8 vCPU
  2. For “medium” EC2 with 16  and 32 vCPU
  3. For  “large” EC2 with 48 and 64 vCPU
  4. For all scenarios to see the overall picture.

There would be four graphs for each test:

  1. Throughput (queries per second) that EC2 could perform for each scenario (number of threads)
  2. Latency 95 percentile that  EC2 could perform for each scenario, (number of threads)
  3. Relative comparing Graviton and Intel, Graviton, and AMD
  4. Absolute comparing Graviton and Intel, Graviton and AMD

Validation that all load goes to the CPU, not to DISK I/O or network, was done also using PMM (Percona Monitoring and Management). 

perona monitoring and management

pic 0.1. OS monitoring during all test stages

 

Result for EC2 with 2, 4, and 8 vCPU:

Result for EC2 with 2, 4, and 8 vCPU

Plot 1.1.  Throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.2.  Latencies (95 percentile) during the test for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.3.2  Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.4.2. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

OVERVIEW:

  1. Based on plot 1.1. We could say that EC2 with Intel hasn’t an absolute advantage compared with Graviton and AMD. 
  2. Especially Intel and AMD, showing an advantage a little bit over – 20% over Graviton.
  3. In numbers, it is over five thousand and more requests per second. 
  4. AMD showed better results for two vCPU instances. 
  5. And it looks like in M6 class of Gravitons CPUs show the worst result compared with others.

Result for EC2 with 16 and 32 vCPU:

Result for EC2 with 16 and 32 vCPU

Plot 2.1. Throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 2.2. Latencies (95 percentile) during the test for EC2 with 16 and 32 vCPU for scenarios with 1,2 4,8,16,32,64,128 threads

 

Plot 2.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 2.3.2  Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 2.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 2.4.2. Numbers comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

OVERVIEW:

  1. Plot 2.1 shows that Intel vCPU is more performance efficient. AMD is the second, and Graviton is the third. 
  2. According to plots 2.3.1 and 2.3.2, Intel is better than Graviton up to 30 % and AMD is better than Graviton up to 20%. Graviton has some exceptional performance advantage over  AMD in some scenarios. But with this configuration and this instance classes, it is an exception according to the plot 2.3.2 scenarios for 8 and 16 concurrent threads. 
  3. In real numbers, Intel could execute up to 140 k read transactions more than Graviton CPUs, and AMD could read more than 70 k read transactions than Graviton. (plot 2.1. , plot 2.4.1.)
  4. In most cases, AMD and Intel are better than Graviton EC2 instances (plot 2.1, plot 2.3.2, plot 2.4.2).

 

Result for EC2 with 48 and 64 vCPU:

Result for EC2 with 48 and 64 vCPU

Plot 3.1. Throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 3.2. Latencies (95 percentile) during the test for EC2 with 48 and 64 vCPU for scenarios with 1,2 4,8,16,32,64,128 threads

 

Plot 3.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 3.3.2  Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 3.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 3.4.2. Numbers comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. For “Large” instances, Intel is still better than other vCPU. AMD was still in second place, except when Graviton stole some cases. (plot 3.1.)
  2. According to plot 3.3.1. Intel showed an advantage over Graviton up to 45%. On the other hand, AMD showed an advantage over Graviton up to 20% in the same case.
  3. There were two cases when Graviton showed some better results, but it is an exception. 
  4. In real numbers: Intel could generate over 150k-200k read transactions more than Graviton. And AMD could execute more than 70k – 130k read transactions than Graviton.

 

Full Result Overview:

Plot 4.1.1. Throughput (queries per second) – bar plot for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.1.2. Throughput (queries per second) – line plot for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.2.1. Latencies (95 percentile) during the test – bar plot for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.2.2. Latencies (95 percentile) during the test – line plot for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.3.2 Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.4.2. Numbers comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.5.1. Percentage comparison INTEL and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.5.2. Numbers comparison INTEL and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

Final Thoughts

  1. We compare general-purpose EC2 (m6i, m6a, m6g) instances from the AWS platform and their behavior for MySQL.  
  2. In these competitions, Graviton instances (m6g) do not provide any competitive results for MySQL. 
  3. There was some strange behavior. AMD and Intel showed their best performance when loaded (in the number of threads) are equal numbers of vCPU. According to plot 4.1.2. We could see some jump in performance when the load becomes the same as the amount of vCPU. This point was hard to see on the bar chart. But this is very interesting. However, Graviton worked more slightly without any “jumps”, and that’s why it showed exceptionally better results in some scenarios with AMD.
  4. Last point. Everyone wants to see an AMD vs Intel comparison. Plot 4.5.1 and 4.5.2.  The result – Intel is better in most cases.  And AMD was better only in one case with 2 vCPU. So the advantage of Intel compared with AMD could rise up to 96% for “large instances” (in some cases). It is unbelievable. But in most cases, this advantage is that Intel could run in 30% more MySql read transactions than AMD.
  5. It is still an open question regarding the economic efficiency of all this EC2. We would research this topic and answer this question a little bit later.

APPENDIX:

List of EC2 used in research:

CPU type Cpu info:

Model name

EC2 Memory GB Amount vCPU EC2 price per hour (USD)
AMD AMD EPYC 7R13 Processor 2650 MHz m6a.large 8 2 $0.0864
AMD m6a.xlarge 16 4 $0.1728
AMD m6a.2xlarge 32 8 $0.3456
AMD m6a.4xlarge 64 16 $0.6912
AMD m6a.8xlarge 128 32 $1.3824
AMD m6a.12xlarge 192 48 $2.0736
AMD m6a.16xlarge 256 64 $2.7648
Graviton ARMv8 AWS Graviton2 2500 MHz m6g.large 8 2 $0.077 
Graviton m6g.xlarge 16 4 $0.154
Graviton m6g.2xlarge 32 8 $0.308
Graviton m6g.4xlarge 64 16 $0.616
Graviton m6g.8xlarge 128 32 $1.232
Graviton m6g.12xlarge 192 48 $1.848
Graviton m6g.16xlarge 256 64 $2.464
Intel Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz m6i.large 8 2 $0.096000
Intel m6i.xlarge 16 4 $0.192000
Intel m6i.2xlarge 32 8 $0.384000 
Intel m6i.4xlarge 64 16 $0.768000
Intel m6i.8xlarge 128 32 $1.536000
Intel m6i.12xlarge 192 48 $2.304000
Intel m6i.16xlarge 256 64 $3.072000

 

my.cnf:

[mysqld]

ssl=0

performance_schema=OFF

skip_log_bin

server_id = 7




# general

table_open_cache = 200000

table_open_cache_instances=64

back_log=3500

max_connections=4000

 join_buffer_size=256K

 sort_buffer_size=256K




# files

innodb_file_per_table

innodb_log_file_size=2G

innodb_log_files_in_group=2

innodb_open_files=4000




# buffers

innodb_buffer_pool_size=${80%_OF_RAM}

innodb_buffer_pool_instances=8

innodb_page_cleaners=8

innodb_log_buffer_size=64M




default_storage_engine=InnoDB

innodb_flush_log_at_trx_commit  = 1

innodb_doublewrite= 1

innodb_flush_method= O_DIRECT

innodb_file_per_table= 1

innodb_io_capacity=2000

innodb_io_capacity_max=4000

innodb_flush_neighbors=0

max_prepared_stmt_count=1000000 

bind_address = 0.0.0.0

[client]

 

Jan
04
2022
--

Taking a Look at BTRFS for MySQL

BTRFS for MySQL

BTRFS for MySQLFollowing my post MySQL/ZFS Performance Update, a few people have suggested I should take a look at BTRFS (“butter-FS”, “b-tree FS”) with MySQL. BTRFS is a filesystem with an architecture and a set of features that are similar to ZFS and with a GPL license. It is a copy-on-write (CoW) filesystem supporting snapshots, RAID, and data compression. These are compelling features for a database server so let’s have a look.

Many years ago, in 2012, Vadim wrote a blog post about BTRFS and the results were disappointing. Needless to say that since 2012, a lot of work and effort has been invested in BTRFS. So, this post will examine the BTRFS version that comes with the latest Ubuntu LTS, 20.04. It is not bleeding edge but it is likely the most recent release and Linux kernel I see in production environments. Ubuntu 20.04 LTS is based on the Linux kernel 5.4.0.

Test Environment

Doing benchmarks is not my core duty at Percona, as, before all, I am a consultant working with our customers. I didn’t want to have a cloud-based instance running mostly idle for months. Instead, I used a KVM instance in my personal lab and made sure the host could easily provide the required resources. The test instance has the following characteristics:

  • 4 CPU
  • 4 GB of RAM
  • 50 GB of storage space throttle at 500 IOPS and 8MB/s

The storage bandwidth limitation is to mimic the AWS EBS behavior capping the IO size to 16KB. With KVM, I used the following iotune section:

<iotune>
     <total_iops_sec>500</total_iops_sec>
     <total_bytes_sec>8192000</total_bytes_sec>
  </iotune>

These IO limitations are very important to consider in respect to the following results and analysis.

For these benchmarks, I use Percona Server for MySQL 8.0.22-13 and unless stated otherwise, the relevant configuration variables are:

[mysqld]
skip-log-bin
datadir = /var/lib/mysql/data
innodb_log_group_home_dir = /var/lib/mysql/log
innodb_buffer_pool_chunk_size=32M
innodb_buffer_pool_size=2G
innodb_lru_scan_depth = 256 # given the bp size, makes more sense
innodb_log_file_size=500M
innodb_flush_neighbors = 0
innodb_fast_shutdown = 2 # skip flushing
innodb_flush_method = O_DIRECT # ZFS uses fsync
performance_schema = off

Benchmark Procedure

For this post, I used the sysbench implementation of the TPCC. The table parameter was set to 10 and the size parameter was set to 20. Those settings yielded a MySQL dataset size of approximately 22GB uncompressed.

All benchmarks used eight threads and lasted two hours. For simplicity, I am only reporting here the total number of events over the duration of the benchmark. TPCC results usually report only one type of event, New order transactions. Keep this in mind if you intend to compare my results with other TPCC benchmarks.

Finally, the dataset is refreshed for every run either by restoring a tar archive or by creating the dataset using the sysbench prepare option.

BTRFS

BTRFS was created and mounted using:

mkfs.btrfs /dev/vdb -f
mount -t btrfs /dev/vdb /var/lib/mysql -o compress-force=zstd:1,noatime,autodefrag
mkdir /var/lib/mysql/data; mkdir /var/lib/mysql/log;
chown -R mysql:mysql /var/lib/mysql

ZFS

ZFS was created and configured using:

zpool create -f bench /dev/vdb
zfs set compression=lz4 atime=off logbias=throughput bench
zfs create -o mountpoint=/var/lib/mysql/data -o recordsize=16k -o primarycache=metadata bench/data
zfs create -o mountpoint=/var/lib/mysql/log bench/log
echo 2147483648 > /sys/module/zfs/parameters/zfs_arc_max
echo 0 > /sys/module/zfs/parameters/zfs_arc_min

Since the ZFS file cache, the ARC, is compressed, an attempt was made with 3GB of ARC and only 256MB of buffer pool. I called this configuration “ARC bias”.

ext4

mkfs.ext4 -F /dev/vdb"
mount -t ext4 /dev/vdb /var/lib/mysql -o noatime,norelatime"

Results

TPCC events

The performance results are presented below. I must admit my surprise at the low results of BTRFS. Over the two-hour period, btrfs wasn’t able to reach 20k events. The “btrfs tar” configuration restored the dataset from a tar file instead of doing a “prepare” using the database. This helped BTRFS somewhat (see the filesystem size results below) but it was clearly insufficient to really make a difference. I really wonder if it is a misconfiguration on my part, contact me in the comments if you think I made a mistake.

The ZFS performance is more than three times higher, reaching almost 67k events. By squeezing more data in the ARC, the ARC bias configuration even managed to execute more events than ext4, about 97k versus 94k.

TPCC performance comparison

TPCC performance comparison, ext4, BTRFS and ZFS

Filesystem Size

Performance, although an important aspect, is not the only consideration behind the decision of using a filesystem like BTRFS or ZFS. The impacts of data compression on filesystem size is also important. The resulting filesystem sizes are presented below.

TPCC dataset size comparison, ext4, BTRFS and ZFS

TPCC dataset size comparison, ext4, BTRFS and ZFS

The uncompressed TPCC dataset size with ext4 is 21GB. Surprisingly, when the dataset is created by the database, with small random disk operations, BTRFS appears to not compress the data at all. This is in stark contrast to the behavior when the dataset is restored from a tar archive. The restore of the tar archive causes large sequential disk operations which trigger compression. The resulting filesystem size is 4.2GB, a fifth of the original size.

Although this is a great compression ratio, the fact that normal database operations yield no compression with BTRFS is really problematic. Using ZFS with lz4 compression, the filesystem size is 6GB. Also, the ZFS compression ratio is not significantly affected by the method of restoring the data.

The performance issues of BTRFS have also been observed by Phoronix; their SQLite and PostgreSQL results are pretty significant and inline with the results presented in this post. It seems that BTRFS is not optimized for the small random IO type of operations requested by database engines. Phonorix recently published an update for the kernel 5.14 and the situation may have improved. The BTRFS random IO write operation results still seems to be quite low.

Conclusion

Although I have been pleased with the ease of installation and configuration of BTRFS, a database workload seems to be far from optimal for it. BTRFS struggles with small random IO operations and doesn’t compress the small blocks. So until these shortcomings are addressed, I will not consider BTRFS as a prime contender for database workloads.

Oct
25
2021
--

Comparing Graviton (ARM) Performance to Intel and AMD for MySQL (Part 2)

Comparing Graviton Performance to Intel and AMD for MySQL

Comparing Graviton Performance to Intel and AMD for MySQLRecently we published the first part of research comparing Graviton (ARM) with AMD and Intel CPU on AWS. In the first part, we selected general-purpose EC2 instances with the same configurations (amount of vCPU).  The main goal was to see the trend and make a general comparison of CPU types on the AWS platform only for MySQL. We didn’t set the goal to compare the performance of different CPU types. Our expertise is in MySQL performance tuning. We share research “as is” with all scripts, and anyone interested could rerun and reproduce it.
All scripts, raw logs and additional plots are available on GitHub: (2021_10_arm_cpu_comparison_c5csv_file_with_all_data).

We were happy to see the reactions from our Percona Blog readers to our research, and we are open to any feedback. If anyone has any ideas on updating our methodology, we would be happy to correct it.

This post is a continuation of research based on our interest in compute-optimized EC2 (and, of course, because we saw that our audience wanted to see it). Today, we will talk about (AWS) Compute Optimized EC2: C5, C5a, C6g (complete list in appendix).

Next time we are going to share our findings on the economic efficiency of m5 and c5 instances.

Short Conclusion:

  1. In most cases for c5, c5a, and c6g instances, Intel shows better performance in throughput for MySQL read transactions.
  2. Sometimes Intel could show a significant advantage — more than almost 100k rps than other CPUs.
  3. If we could say in a few words: c5 instances (with Intel)  are better in their class than other c5a, c6g instances (in performance). And this advantage starts from 5% and could be up to 40% compared with other CPUs.

Details and Disclaimer:

  1. Tests were run  on C5.* (Intel) , C5a.* (AMD),  C6g.*(Graviton) EC2 instances in the US-EAST-1 region. (List of EC2 see in the appendix.)
  2. Monitoring was done with PMM
  3. OS: Ubuntu 20.04 TLS 
  4. Load tool (sysbench) and target DB (MySQL) installed on the same EC2 instance.
  5. Oracle MySQL Community Server — 8.0.26-0 — installed from official packages.
  6. Load tool: sysbench —  1.0.18
  7. innodb_buffer_pool_size=80% of available RAM
  8. Test duration is 5 minutes for each thread and then 90 seconds warm down before the next iteration. 
  9. Tests were run 3 times (to smooth outliers / to have more reproducible results). Then results were averaged for graphs. 
  10. We are going to use a “high-concurrency” scenario definition for scenarios when the number of threads would be bigger than the number of vCPU. And “low-concurrent” scenario definition with scenarios where the number of threads would be less or equal to a number of vCPU on EC2.
  11. We are comparing MySQL behavior on the same class of EC2, not CPU performance.

Test Case:

Prerequisite:

1. Create DB with 10 tables with 10 000 000 rows each table

sysbench oltp_read_only --threads=10 --mysql-user=sbtest --mysql-password=sbtest --table-size=10000000 --tables=10 --db-driver=mysql --mysql-db=sbtest prepare

2. Load all data to LOAD_buffer

sysbench oltp_read_only --time=300 --threads=10 --table-size=1000000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

 

3. Test:

Run in a loop for same scenario but  different concurrency THREAD (1,2,4,8,16,32,64,128) on each EC2 

sysbench oltp_read_only --time=300 --threads=${THREAD} --table-size=100000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Results:

Result reviewing was split into three parts:

  1. For “small” EC2 with 2, 4, and 8 vCPU
  2. For “medium” EC2 with 16 and for  “large” EC2 with 48 and 64 vCPU (AWS does not have C5 EC2 with 64 vCPU )
  3. For all scenarios to see the overall picture.

There would be four graphs for each test:

  1. Throughput (queries per second) that EC2 could perform for each scenario (amount of threads).
  2. Latency 95 percentile that  EC2 could perform for each scenario (amount of threads).
  3. Relative comparing Graviton and Intel, Graviton, and AMD.
  4. Absolute comparing Graviton and Intel, Graviton, and AMD.

Validation that all load goes to CPU, not to DISK I/O, was done also using PMM (Percona Monitoring and Management). 

pic 0.1. OS monitoring during all test stages (picture is for example)

Result for EC2 with 2, 4, and 8 vCPU:

plot 1.1. Throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.2. Latencies (95 percentile) during the test for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.3.2  Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.4.2. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. Based on plot 1.1, we could say that EC2 with Intel has an absolute advantage compared with Graviton and AMD. 
  2. This advantage in most scenarios fluctuates between 10% – 20%.
  3. In numbers, it is over 3,000 requests per second. 
  4. There is one scenario when Graviton becomes better EC2 with 8 vCPU (c6g.2xlarge). But the advantage is so tiny (near 2%) that it could be a statistical error. So we can’t say that benefits are relevant.

Result for EC2 with 16, 48 and 64 vCPU:

plot 2.1.  Throughput (queries per second)  for EC2 with 16, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 2.2.  Latencies (95 percentile) during the test for EC2 with 16, 48 and 64  vCPU for scenarios with 1,2 4,8,16,32,64,128 threads

 

plot 2.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16, 48 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 2.3.2  Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 16, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 2.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16, 48 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 2.4.2. Numbers comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 16, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. Plot 2.1 shows that it has an advantage over the other vCPU in our conditions (there are no EC2 with 64 Intel’s vCPU to have a full picture of comparison). 
  2. This advantage could be near 20% for EC2 with 16 vCPU and up to 40% for EC2 with 48 vCPU. However, it is possible to see that this advantage decreases with an increasing amount of threads. 
  3. In real numbers, Intel could execute up to 100 k read transactions more than other CPUs (plot 2.1. , plot 2.4.1).
  4. On the other hand, in one high-performance scenario, we could see a small advantage (3%) of Graviton. However, it is so small that it could be a statistical error again (plot  2.3.1.).
  5. In most cases, Graviton shows better results than AMD (plot 2.1, plot 2.3.2, plot 2.4.2).

Whole Result Overview:

plot 3.1. Throughput (queries per second) for EC2 with 2, 4, 8, 16, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.2.  Latencies (95 percentile) during the test for EC2 with 2, 4, 8, 16, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.3.1. Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16 and 48 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.3.2. Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16 AND 48 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.4.2. Numbers comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Final Thoughts

  1. We compare compute-optimized ec2 (c5, c5a, c6g) instances from the AWS platform and their behavior for MySQL.  
  2. It is still an open question regarding the economic efficiency of all this EC2. We will research this topic and answer this question a little bit later.
  3. In these tests, AMD does not provide any competitive results for MySQL. It is possible that in other tasks, it could show much better and competitive results.

APPENDIX:

List of EC2 used in research:

CPU type EC2 Amount vCPU Memory GB EC2 price per hour (USD)
AMD c5a.large 2 4 0.077
AMD c5a.xlarge 4 8 0.154
AMD c5a.2xlarge 8 16 0.308
AMD c5a.4xlarge 16 32 0.616
AMD c5a.12xlarge 48 96 1.848
AMD c5a.16xlarge 64 128 2.464
Intel c5.large 2 4 0.085
Intel c5.xlarge 4 8 0.170
Intel c5.2xlarge 8 16 0.340
Intel c5.4xlarge 16 32 0.680
Intel c5.12xlarge 48 96 2.040
Graviton c6g.large 2 4 0.068
Graviton c6g.xlarge 4 8 0.136
Graviton c6g.2xlarge 8 16 0.272
Graviton c6g.4xlarge 16 32 0.544
Graviton c6g.12xlarge 48 96 1.632
Graviton c6g.16xlarge 64 128 2.176

 

my.cnf:

[mysqld]
ssl=0
performance_schema=OFF
skip_log_bin
server_id = 7

# general
table_open_cache = 200000
table_open_cache_instances=64
back_log=3500
max_connections=4000
 join_buffer_size=256K
 sort_buffer_size=256K

# files
innodb_file_per_table
innodb_log_file_size=2G
innodb_log_files_in_group=2
innodb_open_files=4000

# buffers
innodb_buffer_pool_size=${80%_OF_RAM}
innodb_buffer_pool_instances=8
innodb_page_cleaners=8
innodb_log_buffer_size=64M

default_storage_engine=InnoDB
innodb_flush_log_at_trx_commit  = 1
innodb_doublewrite= 1
innodb_flush_method= O_DIRECT
innodb_file_per_table= 1
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_flush_neighbors=0
max_prepared_stmt_count=1000000 
bind_address = 0.0.0.0
[client]

 

Oct
15
2021
--

Comparing Graviton (ARM) Performance to Intel and AMD for MySQL

Comparing Graviton (ARM) performance to Intel and AMD for MySQL

Comparing Graviton (ARM) performance to Intel and AMD for MySQLRecently, AWS presented its own CPU on ARM architecture for server solutions.

It was Graviton. As a result, they update some lines of their EC2 instances with new postfix “g” (e.g. m6g.small, r5g.nano, etc.). In their review and presentation, AWS showed impressive results that it is faster in some benchmarks up to 20 percent. On the other hand, some reviewers said that Graviton does not show any significant results and, in some cases, showed fewer performance results than Intel.

We decided to investigate it and do our research regarding Graviton performance, comparing it with other CPUs (Intel and AMD) directly for MySQL.

Disclaimer

  1. The test is designed to be CPU bound only, so we will use a read-only test and make sure there is no I/O activity during the test.
  2. Tests were run  on m5.* (Intel) , m5a.* (AMD),  m6g.*(Graviton) EC2 instances in the US-EAST-1 region. (List of EC2 see in the appendix).
  3. Monitoring was done with Percona Monitoring and Management (PMM).
  4. OS: Ubuntu 20.04 TLS.
  5. Load tool (sysbench) and target DB (MySQL) installed on the same EC2 instance.
  6. MySQL– 8.0.26-0 — installed from official packages.
  7. Load tool: sysbench —  1.0.18.
  8. innodb_buffer_pool_size=80% of available RAM.
  9. Test duration is five minutes for each thread and then 90 seconds warm down before the next iteration. 
  10. Tests were run three times (to smooth outliers or have more reproducible results), then results were averaged for graphs. 
  11. We are going to use high concurrency scenarios for those scenarios when the number of threads would be bigger than the number of vCPU. And low concurrent scenario with scenarios where the number of threads would be less or equal to a number of vCPU on EC2.
  12. Scripts to reproduce results on our GitHub.

Test Case

Prerequisite:

1. Create DB with 10 tables with 10 000 000 rows each table

sysbench oltp_read_only --threads=10 --mysql-user=sbtest --mysql-password=sbtest --table-size=10000000 --tables=10 --db-driver=mysql --mysql-db=sbtest prepare

2. Load all data to LOAD_buffer

sysbench oltp_read_only --time=300 --threads=10 --table-size=1000000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Test:

Run in a loop for same scenario but  different concurrency THREAD (1,2,4,8,16,32,64,128) on each EC2.

sysbench oltp_read_only --time=300 --threads=${THREAD} --table-size=100000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Results:

Result reviewing was split into 3 parts:

  1. for “small” EC2 with 2, 4, and 8 vCPU
  2. for “medium” EC2 with 16 and 32 vCPU
  3. for  “large” EC2 with 48 and 64 vCPU

This “small”, “medium”, and “large” splitting is just synthetic names for better reviewing depends on the amount of vCPu per EC2
There would be four graphs for each test:

  1. Throughput (Queries per second) that EC2 could perform for each scenario (amount of threads)
  2. Latency 95 percentile that  EC2 could perform for each scenario (amount of threads)
  3. Relative comparing Graviton and Intel
  4. Absolute comparing Graviton and Intel

Validation that all load goes to CPU, not to DISK I/O, was done also using PMM (Percona Monitoring and Management). 

pic 0.1. OS monitoring during all test stages

pic 0.1 – OS monitoring during all test stages

From pic.0.1, we can see that there was no DISK I/O activity during tests, only CPU activity. The main activity with disks was during the DB creation stage.

Result for EC2 with 2, 4, and 8 vCPU

plot 1.1. Throughput (queries per second) for EC2 with 2, 4, and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.2.  Latencies (95 percentile) during the test for EC2 with 2, 4, and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.3.  Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, and 8  vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.4.  Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. AMD has the biggest latencies in all scenarios and for all EC2 instances. We won’t repeat this information in all future overviews, and this is the reason why we exclude it in comparing with other CPUs in percentage and numbers values (in plots 1.3 and 1.4, etc).
  2. Instances with two and four vCPU Intel show some advantage for less than 10 percent in all scenarios.
  3. However, an instance with 8 vCPU intel shows an advantage only on scenarios with threads that less or equal amount of vCPU on EC2.
  4. On EC2 with eight vCPU, Graviton started to show an advantage. It shows some good results in scenarios when the number of threads is more than the amount of vCPU on EC2. It grows up to 15 percent in high-concurrency scenarios with 64 and 128 threads, which are 8 and 16 times bigger than the amount of vCPU available for performing.
  5. Graviton start showing an advantage on EC2 with eight vCPU and with scenarios when threads are more than vCPU amount. This feature would appear in all future scenarios – more load than CPU, better result it shows.

 

Result for EC2 with 16 and 32 vCPU

plot 2.1.  Throughput (queries per second)  for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.2.  Latencies (95 percentile) during the test for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 2.3.  Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 2.4.  Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. In scenarios with the same load for ec2 with 16 and 32 vCPU, Graviton is continuing to have advantages when the amount of threads is more significant than the amount of available vCPU on instances.
  2. Graviton shows an advantage of up to 10 percent in high concurrency scenarios. However, Intel has up to 20 percent in low concurrency scenarios.
  3. In high-concurrency scenarios, Graviton could show an incredible difference in the number of (read) transactions per second up to 30 000 TPS.

Result for EC2 with 48 and 64 vCPU

plot 3.1.  Throughput (queries per second)  for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.2.  Latencies (95 percentile) during the test for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.3.  Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.4.  Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. It looks like Intel shows a significant advantage in most scenarios when its number of threads is less or equal to the amount of vCPU. It seems Intel is really good for such kind of task. When it has some additional free vCPU, it would be better, and this advantage could be up to 35 percent.
  2. However, Graviton shows outstanding results when the amount of threads is larger than the amount of vCPU. It shows an advantage from 5 to 14 percent over Intel.
  3. In real numbers, Graviton advantage could be up to 70 000 transactions per second over Intel performance in high-concurrency scenarios.

Total Result Overview

 

plot 4.2.  Latencies (95 percentile) during the test for EC2 with 2,4,8,16,32,48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 4.3.  Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2,4,8,16,32,48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 4.4.  Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2,4,8,16,32,48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

Conclusions

  1. ARM CPUs show better results on EC2 with more vCPU and with higher load, especially in high-concurrency scenarios.
  2. As a result of small EC2 instances and small load, ARM CPUs show less impressive performance. So we can’t see its benefits comparing with Intel EC2
  3. Intel is still the leader in the area of low-concurrency scenarios. And it is definitely winning on EC2 with a small amount of vCPU.
  4. AMD does not show any competitive results in all cases.   

Final Thoughts

  1. AMD — we have a lot of questions about EC2 instances on AMD. So it would be a good idea to check what was going on that EC2 during the test and check the general performance of CPUs on those EC2.
  2. We found out that in some specific conditions, Intel and Graviton could compete with each other. But the other side of the coin is economical. What is cheaper to use in each situation? The next article will be about it. 
  3. It would be a good idea to try to use EC2 with Graviton for real high-concurrency DB.  
  4. It seems it needs to run some additional scenarios with 256 and 512 threads to check the hypothesis that Graviton could work better when threads are more than vCPU.

APPENDIX:

List of EC2 used in research:

CPU type EC2 EC2 price per hour (USD) vCPU RAM
Graviton m6g.large 0.077 2 8 Gb
Graviton m6g.xlarge 0.154 4 16 Gb
Graviton m6g.2xlarge 0.308 8 32 Gb
Graviton m6g.4xlarge 0.616 16 64 Gb
Graviton m6g.8xlarge 1.232 32 128 Gb
Graviton m6g.12xlarge 1.848 48 192 Gb
Graviton m6g.16xlarge 2.464 64 256 Gb
Intel m5.large 0.096 2 8 Gb
Intel m5.xlarge 0.192 4 16 Gb
Intel m5.2xlarge 0.384 8 32 Gb
Intel m5.4xlarge 0.768 16 64 Gb
Intel m5.8xlarge 1.536 32 128 Gb
Intel m5.12xlarge 2.304 48 192 Gb
Intel m5.16xlarge 3.072 64 256 Gb
AMD m5a.large 0.086 2 8 Gb
AMD m5a.xlarge 0.172 4 16 Gb
AMD m5a.2xlarge 0.344 8 32 Gb
AMD m5a.4xlarge 0.688 16 64 Gb
AMD m5a.8xlarge 1.376 32 128 Gb
AMD m5a.12xlarge 2.064 48 192 Gb
AMD m5a.16xlarge 2.752 64 256 Gb

 

my.cnf

my.cnf:
[mysqld]
ssl=0
performance_schema=OFF
skip_log_bin
server_id = 7

# general
table_open_cache = 200000
table_open_cache_instances=64
back_log=3500
max_connections=4000
 join_buffer_size=256K
 sort_buffer_size=256K

# files
innodb_file_per_table
innodb_log_file_size=2G
innodb_log_files_in_group=2
innodb_open_files=4000

# buffers
innodb_buffer_pool_size=${80%_OF_RAM}
innodb_buffer_pool_instances=8
innodb_page_cleaners=8
innodb_log_buffer_size=64M

default_storage_engine=InnoDB
innodb_flush_log_at_trx_commit  = 1
innodb_doublewrite= 1
innodb_flush_method= O_DIRECT
innodb_file_per_table= 1
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_flush_neighbors=0
max_prepared_stmt_count=1000000 
bind_address = 0.0.0.0
[client]

 

Sep
21
2021
--

Making Aurora Write Latency 15x Higher (or More!) by Choosing a Bad Primary Key

Aurora MySQL Write Latency

Aurora MySQL Write LatencyPrimary Key design is an important thing for InnoDB performance, and choosing a poor PK definition will have an impact on performance and also write propagation in databases. When this comes to Aurora, this impact is even worse than you may notice.

In short, we consider a poor definition of a Primary Key in InnoDB as “anything but quasi sequential values”, which may cause very random access to data and thus increase the IO dependency.

In this post, I’ll try to demonstrate the potential impact of the primary key design when running on Aurora, and how a bad design can lead to a 15x write latency penalty (or more).

The Analysis

Recently I worked on a case where a customer was having issues with scaling writes in Aurora MySQL. While this is a known limitation in Aurora considering how the distributed storage layer syncs out data among all nodes of the cluster, we observed additional latency occurring when more clients were writing to the database.

The first thing I noticed is that their main table had a poor definition of the Primary Key as they were using UUID-like columns based on VARCHAR data types. In this case, the nature of values for the Primary Key was very random, which is really bad for a b-tree based storage like InnoDB.

With this in mind, I referred to a great post from my colleague Yves Trudeau explaining why UUIDs are bad for performance, so based on this premise I decided to try to measure how big this impact can be in the Aurora world.

The set of tests I’ve run were using a db.r5.2xlarge Aurora MySQL Cluster (8vCPU and 64GB of ram) which is pretty similar to the cluster my customer was using.

First, I’ve started with two very basic tables to avoid any extra overhead but something close to a real case:

CREATE TABLE `test_sequential_PK` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`number` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_number` (`number`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `test_random_PK` (
`id` varchar(26) NOT NULL,
`number` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_number` (`number`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Then I created two simple lua scripts to execute with sysbench – super simple, I’d say – as they were just for doing inserts using either the auto_increment property in table test_sequential_PK or creating random values for test_random_PK table.

The final purpose was not to measure Aurora performance as a whole but the write latency when the Primary Key is not optimal.

I’ve started the process by warming up the instances for few days by running both sysbench scripts and pushing the instance really hard while I was filling up the InnoDB Buffer Pool, results were pretty good for some time until the traffic became IO-bound:

Amazon Aurora Latency

It took a few days but after some time we started to see an increase in the write latency. I created an initial set of data using 50 concurrent threads, which means the graphic above is not very helpful for the analysis I’m going to make.

After I was sure the buffer pool was filled and the instance was warmed up, I verified that the dataset is bigger than the buffer pool:

SELECT concat(table_schema,'.',table_name) schema_table,concat(round(table_rows/1000000,2),'M') rows,concat(round(data_length/(1024*1024*1024),2),'G') Data,round(data_length/table_rows,0) DataRow ,concat(round(index_length/(1024*1024*1024),2),'G') idx,round(index_length/table_rows,0) as IdxRow,concat(round((data_length+index_length)/(1024*1024*1024),2),'G') totSize,round(index_length/data_length,2) idxfrac
FROM information_schema.TABLES
where table_schema in ('percona')
ORDER BY data_length+index_length DESC LIMIT 10;
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+
| schema_table                  | rows     | Data    | DataRow | idx    | IdxRow | totSize | idxfrac |
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+
| percona.test_random_PK        | 1586.25M | 104.53G |      71 | 73.04G |     49 | 177.57G |    0.70 |
| percona.test_sequential_PK    | 1840.49M | 54.74G  |      32 | 24.54G |     14 | 79.27G  |    0.45 |
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+

I’ll explain the difference between table sizes later in this post.

After this, I started to run separate tests to check how the write latency is affected by our table design.

First I ran sysbench as follows:

sysbench /usr/share/sysbench/insert_sequence.lua    --mysql-user=admin --mysql-password=xxxxxx --mysql-host=francisco-sysbench.cluster --mysql-db=percona --report-interval=1   --threads=15  --events=0 --time=0 run

MySQL Client Thread Activity

We can see that performance remains steady roughly at 124 microseconds average with all threads running so it looks there is no impact for about 24 hours.

Then I tested the random insert by running:

sysbench /usr/share/sysbench/insert_random.lua    --mysql-user=admin --mysql-password=xxxxxx --mysql-host=francisco-sysbench.cluster --mysql-db=percona --report-interval=1   --threads=15  --events=0 --time=0 run

MySQL InnoDB
This is a huge impact if you consider that random access is causing the instance to suffer from performance, roughly 15x writing latency increase.

These numbers were very impressive compared to my previous experience so, being extra curious, I checked what was reported in Cloudwatch for Aurora Write latency for the previous 3 days.

Cloudwatch for Aurora Write latency

It’s quite clear the 3 stages of the checks:

  • Warming up the instance and buffer pool by pushing really hard with sysbench in both tables (peak of load at the beginning)
  • First round of sequential inserts using sequential write (sequential writes)
  • Last round of random access (random writes)

As I said, the code used on each script was very simple. This is on purpose to avoid adding overhead somewhere other than the database. For reference, this is the interesting portion of code for random writes:

local inserts = {
"INSERT INTO percona.test_random_PK (id, number) VALUES ('%s', %i)"
}

local rnd_str_tmpl = "###########-###########"

function execute_inserts()
             local id = sysbench.rand.string(rnd_str_tmpl)
             con:query(string.format(inserts[1], id, sb_rand(1, sysbench.opt.table_size)))
end

And for sequential writes:

local inserts = {

"INSERT INTO percona.test_sequential_PK (number) VALUES (%i)"

}

function execute_inserts()

-- INSERT

con:query(string.format(inserts[1], sb_rand(1, sysbench.opt.table_size)))

end

Conclusion

Primary key design for InnoDB tables was largely discussed in several posts, and specially UUID format impact was perfectly described in the post I mentioned, above so there are no surprises. What I’ve found interesting in this analysis is that in Aurora there seems to be an extra impact of this random access.

Given that, what’s happening underneath is not quite clear. I just can try to elaborate a theory:

In Aurora, there are no dirty pages. Basically, every commit is synchronized to disk immediately so all replicas can be virtually in sync with the source server. This is reducing the chance of hitting the same page in memory every time you perform a write operation so the whole sequence of grabbing the page, placing it in memory, updating information, flushing, and committing synchronously to the block storage is particularly expensive in Aurora.

Additionally given that every secondary index adds an entry of the Primary Key, the increase in data size (and thus the disk footprint) can be impacted a lot as we have seen before, so this may also cause an extra disk utilization.

So be warned, plan your PKs properly if you don’t want to see a huge performance degradation in AWS Aurora as the database grows and workload is increased.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Jul
19
2021
--

Performing ETL Using Inheritance in PostgreSQL

ETL Using Inheritance in PostgreSQL

Good database maintenance includes not only performing the standard adding, updating, and deleting records, etc., but periodic edits to the table schema too. Operations such as adding, editing, and removing table columns are part of today’s life-cycle reality too as new functionality is constantly being added.

In quieter, and less demanding times, one could get away with schema updates with minimal impact by performing the operation during low load times. However, as all database systems have become more critical to the business bottom line, maintenance windows have become, by necessity, much smaller and more time-sensitive.

So the question is asked: How can one update a multi-terabyte table with near-zero downtime?

Looking deep into our Postgres bag of tricks, we look not at the most recent and advanced features but instead, we leverage a very old capability that’s been part of Postgres since when it was first released as an open source project, i.e. its object-relational capacity implementation of inheritance.

Use Case

Before going into the details of the solution let’s define just what we are trying to solve.

Strictly speaking; there’s two use cases that come to mind when using inheritance as the prime ETL data migration mechanism:

  • Removing table OIDs, such as when moving to Postgres version 12 and greater.
  • Performing DML/DDL operations which include:
    • updating the data
    • adding or removing table columns

Life starts getting complicated when dealing with issues such as updating data types. But we’ll talk about this at the end of the blog.

About Inheritance

Unlike object-oriented programming languages, where the child inherits attributes from the parent, in Postgres it’s the parent that has the ability to inherit from the child ( if only this was true in real-life ;-)). Thus a table column from a child has the potential of becoming available in the parent relation.

Consider the following snippet of code: Two parents and three children are created and populated with records:

BEGIN;
    drop table if exists father, mother cascade;

    create table father(c1 int, c2 int, c3 int);
    create table mother(c1 int, c2 int, c4 int);

-- notice although not declared column "c4" from mother is added to these tables
    create table son(c1 int, c2 int, c3 int) inherits (father,mother);
    create table daughter(c2 int, c4 int, c5 int) inherits (father,mother);

-- this table inherits only those columns from "father"
    create table cousin (c1 int, c2 int, c3 int, c6 int) inherits (father);
COMMIT;
BEGIN;
    insert into son values(1,1,1,1);
    insert into daughter values(2,2,2,2,2);
    insert into cousin values(3,3,3,3);
    insert into father values(10,10,10);
    insert into mother values(11,11,11);
COMMIT;

Columns declared in the parents must therefore exist in the child, i.e. they are merged. But notice those columns unique to the child are not necessarily propagated to the parents:

Table "public.father"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
Number of child tables: 3 (Use \d+ to list them.)

 

Table "public.mother"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c4     | integer |           |          |
Number of child tables: 2 (Use \d+ to list them.)

 

 

Table "public.son"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c4     | integer |           |          |
Inherits: father,
          mother
Table "public.daughter"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c4     | integer |           |          |
 c5     | integer |           |          |
Inherits: father,
          mother
Table "public.cousin"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c6     | integer |           |          |
Inherits: father
 

And even though records populated in the child can be seen by the parent the reverse is NOT true when records are populated into the parent and are not seen by the child:

db02=# select * from father;
 c1 | c2 | c3
----+----+----
 10 | 10 | 10
  1 |  1 |  1
  2 |  2 |  2
  3 |  3 |  3
db02=# select * from mother;
 c1 | c2 | c4
----+----+----
 11 | 11 | 11
  1 |  1 |  1
  2 |  2 |  2
 
db02=# select * from son;
 c1 | c2 | c3 | c4
----+----+----+----
  1 |  1 |  1 |  1
db02=# select * from daughter;
 c1 | c2 | c3 | c4 | c5
----+----+----+----+----
  2 |  2 |  2 |  2 |  2
db02=# select * from cousin;
 c1 | c2 | c3 | c6
----+----+----+----
  3 |  3 |  3 |  3

Developing The ETL/Migration Model

Performing data migration under production conditions should take into consideration these four (4) distinct query operations:

  • SELECT from both the target and source tables at the same time.
  • UPDATE and/or DELETE records from both the target and source tables.
  • INSERT new records into the target table.
  • Moving data from the source to the target tables.

For the sake of discussion we’ll demonstrate using one source and target table respectively inheriting from a single parent:

create table parent(c1 int primary key, c2 int, c3 int);
create table source(like parent including all) inherits (parent);
create table target(like parent including all) inherits (parent);

Querying Both Target And Source Tables

Inheritance makes the SELECT query a straightforward operation. This query checks all the tables for the queried record(s):

explain select * from parent;
QUERY PLAN
--------------------------------------------------------------------------
Append (cost=0.00..81.21 rows=4081 width=12)
-> Seq Scan on parent parent_1 (cost=0.00..0.00 rows=1 width=12)
-> Seq Scan on source parent_2 (cost=0.00..30.40 rows=2040 width=12)
-> Seq Scan on target parent_3 (cost=0.00..30.40 rows=2040 width=12)

UPDATE and/or DELETE Records

Similarly to SELECT queries, one doesn’t have to worry about editing the existing application’s DML operation(s) when performing UPDATE and DELETE. Again, notice how both source and target tables are queried as well as the parent:

explain update parent set c2=0 where c1=1;
QUERY PLAN
---------------------------------------------------------------------------------
 Update on parent  (cost=0.00..16.34 rows=3 width=18)
   Update on parent
   Update on source
   Update on target
   ->  Seq Scan on parent  (cost=0.00..0.00 rows=1 width=18)
         Filter: (c1 = 1)
   ->  Index Scan using source_pkey on source  (cost=0.15..8.17 rows=1 width=18)
         Index Cond: (c1 = 1)
   ->  Index Scan using target_pkey on target  (cost=0.15..8.17 rows=1 width=18)
         Index Cond: (c1 = 1)

INSERT New Records Into The Target Table

The thing to keep in mind about INSERT is that without a redirect mechanism all records are inserted into the parent.

Since everybody already knows about triggers, I thought it would be fun to use a REWRITE RULE instead:

create rule child_insert as
on
    insert to parent
do instead
    insert into target values (NEW.*);

 

Table "public.parent"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 c1     | integer |           | not null |         | plain   |              |
 c2     | integer |           |          |         | plain   |              |
 c3     | integer |           |          |         | plain   |              |
Indexes:
    "parent_pkey" PRIMARY KEY, btree (c1)
Rules:
    child_insert AS
    ON INSERT TO parent DO INSTEAD  INSERT INTO target (c1, c2, c3)
  VALUES (new.c1, new.c2, new.c3)
Child tables: source,
              target

And here’s our validation; notice how the INSERT is redirected from parent to target:

EXPLAIN insert into parent (c1,c2,c3) values (1,1,1);
QUERY PLAN
-----------------------------------------------------
 Insert on target  (cost=0.00..0.01 rows=1 width=12)
   ->  Result  (cost=0.00..0.01 rows=1 width=12)

Moving Records Between Source And Target Tables

It’s time to introduce the last mechanism needed to perform the actual data migration. Essentially, the data is moved in batches otherwise, if you can afford the downtime of moving your records in one very large transaction, this dog and pony show is a bit redundant.

In the real world we need to anticipate multiple processes attempting simultaneous EXCLUSIVE LOCKS. In the case that one or more records are locked by another operation the following example demonstrates how one can simply skip over them:

--
-- EX: insert a single record in table "source"
--
insert into source (c1,c2,c3) values (2,2,2);

-- move 1,000 records at a time
-- from table "source" to "target"
--
with a as (select * from source for update skip locked limit 1000),
     b as (delete from source using a where c1=a.c1
           returning    source.c1,
                        source.c2,
                        source.c3)
insert into target select * from b;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Insert on target  (cost=27.92..41.52 rows=680 width=12) (actual time=0.082..0.082 rows=0 loops=1)
   CTE b
     ->  Delete on source  (cost=9.42..27.92 rows=680 width=6) (actual time=0.050..0.053 rows=1 loops=1)
           ->  Bitmap Heap Scan on source  (cost=9.42..27.92 rows=680 width=6) (actual time=0.021..0.023 rows=1 loops=1)
                 Recheck Cond: (c1 >= 0)
                 Heap Blocks: exact=1
                 ->  Bitmap Index Scan on source_pkey  (cost=0.00..9.25 rows=680 width=0) (actual time=0.010..0.011 rows=1 loops=1)
                       Index Cond: (c1 >= 0)
   ->  CTE Scan on b  (cost=0.00..13.60 rows=680 width=12) (actual time=0.054..0.057 rows=1 loops=1)
 Planning Time: 0.237 ms
 Execution Time: 0.173 ms

Putting It All Together

It’s time to demonstrate a Proof Of Concept using pgbench.

SETUP

Initialize database db02:

dropdb --if-exists db02
createdb db02
pgbench -s 10 -i db02
List of relations
 Schema |       Name       | Type  |  Owner   |  Size   | Description
--------+------------------+-------+----------+---------+-------------
 public | pgbench_accounts | table | postgres | 128 MB  |
 public | pgbench_branches | table | postgres | 40 kB   |
 public | pgbench_history  | table | postgres | 0 bytes |
 public | pgbench_tellers  | table | postgres | 40 kB   |

Create the tables parent and child.

NOTE: In order to demonstrate data migration from a deprecated table, table pgbench_accounts is altered by adding OIDs.

create table parent (like pgbench_accounts including all) without oids;
create table child (like pgbench_accounts including all) inherits (parent) without oids;

alter table pgbench_accounts set with oids, inherit parent;

alter table pgbench_accounts rename to pgbench_accounts_deprecated;
alter table parent rename to pgbench_accounts;

TEST

This query is at the heart of the solution. Any exclusively locked record that it tries to move is automatically skipped and a new attempt can be made the next time this script is invoked.

with a as (select * from pgbench_accounts_deprecated order by 1 for update skip locked limit 10),
     b as (delete
                from pgbench_accounts_deprecated
                using a
                where pgbench_accounts_deprecated.aid=a.aid
                returning pgbench_accounts_deprecated.aid,
                          pgbench_accounts_deprecated.bid,
                          pgbench_accounts_deprecated.abalance,
                          pgbench_accounts_deprecated.filler)
insert into child select * from b;

VALIDATE

explain analyze select * from pgbench_accounts order by 1 limit 13;
QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=0.72..1.45 rows=13 width=97) (actual time=0.012..0.016 rows=13 loops=1)
   ->  Merge Append  (cost=0.72..56212.42 rows=1000211 width=97) (actual time=0.011..0.013 rows=13 loops=1)
         Sort Key: pgbench_accounts.aid
         ->  Index Scan using parent_pkey on pgbench_accounts  (cost=0.12..8.14 rows=1 width=352) (actual time=0.002..0.002 rows=0 loops=1)
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts_deprecated  (cost=0.42..43225.43 rows=1000000 width=97) (actual time=0.006..0.007 rows=3 loops=1)
         ->  Index Scan using child_pkey on child  (cost=0.14..51.30 rows=210 width=352) (actual time=0.002..0.003 rows=10 loops=1)
 Planning Time: 0.084 ms
 Execution Time: 0.030 ms

MIGRATION SCRIPT

The query has been incorporated into this script moving 1,000 records every 5 seconds.

#!/bin/bash

set -e

export PGHOST=/tmp PGPORT=10011 PGDATABASE=db02 PGUSER=postgres

SLEEP=5
REC=1000

SQL="
with a as (select * from pgbench_accounts_deprecated order by 1 for update skip locked limit $REC),
     b as (delete
                from pgbench_accounts_deprecated
                using a
                where pgbench_accounts_deprecated.aid=a.aid
                returning pgbench_accounts_deprecated.aid,
                          pgbench_accounts_deprecated.bid,
                          pgbench_accounts_deprecated.abalance,
                          pgbench_accounts_deprecated.filler)
insert into child select * from b;

with a(count_child)    as (select count(*) from child),
     b(count_accounts) as (select count(*) from pgbench_accounts_deprecated)
select a.count_child, b.count_accounts from a,b;
"

while true
do
    echo "--- $(date): Executing Query, moving $REC records now ... ---"
    psql <<<"$SQL"
    echo "sleeping: $SLEEP seconds ..." && sleep $SLEEP
done

BENCHMARKING

All the while the aforementioned script is active pgbench is running the bench-marking.

#
# doesn't block
#
pgbench -c 2 -j 4 --protocol=simple -T 120 db02

CAVEAT

This is a simple and powerful method but there are limitations: whereas common columns between tables must be of the same datatype.

For example, if column c1 in the table source is of datatype int and you want to migrate the data into table target with the same column c1 but with a datatype bigint then this method won’t work. An alternate solution however could take advantage of Updatable Views, which you can read more about here and using the appropriate triggers and rewrite rules.

Jul
09
2021
--

MySQL/ZFS Performance Update

MySQL/ZFS Performance Update

MySQL/ZFS Performance UpdateAs some of you likely know, I have a favorable view of ZFS and especially of MySQL on ZFS. As I published a few years ago, the argument for ZFS was less about performance than its useful features like data compression and snapshots. At the time, ZFS was significantly slower than xfs and ext4 except when the L2ARC was used.

Since then, however, ZFS on Linux has progressed a lot and I also learned how to better tune it. Also, I found out the sysbench benchmark I used at the time was not a fair choice since the dataset it generates compresses much less than a realistic one. For all these reasons, I believe that it is time to revisit the performance aspect of MySQL on ZFS.

ZFS Evolution

In 2018, I reported ZFS performance results based on version 0.6.5.6, the default version available in Ubuntu Xenial. The present post is using version 0.8.6-1 of ZFS, the default one available on Debian Buster. Between the two versions, there are in excess of 3600 commits adding a number of new features like support for trim operations and the addition of the efficient zstd compression algorithm.

ZFS 0.8.6-1 is not bleeding edge, there have been more than 1700 commits since and after 0.8.6, the ZFS release number jumped to 2.0. The big addition included in the 2.0 release is native encryption.

Benchmark Tools

The classic sysbench MySQL database benchmarks have a dataset containing mostly random data. Such datasets don’t compress much, less than most real-world datasets I worked with. The compressibility of the dataset is important since ZFS caches, the ARC and L2ARC, store compressed data. A better compression ratio essentially means more data is cached and fewer IO operations will be needed.

A well-known tool to benchmark a transactional workload is TPCC. Furthermore, the dataset created by TPCC compresses rather well making it more realistic in the context of this post. The sysbench TPCC implementation was used.

Test Environment

Since I am already familiar with AWS and Google cloud, I decided to try Azure for this project. I launched these two virtual machines:

tpcc:

  • benchmark host
  • Standard D2ds_v4 instance
  • 2 vCpu, 8GB of Ram and 75 GB of temporary storage
  • Debian Buster

db:

  • Database host
  • Standard E4-2ds-v4 instance
  • 2 vCpu, 32GB of Ram and 150GB of temporary storage
  • 256GB SSD Premium (SSD Premium LRS P15 – 1100 IOPS (3500 burst), 125 MB/s)
  • Debian Buster
  • Percona server 8.0.22-13

Configuration

By default and unless specified, the ZFS filesystems are created with:

zpool create bench /dev/sdc
zfs set compression=lz4 atime=off logbias=throughput bench
zfs create -o mountpoint=/var/lib/mysql/data -o recordsize=16k \
           -o primarycache=metadata bench/data
zfs create -o mountpoint=/var/lib/mysql/log bench/log

There are two ZFS filesystems. bench/data is optimized for the InnoDB dataset while bench/log is tuned for the InnoDB log files. Both are compressed using lz4 and the logbias parameter is set to throughput which changes the way the ZIL is used. With ext4, the noatime option is used.

ZFS has also a number of kernel parameters, the ones set to non-default values are:

zfs_arc_max=2147483648
zfs_async_block_max_blocks=5000
zfs_delete_blocks=1000

Essentially, the above settings limit the ARC size to 2GB and they throttle down the aggressiveness of ZFS for deletes. Finally, the database configuration is slightly different between ZFS and ext4. There is a common section:

[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
log-error = /var/log/mysql/error.log
skip-log-bin
datadir = /var/lib/mysql/data
innodb_buffer_pool_size = 26G
innodb_flush_log_at_trx_commit = 1 # TPCC reqs.
innodb_log_file_size = 1G
innodb_log_group_home_dir = /var/lib/mysql/log
innodb_flush_neighbors = 0
innodb_fast_shutdown = 2

and when ext4 is used:

innodb_flush_method = O_DIRECT

and when ZFS is used:

innodb_flush_method = fsync
innodb_doublewrite = 0 # ZFS is transactional
innodb_use_native_aio = 0
innodb_read_io_threads = 10
innodb_write_io_threads = 10

ZFS doesn’t support O_DIRECT but it is ignored with a message in the error log. I chose to explicitly set the flush method to fsync. The doublewrite buffer is not needed with ZFS and I was under the impression that the Linux native asynchronous IO implementation was not well supported by ZFS so I disabled it and increased the number of IO threads. We’ll revisit the asynchronous IO question in a future post.

Dataset

I use the following command to create the dataset:

./tpcc.lua --mysql-host=10.3.0.6 --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \
--threads=8 --tables=10 --scale=200 --db-driver=mysql prepare

The resulting dataset has a size of approximately 200GB. The dataset is much larger than the buffer pool so the database performance is essentially IO-bound.

Test Procedure

The execution of every benchmark was scripted and followed these steps:

  1. Stop MySQL
  2. Remove all datafiles
  3. Adjust the filesystem
  4. Copy the dataset
  5. Adjust the MySQL configuration
  6. Start MySQL
  7. Record the configuration
  8. Run the benchmark

Results

For the benchmark, I used the following invocation:

./tpcc.lua --mysql-host=10.3.0.6 --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \
--threads=16 --time=7200 --report-interval=10 --tables=10 --scale=200 --db-driver=mysql ru

The TPCC benchmark uses 16 threads for a duration of 2 hours. The duration is sufficiently long to allow for a steady state and to exhaust the storage burst capacity. Sysbench returns the total number of TPCC transactions per second every 10s. This number includes not only the New Order transactions but also the other transaction types like payment, order status, etc. Be aware of that if you want to compare these results with other TPCC benchmarks.

In those conditions, the figure below presents the rates of TPCC transactions over time for ext4 and ZFS.

TPCC transactions ZFS

MySQL TPCC results for ext4 and ZFS

During the initial 15 minutes, the buffer pool warms up but at some point, the workload shifts between an IO read bound to an IO write and CPU bound. Then, at around 3000s the SSD Premium burst capacity is exhausted and the workload is only IO-bound. I have been a bit surprised by the results, enough to rerun the benchmarks to make sure. The results for both ext4 and ZFS are qualitatively similar. Any difference is within the margin of error. That essentially means if you configure ZFS properly, it can be as IO efficient as ext4.

What is interesting is the amount of storage used. While the dataset on ext4 consumed 191GB, the lz4 compression of ZFS yielded a dataset of only 69GB. That’s a huge difference, a factor of 2.8, which could save a decent amount of money over time for large datasets.

Conclusion

It appears that it was indeed a good time to revisit the performance of MySQL with ZFS. In a fairly realistic use case, ZFS is on par with ext4 regarding performance while still providing the extra benefits of data compression, snapshots, etc. In a future post, I’ll examine the use of cloud ephemeral storage with ZFS and see how this can further improve performance.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com