Oct
15
2021
--

Comparing Graviton (ARM) Performance to Intel and AMD for MySQL

Comparing Graviton (ARM) performance to Intel and AMD for MySQL

Comparing Graviton (ARM) performance to Intel and AMD for MySQLRecently, AWS presented its own CPU on ARM architecture for server solutions.

It was Graviton. As a result, they update some lines of their EC2 instances with new postfix “g” (e.g. m6g.small, r5g.nano, etc.). In their review and presentation, AWS showed impressive results that it is faster in some benchmarks up to 20 percent. On the other hand, some reviewers said that Graviton does not show any significant results and, in some cases, showed fewer performance results than Intel.

We decided to investigate it and do our research regarding Graviton performance, comparing it with other CPUs (Intel and AMD) directly for MySQL.

Disclaimer

  1. The test is designed to be CPU bound only, so we will use a read-only test and make sure there is no I/O activity during the test.
  2. Tests were run  on m5.* (Intel) , m5a.* (AMD),  m6g.*(Graviton) EC2 instances in the US-EAST-1 region. (List of EC2 see in the appendix).
  3. Monitoring was done with Percona Monitoring and Management (PMM).
  4. OS: Ubuntu 20.04 TLS.
  5. Load tool (sysbench) and target DB (MySQL) installed on the same EC2 instance.
  6. MySQL– 8.0.26-0 — installed from official packages.
  7. Load tool: sysbench —  1.0.18.
  8. innodb_buffer_pool_size=80% of available RAM.
  9. Test duration is five minutes for each thread and then 90 seconds warm down before the next iteration. 
  10. Tests were run three times (to smooth outliers or have more reproducible results), then results were averaged for graphs. 
  11. We are going to use high concurrency scenarios for those scenarios when the number of threads would be bigger than the number of vCPU. And low concurrent scenario with scenarios where the number of threads would be less or equal to a number of vCPU on EC2.
  12. Scripts to reproduce results on our GitHub.

Test Case

Prerequisite:

1. Create DB with 10 tables with 10 000 000 rows each table

sysbench oltp_read_only --threads=10 --mysql-user=sbtest --mysql-password=sbtest --table-size=10000000 --tables=10 --db-driver=mysql --mysql-db=sbtest prepare

2. Load all data to LOAD_buffer

sysbench oltp_read_only --time=300 --threads=10 --table-size=1000000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Test:

Run in a loop for same scenario but  different concurrency THREAD (1,2,4,8,16,32,64,128) on each EC2.

sysbench oltp_read_only --time=300 --threads=${THREAD} --table-size=100000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Results:

Result reviewing was split into 3 parts:

  1. for “small” EC2 with 2, 4, and 8 vCPU
  2. for “medium” EC2 with 16 and 32 vCPU
  3. for  “large” EC2 with 48 and 64 vCPU

This “small”, “medium”, and “large” splitting is just synthetic names for better reviewing depends on the amount of vCPu per EC2
There would be four graphs for each test:

  1. Throughput (Queries per second) that EC2 could perform for each scenario (amount of threads)
  2. Latency 95 percentile that  EC2 could perform for each scenario (amount of threads)
  3. Relative comparing Graviton and Intel
  4. Absolute comparing Graviton and Intel

Validation that all load goes to CPU, not to DISK I/O, was done also using PMM (Percona Monitoring and Management). 

pic 0.1. OS monitoring during all test stages

pic 0.1 – OS monitoring during all test stages

From pic.0.1, we can see that there was no DISK I/O activity during tests, only CPU activity. The main activity with disks was during the DB creation stage.

Result for EC2 with 2, 4, and 8 vCPU

plot 1.1. Throughput (queries per second) for EC2 with 2, 4, and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.2.  Latencies (95 percentile) during the test for EC2 with 2, 4, and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.3.  Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, and 8  vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.4.  Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. AMD has the biggest latencies in all scenarios and for all EC2 instances. We won’t repeat this information in all future overviews, and this is the reason why we exclude it in comparing with other CPUs in percentage and numbers values (in plots 1.3 and 1.4, etc).
  2. Instances with two and four vCPU Intel show some advantage for less than 10 percent in all scenarios.
  3. However, an instance with 8 vCPU intel shows an advantage only on scenarios with threads that less or equal amount of vCPU on EC2.
  4. On EC2 with eight vCPU, Graviton started to show an advantage. It shows some good results in scenarios when the number of threads is more than the amount of vCPU on EC2. It grows up to 15 percent in high-concurrency scenarios with 64 and 128 threads, which are 8 and 16 times bigger than the amount of vCPU available for performing.
  5. Graviton start showing an advantage on EC2 with eight vCPU and with scenarios when threads are more than vCPU amount. This feature would appear in all future scenarios – more load than CPU, better result it shows.

 

Result for EC2 with 16 and 32 vCPU

plot 2.1.  Throughput (queries per second)  for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 1.2.  Latencies (95 percentile) during the test for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 2.3.  Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 2.4.  Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. In scenarios with the same load for ec2 with 16 and 32 vCPU, Graviton is continuing to have advantages when the amount of threads is more significant than the amount of available vCPU on instances.
  2. Graviton shows an advantage of up to 10 percent in high concurrency scenarios. However, Intel has up to 20 percent in low concurrency scenarios.
  3. In high-concurrency scenarios, Graviton could show an incredible difference in the number of (read) transactions per second up to 30 000 TPS.

Result for EC2 with 48 and 64 vCPU

plot 3.1.  Throughput (queries per second)  for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.2.  Latencies (95 percentile) during the test for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.3.  Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 3.4.  Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. It looks like Intel shows a significant advantage in most scenarios when its number of threads is less or equal to the amount of vCPU. It seems Intel is really good for such kind of task. When it has some additional free vCPU, it would be better, and this advantage could be up to 35 percent.
  2. However, Graviton shows outstanding results when the amount of threads is larger than the amount of vCPU. It shows an advantage from 5 to 14 percent over Intel.
  3. In real numbers, Graviton advantage could be up to 70 000 transactions per second over Intel performance in high-concurrency scenarios.

Total Result Overview

 

plot 4.2.  Latencies (95 percentile) during the test for EC2 with 2,4,8,16,32,48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 4.3.  Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2,4,8,16,32,48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

plot 4.4.  Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2,4,8,16,32,48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

Conclusions

  1. ARM CPUs show better results on EC2 with more vCPU and with higher load, especially in high-concurrency scenarios.
  2. As a result of small EC2 instances and small load, ARM CPUs show less impressive performance. So we can’t see its benefits comparing with Intel EC2
  3. Intel is still the leader in the area of low-concurrency scenarios. And it is definitely winning on EC2 with a small amount of vCPU.
  4. AMD does not show any competitive results in all cases.   

Final Thoughts

  1. AMD — we have a lot of questions about EC2 instances on AMD. So it would be a good idea to check what was going on that EC2 during the test and check the general performance of CPUs on those EC2.
  2. We found out that in some specific conditions, Intel and Graviton could compete with each other. But the other side of the coin is economical. What is cheaper to use in each situation? The next article will be about it. 
  3. It would be a good idea to try to use EC2 with Graviton for real high-concurrency DB.  
  4. It seems it needs to run some additional scenarios with 256 and 512 threads to check the hypothesis that Graviton could work better when threads are more than vCPU.

APPENDIX:

List of EC2 used in research:

CPU type EC2 EC2 price per hour (USD) vCPU RAM
Graviton m6g.large 0.077 2 8 Gb
Graviton m6g.xlarge 0.154 4 16 Gb
Graviton m6g.2xlarge 0.308 8 32 Gb
Graviton m6g.4xlarge 0.616 16 64 Gb
Graviton m6g.8xlarge 1.232 32 128 Gb
Graviton m6g.12xlarge 1.848 48 192 Gb
Graviton m6g.16xlarge 2.464 64 256 Gb
Intel m5.large 0.096 2 8 Gb
Intel m5.xlarge 0.192 4 16 Gb
Intel m5.2xlarge 0.384 8 32 Gb
Intel m5.4xlarge 0.768 16 64 Gb
Intel m5.8xlarge 1.536 32 128 Gb
Intel m5.12xlarge 2.304 48 192 Gb
Intel m5.16xlarge 3.072 64 256 Gb
AMD m5a.large 0.086 2 8 Gb
AMD m5a.xlarge 0.172 4 16 Gb
AMD m5a.2xlarge 0.344 8 32 Gb
AMD m5a.4xlarge 0.688 16 64 Gb
AMD m5a.8xlarge 1.376 32 128 Gb
AMD m5a.12xlarge 2.064 48 192 Gb
AMD m5a.16xlarge 2.752 64 256 Gb

 

my.cnf

my.cnf:
[mysqld]
ssl=0
performance_schema=OFF
skip_log_bin
server_id = 7

# general
table_open_cache = 200000
table_open_cache_instances=64
back_log=3500
max_connections=4000
 join_buffer_size=256K
 sort_buffer_size=256K

# files
innodb_file_per_table
innodb_log_file_size=2G
innodb_log_files_in_group=2
innodb_open_files=4000

# buffers
innodb_buffer_pool_size=${80%_OF_RAM}
innodb_buffer_pool_instances=8
innodb_page_cleaners=8
innodb_log_buffer_size=64M

default_storage_engine=InnoDB
innodb_flush_log_at_trx_commit  = 1
innodb_doublewrite= 1
innodb_flush_method= O_DIRECT
innodb_file_per_table= 1
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_flush_neighbors=0
max_prepared_stmt_count=1000000 
bind_address = 0.0.0.0
[client]

 

Sep
21
2021
--

Making Aurora Write Latency 15x Higher (or More!) by Choosing a Bad Primary Key

Aurora MySQL Write Latency

Aurora MySQL Write LatencyPrimary Key design is an important thing for InnoDB performance, and choosing a poor PK definition will have an impact on performance and also write propagation in databases. When this comes to Aurora, this impact is even worse than you may notice.

In short, we consider a poor definition of a Primary Key in InnoDB as “anything but quasi sequential values”, which may cause very random access to data and thus increase the IO dependency.

In this post, I’ll try to demonstrate the potential impact of the primary key design when running on Aurora, and how a bad design can lead to a 15x write latency penalty (or more).

The Analysis

Recently I worked on a case where a customer was having issues with scaling writes in Aurora MySQL. While this is a known limitation in Aurora considering how the distributed storage layer syncs out data among all nodes of the cluster, we observed additional latency occurring when more clients were writing to the database.

The first thing I noticed is that their main table had a poor definition of the Primary Key as they were using UUID-like columns based on VARCHAR data types. In this case, the nature of values for the Primary Key was very random, which is really bad for a b-tree based storage like InnoDB.

With this in mind, I referred to a great post from my colleague Yves Trudeau explaining why UUIDs are bad for performance, so based on this premise I decided to try to measure how big this impact can be in the Aurora world.

The set of tests I’ve run were using a db.r5.2xlarge Aurora MySQL Cluster (8vCPU and 64GB of ram) which is pretty similar to the cluster my customer was using.

First, I’ve started with two very basic tables to avoid any extra overhead but something close to a real case:

CREATE TABLE `test_sequential_PK` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`number` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_number` (`number`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `test_random_PK` (
`id` varchar(26) NOT NULL,
`number` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_number` (`number`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Then I created two simple lua scripts to execute with sysbench – super simple, I’d say – as they were just for doing inserts using either the auto_increment property in table test_sequential_PK or creating random values for test_random_PK table.

The final purpose was not to measure Aurora performance as a whole but the write latency when the Primary Key is not optimal.

I’ve started the process by warming up the instances for few days by running both sysbench scripts and pushing the instance really hard while I was filling up the InnoDB Buffer Pool, results were pretty good for some time until the traffic became IO-bound:

Amazon Aurora Latency

It took a few days but after some time we started to see an increase in the write latency. I created an initial set of data using 50 concurrent threads, which means the graphic above is not very helpful for the analysis I’m going to make.

After I was sure the buffer pool was filled and the instance was warmed up, I verified that the dataset is bigger than the buffer pool:

SELECT concat(table_schema,'.',table_name) schema_table,concat(round(table_rows/1000000,2),'M') rows,concat(round(data_length/(1024*1024*1024),2),'G') Data,round(data_length/table_rows,0) DataRow ,concat(round(index_length/(1024*1024*1024),2),'G') idx,round(index_length/table_rows,0) as IdxRow,concat(round((data_length+index_length)/(1024*1024*1024),2),'G') totSize,round(index_length/data_length,2) idxfrac
FROM information_schema.TABLES
where table_schema in ('percona')
ORDER BY data_length+index_length DESC LIMIT 10;
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+
| schema_table                  | rows     | Data    | DataRow | idx    | IdxRow | totSize | idxfrac |
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+
| percona.test_random_PK        | 1586.25M | 104.53G |      71 | 73.04G |     49 | 177.57G |    0.70 |
| percona.test_sequential_PK    | 1840.49M | 54.74G  |      32 | 24.54G |     14 | 79.27G  |    0.45 |
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+

I’ll explain the difference between table sizes later in this post.

After this, I started to run separate tests to check how the write latency is affected by our table design.

First I ran sysbench as follows:

sysbench /usr/share/sysbench/insert_sequence.lua    --mysql-user=admin --mysql-password=xxxxxx --mysql-host=francisco-sysbench.cluster --mysql-db=percona --report-interval=1   --threads=15  --events=0 --time=0 run

MySQL Client Thread Activity

We can see that performance remains steady roughly at 124 microseconds average with all threads running so it looks there is no impact for about 24 hours.

Then I tested the random insert by running:

sysbench /usr/share/sysbench/insert_random.lua    --mysql-user=admin --mysql-password=xxxxxx --mysql-host=francisco-sysbench.cluster --mysql-db=percona --report-interval=1   --threads=15  --events=0 --time=0 run

MySQL InnoDB
This is a huge impact if you consider that random access is causing the instance to suffer from performance, roughly 15x writing latency increase.

These numbers were very impressive compared to my previous experience so, being extra curious, I checked what was reported in Cloudwatch for Aurora Write latency for the previous 3 days.

Cloudwatch for Aurora Write latency

It’s quite clear the 3 stages of the checks:

  • Warming up the instance and buffer pool by pushing really hard with sysbench in both tables (peak of load at the beginning)
  • First round of sequential inserts using sequential write (sequential writes)
  • Last round of random access (random writes)

As I said, the code used on each script was very simple. This is on purpose to avoid adding overhead somewhere other than the database. For reference, this is the interesting portion of code for random writes:

local inserts = {
"INSERT INTO percona.test_random_PK (id, number) VALUES ('%s', %i)"
}

local rnd_str_tmpl = "###########-###########"

function execute_inserts()
             local id = sysbench.rand.string(rnd_str_tmpl)
             con:query(string.format(inserts[1], id, sb_rand(1, sysbench.opt.table_size)))
end

And for sequential writes:

local inserts = {

"INSERT INTO percona.test_sequential_PK (number) VALUES (%i)"

}

function execute_inserts()

-- INSERT

con:query(string.format(inserts[1], sb_rand(1, sysbench.opt.table_size)))

end

Conclusion

Primary key design for InnoDB tables was largely discussed in several posts, and specially UUID format impact was perfectly described in the post I mentioned, above so there are no surprises. What I’ve found interesting in this analysis is that in Aurora there seems to be an extra impact of this random access.

Given that, what’s happening underneath is not quite clear. I just can try to elaborate a theory:

In Aurora, there are no dirty pages. Basically, every commit is synchronized to disk immediately so all replicas can be virtually in sync with the source server. This is reducing the chance of hitting the same page in memory every time you perform a write operation so the whole sequence of grabbing the page, placing it in memory, updating information, flushing, and committing synchronously to the block storage is particularly expensive in Aurora.

Additionally given that every secondary index adds an entry of the Primary Key, the increase in data size (and thus the disk footprint) can be impacted a lot as we have seen before, so this may also cause an extra disk utilization.

So be warned, plan your PKs properly if you don’t want to see a huge performance degradation in AWS Aurora as the database grows and workload is increased.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Jul
19
2021
--

Performing ETL Using Inheritance in PostgreSQL

ETL Using Inheritance in PostgreSQL

Good database maintenance includes not only performing the standard adding, updating, and deleting records, etc., but periodic edits to the table schema too. Operations such as adding, editing, and removing table columns are part of today’s life-cycle reality too as new functionality is constantly being added.

In quieter, and less demanding times, one could get away with schema updates with minimal impact by performing the operation during low load times. However, as all database systems have become more critical to the business bottom line, maintenance windows have become, by necessity, much smaller and more time-sensitive.

So the question is asked: How can one update a multi-terabyte table with near-zero downtime?

Looking deep into our Postgres bag of tricks, we look not at the most recent and advanced features but instead, we leverage a very old capability that’s been part of Postgres since when it was first released as an open source project, i.e. its object-relational capacity implementation of inheritance.

Use Case

Before going into the details of the solution let’s define just what we are trying to solve.

Strictly speaking; there’s two use cases that come to mind when using inheritance as the prime ETL data migration mechanism:

  • Removing table OIDs, such as when moving to Postgres version 12 and greater.
  • Performing DML/DDL operations which include:
    • updating the data
    • adding or removing table columns

Life starts getting complicated when dealing with issues such as updating data types. But we’ll talk about this at the end of the blog.

About Inheritance

Unlike object-oriented programming languages, where the child inherits attributes from the parent, in Postgres it’s the parent that has the ability to inherit from the child ( if only this was true in real-life ;-)). Thus a table column from a child has the potential of becoming available in the parent relation.

Consider the following snippet of code: Two parents and three children are created and populated with records:

BEGIN;
    drop table if exists father, mother cascade;

    create table father(c1 int, c2 int, c3 int);
    create table mother(c1 int, c2 int, c4 int);

-- notice although not declared column "c4" from mother is added to these tables
    create table son(c1 int, c2 int, c3 int) inherits (father,mother);
    create table daughter(c2 int, c4 int, c5 int) inherits (father,mother);

-- this table inherits only those columns from "father"
    create table cousin (c1 int, c2 int, c3 int, c6 int) inherits (father);
COMMIT;
BEGIN;
    insert into son values(1,1,1,1);
    insert into daughter values(2,2,2,2,2);
    insert into cousin values(3,3,3,3);
    insert into father values(10,10,10);
    insert into mother values(11,11,11);
COMMIT;

Columns declared in the parents must therefore exist in the child, i.e. they are merged. But notice those columns unique to the child are not necessarily propagated to the parents:

Table "public.father"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
Number of child tables: 3 (Use \d+ to list them.)

 

Table "public.mother"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c4     | integer |           |          |
Number of child tables: 2 (Use \d+ to list them.)

 

 

Table "public.son"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c4     | integer |           |          |
Inherits: father,
          mother
Table "public.daughter"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c4     | integer |           |          |
 c5     | integer |           |          |
Inherits: father,
          mother
Table "public.cousin"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c6     | integer |           |          |
Inherits: father
 

And even though records populated in the child can be seen by the parent the reverse is NOT true when records are populated into the parent and are not seen by the child:

db02=# select * from father;
 c1 | c2 | c3
----+----+----
 10 | 10 | 10
  1 |  1 |  1
  2 |  2 |  2
  3 |  3 |  3
db02=# select * from mother;
 c1 | c2 | c4
----+----+----
 11 | 11 | 11
  1 |  1 |  1
  2 |  2 |  2
 
db02=# select * from son;
 c1 | c2 | c3 | c4
----+----+----+----
  1 |  1 |  1 |  1
db02=# select * from daughter;
 c1 | c2 | c3 | c4 | c5
----+----+----+----+----
  2 |  2 |  2 |  2 |  2
db02=# select * from cousin;
 c1 | c2 | c3 | c6
----+----+----+----
  3 |  3 |  3 |  3

Developing The ETL/Migration Model

Performing data migration under production conditions should take into consideration these four (4) distinct query operations:

  • SELECT from both the target and source tables at the same time.
  • UPDATE and/or DELETE records from both the target and source tables.
  • INSERT new records into the target table.
  • Moving data from the source to the target tables.

For the sake of discussion we’ll demonstrate using one source and target table respectively inheriting from a single parent:

create table parent(c1 int primary key, c2 int, c3 int);
create table source(like parent including all) inherits (parent);
create table target(like parent including all) inherits (parent);

Querying Both Target And Source Tables

Inheritance makes the SELECT query a straightforward operation. This query checks all the tables for the queried record(s):

explain select * from parent;
QUERY PLAN
--------------------------------------------------------------------------
Append (cost=0.00..81.21 rows=4081 width=12)
-> Seq Scan on parent parent_1 (cost=0.00..0.00 rows=1 width=12)
-> Seq Scan on source parent_2 (cost=0.00..30.40 rows=2040 width=12)
-> Seq Scan on target parent_3 (cost=0.00..30.40 rows=2040 width=12)

UPDATE and/or DELETE Records

Similarly to SELECT queries, one doesn’t have to worry about editing the existing application’s DML operation(s) when performing UPDATE and DELETE. Again, notice how both source and target tables are queried as well as the parent:

explain update parent set c2=0 where c1=1;
QUERY PLAN
---------------------------------------------------------------------------------
 Update on parent  (cost=0.00..16.34 rows=3 width=18)
   Update on parent
   Update on source
   Update on target
   ->  Seq Scan on parent  (cost=0.00..0.00 rows=1 width=18)
         Filter: (c1 = 1)
   ->  Index Scan using source_pkey on source  (cost=0.15..8.17 rows=1 width=18)
         Index Cond: (c1 = 1)
   ->  Index Scan using target_pkey on target  (cost=0.15..8.17 rows=1 width=18)
         Index Cond: (c1 = 1)

INSERT New Records Into The Target Table

The thing to keep in mind about INSERT is that without a redirect mechanism all records are inserted into the parent.

Since everybody already knows about triggers, I thought it would be fun to use a REWRITE RULE instead:

create rule child_insert as
on
    insert to parent
do instead
    insert into target values (NEW.*);

 

Table "public.parent"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 c1     | integer |           | not null |         | plain   |              |
 c2     | integer |           |          |         | plain   |              |
 c3     | integer |           |          |         | plain   |              |
Indexes:
    "parent_pkey" PRIMARY KEY, btree (c1)
Rules:
    child_insert AS
    ON INSERT TO parent DO INSTEAD  INSERT INTO target (c1, c2, c3)
  VALUES (new.c1, new.c2, new.c3)
Child tables: source,
              target

And here’s our validation; notice how the INSERT is redirected from parent to target:

EXPLAIN insert into parent (c1,c2,c3) values (1,1,1);
QUERY PLAN
-----------------------------------------------------
 Insert on target  (cost=0.00..0.01 rows=1 width=12)
   ->  Result  (cost=0.00..0.01 rows=1 width=12)

Moving Records Between Source And Target Tables

It’s time to introduce the last mechanism needed to perform the actual data migration. Essentially, the data is moved in batches otherwise, if you can afford the downtime of moving your records in one very large transaction, this dog and pony show is a bit redundant.

In the real world we need to anticipate multiple processes attempting simultaneous EXCLUSIVE LOCKS. In the case that one or more records are locked by another operation the following example demonstrates how one can simply skip over them:

--
-- EX: insert a single record in table "source"
--
insert into source (c1,c2,c3) values (2,2,2);

-- move 1,000 records at a time
-- from table "source" to "target"
--
with a as (select * from source for update skip locked limit 1000),
     b as (delete from source using a where c1=a.c1
           returning    source.c1,
                        source.c2,
                        source.c3)
insert into target select * from b;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Insert on target  (cost=27.92..41.52 rows=680 width=12) (actual time=0.082..0.082 rows=0 loops=1)
   CTE b
     ->  Delete on source  (cost=9.42..27.92 rows=680 width=6) (actual time=0.050..0.053 rows=1 loops=1)
           ->  Bitmap Heap Scan on source  (cost=9.42..27.92 rows=680 width=6) (actual time=0.021..0.023 rows=1 loops=1)
                 Recheck Cond: (c1 >= 0)
                 Heap Blocks: exact=1
                 ->  Bitmap Index Scan on source_pkey  (cost=0.00..9.25 rows=680 width=0) (actual time=0.010..0.011 rows=1 loops=1)
                       Index Cond: (c1 >= 0)
   ->  CTE Scan on b  (cost=0.00..13.60 rows=680 width=12) (actual time=0.054..0.057 rows=1 loops=1)
 Planning Time: 0.237 ms
 Execution Time: 0.173 ms

Putting It All Together

It’s time to demonstrate a Proof Of Concept using pgbench.

SETUP

Initialize database db02:

dropdb --if-exists db02
createdb db02
pgbench -s 10 -i db02
List of relations
 Schema |       Name       | Type  |  Owner   |  Size   | Description
--------+------------------+-------+----------+---------+-------------
 public | pgbench_accounts | table | postgres | 128 MB  |
 public | pgbench_branches | table | postgres | 40 kB   |
 public | pgbench_history  | table | postgres | 0 bytes |
 public | pgbench_tellers  | table | postgres | 40 kB   |

Create the tables parent and child.

NOTE: In order to demonstrate data migration from a deprecated table, table pgbench_accounts is altered by adding OIDs.

create table parent (like pgbench_accounts including all) without oids;
create table child (like pgbench_accounts including all) inherits (parent) without oids;

alter table pgbench_accounts set with oids, inherit parent;

alter table pgbench_accounts rename to pgbench_accounts_deprecated;
alter table parent rename to pgbench_accounts;

TEST

This query is at the heart of the solution. Any exclusively locked record that it tries to move is automatically skipped and a new attempt can be made the next time this script is invoked.

with a as (select * from pgbench_accounts_deprecated order by 1 for update skip locked limit 10),
     b as (delete
                from pgbench_accounts_deprecated
                using a
                where pgbench_accounts_deprecated.aid=a.aid
                returning pgbench_accounts_deprecated.aid,
                          pgbench_accounts_deprecated.bid,
                          pgbench_accounts_deprecated.abalance,
                          pgbench_accounts_deprecated.filler)
insert into child select * from b;

VALIDATE

explain analyze select * from pgbench_accounts order by 1 limit 13;
QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=0.72..1.45 rows=13 width=97) (actual time=0.012..0.016 rows=13 loops=1)
   ->  Merge Append  (cost=0.72..56212.42 rows=1000211 width=97) (actual time=0.011..0.013 rows=13 loops=1)
         Sort Key: pgbench_accounts.aid
         ->  Index Scan using parent_pkey on pgbench_accounts  (cost=0.12..8.14 rows=1 width=352) (actual time=0.002..0.002 rows=0 loops=1)
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts_deprecated  (cost=0.42..43225.43 rows=1000000 width=97) (actual time=0.006..0.007 rows=3 loops=1)
         ->  Index Scan using child_pkey on child  (cost=0.14..51.30 rows=210 width=352) (actual time=0.002..0.003 rows=10 loops=1)
 Planning Time: 0.084 ms
 Execution Time: 0.030 ms

MIGRATION SCRIPT

The query has been incorporated into this script moving 1,000 records every 5 seconds.

#!/bin/bash

set -e

export PGHOST=/tmp PGPORT=10011 PGDATABASE=db02 PGUSER=postgres

SLEEP=5
REC=1000

SQL="
with a as (select * from pgbench_accounts_deprecated order by 1 for update skip locked limit $REC),
     b as (delete
                from pgbench_accounts_deprecated
                using a
                where pgbench_accounts_deprecated.aid=a.aid
                returning pgbench_accounts_deprecated.aid,
                          pgbench_accounts_deprecated.bid,
                          pgbench_accounts_deprecated.abalance,
                          pgbench_accounts_deprecated.filler)
insert into child select * from b;

with a(count_child)    as (select count(*) from child),
     b(count_accounts) as (select count(*) from pgbench_accounts_deprecated)
select a.count_child, b.count_accounts from a,b;
"

while true
do
    echo "--- $(date): Executing Query, moving $REC records now ... ---"
    psql <<<"$SQL"
    echo "sleeping: $SLEEP seconds ..." && sleep $SLEEP
done

BENCHMARKING

All the while the aforementioned script is active pgbench is running the bench-marking.

#
# doesn't block
#
pgbench -c 2 -j 4 --protocol=simple -T 120 db02

CAVEAT

This is a simple and powerful method but there are limitations: whereas common columns between tables must be of the same datatype.

For example, if column c1 in the table source is of datatype int and you want to migrate the data into table target with the same column c1 but with a datatype bigint then this method won’t work. An alternate solution however could take advantage of Updatable Views, which you can read more about here and using the appropriate triggers and rewrite rules.

Jul
09
2021
--

MySQL/ZFS Performance Update

MySQL/ZFS Performance Update

MySQL/ZFS Performance UpdateAs some of you likely know, I have a favorable view of ZFS and especially of MySQL on ZFS. As I published a few years ago, the argument for ZFS was less about performance than its useful features like data compression and snapshots. At the time, ZFS was significantly slower than xfs and ext4 except when the L2ARC was used.

Since then, however, ZFS on Linux has progressed a lot and I also learned how to better tune it. Also, I found out the sysbench benchmark I used at the time was not a fair choice since the dataset it generates compresses much less than a realistic one. For all these reasons, I believe that it is time to revisit the performance aspect of MySQL on ZFS.

ZFS Evolution

In 2018, I reported ZFS performance results based on version 0.6.5.6, the default version available in Ubuntu Xenial. The present post is using version 0.8.6-1 of ZFS, the default one available on Debian Buster. Between the two versions, there are in excess of 3600 commits adding a number of new features like support for trim operations and the addition of the efficient zstd compression algorithm.

ZFS 0.8.6-1 is not bleeding edge, there have been more than 1700 commits since and after 0.8.6, the ZFS release number jumped to 2.0. The big addition included in the 2.0 release is native encryption.

Benchmark Tools

The classic sysbench MySQL database benchmarks have a dataset containing mostly random data. Such datasets don’t compress much, less than most real-world datasets I worked with. The compressibility of the dataset is important since ZFS caches, the ARC and L2ARC, store compressed data. A better compression ratio essentially means more data is cached and fewer IO operations will be needed.

A well-known tool to benchmark a transactional workload is TPCC. Furthermore, the dataset created by TPCC compresses rather well making it more realistic in the context of this post. The sysbench TPCC implementation was used.

Test Environment

Since I am already familiar with AWS and Google cloud, I decided to try Azure for this project. I launched these two virtual machines:

tpcc:

  • benchmark host
  • Standard D2ds_v4 instance
  • 2 vCpu, 8GB of Ram and 75 GB of temporary storage
  • Debian Buster

db:

  • Database host
  • Standard E4-2ds-v4 instance
  • 2 vCpu, 32GB of Ram and 150GB of temporary storage
  • 256GB SSD Premium (SSD Premium LRS P15 – 1100 IOPS (3500 burst), 125 MB/s)
  • Debian Buster
  • Percona server 8.0.22-13

Configuration

By default and unless specified, the ZFS filesystems are created with:

zpool create bench /dev/sdc
zfs set compression=lz4 atime=off logbias=throughput bench
zfs create -o mountpoint=/var/lib/mysql/data -o recordsize=16k \
           -o primarycache=metadata bench/data
zfs create -o mountpoint=/var/lib/mysql/log bench/log

There are two ZFS filesystems. bench/data is optimized for the InnoDB dataset while bench/log is tuned for the InnoDB log files. Both are compressed using lz4 and the logbias parameter is set to throughput which changes the way the ZIL is used. With ext4, the noatime option is used.

ZFS has also a number of kernel parameters, the ones set to non-default values are:

zfs_arc_max=2147483648
zfs_async_block_max_blocks=5000
zfs_delete_blocks=1000

Essentially, the above settings limit the ARC size to 2GB and they throttle down the aggressiveness of ZFS for deletes. Finally, the database configuration is slightly different between ZFS and ext4. There is a common section:

[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
log-error = /var/log/mysql/error.log
skip-log-bin
datadir = /var/lib/mysql/data
innodb_buffer_pool_size = 26G
innodb_flush_log_at_trx_commit = 1 # TPCC reqs.
innodb_log_file_size = 1G
innodb_log_group_home_dir = /var/lib/mysql/log
innodb_flush_neighbors = 0
innodb_fast_shutdown = 2

and when ext4 is used:

innodb_flush_method = O_DIRECT

and when ZFS is used:

innodb_flush_method = fsync
innodb_doublewrite = 0 # ZFS is transactional
innodb_use_native_aio = 0
innodb_read_io_threads = 10
innodb_write_io_threads = 10

ZFS doesn’t support O_DIRECT but it is ignored with a message in the error log. I chose to explicitly set the flush method to fsync. The doublewrite buffer is not needed with ZFS and I was under the impression that the Linux native asynchronous IO implementation was not well supported by ZFS so I disabled it and increased the number of IO threads. We’ll revisit the asynchronous IO question in a future post.

Dataset

I use the following command to create the dataset:

./tpcc.lua --mysql-host=10.3.0.6 --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \
--threads=8 --tables=10 --scale=200 --db-driver=mysql prepare

The resulting dataset has a size of approximately 200GB. The dataset is much larger than the buffer pool so the database performance is essentially IO-bound.

Test Procedure

The execution of every benchmark was scripted and followed these steps:

  1. Stop MySQL
  2. Remove all datafiles
  3. Adjust the filesystem
  4. Copy the dataset
  5. Adjust the MySQL configuration
  6. Start MySQL
  7. Record the configuration
  8. Run the benchmark

Results

For the benchmark, I used the following invocation:

./tpcc.lua --mysql-host=10.3.0.6 --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \
--threads=16 --time=7200 --report-interval=10 --tables=10 --scale=200 --db-driver=mysql ru

The TPCC benchmark uses 16 threads for a duration of 2 hours. The duration is sufficiently long to allow for a steady state and to exhaust the storage burst capacity. Sysbench returns the total number of TPCC transactions per second every 10s. This number includes not only the New Order transactions but also the other transaction types like payment, order status, etc. Be aware of that if you want to compare these results with other TPCC benchmarks.

In those conditions, the figure below presents the rates of TPCC transactions over time for ext4 and ZFS.

TPCC transactions ZFS

MySQL TPCC results for ext4 and ZFS

During the initial 15 minutes, the buffer pool warms up but at some point, the workload shifts between an IO read bound to an IO write and CPU bound. Then, at around 3000s the SSD Premium burst capacity is exhausted and the workload is only IO-bound. I have been a bit surprised by the results, enough to rerun the benchmarks to make sure. The results for both ext4 and ZFS are qualitatively similar. Any difference is within the margin of error. That essentially means if you configure ZFS properly, it can be as IO efficient as ext4.

What is interesting is the amount of storage used. While the dataset on ext4 consumed 191GB, the lz4 compression of ZFS yielded a dataset of only 69GB. That’s a huge difference, a factor of 2.8, which could save a decent amount of money over time for large datasets.

Conclusion

It appears that it was indeed a good time to revisit the performance of MySQL with ZFS. In a fairly realistic use case, ZFS is on par with ext4 regarding performance while still providing the extra benefits of data compression, snapshots, etc. In a future post, I’ll examine the use of cloud ephemeral storage with ZFS and see how this can further improve performance.

Mar
08
2021
--

Testing the Value of ScaleFlux Computational Storage Drive (CSD) for PostgreSQL

ScaleFlux Computational Storage Drive PostgreSQL

Some time ago we at Percona were approached by ScaleFlux Inc to benchmark their latest hardware appliance, the CSD 2000 Drive, which is a next-generation SSD computational storage drive. It goes without saying that a truly relevant report requires us to be as honest and as forthright as possible. In other words, my mission was to, ahem, see what kind of mayhem I could cause.

Benchmarking is a bit like cooking; it requires key ingredients, strict adherence to following a set of instructions, mixing the ingredients together, and a bit of heat to make it all happen. In this case, the ingredients include the Linux OS running Ubuntu 18.04 on both the database and the bench-marking hosts, PostgreSQL version 12, SysBench the modular, cross-platform, and multi-threaded benchmark tool, and a comparable, competing appliance i.e. the Intel DC P4610 series drive. The two appliances are mounted as partitions respectively both using the same type of file system.

 

Once the environment is ready, the next step involves declaring and implementing the bench-marking rules which consist of various types of DML and DDL activity. Keeping in mind that apart from the classic OLAP vs OLTP modes of database processing, executing a benchmark that closely follows real production activities can be problematic. Quite often, when pushing a system to its full capacity, one can say that all production systems are to some extent unique. Therefore, for our purposes, we used the testing regime SysBench offers by default.

Once the system was ready, loading started out slow and gentle. The idea was to develop a baseline for the various types of activity and Postgres runtime conditions. Then, the bench-marking intensity was gradually increased to the point where we eventually started getting interesting results.

Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!

Needless to say, it took quite a bit of time running the various permutations, double-checking our numbers, graphing the data, and then after all that, interpreting the output. I’m not going to go into any great detailing the analysis itself. Instead, I encourage you to look at the whitepaper itself.

So after all this effort, what was the takeaway?

There are two key observations that I’d like to share:

  1. At peak loading, the ScaleFlux CSD 2000 Drive demonstrated less performance variance than that of the Intel DC P4610. Variance being the statistical encapsulation of IO read-write spread between maximum and minimum values. The significance is server predictability. This becomes important when, for example, finely tuned application processes depend upon consistent performance with the RDBMS. Many a time I’ve seen applications get upset when response times between inserting, updating, or deleting data and getting the resultant queries would suddenly change.
  2. Remarkable space savings were realized when the Postgres fillfactor was reduced. As you know, the fillfactor can become a critical runtime parameter in regards to performance when high-frequency UPDATE and DELETE operations take place on the same tuple over and over again.

Finally, one last item… I didn’t mention it but we also benchmarked MySQL for ScaleFlux. The results were pretty remarkable. It’s worth your while to have a look at that one too.

ScaleFlux White Papers:

Feb
26
2021
--

Connection Queuing in pgBouncer: Is it a Magical Remedy?

Connection Queuing in pgBouncer

Yes, this post is about connection queueing, not just pooling. Because “connection pooling” – pre-created connections as a pool – is a much-celebrated feature. Almost every discussion on connection pool/pgBouncer starts with the overhead of establishing a new connection to PostgreSQL… and how pre-created connections in the pool can save the world.

But there is a non-celebrity feature in pgBouncer (not denying others) that can address some of the real big operational challenges. It is the connection queueing. Many new PostgreSQL users don’t know there is something like this. In this blog post, I am planning to discuss some of the problems related to connection management first, and then explore how connection queueing can address those problems.

Problem 1: Spike in Load can Jam and Halt the Server

PostgreSQL has a dedicated backend server process for every user connection. So there will be one backend process running on a CPU core for every active queries/sessions. This means there is a one-to-one mapping between active sessions and running processes in the server. If we consider parallel execution of SQL statements, there will be many more running processes than active sessions. In many real-world cases, a sudden spike in load can result in hundreds of active queries starting at once while the server is equipped with a small number of CPUs (sometimes just virtual CPUs with only fraction of performance). As the number of active sessions/processes increases, the overhead of scheduling and context switches takes over.  Many times, the host server becomes unresponsive, and even opening a bash shell/terminal can take time. This is quite easy to simulate. Just 10 active connections on a two virtual CPU server with SELECT only workload can cause this.

With two active sessions:

$ time ssh -t postgres@pghost 'ls -l'
real 0m0.474s
user 0m0.017s
sys 0m0.006s

When there are 10 active sessions on PostgreSQL, just establishing an ssh connection to the server took 15 seconds.

real 0m15.307s
user 0m0.026s
sys 0m0.015s

**These are indicative numbers from a very specific system and do not qualify for a benchmark.

Generally, we could see that as the number of active sessions approaches double the number of CPU cores the performance penalty starts increasing heavily.

Many times, the problem won’t end there. Session level resource allocations (work_mem, temporary tables, etc.) can lead to overall server resource consumption. As the host server slows down, each session will take more time to complete while holding the resources, which could lead to more accumulation of active sessions. It is a spiral of evil. There are many real-world cases, where the entire show ended in a complete halt of the host server or OOM kick-in, terminating the PostgreSQL process and forcing it for crash recovery.

 

Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!

Problem 2: “Too Many Clients Already” Errors

Few smart DBAs will prevent this database disaster by setting max_connections properly to a smaller value than the database can handle, which is the right thing to do from a DB server perspective. Allowing an excessive number of connections to the database can lead to different types of abuses, attacks, and disasters. But the flip side to it is an abusive application may be greeted with the message as follows:

FATAL:  sorry, too many clients already

The same will be logged in the PostgreSQL log.

2021-02-15 13:40:50.017 UTC [12829] FATAL:  sorry, too many clients already

Unfortunately, this could lead to an application crash or misbehavior. From the business point of view, we just shifted the problem from database to application.

Problem 3: Big max_connection Value and Overhead

Because of the above-mentioned problem, it is common to see max_connection to have a very high value. The overhead of connections is probably one of the most actively-discussed topics these days because Postgres 14 is expected to have some of the connection scalability fixes. Andres Freund blogged about the details of analyzing the connection scalability problem and how it is addressed.

Even the idling connection may occupy server resources like memory. The overhead is considered as very low on a properly configured server; however, the impact could be heavy in reality. Again, a lot of things depend on the workload. There are at least a few cases that reported up to 50MB consumption per session. That means 500 idle connections can result in up to 25GB of memory usage.

In addition to this, more connections can lead to more lock management-related overheads. And don’t forget that system becomes vulnerable to sudden spikes as the max_connections are increased.

Solution: Connection Queueing

At the very least, connection queueing is the queueing of connections so that they can absorb the sudden spike in load. The connections can be put into a queue rather than straight away rejecting or sending it to the server and jamming it. This results in streamlining the execution. PostgreSQL server can keep doing what it can do rather than dealing with a jam situation.

Let me demonstrate with an example. For this demonstration, I set the max_connections to “2”, assuming that this is the maximum the server can accommodate without causing too many context switches. Too many connections won’t come and overload my database.

postgres=# show max_connections ;
2

A third connection to the database will result in an error as expected.

psql: error: FATAL: sorry, too many clients already

Now let’s use the pgbouncer for the connection queue. Many of users may not be knowing that it exists, by default. I used the following pgbouncer configuration for testing:

[databases]
pgbounce = host=172.16.2.16 port=5432 dbname=postgres

[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
logfile = pgbouncer.log
pidfile = pgbouncer.pid
admin_users = postgres
application_name_add_host=1
default_pool_size=1
min_pool_size=1

Yes, the pooler will establish only one connection to the database. pgBouncer establishes this one connection when the client connection establishes for the first time because the min_pool_size is 1. The pgBouncer log says:

2021-02-16 15:19:30.725 UTC [2152] LOG C-0x1fb0c58: pgbounce/postgres@172.16.2.56:54754 login attempt: db=pgbounce user=postgres tls=no
2021-02-16 15:19:30.726 UTC [2152] LOG S-0x1fbba80: pgbounce/postgres@172.16.2.16:5432 new connection to server (from 172.16.2.144:58968)

pgbouncer pool statistics also shows the details:

pgbouncer=# show pools;
 database  |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode 
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
 pgbounce  | postgres  |         1 |          0 |         0 |       0 |       1 |         0 |        0 |       0 |          0 | session

But the beauty is that we won’t get any more “FATAL: sorry, too many clients already” errors. All client connections are accepted and put into the connection queue. For example, I have five client connections. please see the value of cl_active:

pgbouncer=# show pools;
 database  |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode 
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
 pgbounce  | postgres  |         5 |          0 |         0 |       0 |       1 |         0 |        0 |       0 |          0 | session

As each client connection becomes active (with a SQL statement), they will be put into waiting.

pgbouncer=# show pools;
 database  |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode 
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
 pgbounce  | postgres  |         1 |          4 |         1 |       0 |       0 |         0 |        0 |      28 |     438170 | session

Each client connection will be executed over the available database connection, one after another. This is a case with a single database connection. If the connection count and pool size can be increased, multiple client connections can hit the server at the same time and queue size drops. The following is a case with two connections (pool size two) to the database.

database  |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode 
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
 pgbounce  | postgres  |         2 |          1 |         2 |       0 |       0 |         0 |        0 |       4 |     978081 | session

Putting Connection Queueing into a Simple Test

This is not an extensive benchmarking, but a quick test to see the benefits for a typical host with two virtual CPUs. I have created 20 active connections to PostgreSQL with select-only load using pgbench.

pgbench -c 20 -j 20 -h pghost -U postgres -S -T 100

As the 20 server processes started running, the load average went out of the roof.

As you can see in the screenshot, the load average spiked to 17+. And as expected, the server response also becomes very poor consistently.

time ssh -t postgres@pghost 'ls -l'
real 0m17.768s
user 0m0.019s
sys 0m0.007s

At this stage, I tried sending the same 20 active connections through the connection queue of pgbouncer with pool size four (default_pool_size=4). The pgbouncer is at the client-side.

Since there are only four server-side processes, the load average dropped drastically. The maximum I could see is 1.73:

The server response is also very good.

$ time ssh -t postgres@pghost 'ls -l'
real 0m0.559s
user 0m0.021s
sys 0m0.008s

A load average of 17+ vs 1.73! That must be too good to be true.

There was a bit of skepticism about whether the low load on the server and the better server response is coming at the cost of database throughput.  I was expecting to see not-so-great throughput numbers. So I took the same test to a more consistently-performing platform (AWS r5.large with two virtual CPUs) again. To a bit of surprise, the numbers were even better.

The following are the numbers I got. At least it is not bad; it’s better. 5% better.

Direct With Queueing
20190.301781 21663.454921
20308.115945 21646.195661
20434.780692 21218.44989

Since we are using just four connections on the database side in this case, it also gives us the opportunity to reduce the max_connection value on the database side. Another check was whether switching to transaction-level pooling can save more because the database connection will be back to the pool and it could serve another client connection after each transaction. This could result in better concurrency.

Queue + max_ connection=5 Queue + max_connection=5 + transaction level pool
21897.685891 23318.655016
21913.841813 23486.485856
21933.129685 23633.39607

As expected, it delivered even better numbers. I would like to encourage readers to do more tests and proper benchmarking.

Summary and References

A really large number of application/client connections can be multiplexed over a very few database connections using the connection queueing. This queue helps in absorbing any spike in connections without overloading the database server. Instead of session processes queueing up on the OS scheduler/run queue, the connections can be kept outside safely and the database server can operate at full-throttle without any contentions. Streamlined database traffic results in better throughput also.

Jan
27
2021
--

Deploying Any Version of Leading Open Source Databases for Tests and Experiments

Deploying Any Version of Leading Open Source Databases

Deploying Any Version of Leading Open Source DatabasesI want to present a tool for running a specific version of open source databases in a single instance, replication setups, and Kubernetes. AnyDbVer deploys MySQL/MariaDB/MongoDB/PostgreSQL for testing and experiments.

It Could Be Started By…

Docker (or Podman) or dbdeployer (MySQL-Sandbox successor) could also start a specific database version, but such installations are significantly different from production setups.

Ansible Playbook

There is an Ansible playbook with few roles configurable by environment variables.

Bash Scripts

You may ignore the Ansible layer because the ./anydbver bash script hides it.

In LXD containers

It’s not practical to reserve physical servers or virtual machines for all tasks. Frequently you need just a few database instances and the ability to restart it with systemd and check logs with journalctl. AnydbVer spawns one or more Linux Containers managed by Canonical (Ubuntu) LXD, and LXD containers support systemd not using a layered filesystem setup.

Best Performance with Linux Running on Hardware Directly

Linux Containers is not an emulation layer. From the “host” system, you can see all processes and files created by containers. Thus the performance is very similar to the same database running on a physical server.

You Can Run Vagrant + VirtualBox as Well, For Other OS

LXD setup is relatively easy, but sometimes you may need to run AnydbVer just with a few commands. Vagrant could start an unmodified Ubuntu virtual machine and automatically configure AnydbVer for you.

You can find the required files and setup instructions at https://github.com/ihanick/anydbver.

Clone the https://github.com/ihanick/anydbver repository to setup LXD or start Vagrant.

git clone https://github.com/ihanick/anydbver.git
cd anydbver
vagrant up
vagrant ssh
cd anydbver

Single Instance Usage

Imagine that you need the exact CentOS 7 package version: Percona Server for MySQL 5.6.43-rel84.3:

$ ./anydbver deploy percona-server:5.6.43-rel84.3
$ ./anydbver ssh
$ mysql
mysql> select version();

You are not limited to using full version specification. To use the latest matching version, reduce 5.6.43-rel84.3 down to 5.6.43 or even 5.6. To run other databases, replace percona-server with:

  • pxc: Percona XtraDB Cluster
  • mysql: Oracle MySQL Community Server
  • mariadb: MariaDB
  • mariadb-cluster: MariaDB Galera Cluster
  • pg: Postgres build from https://www.postgresql.org/
  • ppg: Percona Distribution for PostgreSQL
  • psmdb: Percona Server for MongoDB

Multiple Instances

Several containers are not consuming a significant amount of resources. Actually five MySQL container instances will consume the same resources as five individual processes running on the same host. The syntax is:

$ ./anydbver deploy <default node definition> node1 <node1 definition> node2 <node2 definition> ...

For example, run two independent MySQL instances:

$ ./anydbver mysql node1 mysql

SSH access to default node:

$ ./anydbver ssh
# or
$ ./anydbver ssh default

All other nodes (replace node1 with other node names):

$ ./anydbver ssh node1

Hostnames

You may have a server hostname specified. For example let’s run two Percona Server for MySQL instances:

$ ./anydbver deploy ps:5.7 hostname:leader.percona.local node1 ps:5.7 hostname:follower.percona.local
$ ./anydbver ssh leader
or ./anydbver ssh leader.percona or leader.percona.local
[root@leader ~]# mysql --host follower.percona.local --prompt '\h mysql>'
follower.percona.local mysql>

Replication

The most interesting part of modern open-source databases is replication. Even active-active replication setups are starting from a single server (leader or master/primary). Start the first node normally and attach additional nodes with master:nodename or leader:nodename. PXC or Galera servers could participate in both synchronous and asynchronous replication. Thus, for Galera clusters, you need galera-master or galera-leader syntax.

Start a 3 node Percona XtraDB cluster (latest 5.7):

./anydbver deploy pxc:5.7 node1 pxc:5.7 galera-master:default node2 pxc:5.7 galera-master:default

Run master and two async slaves with Percona Server for MySQL and add all three servers to ProxySQL setup:

./anydbver deploy ps:5.7 node1 ps:5.7 master:default node2 ps:5.7 master:default node3 proxysql master:default

Setup physical replication with slots for Postgresql 12.3:

./anydbver deploy pg:12.3 node1 pg:12.3 master:default

Make a Mongo replica set named rs0:

./anydbver deploy psmdb replica-set:rs0 node1 psmdb master:default replica-set:rs0 node2 psmdb master:default replica-set:rs0

MongoDB Sharding

MongoDB sharding setup requires several server types: servers with data (shardsrv), configuration servers (configsrv), and mongos server:

./anydbver deploy \
psmdb:4.2 replica-set:rs0 shardsrv \
node1 psmdb:4.2 master:default replica-set:rs0 shardsrv \
node2 psmdb:4.2 master:default replica-set:rs0 shardsrv \
node3 psmdb:4.2 configsrv replica-set:cfg0 \
node4 psmdb:4.2 configsrv replica-set:cfg0 master:node3 \
node5 psmdb:4.2 configsrv replica-set:cfg0 master:node3 \
node6 psmdb:4.2 mongos-cfg:cfg0/node3,node4,node5 mongos-shard:rs0/default,node1,node2

Containers and Orchestration

The fact that we are already using containers (LXD) shouldn’t confuse you. We can still run docker images inside our nodes. Nested containers and Podman makes it possible.

Run Percona Monitoring and Management Docker containers

Let’s deploy the default node with Podman and run the Percona Monitoring and Management (PMM) server docker container in it. Percona Server for MySQL 5.7 with PMM client will run on node1:

./anydbver deploy pmm node1 ps:5.7 pmm-client pmm-server:default

Run multi-node Kubernetes cluster

Kubernetes also could utilize nested containers. There are several small Kubernetes distributions: minikube, microk8s, k3s, k0s. The simplest fully functional Kubernetes could be implemented with k3s.

The first k3s LXD container executes the API server and workers. Additional LXD containers could run more workers. Multiple workers are important to run complex HA applications with hostname anti-affinity.

./anydbver deploy k3s node1 k3s-master:default node2 k3s-master:default
./anydbver ssh

On a default node, we can execute kubectl or helm:

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ihanick-node2 Ready <none> 14m v1.19.2+k3s1
ihanick-default Ready master 17m v1.19.2+k3s1
ihanick-node1 Ready <none> 15m v1.19.2+k3s1

Percona Kubernetes Operator for Percona XtraDB Cluster

The Kubernetes cluster allows running stateful applications, like databases, services (similar to Amazon AWS S3), or monitoring solutions. Let’s start:

  • 4 node cluster
  • MinIO (implements S3 api) for backups
  • Percona XtraDB Cluster (managed by the operator)
  • Percona Monitoring and Management
./anydbver deploy k3s \
node1 k3s-master:default \
node2 k3s-master:default \
node3 k3s-master:default \
default k8s-minio k8s-pmm k8s-pxc backup
./anydbver ssh
kubectl get pods
NAME READY STATUS RESTARTS AGE
svclb-monitoring-service-n5nsx 1/1 Running 0 20m
svclb-monitoring-service-htssw 1/1 Running 0 20m
svclb-monitoring-service-n9kt4 1/1 Running 0 20m
svclb-monitoring-service-7btbh 1/1 Running 0 20m
minio-service-6db6667fb9-tk69n 1/1 Running 0 20m
monitoring-0 1/1 Running 0 20m
percona-xtradb-cluster-operator-7886ccf6b5-rtwxc 1/1 Running 0 18m
cluster1-pxc-0 2/2 Running 2 17m
cluster1-haproxy-0 2/2 Running 0 17m
cluster1-haproxy-1 2/2 Running 0 12m
cluster1-haproxy-2 2/2 Running 0 12m
cluster1-pxc-1 2/2 Running 1 13m
cluster1-pxc-2 2/2 Running 1 10m

Percona Kubernetes Operator for Percona Server for MongoDB

You can install and configure the database in the same way, regardless of the architecture:

./anydbver deploy k3s \
  node1 k3s-master:default \
  node2 k3s-master:default \
  node3 k3s-master:default \
  default k8s-minio k8s-mongo backup

Zalando Postgres Operator

You are not limited to Percona-only products and can run other K8s operators:

./anydbver deploy k3s node1 k3s-master:default node2 k3s-master:default node3 k3s-master:default default k8s-pg

Summary

AnyDbVer is a useful deployment tool for experiments with:

AnyDbVer restrictions:

  • It is not a production deployment tool.
  • The deployment process takes a long time. The tool downloads and installs all packages from OS and Vendor repositories.
Jan
22
2021
--

PostgreSQL on ARM-based AWS EC2 Instances: Is It Any Good?

PostgreSQL on ARM-based AWS EC2

The expected growth of ARM processors in data centers has been a hot topic for discussion for quite some time, and we were curious to see how it performs with PostgreSQL. The general availability of ARM-based servers for testing and evaluation was a major obstacle. The icebreaker was when AWS announced their ARM-based processors offering in their cloud in 2018. But we couldn’t see much excitement immediately, as many considered it is more “experimental” stuff. We were also cautious about recommending it for critical use and never gave enough effort in evaluating it.  But when the second generation of Graviton2 based instances was announced in May 2020, we wanted to seriously consider. We decided to take an independent look at the price/performance of the new instances from the standpoint of running PostgreSQL.

Important: Note that while it’s tempting to call this comparison of PostgreSQL on x86 vs arm, that would not be correct. These tests compare PostgreSQL on two virtual cloud instances, and that includes way more moving parts than just a CPU. We’re primarily focusing on the price-performance of two particular AWS EC2 instances based on two different architectures.

Test Setup

For this test, we picked two similar instances. One is the older

m5d.8xlarge

, and the other is a new Graviton2-based

m6gd.8xlarge

. Both instances come with local “ephemeral” storage that we’ll be using here. Using very fast local drives should help expose differences in other parts of the system and avoid testing cloud storage. The instances are not perfectly identical, as you’ll see below, but are close enough to be considered same grade. We used Ubuntu 20.04 AMI and PostgreSQL 13.1 from pgdg repo. We performed tests with small (in-memory) and large (io-bound) database sizes.

Instances

Specifications and On-Demand pricing of the instances as per the AWS Pricing Information for Linux in the Northern Virginia region. With the currently listed prices,

m6gd.8xlarge

is 25% cheaper.

Graviton2 (arm) Instance

Instance : m6gd.8xlarge 	
Virtual CPUs : 32
RAM  : 128 GiB 	
Storage : 1 x 1900 NVMe SSD (1.9 TiB)
Price : $1.4464 per Hour

Regular (x86) Instance

Instance : m5d.8xlarge
Virtual CPUs : 32
RAM : 128 GiB
Storage : 2 x 600 NVMe SSD (1.2 TiB)
Price : $1.808 per Hour

OS and PostgreSQL setup

We selected Ubuntu 20.04.1 LTS AMIs for the instances and didn’t change anything on the OS side. On the m5d.8xlarge instance, two local NVMe drives were unified in a single raid0 device. PostgreSQL was installed using

.deb

packages available from the PGDG repository.

The PostgreSQL version string shows confirm the OS architecture

postgres=# select version();
                                                                version                                                                 
----------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 13.1 (Ubuntu 13.1-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

** aarch64 stands for 64-bit ARM architecture

The following PostgreSQL configuration was used for testing.

max_connections = '200'
shared_buffers = '32GB'
checkpoint_timeout = '1h'
max_wal_size = '96GB'
checkpoint_completion_target = '0.9'
archive_mode = 'on'
archive_command = '/bin/true'
random_page_cost = '1.0'
effective_cache_size = '80GB'
maintenance_work_mem = '2GB'
autovacuum_vacuum_scale_factor = '0.4'
bgwriter_lru_maxpages = '1000'
bgwriter_lru_multiplier = '10.0'
wal_compression = 'ON'
log_checkpoints = 'ON'
log_autovacuum_min_duration = '0'

pgbench Tests

First, a preliminary round of tests is done using pgbench, the micro-benchmarking tool available with PostgreSQL. This allows us to test with a different combination of a number of clients and jobs like:

pgbench -c 16 -j 16 -T 600 -r

Where 16 client connections and 16 pgbench jobs feeding the client connections are used.

Read-Write Without Checksum

The default load that

pgbench

creates is a tpcb-like Read-write load. We used the same on a PostgreSQL instance which doesn’t have checksum enabled.

We could see a 19% performance gain on ARM.

x86 (tps) 28878
ARM (tps) 34409

Read-Write With Checksum

We were curious whether the checksum calculation has any impact on Performance due to the architecture difference. if the PostgreSQL level checksum is enabled. PostgreSQL 12 onwards, the checksum can be enabled using pg_checksum utility as follows:

pg_checksums -e -D $PGDATA

x86 (tps) 29402
ARM (tps) 34701

To our surprise, the results were marginally better! Since the difference is around just 1.7%, we consider it as a noise. At least we feel that it is ok to conclude that enabling checksum doesn’t have any noticeable performance degradation on these modern processors.

Read-Only Without Checksum

Read-only loads are expected to be CPU-centric. Since we selected a database size that fully fits into memory, we could eliminate IO related overheads.

x86 (tps) 221436.05
ARM (tps) 288867.44

The results showed a 30% gain in tps for the ARM than the x86 instance.

Read-Only With Checksum

We wanted to check whether we could observe any tps change if we have checksum enabled when the load becomes purely CPU centric.

x86 (tps) 221144.3858
ARM (tps) 279753.1082

The results were very close to the previous one, with 26.5% gains.

In pgbench tests, we observed that as the load becomes CPU centric, the difference in performance increases. We couldn’t observe any performance degradation with checksum.

Note on checksums

PostgreSQL calculates and writes checksum for pages when they are written out and read in the buffer pool. In addition, hint bits are always logged when checksums are enabled, increasing the WAL IO pressure. To correctly validate the overall checksum overhead, we would need longer and larger testing, similar to once we did with sysbench-tpcc.

Testing With sysbench-tpcc

We decided to perform more detailed tests using sysbench-tpcc. We were mainly interested in the case where the database fits into memory. On a side note, while PostgreSQL on the arm server showed no issues, sysbench was much more finicky compared to the x86 one.

Each round of testing consisted of a few steps:

  1. Restore the data directory of the necessary scale (10/200).
  2. Run a 10-minute warmup test with the same parameters as the large test.
  3. Checkpoint on the PG side.
  4. Run the actual test.

In-memory, 16 threads:

In-memory, 16 threads

With this moderate load, the ARM instance shows around 15.5% better performance than the x86 instance. Here and after, the percentage difference is based on the mean tps value.

You might be wondering why there is a sudden drop in performance towards the end of the test. It is related to checkpointing with

full_page_writes

. Even though for in-memory testing we used pareto distribution, a considerable amount of pages is going to be written out after each checkpoint. In this case, the instance showing more performance triggered checkpoint by WAL earlier than its counterpart. These dips are going to be present across all tests performed.

In-memory, 32 threads:

In-memory, 32 threads

When concurrency increased to 32, the difference in performance reduced to nearly 8%.

In-memory, 64 threads:

In-memory, 64 threads

Pushing instances close to their saturation point (remember, both are 32-cpu instances), we see the difference reducing further to 4.5%.

In-memory, 128 threads:

In-memory, 128 threads

When both instances are past their saturation point, the difference in performance becomes negligible, although it’s still there at 1.4% Additionally, we could observe a 6-7% drop in throughput(tps) for ARM and a 4% drop for x86 when concurrency increased from 64 to 128 on these 32 vCPU machines.

Not everything we measured is favorable to the Graviton2-based instance. In the IO-bound tests (~200G dataset, 200 warehouses, uniform distribution), we saw less difference between the two instances, and at 64 and 128 threads, regular m5d instance performed better. You can see this on the combined plots below.

A possible reason for this, especially the significant meltdown at 128 threads for m6gd.8xlarge, is that it lacks the second drive that m5d.8xlarge has. There’s no perfectly comparable couple of instances available currently, so we consider this a fair comparison; each instance type has an advantage. More testing and profiling is necessary to correctly identify the cause, as we expected local drives to negligibly affect the tests. IO-bound testing with EBS can potentially be performed to try and remove the local drives from the equation.

More details of the test setup, results of the tests, scripts used, and data generated during the testing are available from this GitHub repo.

Summary

There were not many cases where the ARM instance becomes slower than the x86 instance in the tests we performed. The test results were consistent throughout the testing of the last couple of days. While ARM-based instance is 25 percent cheaper, it is able to show a 15-20% performance gain in most of the tests over the corresponding x86 based instances. So ARM-based instances are giving conclusively better price-performance in all aspects. We should expect more and more cloud providers to provide ARM-based instances in the future. Please let us know if you wish to see any different type of benchmark tests.

Join the discussion on Hacker News

Jan
19
2021
--

The MySQL Clone Wars: Plugin vs. Percona XtraBackup

MySQL Plugin vs. Percona XtraBackup

MySQL Plugin vs. Percona XtraBackupLarge replication topologies are quite common nowadays, and this kind of architecture often requires a quick method to rebuild a replica from another server.

The Clone Plugin, available since MySQL 8.0.17, is a great feature that allows cloning databases out of the box. It is easy to rebuild a replica or to add new nodes to a cluster using the plugin. Before the release of the plugin, the best open-source alternative was Percona XtraBackup for MySQL Databases.

In this blog post, we compare both alternatives for cloning purposes. If you need to perform backups, Percona XtraBackup is a better tool as it supports compression and incremental backups, among other features not provided by the plugin. The plugin supports compression only for network transmission, not for storage.

But one of the plugin’s strong points is simplicity. Once installed and configured, cloning a database is straightforward. Just issuing a command from the destination database is enough.

Percona XtraBackup, on the other side, is a more complex tool. The cloning process involves several stages: backup, stream, write, and prepare. These stages can take place in parallel: we can stream the backup to the new server using netcat and, at the same time, we can write it into the destination directory. The only stage that is sequential is the last one: prepare.

Test Characteristics

We used sysbench to create 200 tables of 124Mb each for a total of 24Gb. Both source and replica virtual machines run 4 cores, 8 Gb RAM, and 60Gb storage. We created the disks on the same datastore.

During the tests, we did not generate additional operations on the database. We measured only the clone process, reducing the benchmark complexity. Otherwise, we would have to take into consideration things like application response time, or the number of transactions executed. This is beyond the scope of this assessment.

We tested different combinations of clone and Percona XtraBackup operations. For XtraBackup, we tested 1 to 4 threads, with and without compression. In the case of compression, we allocated the same number of threads to compression and decompression. For the clone plugin, we tested auto (which lets the server decide how many threads will perform the clone) and 1 to 4 threads. We also tested with and without compression. Finally, we executed all the tests using three different network limits: 500mbps, 1000mbps, and 4000mbps. These make a total of 54 tests, executed 12+1 times each.

All times are in seconds. In the graphs below, lower values are better.

Method

Clone

Out of the required parameters to operate the clone plugin, the following were set up accordingly in the recipient server:

  • clone_max_concurrency=<maximum number of threads> Defines the maximum number of threads used for a remote cloning operation with autotune enabled. Otherwise, this is the exact number of threads that remote cloning uses.
  • clone_autotune_concurrency If enabled the clone operation uses up to clone_max_concurrency threads. The default is 16.
    • clone_enable_compression If enabled, the remote clone operation will use compression.

Percona XtraBackup

To stream the backup we used the xbstream format and sent the data to the remote server using netcat. We applied the following parameters:

  • parallel=<number of threads> Xtrabackup and xbstream parameter that defines the number of threads used for backup and restore operations.
  • rebuild-threads The number of threads used for the rebuild (prepare) operation.
  • decompress_threads and compress_threads Xtrabackup and xbstream parameters that define the number of threads used for compression operations.

Some people use additional parameters like innodb-read-io-threads, innodb-write-io-threads, or innoDB-io-capacity, but these parameters only affect the behavior of InnoDB background threads. They have no impact during backup and restore operations.

Results

Clone

No compression

For the lower bandwidth tests, the number of threads used does not make a difference. Once we increase bandwidth we see that time cuts by half when we move from one thread to two. Going beyond that value improves slightly. Probably we reach the disk i/o limit.

Clone Plugin performance without compression.

The auto option is consistently the fastest one.

Compression

Compression is supposed to improve performance for lower bandwidth connections, but we see that this is not the case. Bandwidth has no impact on execution time and compression makes the clone slower. Again auto gives the best results, equivalent to 4 threads.
clone plugin with compression

Percona XtraBackup

No Compression

Without compression, we see again that the number of threads does not make any difference in the lower bandwidth test. When we increase bandwidth, the number of threads is important, but we quickly reach i/o limits.

Percona Xtrabackup stream without compression

Compression

When using compression, we see that requires less time to complete in almost every case compared with the option without compression, even when bandwidth is not the limit.

Percona Xtrabackup stream with compression

Conclusion

We see that, when using compression, the clone plugin is the slower option while Percona XtraBackup gives great results for all bandwidths. Without compression, the clone plugin is faster when using more than 2 threads. XtraBackup is faster for fewer threads.

Xtrabackup vs. Clone plugin - results summary

Below, we have a chart comparing the worst and best results. As expected, the worst results correspond to one thread executions.

The Clone Plugin is a great option for simplicity. Percona XtraBackup is excellent to save bandwidth and provides better results with fewer threads. With enough threads and bandwidth available, both solutions provide comparable results.

Oct
09
2020
--

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First ImpressionFor what reason should I use a real multi-primary setup?

To be clear, not a multi-writer solution where any node can become the active writer in case of needs, as for Percona XtraDB Cluster (PXC) or Percona Server for MySQL using Group_replication. No, we are talking about a multi-primary setup where I can write at the same time on multiple nodes. I want to insist on this “why?”.

After having excluded the possible solutions mentioned above, both covering the famous 99.995% availability, which is 26.30 minutes of downtime in a year, what is left?

Disaster Recovery? Well, that is something I would love to have, but to be a real DR solution we need to put several kilometers (miles for imperial) in the middle.

And we know (see here and here) that aside from some misleading advertising, we cannot have a tightly coupled cluster solution across geographical regions.

So, what is left? I may need more HA, ok, that is a valid reason. Or I may need to scale the number of writes, which is a valid reason as well. This means, in the end, that I am looking to a multi-primary because:

  • Scale writes (more nodes more writes)
    • Consistent reads (what I write on A must be visible on B)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Now, keeping myself bound to the MySQL ecosystem, my natural choice would be MySQL NDB cluster. But my (virtual) boss was at AWS re-invent and someone mentioned to him that Aurora Multi-Primary does what I was looking for. This (long) article is my voyage in discovering if that is true or … not.

Given I am focused on the behavior first, and NOT interested in absolute numbers to shock the audience with millions of QPS, I will use low-level Aurora instances. And will perform tests from two EC2 in the same VPC/region of the nodes.

You can find the details about the tests on GitHub here.

Finally, I will test:

  • Connection speed
  • Stale read
  • Write single node for baseline
  • Write on both node:
    • Scaling splitting the load by schema
    • Scaling same schema

Test Results

Let us start to have some real fun. The first test is …

Connection Speed

The purpose of this test is to evaluate the time taken in opening a new connection and time taken to close it. The action of the open/close connection can be a very expensive operation, especially if applications do not use a connection pool mechanism.

Amazon Aurora Multi-Primary


As we can see, ProxySQL results to be the most efficient way to deal with opening connections, which was expected given the way it is designed to reuse open connections towards the backend.


Different is the close connection operation, in which ProxySQL seems to take a little bit longer.

As a global observation, we can say that by using ProxySQL we have more consistent behavior. Of course, this test is a simplistic one, and we are not checking the scalability (from 1 to N connections) but it is good enough to give us the initial feeling. Specific connection tests will be the focus of the next blog on Aurora MM.

Stale Reads

Aurora multi-primary uses the same mechanism of the default Aurora to update the buffer pool:


Using the Page Cache update, just doing both ways. This means that the Buffer Pool of Node2 is updated with the modification performed in Node1 and vice versa.

To verify if an application would be really able to have consistent reads, I have run this test. This test is meant to measure if, and how many, stale reads we will have when writing on a node and reading from the other.

Amazon Aurora multi-primary has two consistency models:

Aurora consistency model
As an interesting fact, the result was that with the default consistency model (INSTANCE_RAW), we got a 100% stale read.
Given that I focused on identifying the level of the cost that exists when using the other consistency model (REGIONAL_RAW), that allows an application to have consistent reads.

The results indicate an increase of 44% in total execution time, and of 95% (22 times slower) in write execution.

It is interesting to note that the time taken is in some way predictable and consistent between the two consistency models.

The graph below shows in yellow how long the application must wait to see the correct data on the reader node. In blue is the amount of time the application waits to get back the same consistent read because it must wait for the commit on the writer.

lag time in nanoseconds

As you can see, the two are more or less aligned.

Given the performance cost imposed by using REGIONAL_RAW,  all the other tests are done with the default INSTANCE_RAW, unless explicitly stated.

Writing Tests

All tests run in this section were done using sysbench-tpcc with the following settings:

sysbench ./tpcc.lua --mysql-host=<> --mysql-port=3306 --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --time=300 --threads=32 --report-interval=1 --tables=10 --scale=15  --mysql_table_options=" CHARSET=utf8 COLLATE=utf8_bin"  --db-driver=mysql prepare

 sysbench /opt/tools/sysbench-tpcc/tpcc.lua --mysql-host=$mysqlhost --mysql-port=$port --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --db-driver=mysql --tables=10 --scale=15 --time=$time  --rand-type=zipfian --rand-zipfian-exp=0 --report-interval=1 --mysql-ignore-errors=all --histogram  --report_csv=yes --stats_format=csv --db-ps-mode=disable --threads=$threads run

Write Single Node (Baseline)

Before starting the comparative analysis, I was looking to define what was the “limit” of traffic/load for this platform.

baseline reads/writes

From the graph above, we can see that this setup scales up to 128 threads and after that, the performance remains more or less steady.

Amazon claims that we can mainly double the performance when using both nodes in write mode and use a different schema to avoid conflict.

aurora scalability

Once more, remember I am not interested in the absolute numbers here, but I am expecting the same behavior. Given that, our expectation is to see:

expected scalability

Write on Both Nodes, Different Schemas

So AWS recommend this as the scaling solution:


And I diligently follow the advice. I used two EC2 nodes in the same subnet of the Aurora Node, writing to a different schema (tpcc & tpcc2).

Overview

Let us make it short and go straight to the point. Did we get the expected scalability?

Well, no:


We just had a 26% increase, quite far to be the expected 100% Let us see what happened in detail (if not interested just skip and go to the next test).

Node 1

Schema read writes Aurora

Node 2


As you can see, Node1 was (more or less) keeping up with the expectations and being close to the expected performance. But Node2 was just not keeping up, and performances there were just terrible.

The graphs below show what happened.

While Node1 was (again more or less) scaling up to the baseline expectations (128 threads), Node2 collapsed on its knees at 16 threads. Node2 was never able to scale up.

Reads

Node 1


Node1 is scaling the reads as expected, but also here and there we can see performance deterioration.

Node 2


Node2 is not scaling Reads at all.

Writes

Node 1


Same as Read.

Node 2


Same as read.

Now someone may think I was making a mistake and I was writing on the same schema. I assure you I was not. Check the next test to see what happened if using the same schema.

Write on Both Nodes,  Same Schema

Overview

Now, now, Marco, this is unfair. You know this will cause contention. Yes, I do! But nonetheless, I was curious to see what was going to happen and how the platform would deal with that level of contention.

My expectations were to have a lot of performance degradation and an increased number of locks. About conflict I was not wrong, node2 after the test reported:

+-------------+---------+-------------------------+
| table       | index   | PHYSICAL_CONFLICTS_HIST |
+-------------+---------+-------------------------+
| district9   | PRIMARY |                    3450 |
| district6   | PRIMARY |                    3361 |
| district2   | PRIMARY |                    3356 |
| district8   | PRIMARY |                    3271 |
| district4   | PRIMARY |                    3237 |
| district10  | PRIMARY |                    3237 |
| district7   | PRIMARY |                    3237 |
| district3   | PRIMARY |                    3217 |
| district5   | PRIMARY |                    3156 |
| district1   | PRIMARY |                    3072 |
| warehouse2  | PRIMARY |                    1867 |
| warehouse10 | PRIMARY |                    1850 |
| warehouse6  | PRIMARY |                    1808 |
| warehouse5  | PRIMARY |                    1781 |
| warehouse3  | PRIMARY |                    1773 |
| warehouse9  | PRIMARY |                    1769 |
| warehouse4  | PRIMARY |                    1745 |
| warehouse7  | PRIMARY |                    1736 |
| warehouse1  | PRIMARY |                    1735 |
| warehouse8  | PRIMARY |                    1635 |
+-------------+---------+-------------------------+

Which is obviously a strong indication something was not working right. In terms of performance gain, if we compare ONLY the result with the 128 Threads:


Also with the high level of conflict, we still have 12% of performance gain.

The problem is that in general, we have the two nodes behaving quite badly. If you check the graph below you can see that the level of conflict is such to prevent the nodes not only to scale but to act consistently.

Node 1

Write on Both Nodes,  Same Schema

Node 2


Reads

In the following graphs, we can see how node1 had issues and it actually crashed three times, during tests with 32/64/512 threads. Node2 was always up but the performances were very low.

Node 1


Node 2


Writes

Node 1


Node 2


Recovery From Crashed Node

About recovery time, reading the AWS documentation and listening to presentations, I often heard that Aurora Multi-Primary is a 0 downtime solution. Or other statements like: “in applications where you can’t afford even brief downtime for database write operations, a multi-master cluster can help to avoid an outage when a writer instance becomes unavailable. The multi-master cluster doesn’t use the failover mechanism, because it doesn’t need to promote another DB instance to have read/write capability”

To achieve this the suggestion, the solution I found was to have applications pointing directly to the Nodes endpoint and not use the Cluster endpoint.

In this context, the solution pointing to the Nodes should be able to failover within a second or so, while the cluster endpoint:

Recovery From Crashed Node

Personally, I think that designing an architecture where the application is responsible for the connection to the database and failover is some kind of refuse from 2001. But if you feel this is the way, well, go for it.

What I did for testing is to use ProxySQL, as plain as possible with nothing else, and the basic monitor coming from the native monitor. I then compared the results with the tests using the Cluster endpoint. In this way, I adopt the advice of pointing directly at the nodes, but I was doing things in our time.

The results are below and they confirm (more or less) the data coming from Amazon.


A downtime of seven seconds is quite a long time nowadays, especially if I am targeting the 5 nines solution that I want to remember is 864 ms downtime per day. Using ProxySQL is going closer to that, but still too long to be called zero downtime.
I also have fail-back issues when using the AWS cluster endpoint, given it was not able to move the connection to the joining node seamlessly.

Last but not least, when using the consistency level INSTANCE_RAW, I had some data issue as well as PK conflict:
FATAL: mysql_drv_query() returned error 1062 (Duplicate entry ‘18828082’ for key ‘PRIMARY’) 

Conclusions

As state at the beginning of this long blog, the reasonable expectations to go for a multi-primary solution were:

  • Scale writes (more nodes more writes)
  • Gives me zero downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Honestly, I feel we have completely failed the scaling point. Probably if I use the largest Aurora I will get much better absolute numbers, and it will take me more to encounter the same issues, but I will. In any case, if the multi-primary solution is designed to provide that scalability, and it should do that with any version.

I did not have zero downtime, but I was able to failover pretty quickly with ProxySQL.

Finally, unless the consistency model is REGIONAL_RAW, shifting from one node to the other is not prone to possible negative effects like stale reads. Given that I consider this requirement not satisfied in full.

Given all the above, I think this solution could eventually be valid only for High Availability (close to being 5 nines), but given it comes with some limitations I do not feel comfortable in preferring it over others just for that, at the end default Aurora is already good enough as a High available solution.

References

AWS re:Invent 2019: Amazon Aurora Multi-Master: Scaling out database write performance

Working with Aurora multi-master clusters

Improving enterprises ha and disaster recovery solutions reviewed

Robust ha solutions with proxysql

Limitations of multi-master clusters

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com