Mar
08
2021
--

Testing the Value of ScaleFlux Computational Storage Drive (CSD) for PostgreSQL

ScaleFlux Computational Storage Drive PostgreSQL

Some time ago we at Percona were approached by ScaleFlux Inc to benchmark their latest hardware appliance, the CSD 2000 Drive, which is a next-generation SSD computational storage drive. It goes without saying that a truly relevant report requires us to be as honest and as forthright as possible. In other words, my mission was to, ahem, see what kind of mayhem I could cause.

Benchmarking is a bit like cooking; it requires key ingredients, strict adherence to following a set of instructions, mixing the ingredients together, and a bit of heat to make it all happen. In this case, the ingredients include the Linux OS running Ubuntu 18.04 on both the database and the bench-marking hosts, PostgreSQL version 12, SysBench the modular, cross-platform, and multi-threaded benchmark tool, and a comparable, competing appliance i.e. the Intel DC P4610 series drive. The two appliances are mounted as partitions respectively both using the same type of file system.

 

Once the environment is ready, the next step involves declaring and implementing the bench-marking rules which consist of various types of DML and DDL activity. Keeping in mind that apart from the classic OLAP vs OLTP modes of database processing, executing a benchmark that closely follows real production activities can be problematic. Quite often, when pushing a system to its full capacity, one can say that all production systems are to some extent unique. Therefore, for our purposes, we used the testing regime SysBench offers by default.

Once the system was ready, loading started out slow and gentle. The idea was to develop a baseline for the various types of activity and Postgres runtime conditions. Then, the bench-marking intensity was gradually increased to the point where we eventually started getting interesting results.

Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!

Needless to say, it took quite a bit of time running the various permutations, double-checking our numbers, graphing the data, and then after all that, interpreting the output. I’m not going to go into any great detailing the analysis itself. Instead, I encourage you to look at the whitepaper itself.

So after all this effort, what was the takeaway?

There are two key observations that I’d like to share:

  1. At peak loading, the ScaleFlux CSD 2000 Drive demonstrated less performance variance than that of the Intel DC P4610. Variance being the statistical encapsulation of IO read-write spread between maximum and minimum values. The significance is server predictability. This becomes important when, for example, finely tuned application processes depend upon consistent performance with the RDBMS. Many a time I’ve seen applications get upset when response times between inserting, updating, or deleting data and getting the resultant queries would suddenly change.
  2. Remarkable space savings were realized when the Postgres fillfactor was reduced. As you know, the fillfactor can become a critical runtime parameter in regards to performance when high-frequency UPDATE and DELETE operations take place on the same tuple over and over again.

Finally, one last item… I didn’t mention it but we also benchmarked MySQL for ScaleFlux. The results were pretty remarkable. It’s worth your while to have a look at that one too.

ScaleFlux White Papers:

Feb
26
2021
--

Connection Queuing in pgBouncer: Is it a Magical Remedy?

Connection Queuing in pgBouncer

Yes, this post is about connection queueing, not just pooling. Because “connection pooling” – pre-created connections as a pool – is a much-celebrated feature. Almost every discussion on connection pool/pgBouncer starts with the overhead of establishing a new connection to PostgreSQL… and how pre-created connections in the pool can save the world.

But there is a non-celebrity feature in pgBouncer (not denying others) that can address some of the real big operational challenges. It is the connection queueing. Many new PostgreSQL users don’t know there is something like this. In this blog post, I am planning to discuss some of the problems related to connection management first, and then explore how connection queueing can address those problems.

Problem 1: Spike in Load can Jam and Halt the Server

PostgreSQL has a dedicated backend server process for every user connection. So there will be one backend process running on a CPU core for every active queries/sessions. This means there is a one-to-one mapping between active sessions and running processes in the server. If we consider parallel execution of SQL statements, there will be many more running processes than active sessions. In many real-world cases, a sudden spike in load can result in hundreds of active queries starting at once while the server is equipped with a small number of CPUs (sometimes just virtual CPUs with only fraction of performance). As the number of active sessions/processes increases, the overhead of scheduling and context switches takes over.  Many times, the host server becomes unresponsive, and even opening a bash shell/terminal can take time. This is quite easy to simulate. Just 10 active connections on a two virtual CPU server with SELECT only workload can cause this.

With two active sessions:

$ time ssh -t postgres@pghost 'ls -l'
real 0m0.474s
user 0m0.017s
sys 0m0.006s

When there are 10 active sessions on PostgreSQL, just establishing an ssh connection to the server took 15 seconds.

real 0m15.307s
user 0m0.026s
sys 0m0.015s

**These are indicative numbers from a very specific system and do not qualify for a benchmark.

Generally, we could see that as the number of active sessions approaches double the number of CPU cores the performance penalty starts increasing heavily.

Many times, the problem won’t end there. Session level resource allocations (work_mem, temporary tables, etc.) can lead to overall server resource consumption. As the host server slows down, each session will take more time to complete while holding the resources, which could lead to more accumulation of active sessions. It is a spiral of evil. There are many real-world cases, where the entire show ended in a complete halt of the host server or OOM kick-in, terminating the PostgreSQL process and forcing it for crash recovery.

 

Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!

Problem 2: “Too Many Clients Already” Errors

Few smart DBAs will prevent this database disaster by setting max_connections properly to a smaller value than the database can handle, which is the right thing to do from a DB server perspective. Allowing an excessive number of connections to the database can lead to different types of abuses, attacks, and disasters. But the flip side to it is an abusive application may be greeted with the message as follows:

FATAL:  sorry, too many clients already

The same will be logged in the PostgreSQL log.

2021-02-15 13:40:50.017 UTC [12829] FATAL:  sorry, too many clients already

Unfortunately, this could lead to an application crash or misbehavior. From the business point of view, we just shifted the problem from database to application.

Problem 3: Big max_connection Value and Overhead

Because of the above-mentioned problem, it is common to see max_connection to have a very high value. The overhead of connections is probably one of the most actively-discussed topics these days because Postgres 14 is expected to have some of the connection scalability fixes. Andres Freund blogged about the details of analyzing the connection scalability problem and how it is addressed.

Even the idling connection may occupy server resources like memory. The overhead is considered as very low on a properly configured server; however, the impact could be heavy in reality. Again, a lot of things depend on the workload. There are at least a few cases that reported up to 50MB consumption per session. That means 500 idle connections can result in up to 25GB of memory usage.

In addition to this, more connections can lead to more lock management-related overheads. And don’t forget that system becomes vulnerable to sudden spikes as the max_connections are increased.

Solution: Connection Queueing

At the very least, connection queueing is the queueing of connections so that they can absorb the sudden spike in load. The connections can be put into a queue rather than straight away rejecting or sending it to the server and jamming it. This results in streamlining the execution. PostgreSQL server can keep doing what it can do rather than dealing with a jam situation.

Let me demonstrate with an example. For this demonstration, I set the max_connections to “2”, assuming that this is the maximum the server can accommodate without causing too many context switches. Too many connections won’t come and overload my database.

postgres=# show max_connections ;
2

A third connection to the database will result in an error as expected.

psql: error: FATAL: sorry, too many clients already

Now let’s use the pgbouncer for the connection queue. Many of users may not be knowing that it exists, by default. I used the following pgbouncer configuration for testing:

[databases]
pgbounce = host=172.16.2.16 port=5432 dbname=postgres

[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
logfile = pgbouncer.log
pidfile = pgbouncer.pid
admin_users = postgres
application_name_add_host=1
default_pool_size=1
min_pool_size=1

Yes, the pooler will establish only one connection to the database. pgBouncer establishes this one connection when the client connection establishes for the first time because the min_pool_size is 1. The pgBouncer log says:

2021-02-16 15:19:30.725 UTC [2152] LOG C-0x1fb0c58: pgbounce/postgres@172.16.2.56:54754 login attempt: db=pgbounce user=postgres tls=no
2021-02-16 15:19:30.726 UTC [2152] LOG S-0x1fbba80: pgbounce/postgres@172.16.2.16:5432 new connection to server (from 172.16.2.144:58968)

pgbouncer pool statistics also shows the details:

pgbouncer=# show pools;
 database  |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode 
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
 pgbounce  | postgres  |         1 |          0 |         0 |       0 |       1 |         0 |        0 |       0 |          0 | session

But the beauty is that we won’t get any more “FATAL: sorry, too many clients already” errors. All client connections are accepted and put into the connection queue. For example, I have five client connections. please see the value of cl_active:

pgbouncer=# show pools;
 database  |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode 
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
 pgbounce  | postgres  |         5 |          0 |         0 |       0 |       1 |         0 |        0 |       0 |          0 | session

As each client connection becomes active (with a SQL statement), they will be put into waiting.

pgbouncer=# show pools;
 database  |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode 
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
 pgbounce  | postgres  |         1 |          4 |         1 |       0 |       0 |         0 |        0 |      28 |     438170 | session

Each client connection will be executed over the available database connection, one after another. This is a case with a single database connection. If the connection count and pool size can be increased, multiple client connections can hit the server at the same time and queue size drops. The following is a case with two connections (pool size two) to the database.

database  |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode 
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
 pgbounce  | postgres  |         2 |          1 |         2 |       0 |       0 |         0 |        0 |       4 |     978081 | session

Putting Connection Queueing into a Simple Test

This is not an extensive benchmarking, but a quick test to see the benefits for a typical host with two virtual CPUs. I have created 20 active connections to PostgreSQL with select-only load using pgbench.

pgbench -c 20 -j 20 -h pghost -U postgres -S -T 100

As the 20 server processes started running, the load average went out of the roof.

As you can see in the screenshot, the load average spiked to 17+. And as expected, the server response also becomes very poor consistently.

time ssh -t postgres@pghost 'ls -l'
real 0m17.768s
user 0m0.019s
sys 0m0.007s

At this stage, I tried sending the same 20 active connections through the connection queue of pgbouncer with pool size four (default_pool_size=4). The pgbouncer is at the client-side.

Since there are only four server-side processes, the load average dropped drastically. The maximum I could see is 1.73:

The server response is also very good.

$ time ssh -t postgres@pghost 'ls -l'
real 0m0.559s
user 0m0.021s
sys 0m0.008s

A load average of 17+ vs 1.73! That must be too good to be true.

There was a bit of skepticism about whether the low load on the server and the better server response is coming at the cost of database throughput.  I was expecting to see not-so-great throughput numbers. So I took the same test to a more consistently-performing platform (AWS r5.large with two virtual CPUs) again. To a bit of surprise, the numbers were even better.

The following are the numbers I got. At least it is not bad; it’s better. 5% better.

Direct With Queueing
20190.301781 21663.454921
20308.115945 21646.195661
20434.780692 21218.44989

Since we are using just four connections on the database side in this case, it also gives us the opportunity to reduce the max_connection value on the database side. Another check was whether switching to transaction-level pooling can save more because the database connection will be back to the pool and it could serve another client connection after each transaction. This could result in better concurrency.

Queue + max_ connection=5 Queue + max_connection=5 + transaction level pool
21897.685891 23318.655016
21913.841813 23486.485856
21933.129685 23633.39607

As expected, it delivered even better numbers. I would like to encourage readers to do more tests and proper benchmarking.

Summary and References

A really large number of application/client connections can be multiplexed over a very few database connections using the connection queueing. This queue helps in absorbing any spike in connections without overloading the database server. Instead of session processes queueing up on the OS scheduler/run queue, the connections can be kept outside safely and the database server can operate at full-throttle without any contentions. Streamlined database traffic results in better throughput also.

Jan
27
2021
--

Deploying Any Version of Leading Open Source Databases for Tests and Experiments

Deploying Any Version of Leading Open Source Databases

Deploying Any Version of Leading Open Source DatabasesI want to present a tool for running a specific version of open source databases in a single instance, replication setups, and Kubernetes. AnyDbVer deploys MySQL/MariaDB/MongoDB/PostgreSQL for testing and experiments.

It Could Be Started By…

Docker (or Podman) or dbdeployer (MySQL-Sandbox successor) could also start a specific database version, but such installations are significantly different from production setups.

Ansible Playbook

There is an Ansible playbook with few roles configurable by environment variables.

Bash Scripts

You may ignore the Ansible layer because the ./anydbver bash script hides it.

In LXD containers

It’s not practical to reserve physical servers or virtual machines for all tasks. Frequently you need just a few database instances and the ability to restart it with systemd and check logs with journalctl. AnydbVer spawns one or more Linux Containers managed by Canonical (Ubuntu) LXD, and LXD containers support systemd not using a layered filesystem setup.

Best Performance with Linux Running on Hardware Directly

Linux Containers is not an emulation layer. From the “host” system, you can see all processes and files created by containers. Thus the performance is very similar to the same database running on a physical server.

You Can Run Vagrant + VirtualBox as Well, For Other OS

LXD setup is relatively easy, but sometimes you may need to run AnydbVer just with a few commands. Vagrant could start an unmodified Ubuntu virtual machine and automatically configure AnydbVer for you.

You can find the required files and setup instructions at https://github.com/ihanick/anydbver.

Clone the https://github.com/ihanick/anydbver repository to setup LXD or start Vagrant.

git clone https://github.com/ihanick/anydbver.git
cd anydbver
vagrant up
vagrant ssh
cd anydbver

Single Instance Usage

Imagine that you need the exact CentOS 7 package version: Percona Server for MySQL 5.6.43-rel84.3:

$ ./anydbver deploy percona-server:5.6.43-rel84.3
$ ./anydbver ssh
$ mysql
mysql> select version();

You are not limited to using full version specification. To use the latest matching version, reduce 5.6.43-rel84.3 down to 5.6.43 or even 5.6. To run other databases, replace percona-server with:

  • pxc: Percona XtraDB Cluster
  • mysql: Oracle MySQL Community Server
  • mariadb: MariaDB
  • mariadb-cluster: MariaDB Galera Cluster
  • pg: Postgres build from https://www.postgresql.org/
  • ppg: Percona Distribution for PostgreSQL
  • psmdb: Percona Server for MongoDB

Multiple Instances

Several containers are not consuming a significant amount of resources. Actually five MySQL container instances will consume the same resources as five individual processes running on the same host. The syntax is:

$ ./anydbver deploy <default node definition> node1 <node1 definition> node2 <node2 definition> ...

For example, run two independent MySQL instances:

$ ./anydbver mysql node1 mysql

SSH access to default node:

$ ./anydbver ssh
# or
$ ./anydbver ssh default

All other nodes (replace node1 with other node names):

$ ./anydbver ssh node1

Hostnames

You may have a server hostname specified. For example let’s run two Percona Server for MySQL instances:

$ ./anydbver deploy ps:5.7 hostname:leader.percona.local node1 ps:5.7 hostname:follower.percona.local
$ ./anydbver ssh leader
or ./anydbver ssh leader.percona or leader.percona.local
[root@leader ~]# mysql --host follower.percona.local --prompt '\h mysql>'
follower.percona.local mysql>

Replication

The most interesting part of modern open-source databases is replication. Even active-active replication setups are starting from a single server (leader or master/primary). Start the first node normally and attach additional nodes with master:nodename or leader:nodename. PXC or Galera servers could participate in both synchronous and asynchronous replication. Thus, for Galera clusters, you need galera-master or galera-leader syntax.

Start a 3 node Percona XtraDB cluster (latest 5.7):

./anydbver deploy pxc:5.7 node1 pxc:5.7 galera-master:default node2 pxc:5.7 galera-master:default

Run master and two async slaves with Percona Server for MySQL and add all three servers to ProxySQL setup:

./anydbver deploy ps:5.7 node1 ps:5.7 master:default node2 ps:5.7 master:default node3 proxysql master:default

Setup physical replication with slots for Postgresql 12.3:

./anydbver deploy pg:12.3 node1 pg:12.3 master:default

Make a Mongo replica set named rs0:

./anydbver deploy psmdb replica-set:rs0 node1 psmdb master:default replica-set:rs0 node2 psmdb master:default replica-set:rs0

MongoDB Sharding

MongoDB sharding setup requires several server types: servers with data (shardsrv), configuration servers (configsrv), and mongos server:

./anydbver deploy \
psmdb:4.2 replica-set:rs0 shardsrv \
node1 psmdb:4.2 master:default replica-set:rs0 shardsrv \
node2 psmdb:4.2 master:default replica-set:rs0 shardsrv \
node3 psmdb:4.2 configsrv replica-set:cfg0 \
node4 psmdb:4.2 configsrv replica-set:cfg0 master:node3 \
node5 psmdb:4.2 configsrv replica-set:cfg0 master:node3 \
node6 psmdb:4.2 mongos-cfg:cfg0/node3,node4,node5 mongos-shard:rs0/default,node1,node2

Containers and Orchestration

The fact that we are already using containers (LXD) shouldn’t confuse you. We can still run docker images inside our nodes. Nested containers and Podman makes it possible.

Run Percona Monitoring and Management Docker containers

Let’s deploy the default node with Podman and run the Percona Monitoring and Management (PMM) server docker container in it. Percona Server for MySQL 5.7 with PMM client will run on node1:

./anydbver deploy pmm node1 ps:5.7 pmm-client pmm-server:default

Run multi-node Kubernetes cluster

Kubernetes also could utilize nested containers. There are several small Kubernetes distributions: minikube, microk8s, k3s, k0s. The simplest fully functional Kubernetes could be implemented with k3s.

The first k3s LXD container executes the API server and workers. Additional LXD containers could run more workers. Multiple workers are important to run complex HA applications with hostname anti-affinity.

./anydbver deploy k3s node1 k3s-master:default node2 k3s-master:default
./anydbver ssh

On a default node, we can execute kubectl or helm:

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ihanick-node2 Ready <none> 14m v1.19.2+k3s1
ihanick-default Ready master 17m v1.19.2+k3s1
ihanick-node1 Ready <none> 15m v1.19.2+k3s1

Percona Kubernetes Operator for Percona XtraDB Cluster

The Kubernetes cluster allows running stateful applications, like databases, services (similar to Amazon AWS S3), or monitoring solutions. Let’s start:

  • 4 node cluster
  • MinIO (implements S3 api) for backups
  • Percona XtraDB Cluster (managed by the operator)
  • Percona Monitoring and Management
./anydbver deploy k3s \
node1 k3s-master:default \
node2 k3s-master:default \
node3 k3s-master:default \
default k8s-minio k8s-pmm k8s-pxc backup
./anydbver ssh
kubectl get pods
NAME READY STATUS RESTARTS AGE
svclb-monitoring-service-n5nsx 1/1 Running 0 20m
svclb-monitoring-service-htssw 1/1 Running 0 20m
svclb-monitoring-service-n9kt4 1/1 Running 0 20m
svclb-monitoring-service-7btbh 1/1 Running 0 20m
minio-service-6db6667fb9-tk69n 1/1 Running 0 20m
monitoring-0 1/1 Running 0 20m
percona-xtradb-cluster-operator-7886ccf6b5-rtwxc 1/1 Running 0 18m
cluster1-pxc-0 2/2 Running 2 17m
cluster1-haproxy-0 2/2 Running 0 17m
cluster1-haproxy-1 2/2 Running 0 12m
cluster1-haproxy-2 2/2 Running 0 12m
cluster1-pxc-1 2/2 Running 1 13m
cluster1-pxc-2 2/2 Running 1 10m

Percona Kubernetes Operator for Percona Server for MongoDB

You can install and configure the database in the same way, regardless of the architecture:

./anydbver deploy k3s \
  node1 k3s-master:default \
  node2 k3s-master:default \
  node3 k3s-master:default \
  default k8s-minio k8s-mongo backup

Zalando Postgres Operator

You are not limited to Percona-only products and can run other K8s operators:

./anydbver deploy k3s node1 k3s-master:default node2 k3s-master:default node3 k3s-master:default default k8s-pg

Summary

AnyDbVer is a useful deployment tool for experiments with:

AnyDbVer restrictions:

  • It is not a production deployment tool.
  • The deployment process takes a long time. The tool downloads and installs all packages from OS and Vendor repositories.
Jan
22
2021
--

PostgreSQL on ARM-based AWS EC2 Instances: Is It Any Good?

PostgreSQL on ARM-based AWS EC2

The expected growth of ARM processors in data centers has been a hot topic for discussion for quite some time, and we were curious to see how it performs with PostgreSQL. The general availability of ARM-based servers for testing and evaluation was a major obstacle. The icebreaker was when AWS announced their ARM-based processors offering in their cloud in 2018. But we couldn’t see much excitement immediately, as many considered it is more “experimental” stuff. We were also cautious about recommending it for critical use and never gave enough effort in evaluating it.  But when the second generation of Graviton2 based instances was announced in May 2020, we wanted to seriously consider. We decided to take an independent look at the price/performance of the new instances from the standpoint of running PostgreSQL.

Important: Note that while it’s tempting to call this comparison of PostgreSQL on x86 vs arm, that would not be correct. These tests compare PostgreSQL on two virtual cloud instances, and that includes way more moving parts than just a CPU. We’re primarily focusing on the price-performance of two particular AWS EC2 instances based on two different architectures.

Test Setup

For this test, we picked two similar instances. One is the older

m5d.8xlarge

, and the other is a new Graviton2-based

m6gd.8xlarge

. Both instances come with local “ephemeral” storage that we’ll be using here. Using very fast local drives should help expose differences in other parts of the system and avoid testing cloud storage. The instances are not perfectly identical, as you’ll see below, but are close enough to be considered same grade. We used Ubuntu 20.04 AMI and PostgreSQL 13.1 from pgdg repo. We performed tests with small (in-memory) and large (io-bound) database sizes.

Instances

Specifications and On-Demand pricing of the instances as per the AWS Pricing Information for Linux in the Northern Virginia region. With the currently listed prices,

m6gd.8xlarge

is 25% cheaper.

Graviton2 (arm) Instance

Instance : m6gd.8xlarge 	
Virtual CPUs : 32
RAM  : 128 GiB 	
Storage : 1 x 1900 NVMe SSD (1.9 TiB)
Price : $1.4464 per Hour

Regular (x86) Instance

Instance : m5d.8xlarge
Virtual CPUs : 32
RAM : 128 GiB
Storage : 2 x 600 NVMe SSD (1.2 TiB)
Price : $1.808 per Hour

OS and PostgreSQL setup

We selected Ubuntu 20.04.1 LTS AMIs for the instances and didn’t change anything on the OS side. On the m5d.8xlarge instance, two local NVMe drives were unified in a single raid0 device. PostgreSQL was installed using

.deb

packages available from the PGDG repository.

The PostgreSQL version string shows confirm the OS architecture

postgres=# select version();
                                                                version                                                                 
----------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 13.1 (Ubuntu 13.1-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

** aarch64 stands for 64-bit ARM architecture

The following PostgreSQL configuration was used for testing.

max_connections = '200'
shared_buffers = '32GB'
checkpoint_timeout = '1h'
max_wal_size = '96GB'
checkpoint_completion_target = '0.9'
archive_mode = 'on'
archive_command = '/bin/true'
random_page_cost = '1.0'
effective_cache_size = '80GB'
maintenance_work_mem = '2GB'
autovacuum_vacuum_scale_factor = '0.4'
bgwriter_lru_maxpages = '1000'
bgwriter_lru_multiplier = '10.0'
wal_compression = 'ON'
log_checkpoints = 'ON'
log_autovacuum_min_duration = '0'

pgbench Tests

First, a preliminary round of tests is done using pgbench, the micro-benchmarking tool available with PostgreSQL. This allows us to test with a different combination of a number of clients and jobs like:

pgbench -c 16 -j 16 -T 600 -r

Where 16 client connections and 16 pgbench jobs feeding the client connections are used.

Read-Write Without Checksum

The default load that

pgbench

creates is a tpcb-like Read-write load. We used the same on a PostgreSQL instance which doesn’t have checksum enabled.

We could see a 19% performance gain on ARM.

x86 (tps) 28878
ARM (tps) 34409

Read-Write With Checksum

We were curious whether the checksum calculation has any impact on Performance due to the architecture difference. if the PostgreSQL level checksum is enabled. PostgreSQL 12 onwards, the checksum can be enabled using pg_checksum utility as follows:

pg_checksums -e -D $PGDATA

x86 (tps) 29402
ARM (tps) 34701

To our surprise, the results were marginally better! Since the difference is around just 1.7%, we consider it as a noise. At least we feel that it is ok to conclude that enabling checksum doesn’t have any noticeable performance degradation on these modern processors.

Read-Only Without Checksum

Read-only loads are expected to be CPU-centric. Since we selected a database size that fully fits into memory, we could eliminate IO related overheads.

x86 (tps) 221436.05
ARM (tps) 288867.44

The results showed a 30% gain in tps for the ARM than the x86 instance.

Read-Only With Checksum

We wanted to check whether we could observe any tps change if we have checksum enabled when the load becomes purely CPU centric.

x86 (tps) 221144.3858
ARM (tps) 279753.1082

The results were very close to the previous one, with 26.5% gains.

In pgbench tests, we observed that as the load becomes CPU centric, the difference in performance increases. We couldn’t observe any performance degradation with checksum.

Note on checksums

PostgreSQL calculates and writes checksum for pages when they are written out and read in the buffer pool. In addition, hint bits are always logged when checksums are enabled, increasing the WAL IO pressure. To correctly validate the overall checksum overhead, we would need longer and larger testing, similar to once we did with sysbench-tpcc.

Testing With sysbench-tpcc

We decided to perform more detailed tests using sysbench-tpcc. We were mainly interested in the case where the database fits into memory. On a side note, while PostgreSQL on the arm server showed no issues, sysbench was much more finicky compared to the x86 one.

Each round of testing consisted of a few steps:

  1. Restore the data directory of the necessary scale (10/200).
  2. Run a 10-minute warmup test with the same parameters as the large test.
  3. Checkpoint on the PG side.
  4. Run the actual test.

In-memory, 16 threads:

In-memory, 16 threads

With this moderate load, the ARM instance shows around 15.5% better performance than the x86 instance. Here and after, the percentage difference is based on the mean tps value.

You might be wondering why there is a sudden drop in performance towards the end of the test. It is related to checkpointing with

full_page_writes

. Even though for in-memory testing we used pareto distribution, a considerable amount of pages is going to be written out after each checkpoint. In this case, the instance showing more performance triggered checkpoint by WAL earlier than its counterpart. These dips are going to be present across all tests performed.

In-memory, 32 threads:

In-memory, 32 threads

When concurrency increased to 32, the difference in performance reduced to nearly 8%.

In-memory, 64 threads:

In-memory, 64 threads

Pushing instances close to their saturation point (remember, both are 32-cpu instances), we see the difference reducing further to 4.5%.

In-memory, 128 threads:

In-memory, 128 threads

When both instances are past their saturation point, the difference in performance becomes negligible, although it’s still there at 1.4% Additionally, we could observe a 6-7% drop in throughput(tps) for ARM and a 4% drop for x86 when concurrency increased from 64 to 128 on these 32 vCPU machines.

Not everything we measured is favorable to the Graviton2-based instance. In the IO-bound tests (~200G dataset, 200 warehouses, uniform distribution), we saw less difference between the two instances, and at 64 and 128 threads, regular m5d instance performed better. You can see this on the combined plots below.

A possible reason for this, especially the significant meltdown at 128 threads for m6gd.8xlarge, is that it lacks the second drive that m5d.8xlarge has. There’s no perfectly comparable couple of instances available currently, so we consider this a fair comparison; each instance type has an advantage. More testing and profiling is necessary to correctly identify the cause, as we expected local drives to negligibly affect the tests. IO-bound testing with EBS can potentially be performed to try and remove the local drives from the equation.

More details of the test setup, results of the tests, scripts used, and data generated during the testing are available from this GitHub repo.

Summary

There were not many cases where the ARM instance becomes slower than the x86 instance in the tests we performed. The test results were consistent throughout the testing of the last couple of days. While ARM-based instance is 25 percent cheaper, it is able to show a 15-20% performance gain in most of the tests over the corresponding x86 based instances. So ARM-based instances are giving conclusively better price-performance in all aspects. We should expect more and more cloud providers to provide ARM-based instances in the future. Please let us know if you wish to see any different type of benchmark tests.

Join the discussion on Hacker News

Jan
19
2021
--

The MySQL Clone Wars: Plugin vs. Percona XtraBackup

MySQL Plugin vs. Percona XtraBackup

MySQL Plugin vs. Percona XtraBackupLarge replication topologies are quite common nowadays, and this kind of architecture often requires a quick method to rebuild a replica from another server.

The Clone Plugin, available since MySQL 8.0.17, is a great feature that allows cloning databases out of the box. It is easy to rebuild a replica or to add new nodes to a cluster using the plugin. Before the release of the plugin, the best open-source alternative was Percona XtraBackup for MySQL Databases.

In this blog post, we compare both alternatives for cloning purposes. If you need to perform backups, Percona XtraBackup is a better tool as it supports compression and incremental backups, among other features not provided by the plugin. The plugin supports compression only for network transmission, not for storage.

But one of the plugin’s strong points is simplicity. Once installed and configured, cloning a database is straightforward. Just issuing a command from the destination database is enough.

Percona XtraBackup, on the other side, is a more complex tool. The cloning process involves several stages: backup, stream, write, and prepare. These stages can take place in parallel: we can stream the backup to the new server using netcat and, at the same time, we can write it into the destination directory. The only stage that is sequential is the last one: prepare.

Test Characteristics

We used sysbench to create 200 tables of 124Mb each for a total of 24Gb. Both source and replica virtual machines run 4 cores, 8 Gb RAM, and 60Gb storage. We created the disks on the same datastore.

During the tests, we did not generate additional operations on the database. We measured only the clone process, reducing the benchmark complexity. Otherwise, we would have to take into consideration things like application response time, or the number of transactions executed. This is beyond the scope of this assessment.

We tested different combinations of clone and Percona XtraBackup operations. For XtraBackup, we tested 1 to 4 threads, with and without compression. In the case of compression, we allocated the same number of threads to compression and decompression. For the clone plugin, we tested auto (which lets the server decide how many threads will perform the clone) and 1 to 4 threads. We also tested with and without compression. Finally, we executed all the tests using three different network limits: 500mbps, 1000mbps, and 4000mbps. These make a total of 54 tests, executed 12+1 times each.

All times are in seconds. In the graphs below, lower values are better.

Method

Clone

Out of the required parameters to operate the clone plugin, the following were set up accordingly in the recipient server:

  • clone_max_concurrency=<maximum number of threads> Defines the maximum number of threads used for a remote cloning operation with autotune enabled. Otherwise, this is the exact number of threads that remote cloning uses.
  • clone_autotune_concurrency If enabled the clone operation uses up to clone_max_concurrency threads. The default is 16.
    • clone_enable_compression If enabled, the remote clone operation will use compression.

Percona XtraBackup

To stream the backup we used the xbstream format and sent the data to the remote server using netcat. We applied the following parameters:

  • parallel=<number of threads> Xtrabackup and xbstream parameter that defines the number of threads used for backup and restore operations.
  • rebuild-threads The number of threads used for the rebuild (prepare) operation.
  • decompress_threads and compress_threads Xtrabackup and xbstream parameters that define the number of threads used for compression operations.

Some people use additional parameters like innodb-read-io-threads, innodb-write-io-threads, or innoDB-io-capacity, but these parameters only affect the behavior of InnoDB background threads. They have no impact during backup and restore operations.

Results

Clone

No compression

For the lower bandwidth tests, the number of threads used does not make a difference. Once we increase bandwidth we see that time cuts by half when we move from one thread to two. Going beyond that value improves slightly. Probably we reach the disk i/o limit.

Clone Plugin performance without compression.

The auto option is consistently the fastest one.

Compression

Compression is supposed to improve performance for lower bandwidth connections, but we see that this is not the case. Bandwidth has no impact on execution time and compression makes the clone slower. Again auto gives the best results, equivalent to 4 threads.
clone plugin with compression

Percona XtraBackup

No Compression

Without compression, we see again that the number of threads does not make any difference in the lower bandwidth test. When we increase bandwidth, the number of threads is important, but we quickly reach i/o limits.

Percona Xtrabackup stream without compression

Compression

When using compression, we see that requires less time to complete in almost every case compared with the option without compression, even when bandwidth is not the limit.

Percona Xtrabackup stream with compression

Conclusion

We see that, when using compression, the clone plugin is the slower option while Percona XtraBackup gives great results for all bandwidths. Without compression, the clone plugin is faster when using more than 2 threads. XtraBackup is faster for fewer threads.

Xtrabackup vs. Clone plugin - results summary

Below, we have a chart comparing the worst and best results. As expected, the worst results correspond to one thread executions.

The Clone Plugin is a great option for simplicity. Percona XtraBackup is excellent to save bandwidth and provides better results with fewer threads. With enough threads and bandwidth available, both solutions provide comparable results.

Oct
09
2020
--

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First ImpressionFor what reason should I use a real multi-primary setup?

To be clear, not a multi-writer solution where any node can become the active writer in case of needs, as for Percona XtraDB Cluster (PXC) or Percona Server for MySQL using Group_replication. No, we are talking about a multi-primary setup where I can write at the same time on multiple nodes. I want to insist on this “why?”.

After having excluded the possible solutions mentioned above, both covering the famous 99.995% availability, which is 26.30 minutes of downtime in a year, what is left?

Disaster Recovery? Well, that is something I would love to have, but to be a real DR solution we need to put several kilometers (miles for imperial) in the middle.

And we know (see here and here) that aside from some misleading advertising, we cannot have a tightly coupled cluster solution across geographical regions.

So, what is left? I may need more HA, ok, that is a valid reason. Or I may need to scale the number of writes, which is a valid reason as well. This means, in the end, that I am looking to a multi-primary because:

  • Scale writes (more nodes more writes)
    • Consistent reads (what I write on A must be visible on B)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Now, keeping myself bound to the MySQL ecosystem, my natural choice would be MySQL NDB cluster. But my (virtual) boss was at AWS re-invent and someone mentioned to him that Aurora Multi-Primary does what I was looking for. This (long) article is my voyage in discovering if that is true or … not.

Given I am focused on the behavior first, and NOT interested in absolute numbers to shock the audience with millions of QPS, I will use low-level Aurora instances. And will perform tests from two EC2 in the same VPC/region of the nodes.

You can find the details about the tests on GitHub here.

Finally, I will test:

  • Connection speed
  • Stale read
  • Write single node for baseline
  • Write on both node:
    • Scaling splitting the load by schema
    • Scaling same schema

Test Results

Let us start to have some real fun. The first test is …

Connection Speed

The purpose of this test is to evaluate the time taken in opening a new connection and time taken to close it. The action of the open/close connection can be a very expensive operation, especially if applications do not use a connection pool mechanism.

Amazon Aurora Multi-Primary


As we can see, ProxySQL results to be the most efficient way to deal with opening connections, which was expected given the way it is designed to reuse open connections towards the backend.


Different is the close connection operation, in which ProxySQL seems to take a little bit longer.

As a global observation, we can say that by using ProxySQL we have more consistent behavior. Of course, this test is a simplistic one, and we are not checking the scalability (from 1 to N connections) but it is good enough to give us the initial feeling. Specific connection tests will be the focus of the next blog on Aurora MM.

Stale Reads

Aurora multi-primary uses the same mechanism of the default Aurora to update the buffer pool:


Using the Page Cache update, just doing both ways. This means that the Buffer Pool of Node2 is updated with the modification performed in Node1 and vice versa.

To verify if an application would be really able to have consistent reads, I have run this test. This test is meant to measure if, and how many, stale reads we will have when writing on a node and reading from the other.

Amazon Aurora multi-primary has two consistency models:

Aurora consistency model
As an interesting fact, the result was that with the default consistency model (INSTANCE_RAW), we got a 100% stale read.
Given that I focused on identifying the level of the cost that exists when using the other consistency model (REGIONAL_RAW), that allows an application to have consistent reads.

The results indicate an increase of 44% in total execution time, and of 95% (22 times slower) in write execution.

It is interesting to note that the time taken is in some way predictable and consistent between the two consistency models.

The graph below shows in yellow how long the application must wait to see the correct data on the reader node. In blue is the amount of time the application waits to get back the same consistent read because it must wait for the commit on the writer.

lag time in nanoseconds

As you can see, the two are more or less aligned.

Given the performance cost imposed by using REGIONAL_RAW,  all the other tests are done with the default INSTANCE_RAW, unless explicitly stated.

Writing Tests

All tests run in this section were done using sysbench-tpcc with the following settings:

sysbench ./tpcc.lua --mysql-host=<> --mysql-port=3306 --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --time=300 --threads=32 --report-interval=1 --tables=10 --scale=15  --mysql_table_options=" CHARSET=utf8 COLLATE=utf8_bin"  --db-driver=mysql prepare

 sysbench /opt/tools/sysbench-tpcc/tpcc.lua --mysql-host=$mysqlhost --mysql-port=$port --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --db-driver=mysql --tables=10 --scale=15 --time=$time  --rand-type=zipfian --rand-zipfian-exp=0 --report-interval=1 --mysql-ignore-errors=all --histogram  --report_csv=yes --stats_format=csv --db-ps-mode=disable --threads=$threads run

Write Single Node (Baseline)

Before starting the comparative analysis, I was looking to define what was the “limit” of traffic/load for this platform.

baseline reads/writes

From the graph above, we can see that this setup scales up to 128 threads and after that, the performance remains more or less steady.

Amazon claims that we can mainly double the performance when using both nodes in write mode and use a different schema to avoid conflict.

aurora scalability

Once more, remember I am not interested in the absolute numbers here, but I am expecting the same behavior. Given that, our expectation is to see:

expected scalability

Write on Both Nodes, Different Schemas

So AWS recommend this as the scaling solution:


And I diligently follow the advice. I used two EC2 nodes in the same subnet of the Aurora Node, writing to a different schema (tpcc & tpcc2).

Overview

Let us make it short and go straight to the point. Did we get the expected scalability?

Well, no:


We just had a 26% increase, quite far to be the expected 100% Let us see what happened in detail (if not interested just skip and go to the next test).

Node 1

Schema read writes Aurora

Node 2


As you can see, Node1 was (more or less) keeping up with the expectations and being close to the expected performance. But Node2 was just not keeping up, and performances there were just terrible.

The graphs below show what happened.

While Node1 was (again more or less) scaling up to the baseline expectations (128 threads), Node2 collapsed on its knees at 16 threads. Node2 was never able to scale up.

Reads

Node 1


Node1 is scaling the reads as expected, but also here and there we can see performance deterioration.

Node 2


Node2 is not scaling Reads at all.

Writes

Node 1


Same as Read.

Node 2


Same as read.

Now someone may think I was making a mistake and I was writing on the same schema. I assure you I was not. Check the next test to see what happened if using the same schema.

Write on Both Nodes,  Same Schema

Overview

Now, now, Marco, this is unfair. You know this will cause contention. Yes, I do! But nonetheless, I was curious to see what was going to happen and how the platform would deal with that level of contention.

My expectations were to have a lot of performance degradation and an increased number of locks. About conflict I was not wrong, node2 after the test reported:

+-------------+---------+-------------------------+
| table       | index   | PHYSICAL_CONFLICTS_HIST |
+-------------+---------+-------------------------+
| district9   | PRIMARY |                    3450 |
| district6   | PRIMARY |                    3361 |
| district2   | PRIMARY |                    3356 |
| district8   | PRIMARY |                    3271 |
| district4   | PRIMARY |                    3237 |
| district10  | PRIMARY |                    3237 |
| district7   | PRIMARY |                    3237 |
| district3   | PRIMARY |                    3217 |
| district5   | PRIMARY |                    3156 |
| district1   | PRIMARY |                    3072 |
| warehouse2  | PRIMARY |                    1867 |
| warehouse10 | PRIMARY |                    1850 |
| warehouse6  | PRIMARY |                    1808 |
| warehouse5  | PRIMARY |                    1781 |
| warehouse3  | PRIMARY |                    1773 |
| warehouse9  | PRIMARY |                    1769 |
| warehouse4  | PRIMARY |                    1745 |
| warehouse7  | PRIMARY |                    1736 |
| warehouse1  | PRIMARY |                    1735 |
| warehouse8  | PRIMARY |                    1635 |
+-------------+---------+-------------------------+

Which is obviously a strong indication something was not working right. In terms of performance gain, if we compare ONLY the result with the 128 Threads:


Also with the high level of conflict, we still have 12% of performance gain.

The problem is that in general, we have the two nodes behaving quite badly. If you check the graph below you can see that the level of conflict is such to prevent the nodes not only to scale but to act consistently.

Node 1

Write on Both Nodes,  Same Schema

Node 2


Reads

In the following graphs, we can see how node1 had issues and it actually crashed three times, during tests with 32/64/512 threads. Node2 was always up but the performances were very low.

Node 1


Node 2


Writes

Node 1


Node 2


Recovery From Crashed Node

About recovery time, reading the AWS documentation and listening to presentations, I often heard that Aurora Multi-Primary is a 0 downtime solution. Or other statements like: “in applications where you can’t afford even brief downtime for database write operations, a multi-master cluster can help to avoid an outage when a writer instance becomes unavailable. The multi-master cluster doesn’t use the failover mechanism, because it doesn’t need to promote another DB instance to have read/write capability”

To achieve this the suggestion, the solution I found was to have applications pointing directly to the Nodes endpoint and not use the Cluster endpoint.

In this context, the solution pointing to the Nodes should be able to failover within a second or so, while the cluster endpoint:

Recovery From Crashed Node

Personally, I think that designing an architecture where the application is responsible for the connection to the database and failover is some kind of refuse from 2001. But if you feel this is the way, well, go for it.

What I did for testing is to use ProxySQL, as plain as possible with nothing else, and the basic monitor coming from the native monitor. I then compared the results with the tests using the Cluster endpoint. In this way, I adopt the advice of pointing directly at the nodes, but I was doing things in our time.

The results are below and they confirm (more or less) the data coming from Amazon.


A downtime of seven seconds is quite a long time nowadays, especially if I am targeting the 5 nines solution that I want to remember is 864 ms downtime per day. Using ProxySQL is going closer to that, but still too long to be called zero downtime.
I also have fail-back issues when using the AWS cluster endpoint, given it was not able to move the connection to the joining node seamlessly.

Last but not least, when using the consistency level INSTANCE_RAW, I had some data issue as well as PK conflict:
FATAL: mysql_drv_query() returned error 1062 (Duplicate entry ‘18828082’ for key ‘PRIMARY’) 

Conclusions

As state at the beginning of this long blog, the reasonable expectations to go for a multi-primary solution were:

  • Scale writes (more nodes more writes)
  • Gives me zero downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Honestly, I feel we have completely failed the scaling point. Probably if I use the largest Aurora I will get much better absolute numbers, and it will take me more to encounter the same issues, but I will. In any case, if the multi-primary solution is designed to provide that scalability, and it should do that with any version.

I did not have zero downtime, but I was able to failover pretty quickly with ProxySQL.

Finally, unless the consistency model is REGIONAL_RAW, shifting from one node to the other is not prone to possible negative effects like stale reads. Given that I consider this requirement not satisfied in full.

Given all the above, I think this solution could eventually be valid only for High Availability (close to being 5 nines), but given it comes with some limitations I do not feel comfortable in preferring it over others just for that, at the end default Aurora is already good enough as a High available solution.

References

AWS re:Invent 2019: Amazon Aurora Multi-Master: Scaling out database write performance

Working with Aurora multi-master clusters

Improving enterprises ha and disaster recovery solutions reviewed

Robust ha solutions with proxysql

Limitations of multi-master clusters

Sep
16
2020
--

MongoDB 4.4 Performance Regression: Overwhelmed by Memory

MongoDB 4.4 Performance Regression

MongoDB 4.4 Performance RegressionThere is a special collection of database bugs when the system starts to perform worse when given more resources. Examples of such bugs for MySQL I have:

Bug #15815 – This is where InnoDB on an 8-CPU system performed worse than on a 4-CPU system with increased concurrency.

Bug #29847 – This is a similar bug to what I will describe today, when given more memory (innodb_buffer_pool_size), the MySQL crash recovery process would take longer than with less memory, which was described in our blog Innodb Recovery – Is a large buffer pool always better?

It seems InnoDB Flushing was not optimal with a big amount of memory at that time, and this is what I think is happening with MongoDB 4.4 in the scenario I will describe.

MongoDB 4.4 Load Data Procedures

So when preparing data for my benchmark (Percona Server for MongoDB 4.2 vs 4.4 in Python TPCC Benchmark), I also measured how long it takes to load 1000 Warehouses (about 165GB of data in MongoDB) and to have repeatable numbers, as I usually like to repeat the procedure multiple times.

What I noticed is that when MongoDB 4.4 starts with unlimited cache (that is on the server with 250 GB of RAM, it will allocate 125GB for WiredTiger cache and the rest can be used for OS cache) it shows interesting behavior.

Let me describe by load procedure, which is quite simple

  1. Load data into database TPCC1
  2. Sleep 10 mins
  3. Load data into database TPCC3

That is the second time we load into a different database, and in background pages for database, TPCC1 should be flushed and evicted.

Before jumping to the problem I see, let’s check the number for MongoDB 4.2, and by the number I mean how long it takes to accomplish step 1 and step 3.

MongoDB 4.2 with limited memory (WiredTiger cache 25GB):

Step 1: 20 min

Step 3: 21 min

MongoDB 4.4 with limited memory (WiredTiger cache 25GB):

Step 1: 24 min

Step 3: 26 min

MongoDB 4.2 with 125GB WiredTiger cache 

Step 1: 18 min

Step 3: 19 min

 

And now to the problem I see:

MongoDB 4.4 with WiredTiger cache

Step 1: 19 min

Step 3: 497 min

Notice Step 3 takes almost 8 and a half hours, instead of the usual ~20 mins for all previous cases, and this is when MongoDB has 125GB of WiredTiger cache.

What’s interesting is that I do not see this issue when I limit the WiredTiger cache to 25GB, and also this problem does not exist in MongoDB 4.2.

That’s why I think MongoDB 4.4 starts to behave differently when it is given a lot of memory for WiredTiger cache. Why this happens exactly I do not know yet and I will continue to profile this case. A quick look may indicate this is related to the WiredTiger eviction process (similar to the InnoDB flushing problem in the crash recovery process) and replication flow control which was created in MongoDB 4.2 to keep replicas in sync (but I do not use replicas for this test).


Learn more about the history of Oracle, the growth of MongoDB, and what really qualifies software as open source. If you are a DBA, or an executive looking to adopt or renew with MongoDB, this is a must-read!

Download “Is MongoDB the New Oracle?”

Sep
15
2020
--

Percona Server for MongoDB 4.2 vs 4.4 in Python TPCC Benchmark

Percona Server for MongoDB 4.2 vs. 4.4

Percona Server for MongoDB 4.2 vs. 4.4Following my previous blogs on py-tpcc benchmark for MongoDB, Evaluating the Python TPCC MongoDB Benchmark and Evaluating MongoDB Under Python TPCC 1000W Workload, and the recent release of Percona Server for MongoDB 4.4, I wanted to evaluate 4.2 vs 4.4 in similar scenarios.

Hardware Specs

For the client and server, I will use identical bare metal servers, connected via a 10Gb network.

The node specification:

# Percona Toolkit System Summary Report ######################
        Date | 2020-09-14 16:52:46 UTC (local TZ: EDT -0400)
    Hostname | node3
      System | Supermicro; SYS-2028TP-HC0TR; v0123456789 (Other)
    Platform | Linux
     Release | Ubuntu 20.04.1 LTS (focal)
      Kernel | 5.4.0-42-generic
Architecture | CPU = 64-bit, OS = 64-bit
# Processor ##################################################
  Processors | physical = 2, cores = 28, virtual = 56, hyperthreading = yes
      Models | 56xIntel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
      Caches | 56x35840 KB
# Memory #####################################################
       Total | 251.8G
  Swappiness | 0
 DirtyPolicy | 80, 5
 DirtyStatus | 0, 0

The drive I used for the storage in this benchmark is a Samsung SM863 SATA SSD.

MongoDB Topology

For MongoDB I used:

  • Single node instance without limiting cache size. As the bare metal server has 250GB of RAM, MongoDB should allocate 125GB of memory for WiredTiger cache and the rest will be used for OS cache. This should produce more CPU bound workload.
  • Single node instance with limited cache size. For WiredTiger cache I will set a limit 25GB, and to limit OS cache I will limit the memory available to a mongodb instance to 50GB, as described in Using Cgroups to Limit MySQL and MongoDB memory usage.
  • However I did not use cgroups in this case, but I rather used Docker to run different versions and set limits.

The script to start Percona Server for MongoDB in docker with memory limits:

> bash startserver.sh 4.4 1
=== script startserver.sh ===
docker run -d --name db$2 -m 50g  \
          -v /mnt/data/psmdb$2-$1:/data/db \
          --net=host \
          percona/percona-server-mongodb:$1 --replSet "rs$2" --port $(( 27016 + $2 )) \
          --logpath /data/db/server1.log --slowms=10000 --wiredTigerCacheSizeGB=25 

sleep 10

mongo mongodb://127.0.0.1:$(( 27016 + $2 )) --eval "rs.initiate( { _id : 'rs$2',  members: [      { _id: 0, host: '172.16.0.3:$(( 27016 + $2 ))' }   ] })"

MongoDB Versions:

Benchmark Results

Unlimited Memory

The results are in New Order Transactions per Minute (NOTPM), and more is better:

Clients 4.2 4.4
10 541.31 691.89
30 999.89 1105.88
50 1048.50 1171.35
70 1095.72 1335.90
90 1184.38 1433.09
110 1210.18 1521.56
130 1231.38 1575.23
150 1245.31 1680.81
170 1224.13 1668.33
190 1300.11 1641.45
210 1240.86 1619.58
230 1220.89 1575.57
250 1237.86 1545.01

MongoDB 4.4 Unlimited Memory

Limited Memory, 50GB in Total and 25GB for Cache

The results are in New Order Transactions per Minute (NOTPM), and more is better:

Clients 4.2 4.4
10 351.45 377.29
30 483.88 447.22
50 535.34 522.59
70 576.30 574.14
90 604.49 582.10
110 618.59 542.11
130 593.31 386.33
150 386.67 301.75
170 265.91 298.80
190 259.56 301.38
210 254.57 301.88
230 249.47 299.15
250 251.03 300.00

MongoDB 4.2 Limited Memory

Observation

Actually I wanted to perform more benchmarks on 4.4 vs 4.2, but some interesting behavior in 4.4 made me reconsider my plans and I’ve gotten distracted trying to understand the issue, and I share this in the post MongoDB 4.4 Performance Regression: Overwhelmed by Memory.

Besides that, in my tests, 4.4 outperformed 4.2 in case of unlimited memory, but I want to consider a variation of throughput during the benchmark so we are working on a py-tpcc version that would report data with 1-sec resolution. Also, I want to re-evaluate how 4.4 would perform in a long-running benchmark, as the current length of the benchmark is 900 sec.

In the case with limited memory, 4.4 did identically or worse than 4.2 with concurrency over 100 clients.

Both versions did not handle the increased number of clients well, showing worse results with 150 clients compared to 10 clients.


Learn more about the history of Oracle, the growth of MongoDB, and what really qualifies software as open source. If you are a DBA, or an executive looking to adopt or renew with MongoDB, this is a must-read!

Download “Is MongoDB the New Oracle?”

Aug
27
2020
--

More on Checkpoints in InnoDB MySQL 8

Recently I posted about checkpointing in MySQL, where MySQL showed interesting “wave” behavior.

Soon after Dimitri posted a solution with how to fix “waves,” and I would like to dig a little more into proposed suggestions, as there are some materials to process.

This post will be very heavy on InnoDB configuration, so let’s start with the basic configuration for MySQL, but before that some initial environment.

I use MySQL version 8.0.21 on the hardware as described here

As for the storage, I am not using some “old dusty SSD”, but production available Enterprise-Grade Intel SATA SSD D3-S4510. This SSD is able to handle the throughput of 468MiB/sec of random writes or 30000 IOPS of random writes of 16KiB blocks.

So initial configuration for my test was:

[mysqld]
datadir= /data/mysql8-8.0.21
user=mysql
bind_address = 0.0.0.0

socket=/tmp/mysql.sock
log-error=error.log

ssl=0
performance_schema=OFF

skip_log_bin
server_id = 7

# general
table_open_cache = 200000
table_open_cache_instances=64
back_log=3500
max_connections=4000

join_buffer_size=256K
sort_buffer_size=256K

# files
innodb_file_per_table
innodb_log_file_size=10G
innodb_log_files_in_group=2
innodb_open_files=4000

# buffers
innodb_buffer_pool_size= 140G
innodb_buffer_pool_instances=8
innodb_page_cleaners=8
innodb_purge_threads=4
innodb_lru_scan_depth=512
innodb_log_buffer_size=64M

default_storage_engine=InnoDB

innodb_flush_log_at_trx_commit  = 1
innodb_doublewrite= 1
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_flush_neighbors=0

#innodb_monitor_enable=all
max_prepared_stmt_count=1000000

innodb_adaptive_hash_index=1
innodb_monitor_enable='%'

innodb-buffer-pool-load-at-startup=OFF
innodb_buffer_pool_dump_at_shutdown=OFF

There is a lot of parameters, so let’s highlight the most relevant for this test:

innodb_buffer_pool_size= 140G

Buffer pool size is enough to fit all data, which is about 100GB in size

innodb_adaptive_hash_index=1

Adaptive hash index is enabled (as it comes in default InnoDB config)

innodb_buffer_pool_instances=8

This is what defaults provide, but I will increase it, following my previous post. 

innodb_log_file_size=10G

innodb_log_files_in_group=2

These parameters define the limit of 20GB for our redo logs, and this is important, as our workload will be “redo-log” bounded, as we will see from the results

innodb_io_capacity=2000

innodb_io_capacity_max=4000

You may ask, why do I use 2000 and 4000, while the storage can handle 30000 IOPS.

This is a valid point, and as we can see later, these parameters are not high enough for this workload, but also it does not mean we should use them all the way up to 30000, as we will see from the results.

MySQL Manual says the following about innodb_io_capacity:

“The innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to approximately the number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is set, InnoDB estimates the I/O bandwidth available for background tasks based on the set value.” 

From this, you may get the impression that if you set innodb_io_capacity to I/O bandwidth of your storage, you should be fine. Though this part does not say what you should take as I/O operations. For example, if your storage can perform 500MB/sec, then if you do 4KB block IO operations it will be 125000 IO per second, and if you do 16KB IO, then it will be 33000 IO per second. 

MySQL manual leaves it up to your imagination, but as InnoDB typical page size is 16KB, let’s assume we do 16KB blocks IO.

However later on that page, we can read:

“Ideally, keep the setting as low as practical, but not so low that background activities fall behind. If the value is too high, data is removed from the buffer pool and change buffer too quickly for caching to provide a significant benefit. For busy systems capable of higher I/O rates, you can set a higher value to help the server handle the background maintenance work associated with a high rate of row changes”

and

“Consider write workload when tuning innodb_io_capacity. Systems with large write workloads are likely to benefit from a higher setting. A lower setting may be sufficient for systems with a small write workload.”

I do not see that the manual provides much guidance about what value I should use, so we will test it.

Initial results

So if we benchmark with initial parameters, we can see the “wave” pattern.

 

As for why this is happening, let’s check Percona Monitoring and Management “InnoDB Checkpoint Age” chart:

Actually InnoDB Flushing by Type in PMM does not show sync flushing yet, so I had to modify chart a little to show “sync flushing” in orange line:

And we immediately see that Uncheckpointed Bytes exceed Max Checkpoint Age in 16.61GiB, which is defined by 20GiB of innodb log files. 16.61GiB is less than 20GB, because InnoDB reserves some cushion for the cases exactly like this, so even if we exceed 16.61GiB, InnoDB still has an opportunity to flush data.

Also, we see that before Uncheckpointed Bytes exceed Max Checkpoint Age, InnoDB flushes pages with the rate 4000 IOPS, just as defined by innodb_io_capacity_max.

We should try to avoid the case when Uncheckpointed Bytes exceed Max Checkpoint Age, because when it happens, InnoDB gets into “emergency” flushing mode, and in fact, this is what causes the waves we see. I should have detected this in my previous post, mea culpa.

So the first conclusion we can make – if InnoDB does not flush fast enough, what if we increase innodb_io_capacity_max ? Sure, let’s see. And for the simplification, for the next experiments, I will use

Innodb_io_capacity = innodb_io_capacity_max, unless specified otherwise.

Next run with Innodb_io_capacity = innodb_io_capacity_max = 7000

Not much improvement and this also confirmed by InnoDB Checkpoint ge chart

InnoDB tries to flush more pages per second up to 5600 pages/sec, but it is not enough to avoid exceeding Max Checkpoint Age.

Why is this the case? The answer is a double write buffer.

Even though MySQL improved the doublewrite buffer in MySQL 8.0.20, it does not perform well enough with proposed defaults. 

Well, at least the problem was solved because previous Oracle ran benchmarks with disabled doublewrite, just to hide and totally ignore the issue with doublewrite. For the example check this.

But let’s get back to our 8.0.21 and fixed doublewrite.

Dimiti mentions:

“the main config options for DBLWR in MySQL 8.0 are:

  • innodb_doublewrite_files = N
    innodb_doublewrite_pages = M”

Let’s check the manual again:

“The innodb_doublewrite_files variable is intended for advanced performance tuning. The default setting should be suitable for most users.

innodb_doublewrite_pages

The innodb_doublewrite_pages variable (introduced in MySQL 8.0.20) controls the number of maximum number of doublewrite pages per thread. If no value is specified, innodb_doublewrite_pages is set to the innodb_write_io_threads value. This variable is intended for advanced performance tuning. The default value should be suitable for most users.

Was it wrong to assume that innodb_doublewrite_files and  innodb_doublewrite_pages provides the value suitable for our use case?

But let’s try with the values Dmitri recommended to look into, I will use

innodb_doublewrite_files=2 and innodb_doublewrite_pages=128

Results with innodb_doublewrite_files=2 and innodb_doublewrite_pages=128

The problem with waves is fixed! 

And InnoDB Checkpoint Age chart:

Now we are able to keep Uncheckpointed Bytes under Max Checkpoint Age, and this is what fixed “waves” pattern.

We can say that parallel doublewrite is a new welcomed improvement, but the fact that one has to change innodb_doublewrite_pages in order to get improved performance is the design flaw in my opinion.

But there are still a lot of variations in 1 sec resolution and small drops. Before we get to them, let’s take a look at another suggestion: use –innodb_adaptive_hash_index=0 ( that is to disable Adaptive Hash Index). I will use AHI=0 on the charts to mark this setting.

Let’s take a look at the results with improved settings and with –innodb_adaptive_hash_index=0

Results with –innodb_adaptive_hash_index=0

To see what is the real improvement with –innodb_adaptive_hash_index=0 , let’s compare barcharts:

Or in numeric form:

settings Avg tps, last 2000 sec
io_cap_max=7000,doublewrite=opt 7578.69
io_cap_max=7000,doublewrite=opt,AHI=0 7996.33

So –innodb_adaptive_hash_index=0 really brings some improvements, about 5.5%, so I will use  –innodb_adaptive_hash_index=0 for further experiments.

Let’s see if increased innodb_buffer_pool_instances=32 will help to smooth periodical variance.

Results with innodb_buffer_pool_instances=32

So indeed using innodb_buffer_pool_instances=32 gets us less variations, keeping overall throughput about the same. It is 7936.28 tps for this case.

Now let’s review the parameter innodb_change_buffering=none, which Dmitri also suggests.

Results with innodb_change_buffering=none

There is NO practical difference if we disable innodb_change_buffer.

And if we take a look at PMM change buffer chart:

We can see there is NO Change Buffer activity outside of the initial 20 mins. I am not sure why Dimitri suggested disabling it. In fact, Change Buffer can be quite useful, and I will show it in my benchmark for the different workloads.

Now let’s take a look at suggested settings with Innodb_io_capacity = innodb_io_capacity_max = 8000. That will INCREASE innodb_io_capacity_max , and compare to results with innodb_io_capacity_max = 7000.

Or in tabular form:

settings Avg tps, last 2000 sec
io_cap_max=7000,doublewrite=opt,AHI=0,BPI=32 7936.28
io_cap_max=8000,doublewrite=opt,AHI=0,BPI=32 7693.08

Actually with innodb_io_capacity_max=8000 the throughput is LESS than with  innodb_io_capacity_max=7000

Can you guess why? 

Let’s compare InnoDB Checkpoint Age.

This is for innodb_io_capacity_max=8000 :

And this is for innodb_io_capacity_max=7000 

This is like a child’s game: Find the difference.

The difference is that with  innodb_io_capacity_max=7000
Uncheckpointed Bytes is 13.66 GiB,
and with innodb_io_capacity_max=8000
Uncheckpointed Bytes is 12.51 GiB

What does it mean? It means that with innodb_io_capacity_max=7000 HAS to flush LESS pages and still keep within Max Checkpoint Age.

In fact, if we try to push even further, and use innodb_io_capacity_max=innodb_io_capacity=6500 we will get InnoDB Checkpoint Age chart as:

Where Uncheckpointed Bytes are 15.47 GiB. Does it improve throughput? Absolutely!

settings Avg tps, last 2000 sec
io_cap_max=6500,doublewrite=opt,AHI=0,BPI=32 8233.628
io_cap_max=7000,doublewrite=opt,AHI=0,BPI=32 7936.283
io_cap_max=8000,io_cap_max=8000,doublewrite=opt,AHI=0,BPI=32 7693.084

The difference between innodb_io_capacity_max=6500 and innodb_io_capacity_max=8000 is 7%

This now becomes clear what Manual means in the part where it says:

“Ideally, keep the setting as low as practical, but not so low that background activities fall behind”

So we really need to increase innodb_io_capacity_max to the level that Uncheckpointed Bytes stays under Max Checkpoint Age, but not by much, otherwise InnoDB will do more work then it is needed and it will affect the throughput.

In my opinion, this is a serious design flaw in InnoDB Adaptive Flushing, that you actually need to wiggle innodb_io_capacity_max to achieve appropriate results.

Inverse relationship between innodb_io_capacity_max and innodb_log_file_size

To show an even more complicated relation between innodb_io_capacity_max and innodb_log_file_size, let consider the following experiment.

We will increase innodb_log_file_size from 10GB to 20GB, effectively doubling our redo-log capacity.

And now let’s check InnoDB Checkpoint Age with innodb_io_capacity_max=7000:

We can see there is a lot of space in InnoDB logs which InnoDB does not use. There is only 22.58GiB of Uncheckpointed Bytes, while 33.24 GiB are available.

So what happens if we increase innodb_io_capacity_max to 4500

 InnoDB Checkpoint Age with innodb_io_capacity_max=4500:

In this setup, We can push Uncheckpointed Bytes to 29.80 GiB, and it has a positive effect on the throughput.

Let’s compare throughput :

settings Avg tps, last 2000 sec
io_cap_max=4500,log_size=40GB,doublewrite=opt,AHI=0,BPI=32 9865.308
io_cap_max=7000,log_size=40GB,doublewrite=opt,AHI=0,BPI=32 9374.121

So by decreasing innodb_io_capacity_max from 7000 to 4500 we can gain 5.2% in the throughput.

Please note that we can’t continue to decrease innodb_io_capacity_max, because in this case Uncheckpointed Bytes risks to exceed Max Checkpoint Age, and this will lead to the negative effect of emergency flushing.

So again, in order to improve throughput, we should be DECREASING innodb_io_capacity_max, but only to a certain threshold. We should not be setting innodb_io_capacity_max to 30000, to what really SATA SSD can provide.

Again, for me, this is a major design flaw in the current InnoDB Adaptive Flushing. Please note this was a static workload. If your workload changes during the day, it is practically impossible to come up with optimal value. 

Conclusions:

Trying to summarize all of the above, I want to highlight:

  • To fix “wave” pattern we need to tune innodb_io_capacity_max and innodb_doublewrite_pages 
  • InnoDB parallel doublewrite in MySQL 8.0.20 is a definitely positive improvement, but the default values seem chosen poorly, in contradiction with Manual. I wish Oracle/MySQL shipped features that work out of the box for most users.
  • InnoDB Adaptive Hash index is not helping here, and you get better performance by disabling it. I also observed that in other workloads, the InnoDB Adaptive Hash index might be another broken subsystem, which Oracle ignores to fix and just disables it in its benchmarks.
  • InnoDB Change Buffer has no effect on this workload, so you may or may not disable it — there is no difference. But I saw a positive effect from InnoDB Change Buffer in other workloads, so I do not recommend blindly disabling it.
  • Now about InnoDB Adaptive Flushing. In my opinion, InnoDB Adaptive Flushing relies too much on manual tuning of innodb_io_capacity_max , which in fact has nothing to do with the real storage IO capacity. In fact, often you need to lower innodb_io_capacity_max  to get better performance, but not make it too low, because at some point it will hurt the performance. The best way to monitor it is to check InnoDB Checkpoint Age chart in PMM
  • I would encourage Oracle to fix the broken design of InnoDB Adaptive Flushing, where it would detect IO capacity automatically and to not flush aggressively, but to keep  Uncheckpointed Bytes just under Max Checkpoint Age. Let’s hope Oracle faster than doublewrite buffer because history shows that to force Oracle to make improvements in InnoDB IO subsystem, we need to do it first in Percona Server for MySQL like we did with parallel doublewrite buffer.  For the reference parallel doublewrite was implemented first in Percona Server for MySQL 5.7.11-4 which was released March 15th, 2016. Oracle implemented (with not optimal default settings ) parallel doublewrite in MySQL 8.0.20, which was released 4 years later after Percona Server, on April 4th, 2020.
Aug
14
2020
--

Evaluating Performance Improvements in MariaDB 10.5.5

Performance Improvements in MariaDB 10.5.5

Performance Improvements in MariaDB 10.5.5Recently, I published a series of posts on MySQL and MariaDB, where MariaDB 10.5.4 does not necessarily perform well in some scenarios:

In comments left on How MySQL and MariaDB Perform on NVMe Storage, I was told that MariaDB 10.5.4 is not the best version, as there are already known performance bugs that will be fixed in MariaDB 10.5.5, and it would be better to test MariaDB 10.5.5. And now, MariaDB 10.5.5 comes with fixes:

I made a promise that I will test the new version, so this is me fulfilling my promise.

Instead of posting multiple posts, let’s review all scenarios at once. Just to refresh the material, I’ve tested sysbench-tpcc 1000W, which is about 100GB in size in the following setups:

  • Data is located on SATA SSD, innodb_buffer_pool_size=140GB (CPU-bound scenario)
  • Data is located on NVMe, innodb_buffer_pool_size=140GB (CPU-bound scenario)
  • Data is located on SATA SSD, innodb_buffer_pool_size=25GB (IO-bound scenario)
  • Data is located on NVMe, innodb_buffer_pool_size=25GB (IO-bound scenario)

Data is located on SATA SSD, innodb_buffer_pool_size=140GB (CPU-bound scenario)

(The original results are here – Checkpointing in MySQL and MariaDB)

For this re-test round, I compare only MariaDB 10.5.4 vs MariaDB 10.5.5:

MariaDB 10.5.4 vs MariaDB 10.5.5

After the difference in the warm-up behavior, the result is about the same, and we can confirm this with boxplots for the last 2500 sec:

MariaDB 10.5.4 vs MariaDB 10.5.5 2500 sec

Data is located on SATA SSD, innodb_buffer_pool_size=25GB (IO-bound scenario)

This one is interesting, and actually this is where we can see improvements in MariaDB 10.5.5.

improvements in MariaDB 10.5.5

To see individual variations in details, let’s separate charts:

And then again compare the throughput for the last 2500 sec:

compare the throughput for the last 2500 sec

What’s interesting is not only has MariaDB 10.5.5 improved the results, but it also shows a better average throughput than MySQL 8.0.21, though with a higher variation.

Data is located on NVMe, innodb_buffer_pool_size=140GB (CPU-bound scenario)

There we still see a weird initial drop followed by recovery, and the throughput is similar to MariaDB 10.5.4, but something a little worse.

Data is located on NVMe, innodb_buffer_pool_size=25GB (IO-bound scenario)

throughput is similar to MariaDB 10.5.4

This case is an absolute disaster for MariaDB 10.5.5; there is bug https://jira.mariadb.org/browse/MDEV-23399. Most of the time MariaDB 10.5.5 shows the throughput under 100 tps (while MySQL averages above 5000 tps).

I personally think this is a catastrophic bug and MariaDB 10.5.5 with this performance should never have been released to the public, but instead, stay in the internal testing stage.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com