Sep
20
2018
--

Prometheus 2 Times Series Storage Performance Analyses

cpu saturation and max core usage

Prometheus 2 time series database (TSDB) is an amazing piece of engineering, offering a dramatic improvement compared to “v2” storage in Prometheus 1 in terms of ingest performance, query performance and resource use efficiency. As we’ve been adopting Prometheus 2 in Percona Monitoring and Management (PMM), I had a chance to look into the performance of Prometheus 2 TSDB. This blog post details my observations.

Understanding the typical Prometheus workload

For someone who has spent their career working with general purpose databases, the typical workload of Prometheus is quite interesting. The ingest rate tends to remain very stable: typically, devices you monitor will send approximately the same amount of metrics all the time, and infrastructure tends to change relatively slowly.

Queries to the data can come from multiple sources. Some of them, such as alerting, tend to be very stable and predictable too. Others, such as users exploring data, can be spiky, though it is not common for this to be largest part of the load.

The Benchmark

In my assessment, I focused on handling an ingest workload. I had deployed Prometheus 2.3.2 compiled with Go 1.10.1 (as part of PMM 1.14)  on Linode using this StackScript.  For a maximally realistic load generation, I spin up multiple MySQL nodes running some real workloads (Sysbench TPC-C Test) , with each emulating 10 Nodes running MySQL and Linux using this StackScript

The observations below are based on a Linode instance with eight virtual cores and 32GB of memory, running  20 load driving simulating the monitoring of 200 MySQL instances. Or, in Prometheus Terms, some 800 targets; 440 scrapes/sec 380K samples ingested per second and 1.7M of active time series.

Design Observations

The conventional approach of traditional databases, and the approach that Prometheus 1.x used, is to limit amount of memory. If this amount of memory is not enough to handle the load, you will have high latency and some queries (or scrapes) will fail. Prometheus 2 memory usage instead is configured by

storage.tsdb.min-block-duration

   which determines how long samples will be stored in memory before they are flushed (the default being 2h). How much memory it requires will depend on the number of time series, the number of labels you have, and your scrape frequency in addition to the raw ingest rate. On disk, Prometheus tends to use about three bytes per sample. Memory requirements, though, will be significantly higher.

While the configuration knob exists to change the head block size, tuning this by users is discouraged. So you’re limited to providing Prometheus 2 with as much memory as it needs for your workload.

If there is not enough memory for Prometheus to handle your ingest rate, then it will crash with out of memory error message or will be killed by OOM killer.

Adding more swap space as a “backup” in case Prometheus runs out of RAM does not seem to work as using swap space causes a dramatic memory usage explosion. I suspect swapping does not play well with Go garbage collection.

Another interesting design choice is aligning block flushes to specific times, rather than to time since start:

head block Prometheus 2

As you can see from this graph, flushes happen every two hours, on the clock. If you change min-block-duration  to 1h, these flushes will happen every hour at 30 minutes past the hour.

(If you want to see this and other graphs for your Prometheus Installation you can use this Dashboard. It has been designed for PMM but can work for any Prometheus installation with little adjustments.)

While the active block—called head block— is kept in memory, blocks containing older blocks are accessed through

mmap()

  This eliminates the need to configure cache separately, but also means you need to allocate plenty of memory for OS Cache if you want to query data older than fits in the head block.

It also means the virtual memory you will see Prometheus 2 using will get very high: do not let it worry you.

Prometheus process memory usage

Another interesting design choice is WAL configuration. As you can see in the storage documentation, Prometheus protects from data loss during a crash by having WAL log. The exact durability guarantees, though, are not clearly described. As of Prometheus 2.3.2, Prometheus flushes the WAL log every 10 seconds, and this value is not user configurable.

Compactions

Prometheus TSDB is designed somewhat similar to the LSM storage engines – the head block is flushed to disk periodically, while at the same time, compactions to merge a few blocks together are performed to avoid need to scan too many blocks for queries

Here is the number of data blocks I observed on my system after a 24h workload:

active data blocks

If you want more details about storage, you can check out the meta.json file which has additional information about the blocks you have, and how they came about.

{
       "ulid": "01CPZDPD1D9R019JS87TPV5MPE",
       "minTime": 1536472800000,
       "maxTime": 1536494400000,
       "stats": {
               "numSamples": 8292128378,
               "numSeries": 1673622,
               "numChunks": 69528220
       },
       "compaction": {
               "level": 2,
               "sources": [
                       "01CPYRY9MS465Y5ETM3SXFBV7X",
                       "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                       "01CPZ6NR4Q3PDP3E57HEH760XS"
               ],
               "parents": [
                       {
                               "ulid": "01CPYRY9MS465Y5ETM3SXFBV7X",
                               "minTime": 1536472800000,
                               "maxTime": 1536480000000
                       },
                       {
                               "ulid": "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                               "minTime": 1536480000000,
                               "maxTime": 1536487200000
                       },
                       {
                               "ulid": "01CPZ6NR4Q3PDP3E57HEH760XS",
                               "minTime": 1536487200000,
                               "maxTime": 1536494400000
                       }
               ]
       },
       "version": 1
}

Compactions in Prometheus are triggered at the time the head block is flushed, and several compactions may be performed at these intervals:Prometheus 2 compactions

Compactions do not seem to be throttled in any way, causing huge spikes of disk IO usage when they run:

spike in io activity for compactions

And a spike in CPU usage:

spike in CPU usage during compactions

This, of course, can cause negative impact to the system performance. This is also why it is one of the greatest questions in LSM engines: how to run compactions to maintain great query performance, but not cause too much overhead.

Memory utilization as it relates to the compaction process is also interesting:

Memory utilization during compaction process

We can see after compaction a lot of memory changes from “Cached”  to “Free”, meaning potentially valuable data is washed out from memory. I wonder if

fadvice()

 or other techniques to minimize data washout from cache are in use, or if this is caused by the fact that the blocks which were cached are destroyed by the compaction process

Crash Recovery

Crash recovery from the log file takes time, though it is reasonable. For an ingest rate of about 1 mil samples/sec, I observed some 25 minutes recovery time on SSD storage:

level=info ts=2018-09-13T13:38:14.09650965Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=v2.3.2, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-09-13T13:38:14.096599879Z caller=main.go:223 build_context="(go=go1.10.1, user=Jenkins, date=20180725-08:58:13OURCE)"
level=info ts=2018-09-13T13:38:14.096624109Z caller=main.go:224 host_details="(Linux 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:58:07 UTC 2018 x86_64 1bee9e9b78cf (none))"
level=info ts=2018-09-13T13:38:14.096641396Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-09-13T13:38:14.097715256Z caller=web.go:415 component=web msg="Start listening for connections" address=:9090
level=info ts=2018-09-13T13:38:14.097400393Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-09-13T13:38:14.098718401Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536530400000 maxt=1536537600000 ulid=01CQ0FW3ME8Q5W2AN5F9CB7R0R
level=info ts=2018-09-13T13:38:14.100315658Z caller=web.go:467 component=web msg="router prefix" prefix=/prometheus
level=info ts=2018-09-13T13:38:14.101793727Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ78486TNX5QZTBF049PQHSM
level=info ts=2018-09-13T13:38:14.102267346Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536732000000 ulid=01CQ78DE7HSQK0C0F5AZ46YGF0
level=info ts=2018-09-13T13:38:14.102660295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4RM21Y0PT5GNSS146Q
level=info ts=2018-09-13T13:38:14.103075885Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SV8WJ3C2W5S3RTAHC2GHB
level=error ts=2018-09-13T14:05:18.208469169Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum d0465484, want 0" file=/opt/prometheus/data/.prom2-data/wal/007357 pos=15504363
level=info ts=2018-09-13T14:05:19.471459777Z caller=main.go:543 msg="TSDB started"
level=info ts=2018-09-13T14:05:19.471604598Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499156711Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499228186Z caller=main.go:502 msg="Server is ready to receive web requests."

The problem I observed with recovery is that it is very memory intensive. While the server may be capable of handling the normal load with memory to spare if it crashes, it may not be able to ever recover due to running out of memory.  The only solution I found for this is to disable scraping, let it perform crash recovery, and then restarting the server with scraping enabled

Warmup

Another behavior to keep in mind is the need for warmup – a lower performance/higher resource usage ratio immediately after start. In some—but not all—starts I can observe significantly higher initial CPU and memory usage

cpu usage during warmup

memory usage during warmup

The gaps in the memory utilization graph show that Prometheus is not initially able to perform all the scrapes configured, and as such some data is lost.

I have not profiled what exactly causes this extensive CPU and memory consumption. I suspect these might be happening when new time series entries are created, at head block, and at high rate.

CPU Usage Spikes

Besides compaction—which is quite heavy on the Disk IO—I also can observe significant CPU spikes about every 2 minutes. These are longer with a higher ingest ratio. These seem to be caused by Go Garbage Collection during these spikes: at least some CPU cores are completely saturated

cpu usage spikes maybe during Go Garbage collection

cpu saturation and max core usage

These spikes are not just cosmetic. It looks like when these spikes happen, the Prometheus internal /metrics endpoint becomes unresponsive, thus producing data gaps during the exact time that the spikes occur:

Prometheus 2 process memory usage

We can also see the Prometheus Exporter hitting a one second timeout:

scrape time by job

We can observe this correlates with garbage collection:

garbage collection in Prometheus processing

Conclusion

Prometheus 2 TSDB offers impressive performance, being able to handle a cardinality of millions of time series, and also to handle hundreds of thousands of samples ingested per second on rather modest hardware. CPU and disk IO usage are both very impressive. I got up to 200K/metrics/sec per used CPU core!

For capacity planning purposes you need to ensure that you have plenty of memory available, and it needs to be real RAM. The actual amount of memory I observed was about 5GB per 100K/samples/sec ingest rate, which with additional space for OS cache, makes it 8GB or so.

There is work that remains to be done to avoid CPU and IO usage spikes, though this is not unexpected considering how young Prometheus 2 TSDB is – if we look at InnoDB, TokuDB, RocksDB, WiredTiger all of them had similar problem in their initial releases.

The post Prometheus 2 Times Series Storage Performance Analyses appeared first on Percona Database Performance Blog.

Sep
03
2018
--

40 million tables in MySQL 8.0 with ZFS

40 million tables in MySQL 8

40 million tables in MySQL 8In my previous blog post about millions of table in MySQL 8, I was able to create one million tables and test the performance of it. My next challenge is to create 40 million tables in MySQL 8 using shared tablespaces (one tablespace per schema). In this blog post I’m showing how to do it and what challenges we can expect.

Background

Once again – why do we need so many tables in MySQL, what is the use case? The main reason is: customer isolation. With the new focus on security and privacy (take GDPR for example) it is much easier and more beneficial to create a separate schema (or “database” in MySQL terms) for each customer. That creates a new set of challenges that we will need to solve. Here is the summary:

  1. Too many files. For each table MySQL creates an FRM file. With MySQL 8.0, this is not the case for InnoDB tables (new data dictionary): it does not create FRM files, only creates IBD file.
  2. Too much storage overhead. Just to create 40 million tables we will need to have ~4 – 5 Tb of space. The ZFS filesystem can help here a lot, through compression – see below.
  3. MySQL does not work well with so many tables. We have observed a lot of overhead (MySQL needs to open/close table definition files) and contention (table definitions needs to be stored in memory to avoid performance penalty, which introduce mutex contention)

Challenges

When I approached the task of creating 40 million tables, my first challenge was disk space. Just to create them, I needed at least 5Tb of fast disk storage. The good news is: we have the ZFS filesystem which provides compression out of the box. With compression I was able to use just a 250G drive with ZFS – the compression ratio is > 10x:

# du -sh --apparent-size /var/lib/mysql-data
4.7T    /var/lib/mysql-data
# du -sh /var/lib/mysql-data
131G    /var/lib/mysql-data

The second challenge is how to create those tables in a reasonable amount of time. I created a script to “provision” the databases (create all 40 millions tables). The good new is that the performance regression in “create table” speed and scalability bug was fixed so I was able to use this script to create 40 million tables using shared tablespaces (one tablespace per schema):

#/bin/bash
function do_db {
        db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
        if [ "$db_exist" == "1" ]; then echo "Already exists $db"; return 0; fi;
        mysql -vvv -e "create database $db";
        mysql -vvv $db -e "CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB;"
        for i in {1..100}
        do
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                mysql $db -e "$table"
        done
}
c=0
for m in {1..4000000}
do
        for i in {1..40}
        do
                let c=$c+1
                echo $c
                db="sbtest_$c"
                do_db &
        done
        wait
        #if [ $c > 4000000 ]; then exit; fi
done

40 million tables in MySQL 8

Now it’s time for a real test. I’m using the latest MySQL 8 version (at the time of writing): 8.0.12. This implements the new data dictionary.

MySQL config file:

[mysqld]
datadir=/var/lib/mysql-data
socket=/var/lib/mysql-data/mysql.sock
datadir=/var/lib/mysql-data
log-error = /var/lib/mysql-log/error.log
server_id = 12345
log_bin = /var/lib/mysql-log/binlog
relay_log=/var/lib/mysql-log/relay-bin
skip-log-bin=1
innodb_log_group_home_dir = /var/lib/mysql-log
innodb_doublewrite = 0
innodb_flush_log_at_trx_commit=0
innodb_log_file_size=2G
innodb_buffer_pool_size=4G
tablespace_definition_cache = 524288
schema_definition_cache = 524288
table_definition_cache = 524288
table_open_cache=524288
open-files-limit=1000000

Sysbench shell script:

function run_sb() {
conn=" --db-driver=mysql --mysql-socket=/var/lib/mysql-data/mysql.sock  --mysql-db=sbtest_1 --mysql-user=sbtest --mysql-password=abc "
sysbench $conn --oltp_db_count=$db_count --oltp_tables_count=$table_count --oltp-table-size=10000 --report-interval=1 --num-threads=$num_threads --max-requests=0 --max-time=$max_time ./select_custom.lua run | tee -a sysbench_2.txt
}
let db_count=400000
table_count=100
max_time=10000
num_threads=32
run_sb

Sysbench lua script:

pathtest = "/usr/share/sysbench/tests/include/oltp_legacy/"
if pathtest then
   dofile(pathtest .. "common.lua")
else
   require("common")
end
function thread_init(thread_id)
   set_vars()
end
function event()
   local table_name
   local i
   local c_val
   local k_val
   local pad_val
   oltp_db_count = tonumber(oltp_db_count) or 1
   -- local oltp_db_count = 4
   table_name = "sbtest_" .. sb_rand(1, oltp_db_count)..".sbtest".. sb_rand(1, oltp_tables_count)
   k_val = sb_rand(1, oltp_table_size)
   c_val = sb_rand_str([[
###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]])
   pad_val = sb_rand_str([[
###########-###########-###########-###########-###########]])
      rs = db_query("SELECT id FROM " .. table_name .." LIMIT 1")
end

Please note that the tables are empty – no data.

Now we can run the benchmark. Unfortunately, we have a serious mutex contention in the data dictionary. Here are the results:

[ 453s ] thds: 32 tps: 1203.96 qps: 1203.96 (r/w/o: 1203.96/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 454s ] thds: 32 tps: 1202.32 qps: 1202.32 (r/w/o: 1202.32/0.00/0.00) lat (ms,95%): 42.61 err/s: 0.00 reconn/s: 0.00
[ 455s ] thds: 32 tps: 1196.74 qps: 1196.74 (r/w/o: 1196.74/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 456s ] thds: 32 tps: 1197.18 qps: 1197.18 (r/w/o: 1197.18/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 457s ] thds: 32 tps: 887.11 qps: 887.11 (r/w/o: 887.11/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 458s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 459s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 460s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 461s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 462s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 463s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 464s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 465s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 466s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 467s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 468s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 469s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 470s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 471s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 472s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 473s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 474s ] thds: 32 tps: 403.96 qps: 403.96 (r/w/o: 403.96/0.00/0.00) lat (ms,95%): 16819.24 err/s: 0.00 reconn/s: 0.00
[ 475s ] thds: 32 tps: 1196.00 qps: 1196.00 (r/w/o: 1196.00/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 476s ] thds: 32 tps: 1208.96 qps: 1208.96 (r/w/o: 1208.96/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 477s ] thds: 32 tps: 1192.06 qps: 1192.06 (r/w/o: 1192.06/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 478s ] thds: 32 tps: 1173.89 qps: 1173.89 (r/w/o: 1173.89/0.00/0.00) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00

As we can see, for ~15 seconds no queries were processed: a complete MySQL stall. That situation – complete stall – happens constantly, every ~25-30 seconds.

Show engine innodb status query shows mutex contention:

SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 498635
--Thread 140456572004096 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451898689280 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451896919808 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140456571119360 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140457044215552 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140456572299008 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140457043035904 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140456571709184 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451897214720 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451896624896 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140457042740992 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451899279104 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140457042446080 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
OS WAIT ARRAY INFO: signal count 89024
RW-shared spins 11216, rounds 14847, OS waits 3641

I’ve filed a  new MySQL bug: DICT_SYS mutex contention causes complete stall when running with 40 mill tables.

I’ve also tested with pareto distribution in sysbench, and even set the ratio to 0.05 (5%) and 0.01 (1%), and mutex contention is still an issue. I have used the following updated sysbench script:

function run_sb() {
conn=" --db-driver=mysql --mysql-socket=/var/lib/mysql-data/mysql.sock  --mysql-db=sbtest_1 --mysql-user=sbtest --mysql-password=abc "
sysbench $conn --rand-type=$rand_type --rand-pareto-h=$pareto_h --oltp_db_count=$db_count --oltp_tables_count=$table_count --oltp-table-size=10000 --report-interval=1 --num-threads=$num_threads --max-requests=0 --max-time=$max_time $test_name run | tee -a sysbench_2.txt
}
let db_count=400000
table_count=100
max_time=10000
num_threads=32
rand_type="pareto"
pareto_h=0.01
test_name="./select_custom.lua"
echo "Now running $rand_type for $max_time seconds, test=$test_name"
run_sb

And the results with 0.01 (1%) are the following:

[ 55s ] thds: 32 tps: 72465.29 qps: 72465.29 (r/w/o: 72465.29/0.00/0.00) lat (ms,95%): 0.53 err/s: 0.00 reconn/s: 0.00
[ 56s ] thds: 32 tps: 68641.04 qps: 68641.04 (r/w/o: 68641.04/0.00/0.00) lat (ms,95%): 0.61 err/s: 0.00 reconn/s: 0.00
[ 57s ] thds: 32 tps: 70479.82 qps: 70479.82 (r/w/o: 70479.82/0.00/0.00) lat (ms,95%): 0.57 err/s: 0.00 reconn/s: 0.00
[ 58s ] thds: 32 tps: 31395.55 qps: 31395.55 (r/w/o: 31395.55/0.00/0.00) lat (ms,95%): 0.49 err/s: 0.00 reconn/s: 0.00
[ 59s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 61s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 62s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 63s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 64s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 65s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 66s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 67s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 68s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 69s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 71s ] thds: 32 tps: 18879.04 qps: 18879.04 (r/w/o: 18879.04/0.00/0.00) lat (ms,95%): 0.75 err/s: 0.00 reconn/s: 0.00
[ 72s ] thds: 32 tps: 70924.82 qps: 70924.82 (r/w/o: 70924.82/0.00/0.00) lat (ms,95%): 0.48 err/s: 0.00 reconn/s: 0.00
[ 73s ] thds: 32 tps: 72395.57 qps: 72395.57 (r/w/o: 72395.57/0.00/0.00) lat (ms,95%): 0.47 err/s: 0.00 reconn/s: 0.00
[ 74s ] thds: 32 tps: 72483.22 qps: 72484.22 (r/w/o: 72484.22/0.00/0.00) lat (ms,95%): 0.58 err/s: 0.00 reconn/s: 0.00

ZFS

The ZFS filesystem provides compression, which helps tremendously in this case. When MySQL creates an InnoDB table it will create a new blank .ibd file and pre-allocate some pages, which will be blank. I have configured ZFS compression and can see > 10x compression ratio:

# zfs get all | grep compressratio
mysqldata                       compressratio         12.47x                      -
mysqldata                       refcompressratio      1.00x                       -
mysqldata/mysql                 compressratio         12.47x                      -
mysqldata/mysql                 refcompressratio      1.00x                       -
mysqldata/mysql/data            compressratio         12.51x                      -
mysqldata/mysql/data            refcompressratio      12.54x                      -
mysqldata/mysql/log             compressratio         2.79x                       -
mysqldata/mysql/log             refcompressratio      4.57x                       -

Conclusion

It is possible to create 40 million tables with MySQL 8.0 using shared tablespaces. ZFS provides an excellent compression ratio (with gzip) which can help by reducing the overhead of “schema per customer” architecture. Unfortunately, the new data dictionary in MySQL 8.0.12 suffers from the DICT_SYS mutex contention and causes constant “stalls”.

The post 40 million tables in MySQL 8.0 with ZFS appeared first on Percona Database Performance Blog.

Aug
29
2018
--

Scaling IO-Bound Workloads for MySQL in the Cloud

InnoDB / MyRocks throughput on IO1

Is increasing GP2 volumes size or increasing IOPS for IO1 volumes a valid method for scaling IO-Bound workloads? In this post I’ll focus on one question: how much can we improve performance if we use faster cloud volumes? This post is a continuance of previous cloud research posts:

To recap, in Amazon EC2 we can use gp2 and io1 volumes. gp2 performance can be scaled with size, i.e for gp2 volume size of 500GB we get 1500 iops; size 1000GB – 3000 iops; and for 3334GB – 10000 iops (maximal possible value). For io1 volumes we can “buy” throughput up to 30000 iops.

So I wanted to check how both InnoDB and RocksDB storage engines perform on these volumes with different throughput.

Benchmark Scenario

I will use the same datasize that I used in Saving With MyRocks in The Cloud, that is sysbench-tpcc, 50 tables, 100W each, about 500GB datasize in InnoDB and 100GB in RocksDB (compressed with LZ4).

Volumes settings: gp2 volumes from 500GB (1000GB for InnoDB) to 3400GB with 100GB increments (so each increment increases throughput by 300 iops); io1 volumes: 1TB in size, iops from 1000 to 30000 with 1000 increments.

Let’s take look at the results. I will use a slightly different format than usual, but hopefully it represents the results better. You will see density throughout the plots—a higher and narrower chart represents less variance in the throughput. The plot represents the distribution of the throughput.

Results on GP2 volumes:

InnoDB/MyRocks throughput on gp2

It’s quite interesting to see how the result scales with better IO throughput. InnoDB does not improve its throughput after gp2 size 2600GB, while MyRocks continues to scale linearly. The problem with MyRocks is that there is a lot of variance in throughput (I will show a one second resolution chart).

Results on IO1 volumes

InnoDB / MyRocks throughput on IO1

Here MyRocks again shows an impressive growth as as we add more IO capacity, but also shows a lot of variance on high capacity volumes.

Let’s compare how engines perform with one second resolution. GP2 volume, 3400GB:

InnoDB/MyRocks throughput on gp2 3400GB

IO1 volume, 30000 iops:

InnoDB/MyRocks throughput on IO1 30000 IOPS

So for MyRocks there seems to be periodical background activity, which does not allow it to achieve a stable throughput.

Raw results, if you’d like to review them, can be found here: https://github.com/Percona-Lab-results/201808-rocksdb-cloudio

Conclusions

If you are looking to improve throughput in IO-bound workloads, either increasing GP2 volumes size or increasing IOPS for IO1 volumes is a valid method, especially for the MyRocks engine.

The post Scaling IO-Bound Workloads for MySQL in the Cloud appeared first on Percona Database Performance Blog.

Aug
07
2018
--

Resource Usage Improvements in Percona Monitoring and Management 1.13

PMM 1-13 reduction CPU usage by 5x

In Percona Monitoring and Management (PMM) 1.13 we have adopted Prometheus 2, and with this comes a dramatic improvement in resource usage, along with performance improvements!

What does it mean for you? This means you can have a significantly larger number of servers and database instances monitored by the same PMM installation. Or you can reduce the instance size you use to monitor your environment and save some money.

Let’s look at some stats!

CPU Usage

PMM 1.13 reduction in CPU usage by 5x

Percona Monitoring and Management 1.13 reduction in CPU usage after adopting Prometheus 2 by 8x

We can see an approximate 5x and 8x reduction of CPU usage on these two PMM Servers. Depending on the workload, we see CPU usage reductions to range between 3x and 10x.

Disk Writes

There is also less disk write bandwidth required:

PMM 1.13 reduction in disk write bandwidth

On this instance, the bandwidth reduction is “just” 1.5x times. Note this is disk IO for the entire PMM system, which includes more than only the Prometheus component. Prometheus 2 itself promises much more significant IO bandwidth reduction according to official benchmarks

According to the same benchmark, you should expect disk space usage reduction by 33-50% for Prometheus 2 vs Prometheus 1.8. The numbers will be less for Percona Monitoring and Management, as it also stores Query Statistics outside of Prometheus.

Resource usage on the monitored hosts

Also, resource usage on the monitored hosts is significantly reduced:

Percona Monitoring and Management 1.13 reduction of resource usage by Prometheus 2

Why does CPU usage go down on a monitored host with a Prometheus 2 upgrade? This is because PMM uses TLS for the Prometheus to monitored host communication. Before Prometheus 2, a full handshake was performed for every scrape, taking a lot of CPU time. This was optimized with Prometheus 2, resulting in a dramatic CPU usage decrease.

Query performance is also a lot better with Prometheus 2, meaning dashboards visually load a lot faster, though we did not do any specific benchmarks here to share the hard numbers. Note though this improvement only applies when you’re querying the data which is stored in Prometheus 2.

If you’re querying data that was originally stored in Prometheus 1.8, it will be queried through the much slower and less efficient “Remote Read” interface, being quite a bit slower and using a lot more CPU and memory resources.

If you love better efficiency and Performance, consider upgrading to PMM 1.13!

The post Resource Usage Improvements in Percona Monitoring and Management 1.13 appeared first on Percona Database Performance Blog.

Aug
01
2018
--

Saving With MyRocks in The Cloud

The main focus of a previous blog post was the performance of MyRocks when using fast SSD devices. However, I figured that MyRocks would be beneficial for use in cloud workloads, where storage is either slow or expensive.

In that earlier post, we demonstrated the benefits of MyRocks, especially for heavy IO workloads. Meanwhile, Mark wrote in his blog that the CPU overhead in MyRocks might be significant for CPU-bound workloads, but this should not be the issue for IO-bound workloads.

In the cloud the cost of resources is a major consideration. Let’s review the annual cost for the processing and storage resources.

 Resource cost/year, $   IO cost $/year   Total $/year 
c5.9xlarge  7881    7881
1TB io1 5000 IOPS  1500  3900    5400
1TB io1 10000 IOPS  1500  7800    9300
1TB io1 15000 IOPS  1500  11700  13200
1TB io1 20000 IOPS  1500  15600  17100
1TB io1 30000 IOPS  1500  23400  24900
3.4TB GP2 (10000 IOPS)  4800    4800

 

The scenario

The server version is Percona Server 5.7.22

For instances, I used c5.9xlarge instances. The reason for c5 was that it provides high performance Nitro virtualization: Brendan Gregg describes this in his blog post. The rationale for 9xlarge instances was to be able to utilize io1 volumes with a 30000 IOPS throughput – smaller instances will cap io1 throughput at a lower level.

I also used huge gp2 volumes: 3400GB, as this volume provides guaranteed 10000 IOPS even if we do not use io1 volumes. This is a cheaper alternative to io1 volumes to achieve 10000 IOPS.

For the workload I used sysbench-tpcc 5000W (50 tables * 100W), which for InnoDB gave about 471GB in storage used space.

For the cache I used 27GB and 54G buffer size, so the workload is IO-heavy.

I wanted to compare how InnoDB and RocksDB performed under this scenario.

If you are curious I prepared my terraform+ansible deployment files here: https://github.com/vadimtk/terraform-ansible-percona

Before jumping to the results, I should note that for MyRocks I used LZ4 compression for all levels, which in its final size is 91GB. That is five times less than InnoDB size. This alone provides operational benefits—for example to copy InnoDB files (471GB) from a backup volume takes longer than 1 hour, while it is much faster (five times) for MyRocks.

The benchmark results

So let’s review the results.

InnoDB versus MyRocks throughput in the cloud

Or presenting average throughput in a tabular form:

cachesize IOPS engine avg TPS
27 5000 innodb 132.66
27 5000 rocksdb 481.03
27 10000 innodb 285.93
27 10000 rocksdb 1224.14
27 10000gp2 innodb 227.19
27 10000gp2 rocksdb 1268.89
27 15000 innodb 436.04
27 15000 rocksdb 1839.66
27 20000 innodb 584.69
27 20000 rocksdb 2336.94
27 30000 innodb 753.86
27 30000 rocksdb 2508.97
54 5000 innodb 197.51
54 5000 rocksdb 667.63
54 10000 innodb 433.99
54 10000 rocksdb 1600.01
54 10000gp2 innodb 326.12
54 10000gp2 rocksdb 1559.98
54 15000 innodb 661.34
54 15000 rocksdb 2176.83
54 20000 innodb 888.74
54 20000 rocksdb 2506.22
54 30000 innodb 1097.31
54 30000 rocksdb 2690.91

 

We can see that MyRocks outperformed InnoDB in every single combination, but it is also important to note the following:

MyRocks on io1 5000 IOPS showed the performance that InnoDB showed in io1 15000 IOPS.

That means that InnoDB requires three times more in storage throughput. If we take a look at the storage cost, it corresponds to three times more expensive storage. Given that MyRocks requires less storage, it is possible to save even more on storage capacity.

On the most economical storage (3400GB gp2, which will provide 10000 IOPS) MyRocks showed 4.7 times better throughput.

For the 30000 IOPS storage, MyRocks was still better by 2.45 times.

However it is worth noting that MyRocks showed a greater variance in throughput during the runs. Let’s review the charts with 1 sec resolution for GP2 and io1 30000 IOPS storage:Throughput 1 sec resolution for GP2 and io1 30000 IOPS storage MyROCKS versus InnoDB

Such variance might be problematic for workloads that require stable throughput and where periodical slowdowns are unacceptable.

Conclusion

MyRocks is suitable and beneficial not only for fast SSD, but also for cloud deployments. By requiring less IOPS, MyRocks can provide better performance and save on the storage costs.

However, before evaluating MyRocks, make sure that your workload is IO-bound i.e. the working set is much bigger than available memory. For CPU-intensive workloads (where the working set fits into memory), MyRocks will be less beneficial or even perform worse than InnoDB (as described in the blog post A Look at MyRocks Performance)

 

 

 

The post Saving With MyRocks in The Cloud appeared first on Percona Database Performance Blog.

Jul
18
2018
--

Why Consumer SSD Reviews are Useless for Database Performance Use Case

Anandtech Table reviewing consumer SSD performance

If you’re reading consumer SSD reviews and using them to estimate SSD performance under database workloads, you’d better stop. Databases are not your typical consumer applications and they do not use IO in the same way.

Let’s look, for example, at this excellent AnandTech review of Samsung 960 Pro –  a consumer NVMe device that I happen to have in my test lab.

Anandtech Table reviewing consumer SSD performance

The summary table is actually great, showing the performance both at Queue Depth 1 (single threaded) as well as Queue Depth 32 – a pretty heavy concurrent load.

Even at QD1 we see 50K (4K) writes per second, which should be enough for pretty serious database workloads.

In reality, though, you might be in for some disappointing surprises. While “normal” buffered IO is indeed quite fast, this drive really hates fsync() calls, with a single thread fsync() latency of 3.5ms or roughly 300 fsync/sec. That’s just slightly more than your old school spinning drive.

Why is fsync() performance critical for databases?

To achieve Durabilitythe letter “D” of ACIDdatabases tend to rely on a write ahead log (WAL) which is sequentially written. The WAL must be synced to disk on every transaction commit using fsync() or similar measures, such as opening file with O_SYNC flag. These tend to have similar performance implications.

Other database operations use fsync() too, but writing WAL is where it usually hurts the most.    

In a fully durable configuration MySQL tends to be impacted even more by poor fsync() performance. It may need to perform as many as three fsync operations per transaction commit. Group commit reduces the impact on throughput but transaction latency will still be severely impacted

Want more bad news? If the fsync() performance is phenomenal on your consumer SSD it indeed might be too good to be true. Over the years, some consumer SSDs “faked” fsync and accepted possible data loss in the case of power failure. This might not be a big deal if you only use them for testing but it is a showstopper for any real use.

Want to know more about your drive’s fsync() performance?  You can use these sysbench commands:

sysbench fileio --time=60 --file-num=1 --file-extra-flags= --file-total-size=4096 --file-block-size=4096 --file-fsync-all=on --file-test-mode=rndwr --file-fsync-freq=0 --file-fsync-end=0  --threads=1 --percentile=99 prepare
sysbench fileio --time=60 --file-num=1 --file-extra-flags= --file-total-size=4096 --file-block-size=4096 --file-fsync-all=on --file-test-mode=rndwr --file-fsync-freq=0 --file-fsync-end=0  --threads=1 --percentile=99 run | grep "avg:"

You can also use ioping as described in this blog post

I wish that manufacturers’ tech specifications described fsync latency, along with a clear statement as to whether the drive guarantees no loss of data on power failure. Likewise, I wish folk doing storage reviews could include these in their research.

Interested in fsync() performance for variety of devices?  Yves Trudeau wrote an excellent blog post about fsync() performance on various storage devices  a few months ago.

Other technical resources

Principal Support Escalation Specialist Sveta Smirnova presents Troubleshooting MySQL Concurrency Issues with Load Testing Tools. 

You can download a three part series of eBooks by Principal Consultant Alexander Rubin and me on MySQL Performance.

The post Why Consumer SSD Reviews are Useless for Database Performance Use Case appeared first on Percona Database Performance Blog.

Jul
13
2018
--

On MySQL and Intel Optane performance

MySQL 8 versus Percona Server with heavy IO application on Intel Optane

Recently, Dimitri published the results of measuring MySQL 8.0 on Intel Optane storage device. In this blog post, I wanted to look at this in more detail and explore the performance of MySQL 8, MySQL 5.7 and Percona Server for MySQL using a similar set up. The Intel Optane is a very capable device, so I was puzzled that Dimitri chose MySQL options that are either not safe or not recommended for production workloads.

Since we have an Intel Optane in our labs, I wanted to run a similar benchmark, but using settings that we would recommend our customers to use, namely:

  • use innodb_checksum
  • use innodb_doublewrite
  • use binary logs with sync_binlog=1
  • enable (by default) Performance Schema

I still used

charset=latin1

  (even though the default is utf8mb4 in MySQL 8) and I set a total size of InnoDB log files to 30GB (as in Dimitri’s benchmark). This setting allocates big InnoDB log files to ensure there is no pressure from adaptive flushing. Though I have concerns about how it works in MySQL 8, this is a topic for another research.

So let’s see how MySQL 8.0 performed with these settings, and compare it with MySQL 5.7 and Percona Server for MySQL 5.7.

I used an Intel Optane SSD 905P 960GB device on the server with 2 socket Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz CPUs.

To highlight the performance difference I wanted to show, I used a single case: sysbench 8 tables 50M rows each (which is about ~120GB of data) and buffer pool 32GB. I ran sysbench oltp_read_write in 128 threads.

First, let’s review the results for MySQL 8 vs MySQL 5.7

After achieving a steady state – we can see that MySQL 8 does not have ANY performance improvements over MySQL 5.7.

Let’s compare this with Percona Server for MySQL 5.7

MySQL 8 versus Percona Server with heavy IO application on Intel Optane

Percona Server for MySQL 5.7 shows about 60% performance improvement over both MySQL 5.7 and MySQL 8.

How did we achieve this? All our improvements are described here: https://www.percona.com/doc/percona-server/LATEST/performance/xtradb_performance_improvements_for_io-bound_highly-concurrent_workloads.html. In short:

  1. Parallel doublewrite.  In both MySQL 5.7 and MySQL 8 writes are serialized by writing to doublewrite.
  2. Multi-threaded LRU flusher. We reported and proposed a solution here https://bugs.mysql.com/bug.php?id=70500. However, Oracle have not incorporated the solution upstream.
  3. Single page eviction. This is another problematic area in MySQL’s flushing algorithm. The bug https://bugs.mysql.com/bug.php?id=81376 was reported over 2 years ago, but unfortunately it’s still overlooked.

Summarizing performance findings:

  • For Percona Server for MySQL during this workload, I observed 1.4 GB/sec  reads and 815 MB/sec  writes
  • For MySQL 5.7 and MySQL 8 the numbers are 824 MB/sec reads and  530 MB/sec writes.

My opinion is that Oracle focused on addressing the wrong performance problems in MySQL 8 and did not address the real issues. In this benchmark, using real production settings, MySQL 8 does not show any significant performance benefits over MySQL 5.7 for workloads characterized by heavy IO writes.

With this, I should admit that Intel Optane is a very performant storage. By comparison, on Intel 3600 SSD under the same workload, for Percona Server I am able to achieve only 2000 tps, which is 2.5x times slower than with Intel Optane.

Drawing some conclusions

So there are a few outcomes I can highlight:

  • Intel Optane is a very capable drive, it is easily the fastest of those we’ve tested so far
  • MySQL 8 is not able to utilize all the power of Intel Optane, unless you use unsafe settings (which to me is the equivalent of driving 200 MPH on a highway without working brakes)
  • Oracle has focused on addressing the wrong IO bottlenecks and has overlooked the real ones
  • To get all the benefits of Intel Optane performance, use a proper server—Percona Server for MySQL—which is able to utilize more IOPS from the device.

The post On MySQL and Intel Optane performance appeared first on Percona Database Performance Blog.

Jul
12
2018
--

Why MySQL Stored Procedures, Functions and Triggers Are Bad For Performance

Execution map for func1()

MySQL stored procedures, functions and triggers are tempting constructs for application developers. However, as I discovered, there can be an impact on database performance when using MySQL stored routines. Not being entirely sure of what I was seeing during a customer visit, I set out to create some simple tests to measure the impact of triggers on database performance. The outcome might surprise you.

Why stored routines are not optimal performance wise: short version

Recently, I worked with a customer to profile the performance of triggers and stored routines. What I’ve learned about stored routines: “dead” code (the code in a branch which will never run) can still significantly slow down the response time of a function/procedure/trigger. We will need to be careful to clean up what we do not need.

Profiling MySQL stored functions

Let’s compare these four simple stored functions (in MySQL 5.7):

Function 1:

CREATE DEFINER=`root`@`localhost` FUNCTION `func1`() RETURNS int(11)
BEGIN
	declare r int default 0;
RETURN r;
END

This function simply declares a variable and returns it. It is a dummy function

Function 2:

CREATE DEFINER=`root`@`localhost` FUNCTION `func2`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2
    THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

This function calls another function, levenshtein_limit_n (calculates levenshtein distance). But wait: this code will never run – the condition IF 1=2 will never be true. So that is the same as function 1.

Function 3:

CREATE DEFINER=`root`@`localhost` FUNCTION `func3`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select levenshtein_limit_n('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select levenshtein_limit_n('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select levenshtein_limit_n('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

Here there are four conditions and none of these conditions will be true: there are 4 calls of “dead” code. The result of the function call for function 3 will be the same as function 2 and function 1.

Function 4:

CREATE DEFINER=`root`@`localhost` FUNCTION `func3_nope`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select does_not_exit('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select does_not_exit('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select does_not_exit('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select does_not_exit('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

This is the same as function 3 but the function we are running does not exist. Well, it does not matter as the

select does_not_exit

  will never run.

So all the functions will always return 0. We expect that the performance of these functions will be the same or very similar. Surprisingly it is not the case! To measure the performance I used the “benchmark” function to run the same function 1M times. Here are the results:

+-----------------------------+
| benchmark(1000000, func1()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (1.75 sec)
+-----------------------------+
| benchmark(1000000, func2()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (2.45 sec)
+-----------------------------+
| benchmark(1000000, func3()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (3.85 sec)
+----------------------------------+
| benchmark(1000000, func3_nope()) |
+----------------------------------+
|                                0 |
+----------------------------------+
1 row in set (3.85 sec)

As we can see func3 (with four dead code calls which will never be executed, otherwise identical to func1) runs almost 3x slower compared to func1(); func3_nope() is identical in terms of response time to func3().

Visualizing all system calls from functions

To figure out what is happening inside the function calls I used performance_schema / sys schema to create a trace with ps_trace_thread() procedure

  1. Get the thread_id for the MySQL connection:
    mysql> select THREAD_ID from performance_schema.threads where processlist_id = connection_id();
    +-----------+
    | THREAD_ID |
    +-----------+
    |        49 |
    +-----------+
    1 row in set (0.00 sec)
  2. Run ps_trace_thread in another connection passing the thread_id=49:
    mysql> CALL sys.ps_trace_thread(49, concat('/var/lib/mysql-files/stack-func1-run1.dot'), 10, 0, TRUE, TRUE, TRUE);
    +--------------------+
    | summary            |
    +--------------------+
    | Disabled 0 threads |
    +--------------------+
    1 row in set (0.00 sec)
    +---------------------------------------------+
    | Info                                        |
    +---------------------------------------------+
    | Data collection starting for THREAD_ID = 49 |
    +---------------------------------------------+
    1 row in set (0.00 sec)
  3. At that point I switched to the original connection (thread_id=49) and run:
    mysql> select func1();
    +---------+
    | func1() |
    +---------+
    |       0 |
    +---------+
    1 row in set (0.00 sec)
  4. The sys.ps_trace_thread collected the data (for 10 seconds, during which I ran the
    select func1()

     ), then it finished its collection and created the dot file:

    +-----------------------------------------------------------------------+
    | Info                                                                  |
    +-----------------------------------------------------------------------+
    | Stack trace written to /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-----------------------------------------------------------------------+
    1 row in set (9.21 sec)
    +-------------------------------------------------------------------------------+
    | Convert to PDF                                                                |
    +-------------------------------------------------------------------------------+
    | dot -Tpdf -o /tmp/stack_49.pdf /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-------------------------------------------------------------------------------+
    1 row in set (9.21 sec)
    +-------------------------------------------------------------------------------+
    | Convert to PNG                                                                |
    +-------------------------------------------------------------------------------+
    | dot -Tpng -o /tmp/stack_49.png /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-------------------------------------------------------------------------------+
    1 row in set (9.21 sec)
    Query OK, 0 rows affected (9.45 sec)

I repeated these steps for all the functions above and then created charts of the commands.

Here are the results:

Func1()

Execution map for func1()

Func2()

Execution map for func2()

Func3()

Execution map for func3()

 

As we can see there is a sp/jump_if_not call for every “if” check followed by an opening tables statement (which is quite interesting). So parsing the “IF” condition made a difference.

For MySQL 8.0 we can also see MySQL source code documentation for stored routines which documents how it is implemented. It reads:

Flow Analysis Optimizations
After code is generated, the low level sp_instr instructions are optimized. The optimization focuses on two areas:

Dead code removal,
Jump shortcut resolution.
These two optimizations are performed together, as they both are a problem involving flow analysis in the graph that represents the generated code.

The code that implements these optimizations is sp_head::optimize().

However, this does not explain why it executes “opening tables”. I have filed a bug.

When slow functions actually make a difference

Well, if we do not plan to run one million of those stored functions we will never even notice the difference. However, where it will make a difference is … inside a trigger. Let’s say that we have a trigger on a table: every time we update that table it executes a trigger to update another field. Here is an example: let’s say we have a table called “form” and we simply need to update its creation date:

mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.31 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

That is good and fast. Now we create a trigger which will call our dummy func1():

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
	select func1() into r;
END

Now repeat the update. Remember: it does not change the result of the update as we do not really do anything inside the trigger.

mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.90 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Just adding a dummy trigger will add 2x overhead: the next trigger, which does not even run a function, introduces a slowdown:

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.52 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Now, lets use func3 (which has “dead” code and is equivalent to func1):

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
	select func3() into r;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (1.06 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

However, running the code from the func3 inside the trigger (instead of calling a function) will speed up the update:

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select levenshtein_limit_n('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select levenshtein_limit_n('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select levenshtein_limit_n('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.66 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Memory allocation

Potentially, even if the code will never run, MySQL will still need to parse the stored routine—or trigger—code for every execution, which can potentially lead to a memory leak, as described in this bug.

Conclusion

Stored routines and trigger events are parsed when they are executed. Even “dead” code that will never run can significantly affect the performance of bulk operations (e.g. when running this inside the trigger). That also means that disabling a trigger by setting a “flag” (e.g.

if @trigger_disable = 0 then ...

 ) can still affect performance of bulk operations.

The post Why MySQL Stored Procedures, Functions and Triggers Are Bad For Performance appeared first on Percona Database Performance Blog.

Jul
11
2018
--

AMD EPYC Performance Testing… or Don’t get on the wrong side of SystemD

Ubuntu 16 AMD EPYC

Ever since AMD released their EPYC CPU for servers I wanted to test it, but I did not have the opportunity until recently, when Packet.net started offering bare metal servers for a reasonable price. So I started a couple of instances to test Percona Server for MySQL under this CPU. In this benchmark, I discovered some interesting discrepancies in performance between  AMD and Intel CPUs when running under systemd .

The set up

To test CPU performance, I used a read-only in-memory sysbench OLTP benchmark, as it burns CPU cycles and no IO is performed by Percona Server.

For this benchmark I used Packet.net c2.medium.x86 instances powered by AMD EPYC 7401P processors. The OS is exposed to 48 CPU threads.

For the OS I tried

  • Ubuntu 16.04 with default kernel 4.4 and upgraded to 4.15
  • Ubuntu 18.04 with kernel 4.15
  • Percona Server started from SystemD and without SystemD (for reasons which will become apparent later)

To have some points for comparison, I also ran a similar workload on my 2 socket Intel CPU server, with CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. I recognize this is not most recent Intel CPU, but this was the best I had at the time, and it also gave 48 CPU Threads.

Ubuntu 16

First, let’s review the results for Ubuntu 16

Or in tabular format:

Threads Ubuntu 16, kernel 4.4; systemd Ubuntu 16, kernel 4.4;

NO systemd

Ubuntu 16, kernel 4.15
1 943.44 948.7 899.82
2 1858.58 1792.36 1774.45
4 3533.2 3424.05 3555.94
8 6762.35 6731.57 7010.51
12 10012.18 9950.3 10062.82
16 13063.39 13043.55 12893.17
20 15347.68 15347.56 14756.27
24 16886.24 16864.81 16176.76
30 18150.2 18160.07 17860.5
36 18923.06 18811.64 19528.27
42 19374.86 19463.08 21537.79
48 20110.81 19983.05 23388.18
56 20548.51 20362.31 23768.49
64 20860.51 20729.52 23797.14
72 21123.71 21001.06 23645.95
80 21370 21191.24 23546.03
90 21622.54 21441.73 23486.29
100 21806.67 21670.38 23392.72
128 22161.42 22031.53 23315.33
192 22388.51 22207.26 22906.42
256 22091.17 21943.37 22305.06
512 19524.41 19381.69 19181.71

 

There are few conclusions we can see from this data

  1. AMD EPYC CPU scales quite well to the number of CPU Threads
  2. The recent kernel helps to boost the throughput.

Ubuntu 18.04

Now, let’s review the results for Ubuntu 18.04

Threads Ubuntu 18, systemd
Ubuntu 18, NO systemd
1 833.14 843.68
2 1684.21 1693.93
4 3346.42 3359.82
8 6592.88 6597.48
12 9477.92 9487.93
16 12117.12 12149.17
20 13934.27 13933
24 15265.1 15152.74
30 16846.02 16061.16
36 18488.88 16726.14
42 20493.57 17360.56
48 22217.47 17906.4
56 22564.4 17931.83
64 22590.29 17902.95
72 22472.75 17857.73
80 22421.99 17766.76
90 22300.09 17773.57
100 22275.24 17646.7
128 22131.86 17411.55
192 21750.8 17134.63
256 21177.25 16826.53
512 18296.61 17418.72

 

This is where the result surprised me: on Ubuntu 18.04 with SystemD running Percona Server for MySQL as a service the throughput was up to 24% better than if Percona Server for MySQL is started from a bash shell. I do not know exactly what causes this dramatic difference—systemd uses different slices for services and user commands, and somehow it affects the performance.

Baseline benchmark

To establish a baseline, I ran the same benchmark on my Intel box, running Ubuntu 16, and I tried two kernels: 4.13 and 4.15

Threads Ubuntu 16, kernel 4.13, systemd Ubuntu 16, kernel 4.15, systemd
Ubuntu 16, kernel 4.15, NO systemd
1 820.07 798.42 864.21
2 1563.31 1609.96 1681.91
4 2929.63 3186.01 3338.47
8 6075.73 6279.49 6624.49
12 8743.38 9256.18 9622.6
16 10580.14 11351.31 11984.64
20 12790.96 12599.78 14147.1
24 14213.68 14659.49 15716.61
30 15983.78 16096.03 17530.06
36 17574.46 18098.36 20085.9
42 18671.14 19808.92 21875.84
48 19431.05 22036.06 23986.08
56 19737.92 22115.34 24275.72
64 19946.57 21457.32 24054.09
72 20129.7 21729.78 24167.03
80 20214.93 21594.51 24092.86
90 20194.78 21195.61 23945.93
100 20753.44 21597.26 23802.16
128 20235.24 20684.34 23476.82
192 20280.52 20431.18 23108.36
256 20410.55 20952.64 22775.63
512 20953.73 22079.18 23489.3

 

Here we see the opposite result with SystemD: Percona Server running from a bash shell shows the better throughput compared with the SystemD service. So for some reason, systemd works differently for AMD and Intel CPUs. Please let me know if you have any ideas on how to deal with the impact that systemd has on performance.

Conclusions

So there are some conclusions from these results:

  1. AMD EPYC shows a decent performance scalability; the new kernel helps to improve it
  2. systemd shows different effects on throughput for AMD and Intel CPUs
  3. With AMD the throughput declines for a high concurrent workload with 512 threads, while Intel does not show a decline.

The post AMD EPYC Performance Testing… or Don’t get on the wrong side of SystemD appeared first on Percona Database Performance Blog.

Jun
27
2018
--

Scaling PostgreSQL with PgBouncer: You May Need a Connection Pooler Sooner Than You Expect

PostgreSQL connection pools

PostgreSQL connection poolsAs PostgreSQL based applications scale, the need to implement connection pooling can become apparent sooner than you might expect. Since, PostgreSQL to date has no built-in connection pool handler, in this post I’ll explore some of the options for implementing connection pooling. In doing so, we’ll take a look at some of the implications for application performance.

PostgreSQL implements connection handling by “forking” it’s main OS process into a child process for each new connection. An interesting practical consequence of this is that we get a full view of resource utilization per connection in PostgreSQL at the OS level (the output below is from top):

PID USER      PR NI VIRT RES  SHR S %CPU %MEM TIME+  COMMAND             
24379 postgres  20 0 346m 148m 122m R 61.7  7.4 0:46.36 postgres: sysbench sysbench ::1(40120)
24381 postgres  20 0 346m 143m 119m R 62.7  7.1 0:46.14 postgres: sysbench sysbench ::1(40124)
24380 postgres  20 0 338m 137m 121m R 57.7  6.8 0:46.04 postgres: sysbench sysbench ::1(40122)
24382 postgres  20 0 338m 129m 115m R 57.4  6.5 0:46.09 postgres: sysbench sysbench ::1(40126)

This extra visualization comes at some additional cost though: it is more expensive—in terms of time and memory mostly—to fork an OS process than it would be, for example, to spawn a new thread for an existing process. This might be irrelevant if the rate at which connections are opened and closed is low but becomes increasingly important to consider over time as this reality changes. That is possibly one of the reasons why the need for a connection pooling mechanism often manifests itself early in the scaling life of a PostgreSQL-based application.

Scaling connections

When an application server sends a connection request to a PostgreSQL database it will be received by the Postmaster process. Postmaster, observing the limit set by max_connections, will then fork itself, creating a new backend process to handle this new connection. This backend process will live until the connection is closed by the client or terminated by PostgreSQL itself.

If the application was conceived with the database in mind it will make an effective use of connections, reusing existing ones whenever possible while avoiding idle connections from laying around for too long. Unfortunately, that is not always the case. Plus there are legitimate situations where the application popularity/usage increases. These naturally increases the rate at which new connection requests are created. There is a practical limit for the number of connections a server can manage at a given time. Beyond these limits, we start seeing contention in different areas. This in turn affects the server’s capacity to process requests, which can lead to bigger problems.

Putting a cap on the number of connections that may exist at any given time helps somewhat. However, considering the time it takes to properly establish a new connection, if the rate at which new connection requests arrive continues to rise we may soon reach a scaling problem.

Connection pooling for PostgreSQL applications

What if we could instead recycle existing connections to serve new client requests to gain on time (by avoiding the creation and remotion of yet another backend process each time), ultimately increasing transaction throughput, while also making a better use of the resources available? The strategy that revolves around the use of a cache (or pool) of connections that are kept open on the database server and re-used by different client requests is known as connection pooling.

Given there is no built-in handler for PostgreSQL, there are typically two ways of implementing such a mechanism:

  1. On the application side. Some frameworks like Ruby on Rails include their own, built-in connection pool mechanism. There are also libraries that extend database driver functionality to include connection pooling support, such as c3P0.
  2. As an external service, sitting between the application and the database server. The application will then connect to this external service instead, which will relay each request from the application to the database server through one of the connections it maintains in its pool.

An application connection pooler might provide better integration with the application, for example, when it comes to the use of prepared statements and the re-utilization of the cached result set. However, sometimes they may fall short on understanding PostgreSQL protocol and this might result in things like failing to properly clear pre-cached memory. An external service, on the other hand, won’t provide such a tight integration with a particular application but usually allows for greater customization and better maintenance of the pool. This often provides the ability to increase and decrease the number of connections it maintains cached dynamically and according to the demand observed, while also respecting pre-set threshold marks. Probably the most popular connection pooler used with PostgreSQL is PgBouncer.

A simple test

In order to illustrate the impact a connection pooler might have on the performance of a PostgreSQL server, I took advantage of the recent tests we did with sysbench-tpcc on PostgreSQL and repeated them partially by making use of PgBouncer as a connection pooler.

When we first ran the tests our goal was to optimize PostgreSQL for sysbench-tpcc workload running with 56 concurrent clients (threads), with the server having the same amount of CPUs available. The goal this time was to vary the number of concurrent clients (56, 150, 300 and 600) to see how the server would cope with the scaling of connections.

Instead of running each round for 10 hours, however, I ran them for 30 minutes only. This might mask, or at least change, the effects of checkpointing and caching observed in our initial tests.

PgBouncer

I compiled the latest version of PgBouncer (1.8.1) following the instruction on GitHub and installed it in our test box, alongside PostgreSQL.

PgBouncer can be configured with three different types of pooling:

  • Session pooling: once the client gets one of the connections in the pool assigned it will keep it until it disconnects (or a timeout is reached).
  • Transaction pooling: once the client gets a connection from the pool, it keeps it to run a single transaction only.  After that, the connection is returned to the pool. If the client wants to run other transactions it has to wait until it gets another connection assigned to it.
  • Statement pooling: in this mode, PgBouncer will return a connection to its pool as soon as the first query is processed, which means multi-statement transactions would break in this mode.

I went with transaction pooling for my tests as the workload of sysbench-tpcc is composed of several short and single-statement transactions. Here’s the configuration file I used in full, which I named pgbouncer.ini:

[databases]
sbtest = host=127.0.0.1 port=5432 dbname=sbtest
[pgbouncer]
listen_port = 6543
listen_addr = 127.0.0.1
auth_type = md5
auth_file = users.txt
logfile = pgbouncer.log
pidfile = pgbouncer.pid
admin_users = postgres
pool_mode = transaction
default_pool_size=56
max_client_conn=600

Apart from pool_mode, the other variables that matter the most are (definitions below came from PgBouncer’s manual page):

  • default_pool_size: how many server connections to allow per user/database pair.
  • max_client_conn: maximum number of client connections allowed

The users.txt file specified by auth_file contains only a single line with the user and password used to connect to PostgreSQL; more elaborate authentication methods are also supported.

Running the test

I started PgBouncer as a daemon with the following command:

$ pgbouncer -d pgbouncer.ini

Apart from running the benchmark for only 30 minutes and varying the number of concurrent threads each time, I employed the exact same options for sysbench-tpcc we used in our previous tests. The example below is from the first run with threads=56:

$ ./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=1800 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password=****** --db-driver=pgsql run > /var/lib/postgresql/Nando/56t.txt

For the tests using the connection pooler I adapted the connection options so as to connect with PgBouncer instead of PostgreSQL directly. Note it remains a local connection:

./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=1800 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password=****** --pgsql-port=6543 --db-driver=pgsql run > /var/lib/postgresql/Nando/P056t.txt

After each sysbench-tpcc execution I cleared the OS cache with the following command:

$ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'

It is not like this will have made much of a difference. As discussed in our previous post shared_buffers was set with 75% of RAM, enough to fit all of the “hot data” in memory.

Results

Without further ado here are the results I obtained:

comparing the effects of running sysbench-tpcc with a connection pooler
TPS from sysbench-tpcc: comparing direct connection to PostgreSQL and the use of PgBouncer as a connection pooler

When running sysbench-tpcc with only 56 concurrent clients the use of direct connections to PostgreSQL provided a throughput (TPS stands for transactions per second) 2.5 times higher than that obtained when using PgBouncer. The use of a connection pooler in this case was extremely detrimental to performance. At such small scale there were no gains obtained from a pool of connections, only overhead.

When running the benchmark with 150 concurrent clients, however, we start seeing the benefits of employing a connection pooler. In fact, such benefits most probably materialize much earlier. With hindsight going from 56 concurrent transactions to 150 was too big of an initial jump.

It was somewhat surprising to me to realize at first that such throughput could be sustained with PgBouncer even when doubling and then quadrupling the number of concurrent clients. What happens, in this case, is that instead of flooding PostgreSQL with that many requests at once, they all stop at PgBouncer’s door. PgBouncer only allows the next request to proceed to PostgreSQL once one of the connections in its pool is freed. Remember, I configured it in transaction pooling mode. The process is transparent to PostgreSQL. It has no idea how many requests are waiting at PgBouncer’s door to be processed (and thus it is spared the trouble and doesn’t freak out!). It’s effectively like outsourcing connection management, beyond the ones used by the pool itself, to a contractor. PgBouncer does this hard work so PostgreSQL doesn’t need to.

This strategy seems to work great for sysbench-tpcc. With other workloads, the balance point might lie elsewhere.

Experimenting with bigger and smaller connection pools

For the tests above I set default_pool_size on PgBouncer equal to the number of CPU cores available on this server (56). To explore the tuning of this parameter, I repeated these tests using bigger connection pools (150, 300, 600) as well as a smaller one (14). The following chart summarizes the results obtained:

sysbench-tpcc with pgbouncer: comparing different pool sizes
How the use of PgBouncer impacts throughput on sysbench-tpcc: comparing the use of different pool sizes

Using a much smaller connection pool (14), sized to ¼ of the number of available CPUs, still yielded a result almost as good. That says a lot about how much leveraging connection handling alone to PgBouncer already helps. Maybe some of PgBouncer’s “gatekeeper” qualities could be incorporated by PostgreSQL itself?

Doubling the number of connections in the pool didn’t make any practical difference. But as soon as we extrapolate that number to 600 (which is the number of maximum concurrent threads I’ve tested) throughput becomes comparable to when not using a connection pooler once the number of concurrent threads is greater than the number of available CPUs. That’s true even when running as many concurrent threads as there are connections available in the pool (600). There’s a practical limit to it, which clearly is on the PostgreSQL side. That’s expected, otherwise, the need for a connection pool wouldn’t be as important.

As a starting point, setting the connection pool size equal to the number of CPUs available in the server looks like a good idea. There may be a hard limit for the pool around 150 connections or so, after which further benefits aren’t realized. However, that’s only speculation. Further tests using both different hardware and workloads would be needed to investigate this properly.

Here’s the table that summarizes the results obtained, for a different view:

Do I need a connection pooler ?

The need to couple PostgreSQL with a connection pooler depends on a number of factors including:

  • The number of connections typically established with the server. Take into account that the number usually varies significantly during the day. Think about the average number per hour, during different hours in the day, and particularly when compared with peak time.
  • The effective throughput (TPS) produced by these connections
  • The number of CPUs available in view of the effective throughput
  • The nature of your workload:
    • Whether your application opens a new connection for each new request or tends to leave connections open for longer
    • Whether it is composed of various single-statement transactions (AUTOCOMMIT=ON style) or big and long transactions.

Unfortunately, I don’t have a formula for this. But now that you understand how a connection pooler such as PgBouncer works and where it tends to benefit PostgreSQL’s performance (even if only having sysbench-tpcc’s workload as the sole example here) it’s a matter of investigating the points above and experimenting. Experimenting is the key! Though possibly using reads only on a stand-by replica at first, to stay on the safe side…

In a future post, we will cover how the architecture of a database solution changes with the use of connection poolers: where to place them, how to cope with single point of failure issues and how they can be used alongside load balancers.

The post Scaling PostgreSQL with PgBouncer: You May Need a Connection Pooler Sooner Than You Expect appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com