Nov
12
2019
--

Watch Out for Disk I/O Performance Issues when Running EXT4

Performance Issues When Running EXT4

Recently, at Percona Live Europe 2019, Dimitri Kravchuk from Oracle mentioned that he observed some unclear drop in performance for MySQL on an ext4 filesystem with the latest Linux kernels. I decided to check this case out on my side and found out that indeed, starting from linux kernel 4.9, there are some cases with notable (up to 2x) performance drops for ext4 filesystem in direct i/o mode.

So what’s wrong with ext4? It started in 2016 from the patch that was pushed to kernel 4.9: “ext4: Allow parallel DIO reads”. The purpose of that patch was to help to improve read scalability in direct i/o mode. However, along with improvements in pure read workloads, it also introduced regression in intense mixed random read/write scenarios. And it’s quite weird, but this issue had not been noticed for 3 years. Only this summer was performance regression reported and discussed in LKML. As a result of this discussion, there is an attempt to fix it, but from my current understanding that fix will be pushed only to upcoming 5.4/5.5 kernels. Below I will describe what this regression looks like, how it affects MySQL workloads, and what workarounds we can apply to mitigate this issue.

ext4 Performance Regression

Let’s start by defining the scope of this ext4 performance regression. It will only have an impact if the setup/workload meets following conditions:
– fast ssd/nvme
– linux kernel>=4.9
– files resides on ext4 file system
– files opened with O_DIRECT flag
– at least some I/O should be synchronous

In the original report to LKML, the issue was observed/reproduced with a mixed random read/write scenario with sync I/O and O_DIRECT. But how do these factors relate to MySQL? The only files opened by InnoDB in O_DIRECT mode are tablespaces (*.ibd files), and I/O pattern for tablespaces consists of following operations:

– reads ibd data in synchronous mode
– writes ibd data in asynchronous mode
– posix_allocate to extend tablespace file followed by a synchronous write
– fsync

There are also extra I/O from WAL log files:

– writes data to log files in synchronous mode
– fsync

So in the case of InnoDB tablespaces that are opened with O_DIRECT, we have a mix of sync reads and async writes and it turned out that such a combination along with sync writes to innodb log file is enough to cause notable performance regression as well. I have sketched the workload for fio tool (see below) that simulates the I/O access pattern for InnoDB and have run it for SSD and NVMe drives for linux kernels 4.4.0, 5.3.0, and 5.3.0 with ext4 scalability fix.

[global]
filename=tablespace1.ibd:tablespace2.ibd:tablespace3.ibd:tablespace4.ibd:tablespace5.ibd
direct=1
bs=16k
iodepth=1

#read data from *.ibd tablespaces
[ibd_sync_read]
rw=randread
ioengine=psync

#write data to *.ibd tavlespaces
[ibd_async_write]
rw=randwrite
ioengine=libaio

#write data to ib* log file
[ib_log_sync_write]
rw=write
bs=8k
direct=0
ioengine=psync
fsync=1
filename=log.ib
numjobs=1

fio results on the chart:

Observations:

– for SATA/SSD drive there is almost no difference in throughtput, and only at 16 threads do we see a drop in reads for ext4/kernel-5.3.0. For ext4/kernel-5.3.0 mounted with dioread_nolock (that enables scalability fixes), we see that IOPS back and even look better.
– for NVMe drive the situation looks quite different – until 8 i/o threads IOPS for both reads and writes are more/less similar but after increasing pressure on i/o we see a notable spike for writes and similar drop for reads. And again mounting ext4 with dioread_nolock helps to get the same throughput as and for kernels < 4.9.

The similar performance data for the original issue reported to LKML (with more details and analysis) can be found in the patch itself.

How it Affects MySQL

O_DIRECT

Now let’s check the impact of this issue on an IO-bound sysbench/OLTP_RW workload in O_DIRECT mode. I ran a test for the following setup:

– filesystem: xfs, ext4/default, ext4/dioread_nolock
– drives: SATA/SSD and NVMe
– kernels: 4.4.0, 5.3.0, 5.3.0+ilock_fix

Observations

– in the case of SATA/SSD, the ext4 scalability issue has an impact on tps rate after 256 threads and drop is 10-15%
– in the case of NVMe and regular ext4 with kernel 5.3.0 causes performance drop in ~30-80%. If we apply a fix by mounting ext4 with dioread_nolock or use xfs,  throughput looks good.

O_DSYNC

As ext4 regression affects O_DIRECT, let’s replace O_DIRECT with O_DSYNC and look at results of the same sysbench/OLTP_RW workload on kernel 5.3.0:

Note: In order to make results between O_DIRECT and O_DSYNC comparable, I have limited available memory for MySQL instance by cgroup.

Observations:

In the case of O_DSYNC and regular ext4, the performance is just 10% less than for O_DIRECT/ext4/dioread_nolock and O_DIRECT/xfs and ~35% better than for O_DIRECT/ext4. That means that O_DSYNC can be used as a workaround for cases when you have fast storage and ext4 as filesystem but can’t switch to xfs or upgrade kernel.

Conclusions/workarounds

If your workload/setup is affected, there are the following options that you may consider as a workaround:

– downgrade linux kernel to 4.8
– install kernel 5.3.0 with fix and mount ext4 with dioread_nolock option
– if O_DIRECT is important, switch to xfs filesystem
– if changing filesystem is not an option,  replace O_DIRECT with O_DSYNC+cgroup

Aug
15
2019
--

A Faster, Lightweight Trigger Function in C for PostgreSQL

Trigger Function in C for PostgreSQL

Trigger Function in C for PostgreSQLWe have been writing blog posts about how to write simple extensions in C language and a little more complex one by Ibrar which were well received by PostgreSQL user community. Then we observed that many PostgreSQL users create simple triggers for small auditing requirements, and then feel the pain of trigger on transactions. So we were discussing how simple/lightweight and faster a trigger function is when written in C. Generally, Trigger functions are written in high-level languages like PlpgSQL, but it has a higher overhead during execution and it can impact the transactions – and thereby application performance.

This blog post is an attempt to create a simple trigger function to address one of the common use-cases of triggers, which is to update auditing columns in a table.

In this post, we are going to introduce SPI (Server Programming Interface) functions for novice users. Towards the end of the blog, we share some of the quick benchmark results for understanding the benefits.

Example of Audit timestamp

Let’s proceed with taking up a case and assume that we have a table to hold transaction details. But auditing requirements say that there should be a timestamp on each tuple when the tuple is inserted and when it was last updated.

CREATE TABLE transdtls(
  transaction_id int,
  cust_id int,
  amount  int,
...
  insert_ts timestamp,
  update_ts timestamp
);

For demonstration purpose, let’s remove and trim the other columns and create a table with only 3 essential columns.

CREATE TABLE transdtls(
  transaction_id int,
  insert_ts timestamp,
  update_ts timestamp
);

Developing Trigger Function

The trigger function can also be developed and packaged as an extension, which we discussed in s previous blog post here. So we are not going to repeat those steps here. The difference is that file names are named as “trgr” instead of “addme” in the previous blog. Makefile is also modified to refer “trgr” files. This need not be same as the function name “trig_test” in the C source detailed below.

In the end, the following files are available in the development folder:

$ ls
Makefile trgr--0.0.1.sql trgr.c trgr.control

The trgr.c is the main source files with the following content:

#include <stdio.h>
#include <time.h>
#include "postgres.h"
#include "utils/rel.h"
#include "executor/spi.h"
#include "commands/trigger.h"
#include "utils/fmgrprotos.h"
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

extern Datum trig_test(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(trig_test);

Datum
trig_test(PG_FUNCTION_ARGS)
{
    TriggerData *trigdata = (TriggerData *) fcinfo->context;
    //TupleDesc   tupdesc;
    HeapTuple   tuple;
    HeapTuple   rettuple;
    int         attnum = 0;
    Datum       datumVal;

    //Get the structure of the tuple in the table.
    //tupdesc = trigdata->tg_relation->rd_att;

    //Make sure that the function is called from a trigger
    if (!CALLED_AS_TRIGGER(fcinfo))
        elog(ERROR, "are you sure you are calling from trigger manager?");

    //If the trigger is part of an UPDATE event
    if (TRIGGER_FIRED_BY_UPDATE(trigdata->tg_event))
    {
        //attnum = SPI_fnumber(tupdesc,"update_ts");
        attnum = 3;
        tuple = trigdata->tg_newtuple;
    }
    else   //If the trigger is part of INSERT event
    {
        //attnum = SPI_fnumber(tupdesc,"insert_ts");
        attnum = 2;
        tuple = trigdata->tg_trigtuple;
    }
    //Get the current timestamp using "now"
    datumVal = DirectFunctionCall3(timestamp_in, CStringGetDatum("now"), ObjectIdGetDatum(InvalidOid), Int32GetDatum(-1));

    //Connect to Server and modify the tuple
    SPI_connect();
    rettuple = SPI_modifytuple(trigdata->tg_relation, tuple, 1, &attnum, &datumVal, NULL);
    if (rettuple == NULL)
    {
        if (SPI_result == SPI_ERROR_ARGUMENT || SPI_result == SPI_ERROR_NOATTRIBUTE)
                elog(ERROR, "SPI_result failed! SPI_ERROR_ARGUMENT or SPI_ERROR_NOATTRIBUTE");
         elog(ERROR, "SPI_modifytuple failed!");
    }
    SPI_finish();                           /* don't forget say Bye to SPI mgr */
    return PointerGetDatum(rettuple);
}

and

trgr--0.0.1.sql

  with the following content:

CREATE OR REPLACE FUNCTION trig_test() RETURNS trigger
     AS 'MODULE_PATHNAME','trig_test'
LANGUAGE C STRICT;

Now it is a matter of building, installing, and creating the extension.

$ make
$ sudo make install
psql> create extension trgr;

In case you don’t want to develop it as an extension, you may compile it to generate a shared object file (.so) file. Copy the same to the library folder of PostgreSQL binaries, which on my Ubuntu laptop is : /usr/lib/postgresql/11/lib/, and then define the function. You can even specify the full path of the shared object file like this:

CREATE FUNCTION trig_test() RETURNS trigger     
  AS '/usr/lib/postgresql/11/lib/trgr.so'
LANGUAGE C;

Using Trigger Function

Usage of trigger function is not different from regular PLpgSQL functions. You just need to attach the function to the table for all INSERT and UPDATE events.

CREATE TRIGGER transtrgr
 BEFORE INSERT OR UPDATE ON public.transdtls 
FOR EACH ROW EXECUTE PROCEDURE public.trig_test();

Benchmarking

For a fair comparison with trigger function written in PLpgSQL, a similar function is created as follows:

CREATE OR REPLACE FUNCTION transtrgr_pl()
  RETURNS TRIGGER AS $$
  BEGIN
     if  (TG_OP = 'UPDATE') then
        NEW.update_ts = now();
     else 
        NEW.insert_ts = now();
     end if;
    RETURN NEW;
  END;
  $$ language 'plpgsql';

The number of lines and the readability of the code is in favor of PLpgSQL. The development and debugging time required is much less.

Regarding the performance benchmarking, three cases are compared.

  1. PostgreSQL client/application providing the audit timestamp, so that trigger can be avoided.
  2. Trigger function in C language.
  3. Trigger function in PLpgSQL.

Here are the performance numbers in milliseconds for 1 million bulk inserts, obviously a smaller number is better.


Caveats

  1. The first case where there is no trigger on the database side, it takes less time. But the application and network need to take up the extra load, which is not considered in this test.
  2. The C function is bit hardcoded with an attribute number like
    attnum = 3;

    and if we want a generic trigger function which looks for specific column name, we can use SPI_fnumber function like

    attnum = SPI_fnumber(tupdesc,"update_ts");

    .  Such a generic trigger function can be used in multiple tables. Obviously, this involves more processing. Those lines are commented out in the source code. On repeated tests, the average time of execution increases to 1826.722 ms. Still, we can see that it is considerably faster than the PLpgSQL trigger function.

Discuss on Hacker News

May
09
2019
--

Improving OLAP Workload Performance for PostgreSQL with ClickHouse Database

Improving OLAP Workload Performance for PostgreSQL

Every database management system is not optimized for every workload. Database systems are designed for specific loads, and thereby give better performance for that workload. Similarly, some kinds of queries work better on some database systems and worse on others. This is the era of specialization, where a product is designed for a specific requirement or a specific set of requirements. We cannot achieve everything performance-wise from a single database system. PostgreSQL is one of the finest object databases and performs really well in OLTP types of workloads, but I have observed that its performance is not as good as some other database systems for OLAP workloads. ClickHouse is one of the examples that outperforms PostgreSQL for OLAP.

ClickHouse is an open source, column-based database management system which claims to be 100–1,000x faster than traditional approaches, capable of processing of more than a billion rows in less than a second.

Foreign Data Wrapper (Clickhousedb_fdw)

If we have a different kind of workload that PostgreSQL is not optimized for, what is the solution? Fortunately, PostgreSQL provides a mechanism where you can handle a different kind of workload while using it, by creating a data proxy within PostgreSQL that internally calls and fetches data from a different database system. This is called Foreign Data Wrapper, which is based on SQL-MED. Percona provides a foreign data wrapper for the Clickhousedb database system, which is available at Percona’s GitHub project repository.

Benchmark Machine

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: Samsung  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Linux smblade01 4.15.0-42-generic #45~16.04.1-Ubuntu
  • PostgreSQL: version 11

Benchmark Workload

Ontime (On Time Reporting Carrier On-Time Performance) is an openly-available dataset I have used to benchmark. It has a table size of 85GB with 109 number of different types of columns, and its designed queries more closely emulate the real world. 

https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh

Queries

Q# Query Contains Aggregates and Group By
Q1 SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q2 SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q3 SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
Q4 SELECT Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC;
Q5 SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY Carrier ) a
INNER JOIN (  SELECT   Carrier,count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY Carrier)b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q6 SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier,  count(*) AS c FROM ontime  WHERE DepDelay>10 AND Year >= 2000 AND
Year <= 2008 GROUP BY Carrier) a INNER JOIN ( SELECT  Carrier, count(*) AS c2 FROM ontime  WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier ) b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q7 SELECT Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
Q8 SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
Q9 select Year, count(*) as c1 from ontime group by Year;
Q10 SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
Q11 select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month) a;
Q12 SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
Q13 SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
Query Contains Joins
Q14 SELECT a.Year, c1/c2 FROM ( select Year, count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY Year) a
INNER JOIN (select Year, count(*) as c2 from ontime GROUP BY Year ) b on a.Year=b.Year ORDER BY a.Year;
Q15 SELECT a.”Year”, c1/c2 FROM ( select “Year”, count(*)*1000 as c1 FROM fontime WHERE “DepDelay”>10 GROUP BY “Year”) a INNER JOIN (select “Year”, count(*) as c2
FROM fontime GROUP BY “Year” ) b on a.”Year”=b.”Year”;

Table-1: Queries used in the benchmark

 

Benchmark Result

Improving OLAP Workload Performance for PostgreSQL

 

This graph shows the time taken by each query in PostgreSQL, ClickHouse, and Clickhusedb_fdw. You can clearly see that ClickHouse performs really well on its own while PostgreSQL doesn’t. The benchmark has been done on a single node with multiple CPU cores and can scale with multiple nodes. But I think we can see enough performance as ClickHouse is almost 1000x faster in most of the queries. In a previous blog post, I went over some of the reasons why some queries do not perform well in clickhouse_fdw, if you would like to learn more.

Conclusion

The graph above proves my point that PostgreSQL is not fit for every type of workload and query. However, the beauty of PostgreSQL is its ability to use its foreign data wrappers to create a heterogeneous environment with other database systems, which can then optimally handle specific workloads.

The clickhousedb_fdw clearly gives users the power to use ClickHouse for much-improved performance for OLAP workloads whilst still interacting with PostgreSQL only. This has simplified client applications that don’t need to interact with multiple database systems.

May
01
2019
--

Benchmark ClickHouse Database and clickhousedb_fdw

postgres clickhouse fdw

In this research,  I wanted to see what kind of performance improvements could be gained by using a ClickHouse data source rather than PostgreSQL. Assuming that I would see performance advantages with using ClickHouse, would those advantages be retained if I access ClickHouse from within postgres using a foreign data wrapper (FDW)? The FDW in question is clickhousedb_fdw – an open source project from Percona!

The database environments under scrutiny are PostgreSQL v11, clickhousedb_fdw and a ClickHouse database. Ultimately, from within PostgreSQL v11, we are going to issue various SQL queries routed through our clickhousedb_fdw to the ClickHouse database. Then we’ll see how the FDW performance compares with those same queries executed in native PostgreSQL and native ClickHouse.

Clickhouse Database

ClickHouse is an open source column based database management system which can achieve performance of between 100 and 1000 times faster than traditional database approaches, capable of processing more than a billion rows in less than a second.

Clickhousedb_fdw

clickhousedb_fdw, a ClickHouse database foreign data wrapper, or FDW, is an open source project from Percona. Here’s a link for the GitHub project repository:

https://github.com/Percona-Lab/clickhousedb_fdw

I wrote a blog in March which tells you more about our FDW: https://www.percona.com/blog/2019/03/29/postgresql-access-clickhouse-one-of-the-fastest-column-dbmss-with-clickhousedb_fdw/

As you’ll see, this provides for an FDW for ClickHouse that allows you to SELECT from, and INSERT INTO, a ClickHouse database from within a PostgreSQL v11 server.

The FDW supports advanced features such as aggregate pushdown and joins pushdown. These significantly improve performance by utilizing the remote server’s resources for these resource intensive operations.

Benchmark environment

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: Samsung  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Linux smblade01 4.15.0-42-generic #45~16.04.1-Ubuntu
  • PostgreSQL: version 11

Benchmark tests

Rather than using some machine generated dataset for this benchmark, we used the “On Time Reporting Carrier On-Time Performance” data from 1987 to 2018. You can access the data using our shell script available here:

https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh

The size of the database is 85GB, providing a single table of 109 columns.

Benchmark Queries

Here are the queries I used to benchmark the ClickHouse, clickhousedb_fdw, and PostgreSQL.

Q# Query Contains Aggregates and Group By
Q1 SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q2 SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q3 SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
Q4 SELECT Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC;
Q5 SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY Carrier ) a
INNER JOIN (  SELECT   Carrier,count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY Carrier)b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q6 SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier,  count(*) AS c FROM ontime  WHERE DepDelay>10 AND Year >= 2000 AND
Year <= 2008 GROUP BY Carrier) a INNER JOIN ( SELECT  Carrier, count(*) AS c2 FROM ontime  WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier ) b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q7 SELECT Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
Q8 SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
Q9 select Year, count(*) as c1 from ontime group by Year;
Q10 SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
Q11 select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month) a;
Q12 SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
Q13 SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
Query Contains Joins
Q14 SELECT a.Year, c1/c2 FROM ( select Year, count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY Year) a
INNER JOIN (select Year, count(*) as c2 from ontime GROUP BY Year ) b on a.Year=b.Year ORDER BY a.Year;
Q15 SELECT a.”Year”, c1/c2 FROM ( select “Year”, count(*)*1000 as c1 FROM fontime WHERE “DepDelay”>10 GROUP BY “Year”) a INNER JOIN (select “Year”, count(*) as c2
FROM fontime GROUP BY “Year” ) b on a.”Year”=b.”Year”;

Table-1: Queries used in benchmark

Query executions

Here are the results from the each of the queries when run in different database set ups: PostgreSQL with and without indexes, native ClickHouse and clickhousedb_fdw. The time is shown in milliseconds. 

Q# PostgreSQL PostgreSQL (Indexed) ClickHouse clickhousedb_fdw
Q1 27920 19634 23 57
Q2 35124 17301 50 80
Q3 34046 15618 67 115
Q4 31632 7667 25 37
Q5 47220 8976 27 60
Q6 58233 24368 55 153
Q7 30566 13256 52 91
Q8 38309 60511 112 179
Q9 20674 37979 31 81
Q10 34990 20102 56 148
Q11 30489 51658 37 155
Q12 39357 33742 186 1333
Q13 29912 30709 101 384
Q14 54126 39913 124 1364212
Q15 97258 30211 245 259

Table-1: Time taken to execute the queries used in benchmark

Reviewing the results

The graph shows the query execution time in milliseconds, the X-axis shows the query number from the tables above, while the Y-axis shows the execution time in milliseconds. The results for ClickHouse and the data accessed from postgres using clickhousedb_fdw are shown. From the table, you can see there is a huge difference between PostgreSQL and ClickHouse, but there is minimal difference between ClickHouse and clickhousedb_fdw.

 

Clickhouse Vs Clickhousedb_fdw (Shows the overhead of clickhousedb_fdw)

This graph shows the difference between ClickhouseDB and clickhousedb_fdw. In most of the queries, the FDW overhead is not that great, and barely significant apart from in Q12. This query involves joins and ORDER BY clause. Because of the ORDER BY clause the GROUP/BY and ORDER BY does not push down to ClickHouse.

In Table-2 we can see the spike of time in query Q12 and Q13. To reiterate, this is caused by the ORDER BY clause. To confirm this, I ran a queries Q-14 and Q-15 with and without the ORDER BY clause.  Without the ORDER BY clause the completion time is  259ms and with ORDER BY clause, it is 1364212. To debug that the query I explain both the queries, and here is the results of explain.

bm=# EXPLAIN VERBOSE SELECT a."Year", c1/c2
     FROM (SELECT "Year", count(*)*1000 AS c1 FROM fontime WHERE "DepDelay" > 10 GROUP BY "Year") a
     INNER JOIN(SELECT "Year", count(*) AS c2 FROM fontime GROUP BY "Year") b ON a."Year"=b."Year";

 

                                                    QUERY PLAN                                                      
Hash Join  (cost=2250.00..128516.06 rows=50000000 width=12)  
Output: fontime."Year", (((count(*) * 1000)) / b.c2)  
Inner Unique: true   Hash Cond: (fontime."Year" = b."Year")  
->  Foreign Scan  (cost=1.00..-1.00 rows=100000 width=12)        
Output: fontime."Year", ((count(*) * 1000))        
Relations: Aggregate on (fontime)        
Remote SQL: SELECT "Year", (count(*) * 1000) FROM "default".ontime WHERE (("DepDelay" > 10)) GROUP BY "Year"  
->  Hash  (cost=999.00..999.00 rows=100000 width=12)        
Output: b.c2, b."Year"        
->  Subquery Scan on b  (cost=1.00..999.00 rows=100000 width=12)              
Output: b.c2, b."Year"              
->  Foreign Scan  (cost=1.00..-1.00 rows=100000 width=12)                    
Output: fontime_1."Year", (count(*))                    
Relations: Aggregate on (fontime)                    
Remote SQL: SELECT "Year", count(*) FROM "default".ontime GROUP BY "Year"(16 rows)

 

bm=# EXPLAIN VERBOSE SELECT a."Year", c1/c2 FROM(SELECT "Year", count(*)*1000 AS c1 FROM fontime WHERE "DepDelay" > 10 GROUP BY "Year") a
     INNER JOIN(SELECT "Year", count(*) as c2 FROM fontime GROUP BY "Year") b  ON a."Year"= b."Year"
     ORDER BY a."Year";

 

                                                          QUERY PLAN
Merge Join  (cost=2.00..628498.02 rows=50000000 width=12)  
Output: fontime."Year", (((count(*) * 1000)) / (count(*)))  
Inner Unique: true   Merge Cond: (fontime."Year" = fontime_1."Year")  
->  GroupAggregate  (cost=1.00..499.01 rows=1 width=12)       
Output: fontime."Year", (count(*) * 1000)        
Group Key: fontime."Year"        
->  Foreign Scan on public.fontime  (cost=1.00..-1.00 rows=100000 width=4)              
Remote SQL: SELECT "Year" FROM "default".ontime WHERE (("DepDelay" > 10))
            ORDER BY "Year" ASC  
->  GroupAggregate  (cost=1.00..499.01 rows=1 width=12)        
Output: fontime_1."Year", count(*)         Group Key: fontime_1."Year"        
->  Foreign Scan on public.fontime fontime_1  (cost=1.00..-1.00 rows=100000 width=4) 
             
Remote SQL: SELECT "Year" FROM "default".ontime ORDER BY "Year" ASC(16 rows)

Conclusion

The results from these experiments show that ClickHouse offers really good performance, and clickhousedb_fdw offers the benefits of ClickHouse performance from within PostgreSQL. While there is some overhead when using clickhousedb_fdw, it is negligible and is comparable to the performance achieved when running natively within the ClickHouse database. This also confirms that the PostgreSQL foreign data wrapper push-down feature provides wonderful results.

Apr
25
2019
--

Creating Custom Sysbench Scripts

sysbench-lua for benchmark tooling

sysbench-lua for benchmark toolingSysbench has long been established as the de facto standard when it comes to benchmarking MySQL performance. Percona relies on it daily, and even Oracle uses it when blogging about new features in MySQL 8. Sysbench comes with several pre-defined benchmarking tests. These tests are written in an easy-to-understand scripting language called Lua. Some of these tests are called: oltp_read_write, oltp_point_select, tpcc, oltp_insert. There are over ten such scripts to emulate various behaviors found in standard OLTP applications.

But what if your application does not fit the pattern of traditional OLTP? How can you continue to utilize the power of load-testing, benchmarking, and results analysis with sysbench? Just write your own Lua script!

For those that want to jump ahead and see the full source, here you go.

Sysbench API

To start off, each Lua script you create must implement three core sysbench-Lua API functions. These are thread_init, thread_done, and event. You can read the comments in the code below for the meaning of each function and what is happening inside.

-- Called by sysbench one time to initialize this script
function thread_init()
  -- Create globals to be used elsewhere in the script
  -- drv - initialize the sysbench mysql driver
  drv = sysbench.sql.driver()
  -- con - represents the connection to MySQL
  con = drv:connect()
end
-- Called by sysbench when script is done executing
function thread_done()
  -- Disconnect/close connection to MySQL
  con:disconnect()
end
-- Called by sysbench for each execution
function event()
  -- If user requested to disable transactions,
  -- do not execute BEGIN statement
  if not sysbench.opt.skip_trx then
    con:query("BEGIN")
  end
  -- Run our custom statements
  execute_selects()
  execute_inserts()
  -- Like above, if transactions are disabled,
  -- do not execute COMMIT
  if not sysbench.opt.skip_trx then
    con:query("COMMIT")
  end
end

That’s all pretty simple and should function as a good template in your scripts. Now let’s take a look at the rest of the script.

Sanity checks and options

Now let’s get into the core code. At the top you’ll find the following sections:

if sysbench.cmdline.command == nil then
   error("Command is required. Supported commands: run")
end
sysbench.cmdline.options = {
  point_selects = {"Number of point SELECT queries to run", 5},
  skip_trx = {"Do not use BEGIN/COMMIT; Use global auto_commit value", false}
}

The first section is a sanity check to make sure the user actually wants to run this test. Other test scripts, mentioned above, support commands like prepare, run, and cleanup. Our script only supports run as the data we are using is pre-populated by our core application.

The second section allows us, the script writer, to let the user pass some options specific to our test script. In the code above, we can see an option for the number of SELECT statements that will be ran on each thread/iteration (default is 5) and the other option allows the user to disable BEGIN/COMMIT if they so desire (default is false). If you want more customization in your script, simply add more options. You’ll see how to reference these parameters later on.

The queries

Now it is time to define the custom queries we want to execute in our script.

-- Array of categories to be use in the INSERTs
local page_types = { "actor", "character", "movie" }
-- Array of COUNT(*) queries
local select_counts = {
  "SELECT COUNT(*) FROM imdb.title"
}
-- Array of SELECT statements that have 1 integer parameter
local select_points = {
  "SELECT * FROM imdb.title WHERE id = %d",
  "SELECT * FROM imdb.comments ORDER BY id DESC limit 10",
  "SELECT AVG(rating) avg FROM imdb.movie_ratings WHERE movie_id = %d",
  "SELECT * FROM imdb.users ORDER BY RAND() LIMIT 1"
}
-- Array of SELECT statements that have 1 string parameter
local select_string = {
  "SELECT * FROM imdb.title WHERE title LIKE '%s%%'"
}
-- INSERT statements
local inserts = {
  "INSERT INTO imdb.users (email_address, first_name, last_name) VALUES ('%s', '%s', '%s')",
  "INSERT INTO imdb.page_views (type, viewed_id, user_id) VALUES ('%s', %d, %d)"
}

The above code defines several arrays/lists of different queries. Why is this necessary? Later on in the code, we will have to parse each SQL statement and populate/replace the various parameters with randomly generated values. It would not do us any good to repeat the same SELECT * FROM fooTable WHERE id = 44 every time, now would it? Certainly not. We want to generate random numbers and have our queries select from the entire dataset.

Some queries have no parameters, some have integer-based, and some string-based. We will handle these differently below, which is why they are in different arrays above. This method also allows for future expansion. When you want to run additional queries within the script, just add another line to each array; no need to change any other code.

Parse and execute

The function below, execute_selects, will be called from the parent function, event, which we discussed earlier in the post. You can see for-loops for each of the three SELECT categories we created above. The comments inline should help explain what is happening. Note the use of the user-provided option –point-selects in the second loop below, which we created previously in the ‘Sanity and Options’ section.

function execute_selects()
  -- Execute each simple, no parameters, SELECT
  for i, o in ipairs(select_counts) do
    con:query(o)
  end
  -- Loop for however many queries the
  -- user wants to execute in this category
  for i = 1, sysbench.opt.point_selects do
    -- select random query from list
    local randQuery = select_points[math.random(#select_points)]
    -- generate random id and execute
    local id = sysbench.rand.pareto(1, 3000000)
    con:query(string.format(randQuery, id))
  end
  -- generate random string and execute
  for i, o in ipairs(select_string) do
    local str = sysbench.rand.string(string.rep("@", sysbench.rand.special(2, 15)))
    con:query(string.format(o, str))
  end
end

Two more things to mention for this code. First, you will notice the use of sysbench.rand.pareto to generate a random number between 1 and 3,000,000. For our dataset, we know that each table referenced in all queries relating to WHERE id = ? has that many number of rows, at minimum. This is specific to our data. Your values will certainly be different. Second, notice the use of sysbench.rand.string, and string.rep. The string.rep segment will generate a string comprised of ‘@’ symbols, between 2 and 15 characters long. That string of ‘@’ symbols will then be passed to sysbench.rand.string, which will swap out each ‘@’ for a random alphanumeric value. For example, ‘@@@@@@’ could be changed to ‘Hk9EdC’ which will then replace the ‘%s’ inside the query string (string.format) and be executed.

Handle inserts

Our INSERT statements require values. Again, sysbench calls the function execute_inserts from event on each iteration. Inside execute_inserts, we generate some fake string data using built-in functions described above.

Those strings are then formatted into the SQL and executed.

function create_random_email()
  local username = sysbench.rand.string(string.rep("@",sysbench.rand.uniform(5,10)))
  local domain = sysbench.rand.string(string.rep("@",sysbench.rand.uniform(5,10)))
  return username .. "@" .. domain .. ".com"
end
function execute_inserts()
  -- generate fake email/info
  local email = create_random_email()
  local firstname = sysbench.rand.string("first-" .. string.rep("@", sysbench.rand.special(2, 15)))
  local lastname = sysbench.rand.string("last-" .. string.rep("@", sysbench.rand.special(2, 15)))
  -- INSERT for new imdb.user
  con:query(string.format(inserts[1], email, firstname, lastname))
  -- INSERT for imdb.page_view
  local page = page_types[math.random(#page_types)]
  con:query(string.format(inserts[2], page, sysbench.rand.special(2, 500000), sysbench.rand.special(2, 500000)))
end

Example run

$ sysbench imdb_workload.lua \
    --mysql-user=imdb --mysql-password=imdb \
    --mysql-db=imdb --report-interval=1 \
    --events=0 --time=0 run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.17 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 1
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 1s ] thds: 1 tps: 15.96 qps: 177.54 (r/w/o: 112.71/31.92/32.91) lat (ms,95%): 158.63 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 1 tps: 15.01 qps: 169.09 (r/w/o: 109.06/30.02/30.02) lat (ms,95%): 137.35 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 1 tps: 26.00 qps: 285.00 (r/w/o: 181.00/52.00/52.00) lat (ms,95%): 108.68 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 1 tps: 15.00 qps: 170.00 (r/w/o: 108.00/32.00/30.00) lat (ms,95%): 164.45 err/s: 0.00 reconn/s: 0.00

And there we have it! Custom queries specific to our application and dataset. Most of the sysbench parameters are self-explanatory, but let me talk about –report-interval=1 which shows statistics every 1 second. Normally sysbench does not output stats until the end of the run, however, the example execution will run forever (–events=0 –time=0) so we need stats to show all the time. You can adjust the parameters to your liking. For instance, if you only want to run a test for 5 minutes, set –events=0 –run-time=300.

Conclusion

Sysbench is a very well designed application that allows you to load-test your MySQL instances using pre-defined and custom queries. Using the Lua scripting language, you can create just about any scenario to fit your needs. The above is just one example that we use within Percona’s Training and Education department. It is by no means an exhaustive example of all of the capabilities of sysbench-Lua.


Photo by Lachlan Donald on Unsplash

Mar
20
2019
--

MongoDB on ARM Processors

reads updates transactions per hour per $

ARM processors have been around for a while. In mid-2015/2016 there were a couple of attempts by the community to port MongoDB to work with this architecture. At the time, the main storage engine was MMAP and most of the available ARM boards were 32-bits. Overall, the port worked, but the fact is having MongoDB running on a Raspberry Pi was more a hack than a setup. The public cloud providers didn’t yet offer machines running with these processors.

The ARM processors are power-efficient and, for this reason, they are used in smartphones, smart devices and, now, even laptops. It was just a matter of time to have them available in the cloud as well. Now that AWS is offering ARM-based instances you might be thinking: “Hmmm, these instances include the same amount of cores and memory compared to the traditional x86-based offers, but cost a fraction of the price!”.

But do they perform alike?

In this blog, we selected three different AWS instances to compare: one powered by  an ARM processor, the second one backed by a traditional x86_64 Intel processor with the same number of cores and memory as the ARM instance, and finally another Intel-backed instance that costs roughly the same as the ARM instance but carries half as many cores. We acknowledge these processors are not supposed to be “equivalent”, and we do not intend to go deeper in CPU architecture in this blog. Our goal is purely to check how the ARM-backed instance fares in comparison to the Intel-based ones.

These are the instances we will consider in this blog post.

Methodology

We will use the Yahoo Cloud Serving Benchmark (YCSB, https://github.com/brianfrankcooper/YCSB) running on a dedicated instance (c5d.4xlarge) to simulate load in three distinct tests:

  1. a load of 1 billion documents in one collection having only the primary key (which we’ll call Inserts).
  2. a workload comprised of exclusively reads (Reads)
  3. a workload comprised of a mix of 75% reads with 5% scans plus 25% updates (Reads/Updates)

We will run each test with a varying number of concurrent threads (32, 64, and 128), repeating each set three times and keeping only the second-best result.

All instances will run the same MongoDB version (4.0.3, installed from a tarball and running with default settings) and operating system, Ubuntu 16.04. We chose this setup because MongoDB offer includes an ARM version for Ubuntu-based machines.

All the instances will be configured with:

  • 100 GB EBS with 5000 PIOPS and 20 GB EBS boot device
  • Data volume formatted with XFS, 4k blocks
  • Default swappiness and disk scheduler
  • Default kernel parameters
  • Enhanced cloud watch configured
  • Free monitoring tier enabled

Preparing the environment

We start with the setup of the benchmark software we will use for the test, YCSB. The first task was to spin up a powerful machine (c5d.4xlarge) to run the software and then prepare the environment:

The YCSB program requires Java, Maven, Python, and pymongo which doesn’t come by default in our Linux version – Ubuntu server x86. Here are the steps we used to configure our environment:

Installing Java

sudo apt-get install java-devel

Installing Maven

wget http://ftp.heanet.ie/mirrors/www.apache.org/dist/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
sudo tar xzf apache-maven-*-bin.tar.gz -C /usr/local
cd /usr/local
sudo ln -s apache-maven-* maven
sudo vi /etc/profile.d/maven.sh

Add the following to maven.sh

export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Installing Python 2.7

sudo apt-get install python2.7

Installing pip to resolve the pymongo dependency

sudo apt-get install python-pip

Installing pymongo (driver)

sudo pip install pymongo

Installing YCSB

curl -O --location https://github.com/brianfrankcooper/YCSB/releases/download/0.5.0/ycsb-0.5.0.tar.gz
tar xfvz ycsb-0.5.0.tar.gz
cd ycsb-0.5.0

YCSB comes with different workloads, and also allows for the customization of a workload to match our own requirements. If you want to learn more about the workloads have a look at https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workload_template

First, we will edit the workloads/workloada file to perform 1 billion inserts (for our first test) while also preparing it to later perform only reads (for our second test):

recordcount=1000000
operationcount=1000000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=1
updateproportion=0.0

We will then change the workloads/workloadb file so as to provide a mixed workload for our third test.  We also set it to perform 1 billion reads, but we break it down into 70% of read queries and 30% of updates with a scan ratio of 5%, while also placing a cap on the maximum number of scanned documents (2000) in an effort to emulate real traffic – workloads are not perfect, right?

recordcount=10000000
operationcount=10000000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=0.7
updateproportion=0.25
scanproportion=0.05
insertproportion=0
maxscanlength=2000

With that, we have the environment configured for testing.

Running the tests

With all instances configured and ready, we run the stress test against our MongoDB servers using the following command :

./bin/ycsb [load/run] mongodb -s -P workloads/workload[ab] -threads [32/64/128] \
 -p mongodb.url=mongodb://xxx.xxx.xxx.xxx.:27017/ycsb0000[0-9] \
 -jvm-args="-Dlogback.configurationFile=disablelogs.xml"

The parameters between brackets varied according to the instance and operation being executed:

  • [load/run] load means insert data while run means perform action (update/read)
  • workload[a/b] reference the different workloads we’ve used
  • [32/64/128] indicate the number of concurrent threads being used for the test
  • ycsb0000[0-9] is the database name we’ve used for the tests (for reference only)

Results

Without further ado, the table below summarizes the results for our tests:

 

 

 

Performance cost

Considering throughput alone – and in the context of those tests, particularly the last one – you may get more performance for the same cost. That’s certainly not always the case, which our results above also demonstrate. And, as usual, it depends on “how much performance do you need” – a matter that is even more pertinent in the cloud. With that in mind, we had another look at our data under the “performance cost” lens.

As we saw above, the c5.4xlarge instance performed better than the other two instances for a little over 50% more (in terms of cost). Did it deliver 50% more (performance) as well? Well, sometimes it did even more than that, but not always. We used the following formula to extrapolate the OPS (Operations Per Second) data we’ve got from our tests into OPH (Operations Per Hour), so we could them calculate how much bang (operations) for the buck (US$1) each instance was able to provide:

transactions/hour/US$1 = (OPS * 3600) / instance cost per hour

This is, of course, an artificial metric that aims to correlate performance and cost. For this reason, instead of plotting the raw values, we have normalized the results using the best performer instance as baseline(100%):

 

 

The intent behind these was only to demonstrate another way to evaluate how much we’re getting for what we’re paying. Of course, you need to have a clear understanding of your own requirements in order to make a balanced decision.

Parting thoughts

We hope this post awakens your curiosity not only about how MongoDB may perform on ARM-based servers, but also by demonstrating another way you can perform your own tests with the YCSB benchmark. Feel free to reach out to us through the comments section below if you have any suggestions, questions, or other observations to make about the work we presented here.

Mar
06
2019
--

Settling the Myth of Transparent HugePages for Databases

The concept of Linux HugePages has existed for quite a while: for more than 10 years, introduced to Debian in 2007 with kernel version 2.6.23. Whilst a smaller page size is useful for general use, some memory intensive applications may gain performance by using bigger memory pages. By having bigger memory chunks available to them, they can reduce lookup time as well as improve the performance of read/write operations. To be able to make use of HugePages, applications need to carry the specific code directive, and changing applications across the board is not necessarily a simple task. So enter Transparent HugePages (THP).

By reputation, THPs are said to have a negative impact on performance. For this post, I set out to either prove or debunk the case for the use of THPs for database applications.

The Linux context

On Linux – and for that matter all operating systems that I know of – memory is divided into small chunks called pages. A typical memory page size is set to 4k. You can obtain the value of page size on Linux using getconf.

# getconf PAGE_SIZE
4096

Generally, the latest processors support multiple page sizes. However, Linux defaults to a minimal 4k page size. For a system with 64GB physical memory, this memory will be divided into more than 16 million pages. Linking between these pages and physical memory (which is called page table walking) is undertaken by the CPU’s memory management unit (MMU). To optimize page lookup, CPU maintains a cache of recently used pages called the Table Lookaside Buffer (TLB). The higher the number of pages, the lower the percentage of pages that are maintained in TLB. This translates to a higher cache miss ratio. With every cache miss, a more expensive search must be done via page table walking. In effect, that leads to a degradation in performance.

So what if we could increase the page size? We could then reduce the number of pages accessed, and reduce the cost of page walking. Cache hit ratio might then improve because more relevant data now fits in one page rather than multiple pages.

The Linux kernel will always try to allocate a HugePage (if enabled) and will fall back to the default 4K if a contiguous chunk of the required memory size is not available in the required memory space.

The implication for applications

As mentioned, for an application to make use of HugePages it has to contain an explicit instruction to do so. It’s not always practical to change applications in this way so there’s another option.

Transparent HugePages provides a layer within the Linux kernel – probably since version 2.6.38 – which if enabled can potentially allocate HugePages for applications without them actually “knowing” it; hence the transparency. The expectation is that this will improve application performance.

In this blog, I’ll attempt to find the reasons why THP might help improve database performance. There’s a lot of discussion amongst database experts that classic HugePages give a performance gain, but you’ll see a performance hit with Transparent HugePages. I decided to take up the challenge and perform various benchmarks, with different settings, and with different workloads.

So do Transparent HugePages (THP) improve application performance? More specifically, do they improve performance for database workloads? Most industry standard databases recommend disabling THP and enabling HugePages alone.

So is this a myth or does THP degrade performance for databases? Time to break this myth.

Enabling THP

The current setting can be seen using the command line

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Temporary Change

It can be enabled or disabled using the command line.

# echo never > /sys/kernel/mm/transparent_hugepage/enabled

Permanent Change via grub

Or by setting grub parameter  in 

/etc/default/grub

 . 

You can choose one of the three configurations for THP; enable, disable, or “madvise”. Whilst enable and disable options are self-explanatory, madvise allows applications that are optimized for HugePages to use THP.  Applications can use Transparent HugePages by making the madvise system call.

Why was the madvise option added? We will discuss that in a later section.

Transparent HugePages problems

The khugepaged CPU usage

The allocation of a HugePage can be tricky. Whilst traditional HugePages are reserved in virtual memory, THPs are not. In the background, the kernel attempts to allocate a THP, and if it fails, will default to the standard 4k page. This all happens transparently to the user.

The allocation process can potentially involve a number of kernel processes which may include kswapd, defrag, and kcompactd. All of these are responsible for making space in the virtual memory for a future THP. When required, the allocation is made by another kernel process; khugepaged. This process manages Transparent HugePages.

Spikes

It depends on how khugepaged is configured, but since no memory is reserved beforehand, there is potential for performance degradation. With every attempt to allocate a HugePage, potentially a number of kernel processes are invoked. These carry out certain actions to make enough room in the virtual memory for a THP allocation. Although no notifications are provided to the application, precious resources are spent, and this can lead to spikes in performance with any dips indicating an attempt to allocate THP.

Memory Bloating

HugePages are for not for every application. For example, an application that wants to allocate only one byte of data would be better off using a 4k page rather than a huge one. That way, memory is more efficiently used. To prevent this, one option is to configure THP to “madvise”. By doing this, HugePages are disabled system-wide but are available to applications that make a madvise call to allocate THP in the madvise memory region.

Swapping

Linux kernel keeps track of memory pages and differentiates between pages are that are actively being used and the ones that are not immediately required. It may load or unload a page from active memory to disk if that page is no longer required or vice versa.

When page size is 4k, these memory operations are understandably fast. However, consider a 1GB page size: there will a significant performance hit when such a page is swapped out. When a THP is swapped out, it is split in standard page sizes. Unlike conventional HugePages which are reserved in RAM and are never swapped, THPs are swappable pages. They could, therefore, potentially be swapped causing a dip in performance. Although in recent years, there have been loads of performance improvements around swapping out the THPs process, it still does impact performance negatively.

Benchmark

I decided to benchmark with and without Transparent HugePages enabled. Initially, I used pgbench – a PostgreSQL benchmarking tool based on TPCB – for a duration of ten minutes. The benchmark used a mixed mode of READ/WRITE. The results with and without the Transparent HugePages show no degradation or improvement in the benchmark. To be sure, I repeated the same benchmark for 60 minutes and got almost the same results.  I performed another benchmark with a TPCC workload using the sysbench benchmarking tool. The results are almost the same.

Benchmark Machine

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Ubuntu 16.04.4, kernel 4.13.0-36-generic
  • PostgreSQL: version 11

Benchmark TPCB (pgbench) – 10 Minute duration

The following graphs show results for two different database sizes; 48GB and 112GB with 64, 128 and 256 clients each. All other settings were kept unchanged for these benchmarks to ensure that our results are comparable. It is evident that both lines — representing execution with or without THP — are almost overlapping one another. This suggests no performance gains.

Figure 1.1 PostgreSQL' s Benchmark, 10 minutes execution time where database workload(48GB) < shared_buffer (64GB)

Figure 1.1 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload(48GB) < shared_buffer (64GB)

 

Figure 1.2 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) > shared_buffer (64GB)

Figure 1.2 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) > shared_buffer (64GB)

 

Figure 1.3 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 1.3 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses

 

Figure 1.4 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)

Figure 1.4 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)-dTLB-misses

 

Benchmark TPCB (pgbench) – 60 Minute duration

Figure 2.1 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 2.1 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)

 

Figure 2.2 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (112GB) &gt; shared_buffer (64GB)

Figure 2.2 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB)

 

Figure 2.3 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 2.3 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses

 

Figure 2.4 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB)

Figure 2.4 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB) -dTLB-misses

 

Benchmark TPCC (sysbecnch) – 10 Minute duration

Figure 3.1 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) &lt; shared_buffer (64GB)

Figure 3.1 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 3.2 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (112GB) &gt; shared_buffer (64GB)

Figure 3.2 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)

 

Figure 3.3 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 3.3 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses

 

Figure 3.4 PostgreSQL' s Benchmark, 10 minutes execution time where database workload 112GB) > shared_buffer (64GB)

Figure 3.4 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload 112GB) > shared_buffer (64GB) -dTLB-misses

 

Conclusion

I attained these results by running different benchmarking tools and evaluating different OLTP benchmarking standards. The results clearly indicate that for these workloads, THP has a negative impact on the overall database performance. Although the performance degradation is negligible, it is, however, clear that there is no performance gain as one might expect. This is very much in line with all the different databases’ recommendation which suggests disabling the THP.

THP may be beneficial for various applications, but it certainly doesn’t give any performance gains when handling an OLTP workload.

We can safely say that the “myth” is derived from experience and that the rumors are true.

Summary

  • The complete benchmark data is available at GitHub[1]
  • The complete “nmon” reports, which include CPU, memory etc usage can be found at GitHub[2]
  • This whole benchmark is based around OLTP. Watch out for the OLAP benchmark. Maybe THP will have more effect on this type of workload.

[1] – https://github.com/Percona-Lab-results/THP-POSTGRESQL-2019/blob/master/results.xlsx

[2] – https://github.com/Percona-Lab-results/THP-POSTGRESQL-2019/tree/master/results

 

 

Feb
25
2019
--

MySQL Challenge: 100k Connections

thread pools MySQL 100k connections

In this post, I want to explore a way to establish 100,000 connections to MySQL. Not just idle connections, but executing queries.

100,000 connections. Is that really needed for MySQL, you may ask? Although it may seem excessive, I have seen a lot of different setups in customer deployments. Some deploy an application connection pool, with 100 application servers and 1,000 connections in each pool. Some applications use a “re-connect and repeat if the query is too slow” technique, which is a terrible practice. It can lead to a snowball effect, and could establish thousands of connections to MySQL in a matter of seconds.

So now I want to set an overachieving goal and see if we can achieve it.

Setup

For this I will use the following hardware:

Bare metal server provided by packet.net, instance size: c2.medium.x86
Physical Cores @ 2.2 GHz
(1 X AMD EPYC 7401P)
Memory: 64 GB of ECC RAM
Storage : INTEL® SSD DC S4500, 480GB

This is a server grade SATA SSD.

I will use five of these boxes, for the reason explained below. One box for the MySQL server and four boxes for client connections.

For the server I will use Percona  Server for MySQL 8.0.13-4 with the thread pool plugin. The plugin will be required to support the thousands of connections.

Initial server setup

Network settings (Ansible format):

- { name: 'net.core.somaxconn', value: 32768 }
- { name: 'net.core.rmem_max', value: 134217728 }
- { name: 'net.core.wmem_max', value: 134217728 }
- { name: 'net.ipv4.tcp_rmem', value: '4096 87380 134217728' }
- { name: 'net.ipv4.tcp_wmem', value: '4096 87380 134217728' }
- { name: 'net.core.netdev_max_backlog', value: 300000 }
- { name: 'net.ipv4.tcp_moderate_rcvbuf', value: 1 }
- { name: 'net.ipv4.tcp_no_metrics_save', value: 1 }
- { name: 'net.ipv4.tcp_congestion_control', value: 'htcp' }
- { name: 'net.ipv4.tcp_mtu_probing', value: 1 }
- { name: 'net.ipv4.tcp_timestamps', value: 0 }
- { name: 'net.ipv4.tcp_sack', value: 0 }
- { name: 'net.ipv4.tcp_syncookies', value: 1 }
- { name: 'net.ipv4.tcp_max_syn_backlog', value: 4096 }
- { name: 'net.ipv4.tcp_mem', value: '50576   64768 98152' }
- { name: 'net.ipv4.ip_local_port_range', value: '4000 65000' }
- { name: 'net.ipv4.netdev_max_backlog', value: 2500 }
- { name: 'net.ipv4.tcp_tw_reuse', value: 1 }
- { name: 'net.ipv4.tcp_fin_timeout', value: 5 }

These are the typical settings recommended for 10Gb networks and high concurrent workloads.

Limits settings for systemd:

[Service]
LimitNOFILE=1000000
LimitNPROC=500000

And the relevant setting for MySQL in my.cnf:

back_log=3500
max_connections=110000

For the client I will use sysbench version 0.5 and not 1.0.x, for the reasons explained below.

The workload is

sysbench --test=sysbench/tests/db/select.lua --mysql-host=139.178.82.47 --mysql-user=sbtest --mysql-password=sbtest --oltp-tables-count=10 --report-interval=1 --num-threads=10000 --max-time=300 --max-requests=0 --oltp-table-size=10000000 --rand-type=uniform --rand-init=on run

Step 1. 10,000 connections

This one is very easy, as there is not much to do to achieve this. We can do this with only one client. But you may face the following error on the client side:

FATAL: error 2004: Can't create TCP/IP socket (24)

This is caused by the open file limit, which is also a limit of TCP/IP sockets. This can be fixed by setting  

ulimit -n 100000

  on the client.

The performance we observe:

[  26s] threads: 10000, tps: 0.00, reads: 33367.48, writes: 0.00, response time: 3681.42ms (95%), errors: 0.00, reconnects:  0.00
[  27s] threads: 10000, tps: 0.00, reads: 33289.74, writes: 0.00, response time: 3690.25ms (95%), errors: 0.00, reconnects:  0.00

Step 2. 25,000 connections

With 25,000 connections, we hit an error on MySQL side:

Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug

If you try to lookup information on this error you might find the following article:  https://www.percona.com/blog/2013/02/04/cant_create_thread_errno_11/

But it does not help in our case, as we have all limits set high enough:

cat /proc/`pidof mysqld`/limits
Limit                     Soft Limit Hard Limit           Units
Max cpu time              unlimited  unlimited            seconds
Max file size             unlimited  unlimited            bytes
Max data size             unlimited  unlimited            bytes
Max stack size            8388608    unlimited            bytes
Max core file size        0          unlimited            bytes
Max resident set          unlimited  unlimited            bytes
Max processes             500000     500000               processes
Max open files            1000000    1000000              files
Max locked memory         16777216   16777216             bytes
Max address space         unlimited  unlimited            bytes
Max file locks            unlimited  unlimited            locks
Max pending signals       255051     255051               signals
Max msgqueue size         819200     819200               bytes
Max nice priority         0          0
Max realtime priority     0          0
Max realtime timeout      unlimited unlimited            us

This is where we start using the thread pool feature:  https://www.percona.com/doc/percona-server/8.0/performance/threadpool.html

Add:

thread_handling=pool-of-threads

to the my.cnf and restart Percona Server

The results:

[   7s] threads: 25000, tps: 0.00, reads: 33332.57, writes: 0.00, response time: 974.56ms (95%), errors: 0.00, reconnects:  0.00
[   8s] threads: 25000, tps: 0.00, reads: 33187.01, writes: 0.00, response time: 979.24ms (95%), errors: 0.00, reconnects:  0.00

We have the same throughput, but actually the 95% response time has improved (thanks to the thread pool) from 3690 ms to 979 ms.

Step 3. 50,000 connections

This is where we encountered the biggest challenge. At first, trying to get 50,000 connections in sysbench we hit the following error:

FATAL: error 2003: Can't connect to MySQL server on '139.178.82.47' (99)

Error (99) is cryptic and it means: Cannot assign requested address.

It comes from the limit of ports an application can open. By default on my system it is

cat /proc/sys/net/ipv4/ip_local_port_range : 32768   60999

This says there are only 28,231 available ports — 60999 minus 32768 — or the limit of TCP connections you can establish from or to the given IP address.

You can extend this using a wider range, on both the client and the server:

echo 4000 65000 > /proc/sys/net/ipv4/ip_local_port_range

This will give us 61,000 connections, but this is very close to the limit for one IP address (maximal port is 65535). The key takeaway from here is that if we want more connections we need to allocate more IP addresses for MySQL server. In order to achieve 100,000 connections, I will use two IP addresses on the server running MySQL.

After sorting out the port ranges, we hit the following problem with sysbench:

sysbench 0.5:  multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 50000
FATAL: pthread_create() for thread #32352 failed. errno = 12 (Cannot allocate memory)

In this case, it’s a problem with sysbench memory allocation (namely lua subsystem). Sysbench can allocate memory for only 32,351 connections. This is a problem which is even more severe in sysbench 1.0.x.

Sysbench 1.0.x limitation

Sysbench 1.0.x uses a different Lua JIT, which hits memory problems even with 4000 connections, so it is impossible to go over 4000 connection in sysbench 1.0.x

So it seems we hit a limit with sysbench sooner than with Percona Server. In order to use more connections, we need to use multiple sysbench clients, and if 32,351 connections is the limit for sysbench, we have to use at least four sysbench clients to get up to 100,000 connections.

For 50,000 connections I will use 2 servers (each running separate sysbench), each running 25,000 threads from sysbench.

The results for each sysbench looks like:

[  29s] threads: 25000, tps: 0.00, reads: 16794.09, writes: 0.00, response time: 1799.63ms (95%), errors: 0.00, reconnects:  0.00
[  30s] threads: 25000, tps: 0.00, reads: 16491.03, writes: 0.00, response time: 1800.70ms (95%), errors: 0.00, reconnects:  0.00

So we have about the same throughput (16794*2 = 33588 tps in total), however the 95% response time doubled. This is to be expected as we are using twice as many connections compared to the 25,000 connections benchmark.

Step 3. 75,000 connections

To achieve 75,000 connections we will use three servers with sysbench, each running 25,000 threads.

The results for each sysbench:

[ 157s] threads: 25000, tps: 0.00, reads: 11633.87, writes: 0.00, response time: 2651.76ms (95%), errors: 0.00, reconnects:  0.00
[ 158s] threads: 25000, tps: 0.00, reads: 10783.09, writes: 0.00, response time: 2601.44ms (95%), errors: 0.00, reconnects:  0.00

Step 4. 100,000 connections

There is nothing eventful to achieve75k and 100k connections. We just spin up an additional server and start sysbench. For 100,000 connections we need four servers for sysbench, each shows:

[ 101s] threads: 25000, tps: 0.00, reads: 8033.83, writes: 0.00, response time: 3320.21ms (95%), errors: 0.00, reconnects:  0.00
[ 102s] threads: 25000, tps: 0.00, reads: 8065.02, writes: 0.00, response time: 3405.77ms (95%), errors: 0.00, reconnects:  0.00

So we have the same throughput (8065*4=32260 tps in total) with 3405ms 95% response time.

A very important takeaway from this: with 100k connections and using a thread pool, the 95% response time is even better than for 10k connections without a thread pool. The thread pool allows Percona Server to manage resources more efficiently and provides better response times.

Conclusions

100k connections is quite achievable for MySQL, and I am sure we could go even further. There are three components to achieve this:

  • Thread pool in Percona Server
  • Proper tuning of network limits
  • Using multiple IP addresses on the server box (one IP address per approximately 60k connections)

Appendix: full my.cnf

[mysqld]
datadir {{ mysqldir }}
ssl=0
skip-log-bin
log-error=error.log
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
character_set_server=latin1
collation_server=latin1_swedish_ci
skip-character-set-client-handshake
innodb_undo_log_truncate=off
# general
table_open_cache = 200000
table_open_cache_instances=64
back_log=3500
max_connections=110000
# files
innodb_file_per_table
innodb_log_file_size=15G
innodb_log_files_in_group=2
innodb_open_files=4000
# buffers
innodb_buffer_pool_size= 40G
innodb_buffer_pool_instances=8
innodb_log_buffer_size=64M
# tune
innodb_doublewrite= 1
innodb_thread_concurrency=0
innodb_flush_log_at_trx_commit= 0
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_max_dirty_pages_pct=90
innodb_max_dirty_pages_pct_lwm=10
innodb_lru_scan_depth=2048
innodb_page_cleaners=4
join_buffer_size=256K
sort_buffer_size=256K
innodb_use_native_aio=1
innodb_stats_persistent = 1
#innodb_spin_wait_delay=96
innodb_adaptive_flushing = 1
innodb_flush_neighbors = 0
innodb_read_io_threads = 16
innodb_write_io_threads = 16
innodb_io_capacity=1500
innodb_io_capacity_max=2500
innodb_purge_threads=4
innodb_adaptive_hash_index=0
max_prepared_stmt_count=1000000
innodb_monitor_enable = '%'
performance_schema = ON

Feb
22
2019
--

Measuring Percona Server for MySQL On-Disk Decryption Overhead

benchmark heavy IO percona server for mysql 8 encryption

Percona Server for MySQL 8.0 comes with enterprise grade total data encryption features. However, there is always the question of how much overhead – or performance penalty – comes with the data decryption. As we saw in my networking performance post, SSL under high concurrency might be problematic. Is this the case for data decryption?

To measure any overhead, I will start with a simplified read-only workload, where data gets decrypted during read IO.

MySQL decryption schematic

During query execution, the data in memory is already decrypted so there is no additional processing time. The decryption happens only for blocks that require a read from storage.

For the benchmark I will use the following workload:

sysbench oltp_read_only --mysql-ssl=off --tables=20 --table-size=10000000 --threads=$i --time=300 --report-interval=1 --rand-type=uniform run

The datasize for this workload is about 50GB, so I will use

innodb_buffer_pool_size = 5GB

  to emulate a heavy disk read IO during the benchmark. In the second run, I will use

innodb_buffer_pool_size = 60GB

  so all data is kept in memory and there are NO disk read IO operations.

I will only use table-level encryption at this time (ie: no encryption for binary log, system tablespace, redo-  and undo- logs).

The server I am using has AES hardware CPU acceleration. Read more at https://en.wikipedia.org/wiki/AES_instruction_set

Benchmark N1, heavy read IO

benchmark heavy IO percona server for mysql 8 encryption

Threads encrypted storage no encryption encryption overhead
1 389.11 423.47 1.09
4 1531.48 1673.2 1.09
16 5583.04 6055 1.08
32 8250.61 8479.61 1.03
64 8558.6 8574.43 1.00
96 8571.55 8577.9 1.00
128 8570.5 8580.68 1.00
256 8576.34 8585 1.00
512 8573.15 8573.73 1.00
1024 8570.2 8562.82 1.00
2048 8422.24 8286.65 0.98

Benchmark N2, data in memory, no read IO

benchmark data in memory percona server for mysql 8 encryption

Threads Encryption No encryption
1 578.91 567.65
4 2289.13 2275.12
16 8304.1 8584.06
32 13324.02 13513.39
64 20007.22 19821.48
96 19613.82 19587.56
128 19254.68 19307.82
256 18694.05 18693.93
512 18431.97 18372.13
1024 18571.43 18453.69
2048 18509.73 18332.59

Observations

For a high number of threads, there is no measurable difference between encrypted and unencrypted storage. This is because a lot of CPU resources are spent in contention and waits, so the relative time spend in decryption is negligible.

However, we can see some performance penalty for a low number of threads: up to 9% penalty for hardware decryption. When data fully fits into memory, there is no measurable difference between encrypted and unencrypted storage.

So if you have hardware support then you should see little impact when using storage encryption with MySQL. The easiest way to check if you have support for this is to look at CPU flags and search for ‘aes’ string:

> lscpu | grep aes Flags: ... tsc_deadline_timer aes xsave avx f16c ...

Feb
21
2019
--

MySQL 8 is not always faster than MySQL 5.7

mysql 8 slower than mysql 5.7 sysbench

MySQL 8.0.15 performs worse in sysbench oltp_read_write than MySQL 5.7.25

Initially I was testing group replication performance and was puzzled why MySQL 8.0.15 performs consistently worse than MySQL 5.7.25.

It appears that a single server instance is affected by a performance degradation.

My testing setup

mysql 8 slower than mysql 5.7 sysbenchHardware details:
Bare metal server provided by packet.net, instance size: c2.medium.x86
24 Physical Cores @ 2.2 GHz
(1 X AMD EPYC 7401P)
Memory: 64 GB of ECC RAM

Storage : INTEL® SSD DC S4500, 480GB

This is a server grade SATA SSD.

Benchmark

sysbench oltp_read_write --report-interval=1 --time=1800 --threads=24 --tables=10 --table-size=10000000 --mysql-user=root --mysql-socket=/tmp/mysql.sock run

In the following summary I used these combinations:

  • innodb_flush_log_at_trx_commit=0 or 1
  • Binlog: off or on
  • sync_binlog=1000 or sync_binlog=1

The summary table, the number are transactions per second (tps – the more the better)

+-------------------------------------------+--------------+--------------+-------+
| case                                      | MySQL 5.7.25 | MySQL 8.0.15 | ratio |
+-------------------------------------------+--------------+--------------+-------+
| trx_commit=0, binlog=off                  | 11402 tps    | 9840(*)      | 1.16  |
+-------------------------------------------+--------------+--------------+-------+
| trx_commit=1, binlog=off                  | 8375         | 7974         | 1.05  |
+-------------------------------------------+--------------+--------------+-------+
| trx_commit=0, binlog=on, sync_binlog=1000 | 10862        | 8871         | 1.22  |
+-------------------------------------------+--------------+--------------+-------+
| trx_commit=0, binlog=on, sync_binlog=1    | 7238         | 6459         | 1.12  |
+-------------------------------------------+--------------+--------------+-------+
| trx_commit=1, binlog=on, sync_binlog=1    | 5970         | 5043         | 1.18  |
+-------------------------------------------+--------------+--------------+-------+

Summary: MySQL 8.0.15 is persistently worse than MySQL 5.7.25.

In the worst case with

trx_commit=0

  and

sync_binlog=1000

 , it is worse by 22%, which is huge.

I was looking to use these settings for group replication testing, but these settings, when used with MySQL 8.0.15, provide much worse results than I had with MySQL 5.7.25

(*)  in the case of trx_commit=0, binlog=off, MySQL 5.7.25 performance is very stable, and practically stays at the 11400 tps level. MySQL 8.0.15 varies a lot from 8758 tps to 10299 tps in 1 second resolution measurements

Appendix:

[mysqld]
datadir= /mnt/data/mysql
socket=/tmp/mysql.sock
ssl=0
#innodb-encrypt-tables=ON
#skip-log-bin
log-error=error.log
log_bin = binlog
relay_log=relay
sync_binlog=1000
binlog_format = ROW
binlog_row_image=MINIMAL
server-id=1
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Recommended in standard MySQL setup
# general
 table_open_cache = 200000
 table_open_cache_instances=64
 back_log=3500
 max_connections=4000
# files
 innodb_file_per_table
 innodb_log_file_size=15G
 innodb_log_files_in_group=2
 innodb_open_files=4000
# buffers
 innodb_buffer_pool_size= 40G
 innodb_buffer_pool_instances=8
 innodb_log_buffer_size=64M
# tune
 innodb_doublewrite= 1
 innodb_thread_concurrency=0
 innodb_flush_log_at_trx_commit= 0
 innodb_flush_method=O_DIRECT_NO_FSYNC
 innodb_max_dirty_pages_pct=90
 innodb_max_dirty_pages_pct_lwm=10
 innodb_lru_scan_depth=2048
 innodb_page_cleaners=4
 join_buffer_size=256K
 sort_buffer_size=256K
 innodb_use_native_aio=1
 innodb_stats_persistent = 1
 #innodb_spin_wait_delay=96
# perf special
 innodb_adaptive_flushing = 1
 innodb_flush_neighbors = 0
 innodb_read_io_threads = 16
 innodb_write_io_threads = 16
 innodb_io_capacity=1500
 innodb_io_capacity_max=2500
 innodb_purge_threads=4
 innodb_adaptive_hash_index=0
max_prepared_stmt_count=1000000


Photo by Suzy Hazelwood from Pexels

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com