Jan
17
2022
--

Comparing Graviton (ARM) Performance to Intel and AMD for MySQL (Part 3)

Comparing Graviton (ARM) Performance to Intel and AMD for MySQL

Comparing Graviton (ARM) Performance to Intel and AMD for MySQLRecently we published the first part (m5, m5a, m6g) and the second part (C5, C5a, C6g) of research regarding comparing Graviton ARM with AMD and Intel CPU on AWS. We selected general-purpose EC2 instances with the same configurations (amount of vCPU in the first part). In the second part, we compared compute-optimized EC2 instances with the same conditions. The main goal was to see the trend and make a general comparison of CPU types on the AWS platform only for MySQL. We didn’t set the goal to compare the performance of different CPU types. Our expertise is in MySQL performance tuning. We share research “as is” with all scripts, and anyone interested could rerun and reproduce it.
All scripts, raw logs and additional plots are available on GitHub: 

(arm_cpu_comparison_m5, csv_file_with_all_data_m5,

arm_cpu_comparison_c5csv_file_with_all_data_c5

arm_cpu_comparison_m6, csv_file_with_all_data_m6). 

We were happy to see the reactions from our Percona Blog readers to our research. And we are open to any feedback. If anyone has any ideas on updating our methodology, we would be happy to correct it. 

This post is a continuation of research based on our interest in general-purpose EC2 (and, of course, because we saw that our audience wanted to see it). The main inspiration for this research was the feedback of our readers that we compared different generations of instances, especially old AMD instances (m5a.*), and compared it with the latest Graviton instances (m6g.*).  Additionally, we also decided to use the latest Intels instances (m6i.*) too.

Today, we will talk about (AWS) the latest general-purpose EC2: M6i, M6a, M6g (complete list in appendix). 

Short Conclusion:

  1. In most cases for m6i, m6g, and m6a instances, Intel shows better performance in throughput for MySQL read transactions. However, AMD instances are pretty close to Intel’s results.
  2. Sometimes Intel could show a significant advantage — more than almost 200k rps (almost 45% better) than Graviton. However, AMD’s gap wasn’t as significant as in previous results.
    Unfortunately, we compared Graviton with others. So we didn’t concentrate on comparing AMD with Intel. 
  3. If we could say in a few words: m6i instances (with Intel)  are better in their class than other m6a, m6g instances (in performance for MySql). And this advantage starts from 5%-10% and could be up to 45% compared with other CPUs.
  4. But Gravitons instances are still cheaper

Details, or How We Got Our Short Conclusion:

Disclaimer:

  1. Tests were run  on M6i.* (Intel) , M6a.* (AMD),  M6g.*(Graviton) EC2 instances in the US-EAST-1 region. (List of EC2 see in the appendix). It was selected using only the same class of instances without additional upgrades. The main goal is to take the same instances with only differences in CPU types and identify their performance for MySQL.
  2. Monitoring was done with Percona Monitoring and Management (PMM).
  3. OS: Ubuntu 20.04 LTS 
  4. Load tool (sysbench) and target DB (MySQL) installed on the same EC2 instance.
  5. Oracle MySQL Community Server — 8.0.26-0 — installed from official packages (it was installed from Ubuntu repositories).
  6. Load tool: sysbench —  1.0.18
  7. innodb_buffer_pool_size=80% of available RAM
  8. Test duration is five minutes for each thread and then 90 seconds cool down before the next iteration. 
  9. Tests were run four times independently (to smooth outliers / to have more reproducible results). Then results were averaged for graphs. Also, graphs show min and max values that were during the test, which shows the range of variance. 
  10. We are going to use the “high-concurrency” scenario definition for scenarios when the number of threads would be bigger than the number of vCPU. And “low-concurrent” scenario definition with scenarios where the number of threads would be less or equal to a number of vCPU on EC2.
  11. We are comparing MySQL behavior on the same class of EC2, not CPU performance.
  12. We got some feedback regarding our methodology, and we would update it in the next iteration, with a different configuration, but for this particular research we leave previous to have possibility compare “apples to apples”.
  13. The post is not sponsored by any external company. It was produced using only Percona resources. We do not control what AWS uses as CPU in their instances, we only operate with what they offer. 

Test Case:

Prerequisite:

To use only CPU (without disk and network) we decided to use only read queries from memory. To do this we did the following actions. 

1. Create DB with 10 tables with 10 000 000 rows each table

sysbench oltp_read_only --threads=10 --mysql-user=sbtest --mysql-password=sbtest --table-size=10000000 --tables=10 --db-driver=mysql --mysql-db=sbtest prepare

2. Load all data to LOAD_buffer 

sysbench oltp_read_only --time=300 --threads=10 --table-size=1000000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Test:

3. Run in a loop for same scenario but  different concurrency THREAD (1,2,4,8,16,32,64,128) on each EC2 

sysbench oltp_read_only --time=300 --threads=${THREAD} --table-size=100000 --mysql-user=sbtest --mysql-password=sbtest --db-driver=mysql --mysql-db=sbtest run

Results:

Result reviewing was split into four parts:

  1. For “small” EC2 with 2, 4, and 8 vCPU
  2. For “medium” EC2 with 16  and 32 vCPU
  3. For  “large” EC2 with 48 and 64 vCPU
  4. For all scenarios to see the overall picture.

There would be four graphs for each test:

  1. Throughput (queries per second) that EC2 could perform for each scenario (number of threads)
  2. Latency 95 percentile that  EC2 could perform for each scenario, (number of threads)
  3. Relative comparing Graviton and Intel, Graviton, and AMD
  4. Absolute comparing Graviton and Intel, Graviton and AMD

Validation that all load goes to the CPU, not to DISK I/O or network, was done also using PMM (Percona Monitoring and Management). 

perona monitoring and management

pic 0.1. OS monitoring during all test stages

 

Result for EC2 with 2, 4, and 8 vCPU:

Result for EC2 with 2, 4, and 8 vCPU

Plot 1.1.  Throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.2.  Latencies (95 percentile) during the test for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.3.2  Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 1.4.2. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4 and 8 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

OVERVIEW:

  1. Based on plot 1.1. We could say that EC2 with Intel hasn’t an absolute advantage compared with Graviton and AMD. 
  2. Especially Intel and AMD, showing an advantage a little bit over – 20% over Graviton.
  3. In numbers, it is over five thousand and more requests per second. 
  4. AMD showed better results for two vCPU instances. 
  5. And it looks like in M6 class of Gravitons CPUs show the worst result compared with others.

Result for EC2 with 16 and 32 vCPU:

Result for EC2 with 16 and 32 vCPU

Plot 2.1. Throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 2.2. Latencies (95 percentile) during the test for EC2 with 16 and 32 vCPU for scenarios with 1,2 4,8,16,32,64,128 threads

 

Plot 2.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 2.3.2  Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 2.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 2.4.2. Numbers comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 16 and 32 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

OVERVIEW:

  1. Plot 2.1 shows that Intel vCPU is more performance efficient. AMD is the second, and Graviton is the third. 
  2. According to plots 2.3.1 and 2.3.2, Intel is better than Graviton up to 30 % and AMD is better than Graviton up to 20%. Graviton has some exceptional performance advantage over  AMD in some scenarios. But with this configuration and this instance classes, it is an exception according to the plot 2.3.2 scenarios for 8 and 16 concurrent threads. 
  3. In real numbers, Intel could execute up to 140 k read transactions more than Graviton CPUs, and AMD could read more than 70 k read transactions than Graviton. (plot 2.1. , plot 2.4.1.)
  4. In most cases, AMD and Intel are better than Graviton EC2 instances (plot 2.1, plot 2.3.2, plot 2.4.2).

 

Result for EC2 with 48 and 64 vCPU:

Result for EC2 with 48 and 64 vCPU

Plot 3.1. Throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 3.2. Latencies (95 percentile) during the test for EC2 with 48 and 64 vCPU for scenarios with 1,2 4,8,16,32,64,128 threads

 

Plot 3.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 3.3.2  Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 3.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 3.4.2. Numbers comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

OVERVIEW:

  1. For “Large” instances, Intel is still better than other vCPU. AMD was still in second place, except when Graviton stole some cases. (plot 3.1.)
  2. According to plot 3.3.1. Intel showed an advantage over Graviton up to 45%. On the other hand, AMD showed an advantage over Graviton up to 20% in the same case.
  3. There were two cases when Graviton showed some better results, but it is an exception. 
  4. In real numbers: Intel could generate over 150k-200k read transactions more than Graviton. And AMD could execute more than 70k – 130k read transactions than Graviton.

 

Full Result Overview:

Plot 4.1.1. Throughput (queries per second) – bar plot for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.1.2. Throughput (queries per second) – line plot for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.2.1. Latencies (95 percentile) during the test – bar plot for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.2.2. Latencies (95 percentile) during the test – line plot for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.3.1 Percentage comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.3.2 Percentage comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.4.1. Numbers comparison Graviton and Intel CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.4.2. Numbers comparison Graviton and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.5.1. Percentage comparison INTEL and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

 

Plot 4.5.2. Numbers comparison INTEL and AMD CPU in throughput (queries per second) for EC2 with 2, 4, 8, 16, 32, 48 and 64 vCPU for scenarios with 1,2,4,8,16,32,64,128 threads

Final Thoughts

  1. We compare general-purpose EC2 (m6i, m6a, m6g) instances from the AWS platform and their behavior for MySQL.  
  2. In these competitions, Graviton instances (m6g) do not provide any competitive results for MySQL. 
  3. There was some strange behavior. AMD and Intel showed their best performance when loaded (in the number of threads) are equal numbers of vCPU. According to plot 4.1.2. We could see some jump in performance when the load becomes the same as the amount of vCPU. This point was hard to see on the bar chart. But this is very interesting. However, Graviton worked more slightly without any “jumps”, and that’s why it showed exceptionally better results in some scenarios with AMD.
  4. Last point. Everyone wants to see an AMD vs Intel comparison. Plot 4.5.1 and 4.5.2.  The result – Intel is better in most cases.  And AMD was better only in one case with 2 vCPU. So the advantage of Intel compared with AMD could rise up to 96% for “large instances” (in some cases). It is unbelievable. But in most cases, this advantage is that Intel could run in 30% more MySql read transactions than AMD.
  5. It is still an open question regarding the economic efficiency of all this EC2. We would research this topic and answer this question a little bit later.

APPENDIX:

List of EC2 used in research:

CPU type Cpu info:

Model name

EC2 Memory GB Amount vCPU EC2 price per hour (USD)
AMD AMD EPYC 7R13 Processor 2650 MHz m6a.large 8 2 $0.0864
AMD m6a.xlarge 16 4 $0.1728
AMD m6a.2xlarge 32 8 $0.3456
AMD m6a.4xlarge 64 16 $0.6912
AMD m6a.8xlarge 128 32 $1.3824
AMD m6a.12xlarge 192 48 $2.0736
AMD m6a.16xlarge 256 64 $2.7648
Graviton ARMv8 AWS Graviton2 2500 MHz m6g.large 8 2 $0.077 
Graviton m6g.xlarge 16 4 $0.154
Graviton m6g.2xlarge 32 8 $0.308
Graviton m6g.4xlarge 64 16 $0.616
Graviton m6g.8xlarge 128 32 $1.232
Graviton m6g.12xlarge 192 48 $1.848
Graviton m6g.16xlarge 256 64 $2.464
Intel Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz m6i.large 8 2 $0.096000
Intel m6i.xlarge 16 4 $0.192000
Intel m6i.2xlarge 32 8 $0.384000 
Intel m6i.4xlarge 64 16 $0.768000
Intel m6i.8xlarge 128 32 $1.536000
Intel m6i.12xlarge 192 48 $2.304000
Intel m6i.16xlarge 256 64 $3.072000

 

my.cnf:

[mysqld]

ssl=0

performance_schema=OFF

skip_log_bin

server_id = 7




# general

table_open_cache = 200000

table_open_cache_instances=64

back_log=3500

max_connections=4000

 join_buffer_size=256K

 sort_buffer_size=256K




# files

innodb_file_per_table

innodb_log_file_size=2G

innodb_log_files_in_group=2

innodb_open_files=4000




# buffers

innodb_buffer_pool_size=${80%_OF_RAM}

innodb_buffer_pool_instances=8

innodb_page_cleaners=8

innodb_log_buffer_size=64M




default_storage_engine=InnoDB

innodb_flush_log_at_trx_commit  = 1

innodb_doublewrite= 1

innodb_flush_method= O_DIRECT

innodb_file_per_table= 1

innodb_io_capacity=2000

innodb_io_capacity_max=4000

innodb_flush_neighbors=0

max_prepared_stmt_count=1000000 

bind_address = 0.0.0.0

[client]

 

Apr
25
2019
--

Creating Custom Sysbench Scripts

sysbench-lua for benchmark tooling

sysbench-lua for benchmark toolingSysbench has long been established as the de facto standard when it comes to benchmarking MySQL performance. Percona relies on it daily, and even Oracle uses it when blogging about new features in MySQL 8. Sysbench comes with several pre-defined benchmarking tests. These tests are written in an easy-to-understand scripting language called Lua. Some of these tests are called: oltp_read_write, oltp_point_select, tpcc, oltp_insert. There are over ten such scripts to emulate various behaviors found in standard OLTP applications.

But what if your application does not fit the pattern of traditional OLTP? How can you continue to utilize the power of load-testing, benchmarking, and results analysis with sysbench? Just write your own Lua script!

For those that want to jump ahead and see the full source, here you go.

Sysbench API

To start off, each Lua script you create must implement three core sysbench-Lua API functions. These are thread_init, thread_done, and event. You can read the comments in the code below for the meaning of each function and what is happening inside.

-- Called by sysbench one time to initialize this script
function thread_init()
  -- Create globals to be used elsewhere in the script
  -- drv - initialize the sysbench mysql driver
  drv = sysbench.sql.driver()
  -- con - represents the connection to MySQL
  con = drv:connect()
end
-- Called by sysbench when script is done executing
function thread_done()
  -- Disconnect/close connection to MySQL
  con:disconnect()
end
-- Called by sysbench for each execution
function event()
  -- If user requested to disable transactions,
  -- do not execute BEGIN statement
  if not sysbench.opt.skip_trx then
    con:query("BEGIN")
  end
  -- Run our custom statements
  execute_selects()
  execute_inserts()
  -- Like above, if transactions are disabled,
  -- do not execute COMMIT
  if not sysbench.opt.skip_trx then
    con:query("COMMIT")
  end
end

That’s all pretty simple and should function as a good template in your scripts. Now let’s take a look at the rest of the script.

Sanity checks and options

Now let’s get into the core code. At the top you’ll find the following sections:

if sysbench.cmdline.command == nil then
   error("Command is required. Supported commands: run")
end
sysbench.cmdline.options = {
  point_selects = {"Number of point SELECT queries to run", 5},
  skip_trx = {"Do not use BEGIN/COMMIT; Use global auto_commit value", false}
}

The first section is a sanity check to make sure the user actually wants to run this test. Other test scripts, mentioned above, support commands like prepare, run, and cleanup. Our script only supports run as the data we are using is pre-populated by our core application.

The second section allows us, the script writer, to let the user pass some options specific to our test script. In the code above, we can see an option for the number of SELECT statements that will be ran on each thread/iteration (default is 5) and the other option allows the user to disable BEGIN/COMMIT if they so desire (default is false). If you want more customization in your script, simply add more options. You’ll see how to reference these parameters later on.

The queries

Now it is time to define the custom queries we want to execute in our script.

-- Array of categories to be use in the INSERTs
local page_types = { "actor", "character", "movie" }
-- Array of COUNT(*) queries
local select_counts = {
  "SELECT COUNT(*) FROM imdb.title"
}
-- Array of SELECT statements that have 1 integer parameter
local select_points = {
  "SELECT * FROM imdb.title WHERE id = %d",
  "SELECT * FROM imdb.comments ORDER BY id DESC limit 10",
  "SELECT AVG(rating) avg FROM imdb.movie_ratings WHERE movie_id = %d",
  "SELECT * FROM imdb.users ORDER BY RAND() LIMIT 1"
}
-- Array of SELECT statements that have 1 string parameter
local select_string = {
  "SELECT * FROM imdb.title WHERE title LIKE '%s%%'"
}
-- INSERT statements
local inserts = {
  "INSERT INTO imdb.users (email_address, first_name, last_name) VALUES ('%s', '%s', '%s')",
  "INSERT INTO imdb.page_views (type, viewed_id, user_id) VALUES ('%s', %d, %d)"
}

The above code defines several arrays/lists of different queries. Why is this necessary? Later on in the code, we will have to parse each SQL statement and populate/replace the various parameters with randomly generated values. It would not do us any good to repeat the same SELECT * FROM fooTable WHERE id = 44 every time, now would it? Certainly not. We want to generate random numbers and have our queries select from the entire dataset.

Some queries have no parameters, some have integer-based, and some string-based. We will handle these differently below, which is why they are in different arrays above. This method also allows for future expansion. When you want to run additional queries within the script, just add another line to each array; no need to change any other code.

Parse and execute

The function below, execute_selects, will be called from the parent function, event, which we discussed earlier in the post. You can see for-loops for each of the three SELECT categories we created above. The comments inline should help explain what is happening. Note the use of the user-provided option –point-selects in the second loop below, which we created previously in the ‘Sanity and Options’ section.

function execute_selects()
  -- Execute each simple, no parameters, SELECT
  for i, o in ipairs(select_counts) do
    con:query(o)
  end
  -- Loop for however many queries the
  -- user wants to execute in this category
  for i = 1, sysbench.opt.point_selects do
    -- select random query from list
    local randQuery = select_points[math.random(#select_points)]
    -- generate random id and execute
    local id = sysbench.rand.pareto(1, 3000000)
    con:query(string.format(randQuery, id))
  end
  -- generate random string and execute
  for i, o in ipairs(select_string) do
    local str = sysbench.rand.string(string.rep("@", sysbench.rand.special(2, 15)))
    con:query(string.format(o, str))
  end
end

Two more things to mention for this code. First, you will notice the use of sysbench.rand.pareto to generate a random number between 1 and 3,000,000. For our dataset, we know that each table referenced in all queries relating to WHERE id = ? has that many number of rows, at minimum. This is specific to our data. Your values will certainly be different. Second, notice the use of sysbench.rand.string, and string.rep. The string.rep segment will generate a string comprised of ‘@’ symbols, between 2 and 15 characters long. That string of ‘@’ symbols will then be passed to sysbench.rand.string, which will swap out each ‘@’ for a random alphanumeric value. For example, ‘@@@@@@’ could be changed to ‘Hk9EdC’ which will then replace the ‘%s’ inside the query string (string.format) and be executed.

Handle inserts

Our INSERT statements require values. Again, sysbench calls the function execute_inserts from event on each iteration. Inside execute_inserts, we generate some fake string data using built-in functions described above.

Those strings are then formatted into the SQL and executed.

function create_random_email()
  local username = sysbench.rand.string(string.rep("@",sysbench.rand.uniform(5,10)))
  local domain = sysbench.rand.string(string.rep("@",sysbench.rand.uniform(5,10)))
  return username .. "@" .. domain .. ".com"
end
function execute_inserts()
  -- generate fake email/info
  local email = create_random_email()
  local firstname = sysbench.rand.string("first-" .. string.rep("@", sysbench.rand.special(2, 15)))
  local lastname = sysbench.rand.string("last-" .. string.rep("@", sysbench.rand.special(2, 15)))
  -- INSERT for new imdb.user
  con:query(string.format(inserts[1], email, firstname, lastname))
  -- INSERT for imdb.page_view
  local page = page_types[math.random(#page_types)]
  con:query(string.format(inserts[2], page, sysbench.rand.special(2, 500000), sysbench.rand.special(2, 500000)))
end

Example run

$ sysbench imdb_workload.lua \
    --mysql-user=imdb --mysql-password=imdb \
    --mysql-db=imdb --report-interval=1 \
    --events=0 --time=0 run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.17 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 1
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 1s ] thds: 1 tps: 15.96 qps: 177.54 (r/w/o: 112.71/31.92/32.91) lat (ms,95%): 158.63 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 1 tps: 15.01 qps: 169.09 (r/w/o: 109.06/30.02/30.02) lat (ms,95%): 137.35 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 1 tps: 26.00 qps: 285.00 (r/w/o: 181.00/52.00/52.00) lat (ms,95%): 108.68 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 1 tps: 15.00 qps: 170.00 (r/w/o: 108.00/32.00/30.00) lat (ms,95%): 164.45 err/s: 0.00 reconn/s: 0.00

And there we have it! Custom queries specific to our application and dataset. Most of the sysbench parameters are self-explanatory, but let me talk about –report-interval=1 which shows statistics every 1 second. Normally sysbench does not output stats until the end of the run, however, the example execution will run forever (–events=0 –time=0) so we need stats to show all the time. You can adjust the parameters to your liking. For instance, if you only want to run a test for 5 minutes, set –events=0 –run-time=300.

Conclusion

Sysbench is a very well designed application that allows you to load-test your MySQL instances using pre-defined and custom queries. Using the Lua scripting language, you can create just about any scenario to fit your needs. The above is just one example that we use within Percona’s Training and Education department. It is by no means an exhaustive example of all of the capabilities of sysbench-Lua.


Photo by Lachlan Donald on Unsplash

Jun
15
2018
--

Tuning PostgreSQL for sysbench-tpcc

PostgreSQL benchmark

tuning PostgreSQL performance for sysbench-tpccPercona has a long tradition of performance investigation and benchmarking. Peter Zaitsev, CEO and Vadim Tkachenko, CTO, led their crew into a series of experiments with MySQL in this space. The discussion that always follows on the results achieved is well known and praised even by the PostgreSQL community. So when Avi joined the team and settled at Percona just enough to get acquainted with my colleagues, sure enough one of the first questions they asked him was: “did you know sysbench-tpcc also works with PostgreSQL now ?!“. 

sysbench

sysbench is “a scriptable multi-threaded benchmark tool based on LuaJIT (…) most frequently used for database benchmarks“, created and maintained by Alexey Kopytov. It’s been around for a long time now and has been a main reference for MySQL benchmarking since its inception. One of the favorites of Netflix’ Brendan Gregg, we now know. You may remember Sveta Smirnova and Alexander Korotkov’s report on their experiments in Millions of Queries per Second: PostgreSQL and MySQL’s Peaceful Battle at Today’s Demanding Workloads here. In fact, that post may serve as a nice prelude for the tests we want to show you today. It provides a good starting point as a MySQL vs PostgreSQL performance comparison.

The idea behind Sveta and Alexander’s experiments was “to provide an honest comparison for the two popular RDBMSs“, MySQL and PostgreSQL, using “the same tool, under the same challenging workloads and using the same configuration parameters (where possible)“. Since neither pgbench nor sysbench would work effectively with MySQL and PostgreSQL for both writes and reads they attempted to port pgbench‘s workload as a sysbench benchmark. 

sysbench-tpcc

More recently, Vadim came up with an implementation of the famous TPC-C workload benchmark for sysbench, sysbench-tpcc. He has since published a series of tests using Percona Server and MySQL, and worked to make it compatible with PostgreSQL too. For real now, hence the request that awaited us.

Our goal this time was less ambitious than Sveta and Alexander’s. We wanted to show you how we setup PostgreSQL to perform optimally for sysbench-tpcc, highlighting the settings we tuned the most to accomplish this. We ran our tests on the same box used by Vadim in his recent experiments with Percona Server for MySQL and MySQL.

A valid benchmark – benchmark rules

Before we present our results we shall note there are several ways to speed up database performance. You may for example disable full_page_writes, which would make a server crash unsafe, and use a minimalistic wal_level mode, which would block replication capability. These would speed things up but at the expense of reliability, making the server inappropriate for production usage.

For our benchmarks, we made sure we had all the necessary parameters in place to satisfy the following:

  1. ACID Compliance
  2. Point-in-time-recovery
  3. WALs usable by Replica/Slave for Replication
  4. Crash Recovery
  5. Frequent Checkpointing to reduce time for Crash Recovery
  6. Autovacuum

When we initially prepared sysbench-tpcc with PostgreSQL 10.3 the database size was 118 GB. By the time we completed the test, i.e. after 36000 seconds, the DB size had grown up to 335 GB. We have a total of “only” 256 GB of memory available in this server, however, based on the observations from pg_stat_database, pg_statio_user_tables and pg_statio_user_indexes 99.7% of the blocks were always in-memory:

postgres=# select ((blks_hit)*100.00)/(blks_hit+blks_read) AS “perc_mem_hit” from pg_stat_database where datname like ‘sbtest’;
   perc_mem_hit
---------------------
99.7267224322546
(1 row)

Hence, we consider it to be an in-memory workload with the whole active data set in RAM. In this post we explain how we tuned our PostgreSQL Instance for an in-memory workload, as was the case here.

Preparing the database before running sysbench

In order to run a sysbench-tpcc, we must first prepare the database to load some data. In our case, as mentioned above, this initial step resulted in a 118 GB database:

postgres=# select datname, pg_size_pretty(pg_database_size(datname)) as "DB_Size" from pg_stat_database where datname = 'sbtest';
 datname | DB_Size
---------+---------
 sbtest  | 118 GB
(1 row)

This may change depending on the arguments used. Here is the actual command we used to prepare the PostgreSQL Database for sysbench-tpcc:

$ ./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=120 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --db-driver=pgsql prepare

While we were loading the data, we wanted to see if we could speed-up the process. Here’s the customized PostgreSQL settings we used, some of them directly targeted to accelerate the data load:

shared_buffers = 192GB
maintenance_work_mem = '20GB'
wal_level = 'minimal'
autovacuum = 'OFF'
wal_compression = 'ON'
max_wal_size = '20GB'
checkpoint_timeout = '1h'
checkpoint_completion_target = '0.9'
random_page_cost = 1
max_wal_senders = 0
full_page_writes = ON
synchronous_commit = ON

We’ll discuss most of these parameters in the sections that follow, but we would like to highlight two of them here. We increased maintenance_work_mem to speed  up index creation and max_wal_size to delay checkpointing further, but not too much — this is a write-intensive phase after all. Using these parameters it took us 33 minutes to complete the prepare stage compared with 55 minutes when using the default parameters. 

If you are not concerned about crash recovery or ACID, you could turn off full_page_writes, fsync and synchrnous_commit. That would speed up the data load much more. 

Running a manual VACUUM ANALYZE after sysbench-tpcc’s initial prepare stage

Once we had prepared the database, as it is a newly created DB Instance, we ran a manual VACUUM ANALYZE on the database (in parallel jobs) using the command below. We employed all the 56 vCPUs available in the server since there was nothing else running in the machine:

$ /usr/lib/postgresql/10/bin/vacuumdb -j 56 -d sbtest -z

Having run a vacuum for the entire database we restarted PostgreSQL and cleared the OS cache before executing the benchmark in “run” mode. We repeated this process after each round.

First attempt with sysbench-tpcc

When we ran sysbench-tpcc for the first time, we observed a resulting TPS of 1978.48 for PostgreSQL with the server not properly tuned, running with default settings. We used the following command to run sysbench-tpcc for PostgreSQL for 10 hours (or 36000 seconds) for all rounds:

./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=36000 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password=oracle --db-driver=pgsql run

PostgreSQL performance tuning of parameters for sysbench-tpcc (crash safe)

After getting an initial idea of how PostgreSQL performed with the default settings and the actual demands of the sysbench-tpcc workload, we began making progressive adjustments in the settings, observing how they impacted the server’s performance. After several rounds we came up with the following list of parameters (all of these satisfy ACID properties):

shared_buffers = '192GB'
work_mem = '4MB'
random_page_cost = '1'
maintenance_work_mem = '2GB'
wal_level = 'replica'
max_wal_senders = '3'
synchronous_commit = 'on'
seq_page_cost = '1'
max_wal_size = '100GB'
checkpoint_timeout = '1h'
synchronous_commit = 'on'
checkpoint_completion_target = '0.9'
autovacuum_vacuum_scale_factor = '0.4'
effective_cache_size = '200GB'
min_wal_size = '1GB'
bgwriter_lru_maxpages = '1000'
bgwriter_lru_multiplier = '10.0'
logging_collector = 'ON'
wal_compression = 'ON'
log_checkpoints = 'ON'
archive_mode = 'ON'
full_page_writes = 'ON'
fsync = 'ON'

Let’s discuss our reasoning behind the tuning of the most important settings:

shared_buffers

Defines the amount of memory PostgreSQL uses for shared memory buffers. It’s arguably its most important setting, often compared (for better or worse) to MySQL’s innodb_buffer_pool_size. The biggest difference, if we dare to compare shared_buffers to the Buffer Pool, is that InnoDB bypasses the OS cache to directly access (read and write) data in the underlying storage subsystem whereas PostgreSQL do not.

Does this mean PostgreSQL does “double caching” by first loading data from disk into the OS cache to then make a copy of these pages into the shared_buffers area? Yes.

Does this “double caching” makes PostgreSQL inferior to InnoDB and MySQL in terms of memory management? No. We’ll discuss why that is the case in a follow up blog post. For now it suffice to say the actual performance depends on the workload (mix of reads and writes), the size of the “hot data” (the portion of the dataset that is most accessed and modified) and how often checkpointing takes place.

How we chose the setting for shared_buffers to optimize PostgreSQL performance

Due to these factors, the documented suggested formula of setting shared_buffers to 25% of RAM or the magic number of “8GB” is hardly ideal. What seems to be good reasoning, though, is this:

  • If you can fit the whole of your “hot data” in memory, then dedicating most of your memory to shared_buffers pays off nicely, making PostgreSQL behave as close to an in-memory database as possible.
  • If the size of your “hot data” surpasses the amount of memory you have available in the server, then you’re probably better off working with a much smaller shared_buffers area and relying more on the OS cache.

For this benchmark, considering the options we used, we found that dedicating 75% of all the available memory to shared_buffers is ideal. It is enough to fit the entire “hot data” and still leave sufficient memory for the OS to operate, handle connections and everything else.

work_mem

This setting defines the amount of memory that can be used by each query (not session) for internal sort operations (such as ORDER BY and DISTINCT), and hash tables (such as when doing hash-based aggregation). Beyond this, PostgreSQL moves the data into temporary disk files. The challenge is usually finding a good balance here. We want to avoid the use of temporary disk files, which slow down query completion and in turn may cause contention. But we don’t want to over-commit memory, which could even lead to OOM; working with high values for work_mem may be destructive when it is not really needed.

We analyzed the workload produced by sysbench-tpcc and found with some surprise that work_mem doesn’t play a role here, considering the queries that were executed. So we kept the default value of 4MB. Please note that this is seldom the case in production workloads, so it is important to always keep an eye on that parameter.

random_page_cost

This setting stipulates the cost that a non-sequentially-fetched disk page would have, and directly affects the query planner’s decisions. Going with a conservative value is particularly important when using high latency storage, such as spinning disks. This wasn’t our case, hence we could afford to equalize random_page_cost to seq_page_cost. So, we set this parameter to 1 as well, down from the default value of 4.

wal_level, max_wal_senders and archive_mode

To set up streaming replication wal_level needs to be set to at least “replica” and archive_mode must be enabled. This means the amount of WAL data produced increases significantly compared to when using default settings for these parameters, which in turn impacts IO. However, we considered these with a production environment in mind.

wal_compression

For this workload, we observed total WALs produced of size 3359 GB with wal_compression disabled and 1962 GB with wal_compression. We enabled wal_compression to reduce IO — the amount (and, most importantly, the rate) of WAL files being written to disk — at the expense of some additional CPU cycles. This proved to be very effective in our case as we had a surplus of CPU available.

checkpoint_timeout, checkpoint_completion_target and max_wal_size

We set the checkpoint_timeout to 1 hour and checkpoint_completion_target to 0.9. This means a checkpoint is forced every 1 hour and it has 90% of the time before the next checkpoint to spread the writes. However, a checkpoint is also forced when max_wal_size of WAL’s have been generated. With these parameters for a sysbench-tpcc workload, we saw that there were 3 to 4 checkpoints every 1 hour. This is especially because of the amount of WALs being generated.

In production environments we would always recommend you perform a manual CHECKPOINT before shutting down PostgreSQL in order to allow for a faster restart (recovery) time. In this context, issuing a manual CHECKPOINT took us between 1 and 2 minutes, after which we were able to restart PostgreSQL in just about 4 seconds. Please note that in our testing environment, taking time to restart PostgreSQL was not a concern, so working with this checkpoint rate benefited us. However, if you cannot afford a couple of minutes for crash recovery it is always suggested to force checkpointing to take place more often, even at the cost of some degraded performance.

full_page_writes, fsync and synchronous_commit

We set all of these parameters to ON to satisfy ACID properties.

autovacuum

We enabled autovacuum and other vacuum settings to ensure vacuum is being performed in the backend. We will discuss the importance of maintaining autovacuum enabled in a production environment, as well as the danger of doing otherwise, in a separate post. 

Amount of WAL’s (Transaction Logs) generated after 10 hours of sysbench-tpcc

Before we start to discuss the numbers it is important to highlight that we enabled wal_compression before starting sysbench. As we mentioned above, the amount of WALs generated with wal_compression set to OFF was more than twice the amount of WALs generated when having compression enabled. We observed that enabling wal_compression resulted in an increase in TPS of 21%. No wonder, the production of WALs has an important impact on IO: so much so that it is very common to find PostgreSQL servers with a dedicated storage for WALs only. Thus, it is important to highlight the fact wal_compression may benefit write-intensive workloads by sparing IO at the expense of additional CPU usage.

To find out the total amount of WALs generated after 10 Hours, we took note at the WAL offset from before we started the test and after the test completed:

WAL Offset before starting the sysbench-tpcc ? 2C/860000D0
WAL Offset after 10 hours of sysbench-tpcc   ? 217/14A49C50

and subtracted one from the other using pg_wal_lsn_diff, as follows:

postgres=# SELECT pg_size_pretty(pg_wal_lsn_diff('217/14A49C50','2C/860000D0'));
pg_size_pretty
----------------
1962 GB
(1 row)

1962 GB of WALs is a fairly big amount of transaction logs produced over 10 hours, considering we had enabled wal_compression .

We contemplated making use of a separate disk to store WALs to find out by how much more a dedicated storage for transaction logs would benefit overall performance. However, we wanted to keep using the same hardware Vadim had used for his previous tests, so decided against this.

Crash unsafe parameters

Setting full_page_writes, fsync and synchronous_commit to OFF may speed up the performance but it is always crash unsafe unless we have enough backup in place to consider these needs. For example, if you are using a COW FileSystem with Journaling, you may be fine with full_page_writes set to OFF. This may not be true 100% of the time though.

However, we still want to share the results with the crash unsafe parameters mentioned in the paragraph above as a reference.

Results after 10 Hours of sysbench-tpcc for PostgreSQL with default, crash safe and crash unsafe parameters

Here are the final numbers we obtained after running sysbench-tpcc for 10 hours considering each of the scenarios above:

Parameters

TPS

Default / Untuned

1978.48

Tuned (crash safe)

5736.66

Tuned (crash unsafe)

7881.72

Did we expect to get these numbers? Yes and no.

Certainly we expected a properly tuned server would outperform one running with default settings considerably but we can’t say we expected it to be almost three times better (2.899). With PostgreSQL making use of the OS cache it is not always the case that tuning shared_buffers in particular will make such a dramatic difference. By comparison, tuning MySQL’s InnoDB Buffer Pool almost always makes a difference. For PostgreSQL high performance it depends on the workload. In this case for sysbench-tpcc benchmarks, tuning shared_buffers definitely makes a difference.

On the other hand experiencing an additional order of magnitude faster (4x), when using crash unsafe settings, was not much of a surprise.

Here’s an alternative view of the results of our PostgreSQL insert performance tuning benchmarks:

Sysbench-TPCC with PostgreSQL

What did you think about this experiment? Please let us know in the comments section below and let’s get the conversation going.

Hardware spec

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Ubuntu 16.04.4, kernel 4.13.0-36-generic
  • PostgreSQL: version 10.3
  • sysbench-tpcc: https://github.com/Percona-Lab/sysbench-tpcc

The post Tuning PostgreSQL for sysbench-tpcc appeared first on Percona Database Performance Blog.

Sep
21
2017
--

Percona Live Europe Featured Talks: Modern sysbench – Teaching an Old Dog New Tricks with Alexey Kopytov

Percona Live Europe 2017

Percona Live EuropeWelcome to another post in our series of interview blogs for the upcoming Percona Live Europe 2017 in Dublin. This series highlights a number of talks that will be at the conference and gives a short preview of what attendees can expect to learn from the presenter.

This blog post is with Alexey Kopytov, sofware developer and maintainer of sysbench. His talk is Modern sysbench: Teaching an Old Dog New Tricks. His presentation present new features provided by recent releases and explain how they can be used to create complex benchmark scenarios and collect performance metrics with a simple Lua API. It will also run a live demo of some of the new sysbench features.

In our conversation, we discussed benchmarking your database environment:

Percona: How did you get into database technology? What do you love about it?

Alexey: It was 2003, and I was working as a software developer for a boring company providing hosted VoIP solutions. I was a big fan of the free and open source software philosophy, which was way less popular back then than it is today. I contributed to a number of open source projects in my free time, but I also had a dream of developing open source software as part of my paid job. This looked completely unrealistic at the time, until I came across a job posting on a Russian IT forum about a Swedish company called MySQL AB looking for software developers to work remotely on MySQL! That sounded like my dream job, so I applied.

I knew very little about database internals at the time, so looking back I was giving terrible answers during my job interviews. Nevertheless, I joined the High Performance Group at MySQL AB after a few months, and that has defined my professional life for many years.

I love database technology because it presents the toughest challenges in software development. Most problems and solutions related to ever-evolving hardware, scalability and data processing requirements are discovered first by people from the database world.

Percona: Your talk is called “Modern sysbench: Teaching an Old Dog New Tricks”. What is sysbench used for generally, why is it important and how have you used it in your career? 

Alexey: sysbench was an internal project that I took over as soon as I joined MySQL AB. We used it to troubleshoot customer issues, find performance bottlenecks in MySQL and evaluate new features. Of course it was an open source project, so over the years we’ve got many people from the MySQL community using sysbench for all kinds of performance research like testing new hardware, identifying performance-related issues and comparing MySQL configurations, versions and forks.

Percona: What are some of the important new developments in the latest release?

Alexey: This year sysbench got a major upgrade in terms of features and performance to meet the modern world of many-core CPUs, powerful storage devices and distributed database systems capable of processing millions of transactions per second. Some feature highlights from the latest release include simplified command-line interface, a revamped API which allows creating more complex benchmark scenarios with less code, new performance metrics, customizable reports and more!

Percona: What do you want attendees to take away from your session? Why should they attend?

Alexey: sysbench is quite popular, but most people rarely use it more than a few bundled OLTP-style benchmarks. I’d like to explain its full potential, especially the possibilities provided by the new features. I want people to use it to create their own benchmarks, not necessarily related to MySQL, and hopefully find sysbench useful in areas that I have not even envisioned myself.

Percona: What are you most looking forward to at Percona Live Europe 2017?

Alexey: For me Percona Live conferences have always been the place where I can feel the pulse of the technology and learn from the smartest people in the industry. This is especially true now that Percona Live provides talks on diverse topics from communities and database management technologies other than MySQL. Which makes it an even greater event to share ideas, solutions and expertise.

Want to find out more about Alexey, sysbench and database benchmarking? Register for Percona Live Europe 2017, and see his talk Modern sysbench: Teaching an Old Dog New Tricks. Register now to get the best price! Use discount code SeeMeSpeakPLE17 to get 10% off your registration.

Percona Live Open Source Database Conference Europe 2017 in Dublin is the premier European open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, MariaDB, MongoDB, time series database, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference Europe will be September 25-27, 2017 at the Radisson Blu Royal Hotel, Dublin.

Sep
20
2017
--

sysbench Histograms: A Helpful Feature Often Overlooked

Sysbench Histograms

Sysbench HistogramsIn this blog post, I will demonstrate how to run and use sysbench histograms.

One of the features of sysbench that I often I see overlooked (and rarely used) is its ability to produce detailed query response time histograms in addition to computing percentile numbers. Looking at histograms together with throughput or latency over time provides many additional insights into query performance.

Here is how you get detailed sysbench histograms and performance over time:

sysbench --rand-type=uniform --report-interval=1 --percentile=99 --time=300 --histogram --mysql-password=sbtest oltp_point_select --table_size=400000000 run

There are a few command line options to consider:

  • report-interval=1 prints out the current performance measurements every second, which helps see if performance is uniform, if you have stalls or otherwise high variance
  • percentile=99 computes 99 percentile response time, rather than 95 percentile (the default); I like looking at 99 percentile stats as it is a better measure of performance
  • histogram=on produces a histogram at the end of the run (as shown below)

The first thing to note about this histogram is that it is exponential. This means the width of the buckets changes with higher values. It starts with 0.001 ms (one microsecond) and gradually grows. This design is used so that sysbench can deal with workloads with requests that take small fractions of milliseconds, as well as accommodate requests that take many seconds (or minutes).

Next, we learn some us very interesting things about typical request response time distribution for databases. You might think that this distribution would be close to some to some “academic” distributions, such as normal distribution. In reality, we often observe is something of a “camelback” distribution (not a real term) – and our “camel” can have more than two humps (especially for simple requests such as the single primary key lookup shown here).

Why do request response times tend to have this distribution? It is because requests can take multiple paths inside the database. For example, certain requests might get responses from the MySQL Query Cache (which will result in the first hump). A second hump might come from resolving lookups using the InnoDB Adaptive Hash Index. A third hump might come from finding all the data in memory (rather than the Adaptive Hash Index). Finally, another hump might coalesce around the time (or times) it takes to execute on requests that require disk IO.    

You also will likely see some long-tail data that highlights the fact that MySQL and Linux are not hard, real-time systems. As an example, this very simple run with a single thread (and thus no contention) has an outlier at around 18ms. Most of the requests are served within 0.2ms or less.

As you add contention, row-level locking, group commit and other issues, you are likely to see even more complicated diagrams – which can often show you something unexpected:

Latency histogram (values are in milliseconds)
      value  ------------- distribution ------------- count
      0.050 |                                         1
      0.051 |                                         2
      0.052 |                                         2
      0.053 |                                         54
      0.053 |                                         79
      0.054 |                                         164
      0.055 |                                         883
      0.056 |*                                        1963
      0.057 |*                                        2691
      0.059 |**                                       4047
      0.060 |****                                     9480
      0.061 |******                                   15234
      0.062 |********                                 20723
      0.063 |********                                 20708
      0.064 |**********                               26770
      0.065 |*************                            35928
      0.066 |*************                            34520
      0.068 |************                             32247
      0.069 |************                             31693
      0.070 |***************                          41682
      0.071 |**************                           37862
      0.073 |********                                 22691
      0.074 |******                                   15907
      0.075 |****                                     10509
      0.077 |***                                      7853
      0.078 |****                                     9880
      0.079 |****                                     10853
      0.081 |***                                      9243
      0.082 |***                                      9280
      0.084 |***                                      8947
      0.085 |***                                      7869
      0.087 |***                                      8129
      0.089 |***                                      9073
      0.090 |***                                      8364
      0.092 |***                                      6781
      0.093 |**                                       4672
      0.095 |*                                        3356
      0.097 |*                                        2512
      0.099 |*                                        2177
      0.100 |*                                        1784
      0.102 |*                                        1398
      0.104 |                                         1082
      0.106 |                                         810
      0.108 |                                         742
      0.110 |                                         511
      0.112 |                                         422
      0.114 |                                         330
      0.116 |                                         259
      0.118 |                                         203
      0.120 |                                         165
      0.122 |                                         126
      0.125 |                                         108
      0.127 |                                         87
      0.129 |                                         83
      0.132 |                                         55
      0.134 |                                         42
      0.136 |                                         45
      0.139 |                                         41
      0.141 |                                         149
      0.144 |                                         456
      0.147 |                                         848
      0.149 |*                                        2128
      0.152 |**                                       4586
      0.155 |***                                      7592
      0.158 |*****                                    13685
      0.160 |*********                                24958
      0.163 |*****************                        44558
      0.166 |*****************************            78332
      0.169 |*************************************    98616
      0.172 |**************************************** 107664
      0.176 |**************************************** 107154
      0.179 |****************************             75272
      0.182 |******************                       49645
      0.185 |****************                         42793
      0.189 |*****************                        44649
      0.192 |****************                         44329
      0.196 |******************                       48460
      0.199 |*****************                        44769
      0.203 |**********************                   58578
      0.206 |***********************                  61373
      0.210 |**********************                   58758
      0.214 |******************                       48012
      0.218 |*************                            34533
      0.222 |**************                           36517
      0.226 |*************                            34645
      0.230 |***********                              28694
      0.234 |*******                                  17560
      0.238 |*****                                    12920
      0.243 |****                                     10911
      0.247 |***                                      9208
      0.252 |****                                     10556
      0.256 |***                                      7561
      0.261 |**                                       5047
      0.266 |*                                        3757
      0.270 |*                                        3584
      0.275 |*                                        2951
      0.280 |*                                        2078
      0.285 |*                                        2161
      0.291 |*                                        1747
      0.296 |*                                        1954
      0.301 |*                                        2878
      0.307 |*                                        2810
      0.312 |*                                        1967
      0.318 |*                                        1619
      0.324 |*                                        1409
      0.330 |                                         1205
      0.336 |                                         1193
      0.342 |                                         1151
      0.348 |                                         989
      0.354 |                                         985
      0.361 |                                         799
      0.367 |                                         671
      0.374 |                                         566
      0.381 |                                         537
      0.388 |                                         351
      0.395 |                                         276
      0.402 |                                         214
      0.409 |                                         143
      0.417 |                                         80
      0.424 |                                         85
      0.432 |                                         54
      0.440 |                                         41
      0.448 |                                         29
      0.456 |                                         16
      0.464 |                                         15
      0.473 |                                         11
      0.481 |                                         4
      0.490 |                                         9
      0.499 |                                         4
      0.508 |                                         3
      0.517 |                                         4
      0.527 |                                         4
      0.536 |                                         2
      0.546 |                                         4
      0.556 |                                         4
      0.566 |                                         4
      0.587 |                                         1
      0.597 |                                         1
      0.608 |                                         5
      0.619 |                                         3
      0.630 |                                         2
      0.654 |                                         2
      0.665 |                                         5
      0.677 |                                         26
      0.690 |                                         298
      0.702 |                                         924
      0.715 |*                                        1493
      0.728 |                                         1027
      0.741 |                                         1112
      0.755 |                                         1127
      0.768 |                                         796
      0.782 |                                         574
      0.797 |                                         445
      0.811 |                                         415
      0.826 |                                         296
      0.841 |                                         245
      0.856 |                                         202
      0.872 |                                         210
      0.888 |                                         168
      0.904 |                                         217
      0.920 |                                         163
      0.937 |                                         157
      0.954 |                                         204
      0.971 |                                         155
      0.989 |                                         158
      1.007 |                                         137
      1.025 |                                         94
      1.044 |                                         79
      1.063 |                                         52
      1.082 |                                         36
      1.102 |                                         25
      1.122 |                                         25
      1.142 |                                         16
      1.163 |                                         8
      1.184 |                                         5
      1.205 |                                         7
      1.227 |                                         2
      1.250 |                                         4
      1.272 |                                         3
      1.295 |                                         3
      1.319 |                                         2
      1.343 |                                         2
      1.367 |                                         1
      1.417 |                                         2
      1.791 |                                         1
      1.996 |                                         2
      2.106 |                                         2
      2.184 |                                         1
      2.264 |                                         1
      2.347 |                                         2
      2.389 |                                         1
      2.433 |                                         1
      2.477 |                                         1
      2.568 |                                         2
      2.615 |                                         1
      2.710 |                                         1
      2.810 |                                         1
      2.861 |                                         1
      3.187 |                                         1
      3.488 |                                         1
      3.816 |                                         1
      4.028 |                                         1
      6.913 |                                         1
      7.565 |                                         1
      8.130 |                                         1
     17.954 |                                         1

I hope you give sysbench histograms a try, and see what you can discover!

Jan
26
2017
--

Performance Schema Benchmarks: OLTP RW

Performance Schema Benchmarks

In this blog post, we’ll look at Performance Schema benchmarks for OLTP Read/Write workloads.

I am in love with Performance Schema and talk a lot about it. Performance Schema is a revolutionary MySQL troubleshooting instrument, but earlier versions had performance issues. Many of these issues are fixed now, and the default options work quickly and reliably. However, there is no free lunch! It is expected that the more instruments you turn ON, the more overhead you’ll have.

The advice I give my customers is that when in doubt, only turn ON the instruments that are required to troubleshoot your issue. Many of them ask: what exactly are the overhead costs for one instrumentation or another? I believe the best answer is “test on your system!” No generic benchmark can exactly repeat a workload on your site. But while I was working on the “OpenSource Databases on Big Machines” project, I decided to test the performance of Performance Schema as well.

I only tested a Read/Write workload. I used the same fast machine (144 CPU cores), the same MySQL options and the same SysBench commands that I described in this post. The option 

innodb_flush_method

 was changed to O_DIRECT, because it’s more reasonable for real-life workloads. I also upgraded the MySQL Server version to Oracle’s MySQL 5.7.17. The reason for the upgrade was to test if the issue described in this post is repeatable with latest Oracle MySQL server version. But since I tested Performance Schema, the effect on Percona Server for MySQL should be same.

I tested nine different scenarios:

  1. “All disabled”: Performance Schema is ON, but all instruments and consumers are disabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
  2. “All enabled”: Performance Schema is ON, and all instruments and consumers are enabled.
    update setup_instruments set enabled='yes';
    update setup_consumers set enabled='yes';
  3. “Default”: Performance Schema is ON, and only default instruments and consumers are enabled.
  4. “MDL only”: only Metadata Lock instrumentation is enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name='wait/lock/metadata/sql/mdl';
  5. “Memory only”: only Memory instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'memory%';
  6. “Stages and Statements”: only Stages and Statements instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'statement%';
    update setup_consumers set enabled='yes' where name like 'events_statements%';
    update setup_instruments set enabled='yes' where name like 'stage%';
    update setup_consumers set enabled='yes' where name like 'events_stages%';
  7. “Stages only”: only Stages instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'stage%';
    update setup_consumers set enabled='yes' where name like 'events_stages%';
  8. “Statements only”: only Statements instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'statement%';
    update setup_consumers set enabled='yes' where name like 'events_statements%';
  9. “Waits only”: only Waits instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'wait%' ;
    update setup_consumers set enabled='yes' where name like 'events_waits%';

Here are the overall results.

As you can see, some instrumentation only slightly affects performance, while others affect it a lot. I created separate graphs to make the picture clearer.

As expected, enabling all instrumentation makes performance lower:

Does this mean to use Performance Schema, you need to start the server with it ON and then disable all instruments? No! The default options have very little effect on performance:

The same is true for Metadata Locks, Memory and Statements instrumentation:

Regarding statements, I should note that I used prepared statements (which are instrumented in version 5.7). But it makes sense to repeat the tests without prepared statements.

The Stages instrumentation starts affecting performance:

However, the slowdown is reasonable and it happens only after we reach 32 concurrent threads. It still provides great insights on what is happening during query execution.

The real performance killer is Waits instrumentation:

It affects performance close to the same way as all instruments ON.

Conclusion

Using Performance Schema with the default options, Memory, Metadata Locks and Statements instrumentation doesn’t have a great impact on read-write workload performance. You might notice slowdowns with Stages instrumentation after reaching 32 actively running parallel connections. The real performance killer is Waits instrumentation. And even with it on, you will start to notice a performance drop only after 10,000 transactions per second.

Save

Save

Save

Save

Save

Save

Save

Jan
06
2017
--

Millions of Queries per Second: PostgreSQL and MySQL’s Peaceful Battle at Today’s Demanding Workloads

PostgreSQL and MySQL

This blog compares how PostgreSQL and MySQL handle millions of queries per second.

Anastasia: Can open source databases cope with millions of queries per second? Many open source advocates would answer “yes.” However, assertions aren’t enough for well-grounded proof. That’s why in this blog post, we share the benchmark testing results from Alexander Korotkov (CEO of Development, Postgres Professional) and Sveta Smirnova (Principal Technical Services Engineer, Percona). The comparative research of PostgreSQL 9.6 and MySQL 5.7 performance will be especially valuable for environments with multiple databases.

The idea behind this research is to provide an honest comparison for the two popular RDBMSs. Sveta and Aleksander wanted to test the most recent versions of both MySQL and PostgreSQL with the same tool, under the same challenging workloads and using the same configuration parameters (where possible). However, because both PostgreSQL and MySQL ecosystems evolved independently, with standard testing tools (pgbench and SysBench) used for each database, it wasn’t an easy journey.

The task fell to database experts with years of hands-on experience. Sveta has worked as Senior Principal Technical Support Engineer in the Bugs Verification Group of the MySQL Support Group at Oracle for more than eight years, and since 2015  has worked as a Principal Technical Services Engineer at Percona. Alexander Korotkov is a PostgreSQL major contributor, and the developer of a number PostgreSQL features – including the CREATE ACCESS METHOD command, generic WAL interface, lockfree Pin/UnpinBuffer, index-based search for regular expressions and much more. So we have a pretty decent cast for this particular play!

SvetaDimitri Kravtchuk regularly publishes detailed benchmarks for MySQL, so my main task wasn’t confirming that MySQL can do millions of queries per second. As our graphs will show, we’ve passed that mark already. As a Support Engineer, I often work with customers who have heterogeneous database environments in their shops, and want to know about the impact of migrating jobs from one database to another. So instead, I found the chance to work with the Postgres Professional company and identify both the strong and weak points of the two databases an excellent opportunity.

We wanted to test both databases on the same hardware, using the same tools and tests. We expected to test base functionality, and then work on more detailed comparisons. That way we could compare different real-world use case scenarios and popular options.

Spoiler: We are far from the final results. This is the start of a blog series.

OpenSource Databases on Big Machines, Series 1: “That Was Close…”

PostgreSQL Professional together with Freematiq provided two modern, powerful machines for tests.

Hardware configuration:

Processors: physical = 4, cores = 72, virtual = 144, hyperthreading = yes
Memory: 3.0T
Disk speed: about 3K IOPS
OS: CentOS 7.1.1503
File system: XFS

I also used a smaller Percona machine.

Hardware configuration:

Processors: physical = 2, cores = 12, virtual = 24, hyperthreading = yes
Memory: 251.9G
Disk speed: about 33K IOPS
OS: Ubuntu 14.04.5 LTS
File system: EXT4

Note that machines with smaller numbers of CPU cores and faster disks are more common for MySQL installations than machines with larger numbers of cores.

The first thing we needed to agree on is which tool to use. A fair comparison only makes sense if the workloads are as close as possible.

The standard PostgreSQL tool for performance tests is pgbench, while for MySQL it’s SysBench. SysBench supports multiple database drivers and scriptable tests in the Lua programming language, so we decided to use this tool for both databases.

The initial plan was to convert pgbench tests into SysBench Lua syntax, and then run standard tests on both databases. After initial results, we modified our tests to better examine specific MySQL and PostgreSQL features.

I converted pgbench tests into SysBench syntax, and put the tests into an open-database-bench GitHub repository.

And then we both faced difficulties.

As I wrote already, I also ran the tests on a Percona machine. For this converted test, the results were almost identical:

Percona machine:

OLTP test statistics:
       transactions:                        1000000 (28727.81 per sec.)
       read/write requests:                 5000000 (143639.05 per sec.)
       other operations:                    2000000 (57455.62 per sec.)

Freematiq machine:

OLTP test statistics:
       transactions:                        1000000 (29784.74 per sec.)
       read/write requests:                 5000000 (148923.71 per sec.)
       other operations:                    2000000 (59569.49 per sec.)

I started investigating. The only place where the Percona machine was better than Freematiq’s was disk speed. So I started running the pgbench read-only test, which was identical to SysBench’s point select test with full dataset in memory. But this time SysBench used 50% of available CPU resources:

PID  USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
4585 smirnova  20   0  0,157t 0,041t   9596 S  7226  1,4  12:27.16 mysqld
8745 smirnova  20   0 1266212 629148   1824 S  7126  0,0   9:22.78 sysbench

Alexander, in turn, had issues with SysBench, which could not create a high load on PostgreSQL when prepared statements were used:

93087 korotkov  20   0 9289440 3,718g   2964 S 242,6  0,1   0:32.82 sysbench
93161 korotkov  20   0 32,904g  81612  80208 S   4,0  0,0   0:00.47 postgres
93116 korotkov  20   0 32,904g  80828  79424 S   3,6  0,0   0:00.46 postgres
93118 korotkov  20   0 32,904g  80424  79020 S   3,6  0,0   0:00.47 postgres
93121 korotkov  20   0 32,904g  80720  79312 S   3,6  0,0   0:00.47 postgres
93128 korotkov  20   0 32,904g  77936  76536 S   3,6  0,0   0:00.46 postgres
93130 korotkov  20   0 32,904g  81604  80204 S   3,6  0,0   0:00.47 postgres
93146 korotkov  20   0 32,904g  81112  79704 S   3,6  0,0   0:00.46 postgres

We contacted SysBench author Alexey Kopytov, and he fixed MySQL issue. The solution is:

  • Use SysBench with options
    --percentile=0 --max-requests=0

      (reasonable CPU usage)

  • Use concurrency_kit branch (better concurrency and Lua processing)
  • Rewrite Lua scripts to support prepared statements (pull request: https://github.com/akopytov/sysbench/pull/94)
  • Start both SysBench and mysqld with the jemalloc or tmalloc library pre-loaded

A fix for PostgreSQL is on the way. For now, Alexander converted a standard SysBench test into pgbench format and we stuck with it. Not much new for MySQL, but at least we had a baseline for comparison.

The next difficulty I faced was the default operating system parameters. To make the long story short, I changed them to the recommended ones (described below):

vm.swappiness=1
cpupower frequency-set --governor performance
kernel.sched_autogroup_enabled=0
kernel.sched_migration_cost_ns= 5000000
vm.dirty_background_bytes=67108864
vm.dirty_bytes=536870912
IO scheduler [deadline]

The same parameters were better for PostgreSQL performance as well. Alexander set his machine similarly.

After solving these issues we learned and implemented the following:

  • We cannot use a single tool (for now)
  • Alexander wrote a test for pgbench, imitating the standard SysBench tests
  • We are still not able to write custom tests because we use different tools

But we could use these tests as a baseline. After work done by Alexander, we stuck with the standard SysBench tests. I converted them to use prepared statements, and Alexander converted them into pgbench format.

I should mention that I was not able to get the same results for the Read Only and Point Select tests as Dimitri. They are close, but slightly slower. We need to investigate if this is the result of different hardware, or my lack of performance testing abilities. The results from the Read-Write tests are similar.

Another difference was between the PostgreSQL and MySQL tests. MySQL users normally have many connections. Setting the value of the variable

max_conenctions

, and limiting the total number of parallel connections to thousands is not rare nowadays. While not recommended, people use this option even without the thread pool plugin. In real life, most of these connections are sleeping. But there is always a chance they all will used in cases of increased website activity.

For MySQL I tested up to 1024 connections. I used powers of two and multiplies of the number of cores: 1, 2, 4, 8, 16, 32, 36, 64, 72, 128, 144, 256, 512 and 1024 threads.

For Alexander, it was more important to test in smaller steps. He started from one thread and increased by 10 threads, until 250 parallel threads were reached. So you will see a more detailed graph for PostgreSQL, but no results after 250 threads.

Here are our comparison results.

Point SELECTs

PostgreSQL and MySQL

  • pgsql-9.6 is standard PostgreSQL
  • pgsql-9.6 + pgxact-align is PostgreSQL with this patch (more details can be found in this blog post)
  • MySQL-5.7 Dimitri is Oracle’s MySQL Server
  • MySQL-5.7 Sveta is Percona Server 5.7.15

OLTP RO

PostgreSQL and MySQL

OLTP RW

PostgreSQL and MySQL

Sync commit in PostgreSQL is a feature, similar to

innodb_flush_log_at_trx_commit=1

 in InnoDB, and async commit is similar to

innodb_flush_log_at_trx_commit=2

.

You see that the results are very similar: both databases are developing very fast and work with modern hardware well.

MySQL results which show 1024 threads for reference.

Point SELECT and OLTP RO

PostgreSQL and MySQL

OLTP RW with innodb_flush_log_at_trx_commit set to 1 and 2

PostgreSQL and MySQL

After receiving these results, we did a few feature-specific tests that will be covered in separate blog posts.

More Information

MySQL Options for OLTP RO and Point SELECT tests:

# general
table_open_cache = 8000
table_open_cache_instances=16
back_log=1500
query_cache_type=0
max_connections=4000
# files
innodb_file_per_table
innodb_log_file_size=1024M
innodb_log_files_in_group=3
innodb_open_files=4000
# Monitoring
innodb_monitor_enable = '%'
performance_schema=OFF #cpu-bound, matters for performance
#Percona Server specific
userstat=0
thread-statistics=0
# buffers
innodb_buffer_pool_size=128000M
innodb_buffer_pool_instances=128 #to avoid wait on InnoDB Buffer Pool mutex
innodb_log_buffer_size=64M
# InnoDB-specific
innodb_checksums=1 #Default is CRC32 in 5.7, very fast
innodb_use_native_aio=1
innodb_doublewrite= 1 #https://www.percona.com/blog/2016/05/09/percona-server-5-7-parallel-doublewrite/
innodb_stats_persistent = 1
innodb_support_xa=0 #(We are read-only, but this option is deprecated)
innodb_spin_wait_delay=6 #(Processor and OS-dependent)
innodb_thread_concurrency=0
join_buffer_size=32K
innodb_flush_log_at_trx_commit=2
sort_buffer_size=32K
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_max_dirty_pages_pct=90
innodb_max_dirty_pages_pct_lwm=10
innodb_lru_scan_depth=4000
innodb_page_cleaners=4
# perf special
innodb_adaptive_flushing = 1
innodb_flush_neighbors = 0
innodb_read_io_threads = 4
innodb_write_io_threads = 4
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_purge_threads=4
innodb_max_purge_lag_delay=30000000
innodb_max_purge_lag=0
innodb_adaptive_hash_index=0 (depends on workload, always check)

MySQL Options for OLTP RW:

#Open files
table_open_cache = 8000
table_open_cache_instances = 16
query_cache_type = 0
join_buffer_size=32k
sort_buffer_size=32k
max_connections=16000
back_log=5000
innodb_open_files=4000
#Monitoring
performance-schema=0
#Percona Server specific
userstat=0
thread-statistics=0
#InnoDB General
innodb_buffer_pool_load_at_startup=1
innodb_buffer_pool_dump_at_shutdown=1
innodb_numa_interleave=1
innodb_file_per_table=1
innodb_file_format=barracuda
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_doublewrite=1
innodb_support_xa=1
innodb_checksums=1
#Concurrency
innodb_thread_concurrency=144
innodb_page_cleaners=8
innodb_purge_threads=4
innodb_spin_wait_delay=12 Good value for RO is 6, for RW and RC is 192
innodb_log_file_size=8G
innodb_log_files_in_group=16
innodb_buffer_pool_size=128G
innodb_buffer_pool_instances=128 #to avoid wait on InnoDB Buffer Pool mutex
innodb_io_capacity=18000
innodb_io_capacity_max=36000
innodb_flush_log_at_timeout=0
innodb_flush_log_at_trx_commit=2
innodb_flush_sync=1
innodb_adaptive_flushing=1
innodb_flush_neighbors = 0
innodb_max_dirty_pages_pct=90
innodb_max_dirty_pages_pct_lwm=10
innodb_lru_scan_depth=4000
innodb_adaptive_hash_index=0
innodb_change_buffering=none #can be inserts, workload-specific
optimizer_switch="index_condition_pushdown=off" #workload-specific

MySQL SysBench parameters.

LD_PRELOAD=/data/sveta/5.7.14/lib/mysql/libjemalloc.so
 /data/sveta/sbkk/bin/sysbench
 [ --test=/data/sveta/sysbench/sysbench/tests/db/oltp_prepared.lua | --test=/data/sveta/sysbench/sysbench/tests/db/oltp_simple_prepared.lua ]
 --db-driver=mysql --oltp-tables-count=8 --oltp-table-size=10000000
--mysql-table-engine=innodb --mysql-user=msandbox --mysql-password=msandbox
 --mysql-socket=/tmp/mysql_sandbox5715.sock
--num-threads=$i --max-requests=0 --max-time=300
--percentile=0 [--oltp-read-only=on --oltp-skip-trx=on]

PostgreSQL pgbench parameters:

$ git clone https://github.com/postgrespro/pg_oltp_bench.git
$ cd pg_oltp_bench
$ make USE_PGXS=1
$ sudo make USE_PGXS=1 install
$ psql DB -f oltp_init.sql
$ psql DB -c "CREATE EXTENSION pg_oltp_bench;"
$ pgbench -c 100 -j 100 -M prepared -f oltp_ro.sql -T 300 -P 1 DB
$ pgbench -c 100 -j 100 -M prepared -f oltp_rw.sql -T 300 -P 1 DB

Features in MySQL 5.7 that significantly improved performance:

Anastasia: The initial findings of this research were announced at Percona Live Amsterdam 2016. More findings were added to the second version of the same talk given at Moscow HighLoad++ 2016. Hopefully the third iteration of this talk will be available at Percona Live Open Source Database Conference 2017 in Santa Clara. Stay tuned: the Percona Live Committee is working on the program!

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

May
13
2016
--

Benchmark MongoDB with sysbench

Benchmark MongoDB with sysbench

Benchmark MongoDB with sysbenchIn this blog post, we’ll discuss how to benchmark MongoDB with sysbench.

In an earlier post, I mentioned our use of sysbench-mongodb (via this fork) to run benchmarks of MongoDB servers. I now want to share our work extending sysbench to make it work with MongoDB.

If you’re not familiar with sysbench, it’s a great project developed by Alexey Kopytov that lets you run different types of benchmarks (referred to as “tests” by the tool), including database benchmarks. The database tests are implemented in Lua scripts, which means you can customize them as needed (or even write new ones from scratch) – something useful for simulating specific workloads.

All of the database tests in sysbench assume an SQL-based database, so instead of trying to shoehorn MongoDB tests into this framework I modified the connect/disconnect functions to handle MongoDB, and then implemented new functions specific for this database.

You can find the work (which is still in progress but usable, and in fact currently used by us in benchmarks) on the dev-mongodb-support-1.0 branch of our sysbench fork.

To use it, you just need to specify the –mongo-url argument (others too, as needed, but this is the one that must be present for sysbench to detect a MongoDB test is requested), and then provide the path to the Lua script you want to run. The following is an example:

sysbench
--mongo-write-concern=1
--mongo-url="mongodb://localhost"
--mongo-database-name=sbtest
--test=sysbench/sysbench/tests/mongodb/oltp.lua
--oltp_table_size=60000000
--oltp_tables_count=16
--num-threads=512
--rand-type=pareto
--report-interval=10
--max-requests=0
--max-time=600
--oltp-point-selects=10
--oltp-simple-ranges=1
--oltp-sum-ranges=1
--oltp-order-ranges=1
--oltp-distinct-ranges=1
--oltp-index-updates=1
--oltp-non-index-updates=1
--oltp-inserts=1 run

To build this branch, you’ll first need to build and install (or otherwise obtain) the mongo-c-driver project, as that is what we use to connect to MongoDB. Once that’s done, building is just a matter of running the following commands from the repo’s root:

./autogen.sh
./configure
make
sudo make install #optionally

The changes should not affect the other database tests in sysbench, though I have only verified that the MySQL ones continue to work.

Right now, the workload from sysbench-mongodb is implemented in Lua scripts (oltp.lua), and work is in progress to allow freeform operations to be created with new Lua scripts (by providing functions that take JSON as the argument). As an alternative, you may want to check out this much-less-tested (and currently unstable) branch based on luamongo. It already supports the creation of arbitrary workloads in Lua. In this case, you also need to build luamongo, which is included.

With either branch, you can add new tests by implementing new Lua scripts (though the dev-mongodb-support-1.0 branch still needs a few functions implemented on the C side to support arbitrary operations from the Lua side).

We think there are still some types of operations needed to improve sysbench’s usefulness for MongoDB, such as queries involving arrays, union, the $in operator, geospatial operators, and in place updates.

We hope you find this useful, and we welcome suggestions and bug reports to improve it.

Happy benchmarking!

Aug
07
2015
--

The MySQL query cache: Worst enemy or best friend?

During the last couple of months I have been involved in an unusually high amount of performance audits for e-commerce applications running with Magento. And although the systems were quite different, they also had one thing in common: the MySQL query cache was very useful. That was counter-intuitive for me as I’ve always expected the query cache to be such a bottleneck that response time is better when the query cache is turned off no matter what. That lead me to run a few experiments to better understand when the query cache can be helpful.

Some context

The query cache is well known for its contentions: a global mutex has to be acquired for any read or write operation, which means that any access is serialized. This was not an issue 15 years ago, but with today’s multi-core servers, such serialization is the best way to kill performance.

However from a performance point of view, any query cache hit is served in a few tens of microseconds while the fastest access with InnoDB (primary lookup) still requires several hundreds of microseconds. Yes, the query cache is at least an order of magnitude faster than any query that goes to InnoDB.

A simple test

To better understand how good or bad the query cache can be, I set up a very simple benchmark:

  • 1M records were inserted in 16 tables.
  • A moderate write load (65 updates/s) was run with a modified version of the update_index.lua sysbench script (see the end of the post for the code).
  • The select.lua sysbench script was run, with several values for the --num-threads option.

Note that the test is designed to be unfavorable to the query cache as the whole dataset fits in the buffer pool and the SELECT statements are very simple. Also note that I configured the query cache to be large enough so that no entry was evicted from the cache due to low memory.

Results – MySQL query cache ON

First here are the results when the query cache is enabled:

qcache_on

This configuration scales well up to 4 concurrent threads, but then the throughput degrades very quickly. With 10 concurrent threads, SHOW PROCESSLIST is enough to show you that all threads spend all their time waiting for the query cache mutex. Okay, this is not a surprise.

Results – MySQL query cache OFF

When the query cache is disabled, this is another story:

qcache_off

Throughput scales well up to somewhere between 10 and 20 threads (for the record the server I was using had 16 cores). But more importantly, even at the higher concurrencies, the overall throughput continued to increase: at 20 concurrent threads, MySQL was able to serve nearly 3x more queries without the query cache.

Conclusion

With Magento, you can expect to have a light write workload, very low concurrency and also quite complex SELECT statements. Given the results of our simple benchmarks, it is finally not that surprising that the MySQL query cache is a good fit in this case.

It is also worth noting that many applications run a database workload where writes are light and concurrency is low: the query cache should then not be discarded immediately. And maybe it is time for Oracle to make plans to improve the query cache as suggested by Peter a few years ago?

Annex: sysbench commands

# Modified update_index.lua
function event(thread_id)
   local table_name
   table_name = "sbtest".. sb_rand_uniform(1, oltp_tables_count)
   rs = db_query("UPDATE ".. table_name .." SET k=k+1 WHERE id=" .. sb_rand(1, oltp_table_size))
   db_query("SELECT SLEEP(0.015)")
end

# Populate the tables
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-table-size=1000000 --oltp-tables-count=16 --num-threads=16 --test=/usr/share/doc/sysbench/tests/db/insert.lua prepare
# Write workload
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-tables-count=16 --num-threads=1 --test=/usr/share/doc/sysbench/tests/db/update_index.lua --max-requests=1000000 run
# Read workload
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-tables-count=16 --num-threads=1 --test=/usr/share/doc/sysbench/tests/db/select.lua --max-requests=10000000 run

The post The MySQL query cache: Worst enemy or best friend? appeared first on MySQL Performance Blog.

Jun
19
2015
--

Q&A: High availability when using MySQL in the cloud

Percona MySQL webinar followup: Q&ALast week I hosted a webinar on using MySQL in the cloud for High Availability (HA) alongside 451 Research analyst Jason Stamper. You can watch the recording and also download the slides (free) here. Just click the “Register” button at the end of that page.

We had several excellent questions and we didn’t have time to get to several of them in the allotted time. I’m posting them here along with the answers. Feel free to ask follow-up questions in the comments below.


Q: Can the TokuDB engine be used in a PXC environment?

A: No, TokuDB cannot currently be used in a PXC environment, the only supported engine in Percona XtraDB Cluster 5.6 is InnoDB.

Q: With Galera replication (PXC), is balancing the load on each node?

A: No, you need to implement your own load balancing and HA layer between your clients and the Percona XtraDB Cluster server.  Examples mentioned in the webinar include HAProxy and F5 BigIP.

Q: What’s the best version of Percona XtraDB Cluster regarding InnoDB performance?

A: In general for best performance you should be using the latest release of Percona XtraDB Cluster 5.6, which is currently 5.6.24, released on June 3rd, 2015.

Q: Can I redirect my writes in Percona XtraDB Cluster to multiple nodes using the HAProxy? While trying with SysBench I can see write-only goes to first nodes in PXC while reads does goes to multiple nodes.

A: Yes you can configure HAProxy to distribute both reads and writes across all of your nodes in a Percona XtraDB Cluster environment. Perhaps SysBench created only one database connection for all writes, and so haproxy kept those confined to only one host. You may want to experiment with parallel_prepare.lua.

Q: What’s the optimal HA for small datasets (db is less than 10gb)?

A: The optimal HA deployment for small datasets would be dependent on your level of recovery required (tolerance for loss of transactions) and time that you can be in an unavailable state (seconds, minutes, hours?).  Unfortunately there isn’t a single answer to your question, however, if you are interested in further discussion on this point Percona would be happy to coordinate a time to speak.  Please feel free to contact me directly and we can continue the conversation at michael.coburn@percona.com.

 Q: Is there a concept of local master vs. remote master with PXC?

A: No there is no concept of local vs remote master.  All nodes in a Percona XtraDB Cluster can now be classified as Master, regardless of their proximity to the clients.

Q: Are there any concerns when considering AWS RDS or AURORA DB for MySQL HA in the Cloud?

A: Regarding AWS RDS, yes this a good option for MySQL HA in the Cloud.  I unfortunately haven’t worked with Aurora DB that much yet so I don’t have an opinion on it’s suitability for HA in the Cloud.

Q: We tried out PXC awhile back and it used to lock everything whenever any ddl was done. Has that changed?

A: We would have to look at the specifics of your environment, however, there have been numerous improvements in the 1½ years of development since Percona XtraDB Cluster went Generally Available (GA) on January 30th, 2014 in version 5.6.15.

Q: Is using the arbitrator a must?

A: No the arbitrator role via the garbd daemon is generally only used when operating in a minimal environment of two nodes that contain the data and you need a third node for quorum but don’t want to store the data a third time.

Q: Can we do a cluster across different zones?

A: Yes you can. However be aware that the latency incurred for all cluster certification operations will be impacted by the round trip time between nodes.

Q: Does PXC also support the MyISAM database?

A: No, Percona XtraDB Cluster does not support any storage engine other than InnoDB as of PXC 5.6.

Q: How do load balancers affect the throughput in a Galera-based setup given that the write would be limited by the slowest node?

A: Load balancers will introduce some measure of additional latency in the form of CPU time in the load balancer layer as it evaluates its own ruleset, and also in network time due to additional hop via load balancer.  Otherwise there should be no perceptible difference in the write throughput of a Percona XtraDB Cluster with and without a load balancer as it relates to the “slowest node” factor.

Q: Have you used MaxScale yet? If so, what are your thoughts?

A: Unfortunately I haven’t used MaxScale however Yves Trudeau, Percona Principal Architect, has recently written about MaxScale in this blog post.

Q: How do you configure timeout and maintain persistent connection to HAProxy?

A: I would encourage you to refer to the HAProxy Documentation.

The post Q&A: High availability when using MySQL in the cloud appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com