Aug
07
2018
--

Resource Usage Improvements in Percona Monitoring and Management 1.13

PMM 1-13 reduction CPU usage by 5x

In Percona Monitoring and Management (PMM) 1.13 we have adopted Prometheus 2, and with this comes a dramatic improvement in resource usage, along with performance improvements!

What does it mean for you? This means you can have a significantly larger number of servers and database instances monitored by the same PMM installation. Or you can reduce the instance size you use to monitor your environment and save some money.

Let’s look at some stats!

CPU Usage

PMM 1.13 reduction in CPU usage by 5x

Percona Monitoring and Management 1.13 reduction in CPU usage after adopting Prometheus 2 by 8x

We can see an approximate 5x and 8x reduction of CPU usage on these two PMM Servers. Depending on the workload, we see CPU usage reductions to range between 3x and 10x.

Disk Writes

There is also less disk write bandwidth required:

PMM 1.13 reduction in disk write bandwidth

On this instance, the bandwidth reduction is “just” 1.5x times. Note this is disk IO for the entire PMM system, which includes more than only the Prometheus component. Prometheus 2 itself promises much more significant IO bandwidth reduction according to official benchmarks

According to the same benchmark, you should expect disk space usage reduction by 33-50% for Prometheus 2 vs Prometheus 1.8. The numbers will be less for Percona Monitoring and Management, as it also stores Query Statistics outside of Prometheus.

Resource usage on the monitored hosts

Also, resource usage on the monitored hosts is significantly reduced:

Percona Monitoring and Management 1.13 reduction of resource usage by Prometheus 2

Why does CPU usage go down on a monitored host with a Prometheus 2 upgrade? This is because PMM uses TLS for the Prometheus to monitored host communication. Before Prometheus 2, a full handshake was performed for every scrape, taking a lot of CPU time. This was optimized with Prometheus 2, resulting in a dramatic CPU usage decrease.

Query performance is also a lot better with Prometheus 2, meaning dashboards visually load a lot faster, though we did not do any specific benchmarks here to share the hard numbers. Note though this improvement only applies when you’re querying the data which is stored in Prometheus 2.

If you’re querying data that was originally stored in Prometheus 1.8, it will be queried through the much slower and less efficient “Remote Read” interface, being quite a bit slower and using a lot more CPU and memory resources.

If you love better efficiency and Performance, consider upgrading to PMM 1.13!

The post Resource Usage Improvements in Percona Monitoring and Management 1.13 appeared first on Percona Database Performance Blog.

Aug
01
2018
--

Saving With MyRocks in The Cloud

The main focus of a previous blog post was the performance of MyRocks when using fast SSD devices. However, I figured that MyRocks would be beneficial for use in cloud workloads, where storage is either slow or expensive.

In that earlier post, we demonstrated the benefits of MyRocks, especially for heavy IO workloads. Meanwhile, Mark wrote in his blog that the CPU overhead in MyRocks might be significant for CPU-bound workloads, but this should not be the issue for IO-bound workloads.

In the cloud the cost of resources is a major consideration. Let’s review the annual cost for the processing and storage resources.

 Resource cost/year, $   IO cost $/year   Total $/year 
c5.9xlarge  7881    7881
1TB io1 5000 IOPS  1500  3900    5400
1TB io1 10000 IOPS  1500  7800    9300
1TB io1 15000 IOPS  1500  11700  13200
1TB io1 20000 IOPS  1500  15600  17100
1TB io1 30000 IOPS  1500  23400  24900
3.4TB GP2 (10000 IOPS)  4800    4800

 

The scenario

The server version is Percona Server 5.7.22

For instances, I used c5.9xlarge instances. The reason for c5 was that it provides high performance Nitro virtualization: Brendan Gregg describes this in his blog post. The rationale for 9xlarge instances was to be able to utilize io1 volumes with a 30000 IOPS throughput – smaller instances will cap io1 throughput at a lower level.

I also used huge gp2 volumes: 3400GB, as this volume provides guaranteed 10000 IOPS even if we do not use io1 volumes. This is a cheaper alternative to io1 volumes to achieve 10000 IOPS.

For the workload I used sysbench-tpcc 5000W (50 tables * 100W), which for InnoDB gave about 471GB in storage used space.

For the cache I used 27GB and 54G buffer size, so the workload is IO-heavy.

I wanted to compare how InnoDB and RocksDB performed under this scenario.

If you are curious I prepared my terraform+ansible deployment files here: https://github.com/vadimtk/terraform-ansible-percona

Before jumping to the results, I should note that for MyRocks I used LZ4 compression for all levels, which in its final size is 91GB. That is five times less than InnoDB size. This alone provides operational benefits—for example to copy InnoDB files (471GB) from a backup volume takes longer than 1 hour, while it is much faster (five times) for MyRocks.

The benchmark results

So let’s review the results.

InnoDB versus MyRocks throughput in the cloud

Or presenting average throughput in a tabular form:

cachesize IOPS engine avg TPS
27 5000 innodb 132.66
27 5000 rocksdb 481.03
27 10000 innodb 285.93
27 10000 rocksdb 1224.14
27 10000gp2 innodb 227.19
27 10000gp2 rocksdb 1268.89
27 15000 innodb 436.04
27 15000 rocksdb 1839.66
27 20000 innodb 584.69
27 20000 rocksdb 2336.94
27 30000 innodb 753.86
27 30000 rocksdb 2508.97
54 5000 innodb 197.51
54 5000 rocksdb 667.63
54 10000 innodb 433.99
54 10000 rocksdb 1600.01
54 10000gp2 innodb 326.12
54 10000gp2 rocksdb 1559.98
54 15000 innodb 661.34
54 15000 rocksdb 2176.83
54 20000 innodb 888.74
54 20000 rocksdb 2506.22
54 30000 innodb 1097.31
54 30000 rocksdb 2690.91

 

We can see that MyRocks outperformed InnoDB in every single combination, but it is also important to note the following:

MyRocks on io1 5000 IOPS showed the performance that InnoDB showed in io1 15000 IOPS.

That means that InnoDB requires three times more in storage throughput. If we take a look at the storage cost, it corresponds to three times more expensive storage. Given that MyRocks requires less storage, it is possible to save even more on storage capacity.

On the most economical storage (3400GB gp2, which will provide 10000 IOPS) MyRocks showed 4.7 times better throughput.

For the 30000 IOPS storage, MyRocks was still better by 2.45 times.

However it is worth noting that MyRocks showed a greater variance in throughput during the runs. Let’s review the charts with 1 sec resolution for GP2 and io1 30000 IOPS storage:Throughput 1 sec resolution for GP2 and io1 30000 IOPS storage MyROCKS versus InnoDB

Such variance might be problematic for workloads that require stable throughput and where periodical slowdowns are unacceptable.

Conclusion

MyRocks is suitable and beneficial not only for fast SSD, but also for cloud deployments. By requiring less IOPS, MyRocks can provide better performance and save on the storage costs.

However, before evaluating MyRocks, make sure that your workload is IO-bound i.e. the working set is much bigger than available memory. For CPU-intensive workloads (where the working set fits into memory), MyRocks will be less beneficial or even perform worse than InnoDB (as described in the blog post A Look at MyRocks Performance)

 

 

 

The post Saving With MyRocks in The Cloud appeared first on Percona Database Performance Blog.

Jun
18
2018
--

Webinar Tues 19/6: MySQL: Scaling and High Availability – Production Experience from the Last Decade(s)

scale high availability

scale high availability
Please join Percona’s CEO, Peter Zaitsev as he presents MySQL: Scaling and High Availability – Production Experience Over the Last Decade(s) on Tuesday, June 19th, 2018 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

Percona is known as the MySQL performance experts. With over 4,000 customers, we’ve studied, mastered and executed many different ways of scaling applications. Percona can help ensure your application is highly available. Come learn from our playbook, and leave this talk knowing your MySQL database will run faster and more optimized than before.

Register Now

About Peter Zaitsev, CEO

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 140 professionals in 30 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups. Percona was named to the Inc. 5000 in 2013, 2014, 2015 and 2016.

Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Database Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization Volume 1 is one of percona.com’s most popular downloads. Peter lives in North Carolina with his wife and two children. In his spare time, Peter enjoys travel and spending time outdoors.

The post Webinar Tues 19/6: MySQL: Scaling and High Availability – Production Experience from the Last Decade(s) appeared first on Percona Database Performance Blog.

May
07
2018
--

Webinar Wednesday, May 9, 2018: MySQL Troubleshooting and Performance Optimization with Percona Monitoring and Management (PMM)

MySQL Troubleshooting

MySQL TroubleshootingPlease join Percona’s CEO, Peter Zaitsev as he presents MySQL Troubleshooting and Performance Optimization with PMM on Wednesday, May 9, 2018, at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

Optimizing MySQL performance and troubleshooting MySQL problems are two of the most critical and challenging tasks for MySQL DBAs. The databases powering your applications must handle heavy traffic loads while remaining responsive and stable so that you can deliver an excellent user experience. Further, DBAs’ bosses expect solutions that are cost-efficient.

In this webinar, Peter discusses how you can optimize and troubleshoot MySQL performance and demonstrate how Percona Monitoring and Management (PMM) enables you to solve these challenges using free and open source software. We will look at specific, common MySQL problems and review the essential components in PMM that allow you to diagnose and resolve them.

Register for the webinar now.

Peter ZaitsevPeter Zaitsev, CEO

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 140 professionals in 30 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups. The Inc. 5000 recognized Percona in 2013, 2014, 2015 and 2016. Peter was an early employee at MySQL AB, eventually leading the company’s High-Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High-Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Database Performance Blog. He was also tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization is one of percona.com’s most popular downloads.

The post Webinar Wednesday, May 9, 2018: MySQL Troubleshooting and Performance Optimization with Percona Monitoring and Management (PMM) appeared first on Percona Database Performance Blog.

Jan
18
2018
--

Does the Meltdown Fix Affect Performance for MySQL on Bare Metal?

Meltdown Fix Affect Performance small

In this blog post, we’ll look at does the Meltdown fix affect performance for MySQL on bare metal servers.

Since the news about the Meltdown bug, there were a lot of reports on the performance hit from proposed fixes. We have looked at how the fix affects MySQL (Percona Server for MySQL) under a sysbench workload.

In this case, we used bare metal boxes with the following specifications:

  • Two-socket Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz (in total 56 entries in /proc/cpuinfo)
  • Ubuntu 16.04
  • Memory: 256GB
  • Storage: Samsung SM863 1.9TB SATA SSD
  • Percona Server for MySQL 5.7.20
  • Kernel (vulnerable) 4.13.0-21
  • Kernel (with Meltdown fix) 4.13.0-25

Please note, the current kernel for Ubuntu 16.04 contains only a Meltdown fix, and not one for Spectre.

We performed the validation with the https://github.com/speed47/spectre-meltdown-checker tool. The database size is 100GB in a sysbench workload with 100 tables, 4mln rows each with Pareto distribution.

We have used a socket connection and TCP host connection to measure a possible overhead from the TCP network connection. We also perform read-write and read-only benchmarks.

The results are below for a various number of threads:

Meltdown Fix Affect Performance

Where

  • Nokpti: kernel without KPTI patch (4.13.0-21)
  • Pti: kernel with KPTI patch (4.13.0-25), with PTI enabled
  • Nopti: kernel with KPTI patch (4.13.0-25), with PTI disabled

 

testname bp socket threads pti nopti nokpti nopti_pct pti_pct
1 OLTP_RO in-memory tcp_socket 1 709.93 718.47 699.50 -2.64 -1.47
4 OLTP_RO in-memory tcp_socket 8 5473.05 5500.08 5483.40 -0.30 0.19
3 OLTP_RO in-memory tcp_socket 64 21716.18 22036.98 21548.46 -2.22 -0.77
2 OLTP_RO in-memory tcp_socket 128 21606.02 22010.36 21548.62 -2.10 -0.27
 5 OLTP_RO in-memory unix_socket 1 750.41 759.33 776.88 2.31 3.53
8 OLTP_RO in-memory unix_socket 8 5851.80 5896.86 5986.89 1.53 2.31
7 OLTP_RO in-memory unix_socket 64 23052.10 23552.26 23191.48 -1.53 0.60
6 OLTP_RO in-memory unix_socket 128 23215.38 23602.64 23146.42 -1.93 -0.30
9 OLTP_RO io-bound tcp_socket 1 364.03 369.68 370.51 0.22 1.78
12 OLTP_RO io-bound tcp_socket 8 3205.05 3225.21 3210.63 -0.45 0.17
11 OLTP_RO io-bound tcp_socket 64 15324.66 15456.44 15364.25 -0.60 0.26
10 OLTP_RO io-bound tcp_socket 128 17705.29 18007.45 17748.70 -1.44 0.25
13 OLTP_RO io-bound unix_socket 1 421.74 430.10 432.88 0.65 2.64
16 OLTP_RO io-bound unix_socket 8 3322.19 3367.46 3367.34 -0.00 1.36
15 OLTP_RO io-bound unix_socket 64 15977.28 16186.59 16248.42 0.38 1.70
14 OLTP_RO io-bound unix_socket 128 18729.10 19111.55 18962.02 -0.78 1.24
17 OLTP_RW in-memory tcp_socket 1 490.76 495.21 489.49 -1.16 -0.26
20 OLTP_RW in-memory tcp_socket 8 3445.66 3459.16 3414.36 -1.30 -0.91
19 OLTP_RW in-memory tcp_socket 64 11165.77 11167.44 10861.44 -2.74 -2.73
18 OLTP_RW in-memory tcp_socket 128 12176.96 12226.17 12204.85 -0.17 0.23
21 OLTP_RW in-memory unix_socket 1 530.08 534.98 540.27 0.99 1.92
24 OLTP_RW in-memory unix_socket 8 3734.93 3757.98 3772.17 0.38 1.00
23 OLTP_RW in-memory unix_socket 64 12042.27 12160.86 12138.01 -0.19 0.80
22 OLTP_RW in-memory unix_socket 128 12930.34 12939.02 12844.78 -0.73 -0.66
25 OLTP_RW io-bound tcp_socket 1 268.08 270.51 270.71 0.07 0.98
28 OLTP_RW io-bound tcp_socket 8 1585.39 1589.30 1557.58 -2.00 -1.75
27 OLTP_RW io-bound tcp_socket 64 4828.30 4782.42 4620.57 -3.38 -4.30
26 OLTP_RW io-bound tcp_socket 128 5158.66 5172.82 5321.03 2.87 3.15
29 OLTP_RW io-bound unix_socket 1 280.54 282.06 282.35 0.10 0.65
32 OLTP_RW io-bound unix_socket 8 1582.69 1584.52 1601.26 1.06 1.17
31 OLTP_RW io-bound unix_socket 64 4519.45 4485.72 4515.28 0.66 -0.09
30 OLTP_RW io-bound unix_socket 128 5524.28 5460.03 5275.53 -3.38 -4.50

 

As you can see, there is very little difference between runs (in 3-4% range), which fits into variance during the test.

Similar experiments were done on different servers and workloads:

There also we see a negligible difference that fits into measurement variance.

Overhead analysis

To understand why we do not see much effect in MySQL (InnoDB workloads), let’s take a look where we expect to see the overhead from the proposed fix.

The main overhead is expected from a system call, so let’s test syscall execution on the kernel before the fix and after the fix (thanks for Alexey Kopytov for an idea how to test it with sysbench).

We will use the following script syscall.lua:

ffi.cdef[[long syscall(long, long, long, long);]]
function event()
 for i = 1, 10000 do
 ffi.C.syscall(0, 0, 0, 0)
 end
end

Basically, we measure the time for executing 10000 system calls (this will be one event).

To run benchmark:

sysbench syscall.lua --time=60 --report-interval=1 run

 

And the results are following:

  • On the kernel without the fix (4.13.0-21): 455 events/sec
  • On the kernel with the fix (4.13.0-26): 250 events/sec

This means that time to execute 10000 system calls increased from 2.197ms to 4ms.

While this increase looks significant, it does not have much effect on MySQL (InnoDB engine). In MySQL, you can expect most system calls done for IO or network communication.

We can assume that the time to execute 10000 IO events on the fast storage takes 1000ms, so adding an extra 2ms for the system calls corresponds to adding 0.2% in overhead (which is practically invisible in MySQL workloads).

I expect the effect will be much more visible if we work with MyISAM tables cached in OS memory. In this case, the syscall overhead would be much more visible when accessing data in memory.

Conclusion:

From our results, we do not see a measurable effect from KPTI patches (to mitigate the Meltdown vulnerability) running on bare metal servers with Ubuntu 16.04 and 4.13 kernel series.

Reference commands and configs:

sysbench oltp_read_only.lua   {--mysql-socket=/tmp/mysql.sock|--mysql-host=127.0.0.1} --mysql-user=root
--mysql-db=sbtest100t4M --rand-type=pareto  --tables=100  --table-size=4000000 --num-threads=$threads --report-interval=1
--max-time=180 --max-requests=0  run

RW:

sysbench oltp_read_write.lua   {--mysql-socket=/tmp/mysql.sock|--mysql-host=127.0.0.1} --mysql-user=root
--mysql-db=sbtest100t4M --rand-type=pareto  --tables=100  --table-size=4000000 --num-threads=$threads --report-interval=1
--max-time=180 --max-requests=0  run

mysqld:
Percona Server 5.7.20-19

numactl --physcpubind=all --interleave=all   /usr/bin/env LD_PRELOAD=/data/opt/alexey.s/bin64_5720.ps/lib/mysql/libjemalloc.so.1 ./bin/mysqld
--defaults-file=/data/opt/alexey.s/my-perf57.cnf --basedir=. --datadir=/data/sam/sbtest100t4M   --user=root  --innodb_flush_log_at_trx_commit=1
--innodb-buffer-pool-size=150GB --innodb-log-file-size=10G --innodb-buffer-pool-instances=8  --innodb-io-capacity-max=20000
--innodb-io-capacity=10000 --loose-innodb-page-cleaners=8 --ssl=0

My.cnf file:

[mysqld]
user=root
port=3306
innodb_status_file=0
innodb_data_file_path=ibdata1:100M:autoextend
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = true
innodb_log_buffer_size = 128M
innodb_log_file_size = 10G
innodb_log_files_in_group = 2
innodb_write_io_threads=8
innodb_read_io_threads=8
innodb_io_capacity=15000
innodb_io_capacity_max=25000
innodb_lru_scan_depth=8192
#innodb_buffer_pool_size=${BP}G
innodb_doublewrite=1
innodb_thread_concurrency=0
innodb-checksum-algorithm=strict_crc32
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_purge_threads=8
loose-innodb-page-cleaners=8
innodb_buffer_pool_instances=8
innodb_change_buffering=none
innodb_adaptive_hash_index=OFF
sync_binlog=0
max_connections=5000
table_open_cache=5000
query_cache_type=OFF
thread_cache_size=16
back_log=2000
connect_timeout=15
skip-grant-tables
sort_buffer_size=262144
key_buffer_size=8388608
join_buffer_size=262144
server-id=1
max_connections=50000
skip_name_resolve=ON
max_prepared_stmt_count=1048560
performance_schema=OFF
performance-schema-instrument='wait/synch/%=ON'
innodb_monitor_enable=all
innodb_flush_neighbors=0
metadata_locks_hash_instances=256
table_open_cache_instances=64

Oct
18
2017
--

How to Choose the MySQL innodb_log_file_size

innodb_log_file_size

In this blog post, I’ll provide some guidance on how to choose the MySQL innodb_log_file_size.

Like many database management systems, MySQL uses logs to achieve data durability (when using the default InnoDB storage engine). This ensures that when a transaction is committed, data is not lost in the event of crash or power loss.

MySQL’s InnoDB storage engine uses a fixed size (circular) Redo log space. The size is controlled by innodb_log_file_size and innodb_log_files_in_group (default 2). You multiply those values and get the Redo log space that available to use. While technically it shouldn’t matter whether you change either the innodb_log_file_size or innodb_log_files_in_group variable to control the Redo space size, most people just work with the innodb_log_file_size and leave innodb_log_files_in_group alone.

Configuring InnoDB’s Redo space size is one of the most important configuration options for write-intensive workloads. However, it comes with trade-offs. The more Redo space you have configured, the better InnoDB can optimize write IO. However, increasing the Redo space also means longer recovery times when the system loses power or crashes for other reasons.  

It is not easy or straightforward to predict how much time a system crash recovery takes for a specific innodb_log_file_size value – it depends on the hardware, MySQL version and workload. It can vary widely (10 times difference or more, depending on the circumstances). However, around five minutes per 1GB of innodb_log_file_size is a decent ballpark number. If this is really important for your environment, I would recommend testing it by a simulating system crash under full load (after the database has completely warmed up).   

While recovery time can be a guideline for the limit of the InnoDB Log File size, there are a couple of other ways you can look at this number – especially if you have Percona Monitoring and Management installed.

Check Percona Monitoring and Management’s “MySQL InnoDB Metrics” Dashboard. If you see a graph like this:

innodb_log_file_size

where Uncheckpointed Bytes is pushing very close to the Max Checkpoint Age, you can almost be sure your current innodb_log_file_size is limiting your system’s performance. Increasing it can provide substantial performance improvements.

If you see something like this instead:

innodb_log_file_size 2

where the number of Uncheckpointed Bytes is well below the Max Checkpoint Age, then increasing the log file size won’t give you a significant improvement.

Note: many MySQL settings are interconnected. While a specific log file size might be good enough for smaller innodb_buffer_pool_size, larger InnoDB Buffer Pool values might warrant larger log files for optimal performance.

Another thing to keep in mind: the recovery time we spoke about early really depends on the Uncheckpointed Bytes rather than total log file size. If you do not see recovery time increasing with a larger innodb_log_file_size, check out InnoDB Checkpoint Age graph – it might be you just can’t fully utilize large log files with your workload and configuration.

Another way to look at the log file size is in context of log space usage:

innodb_log_file_size 3

This graph shows the amount of Data Written to the InnoDB log files per hour, as well as the total size of the InnoDB log files. In the graph above, we have 2GB of log space and some 12GB written to the Log files per hour. This means we cycle through logs every ten minutes.

InnoDB has to flush every dirty page in the buffer pool at least once per log file cycle time.

InnoDB gets better performance when it does that less frequently, and there is less wear and tear on SSD devices. I like to see this number at no less than 15 minutes. One hour is even better.  

Summary

Getting the innodb_log_file_file size is important to achieve the balance between reasonably fast crash recovery time and good system performance. Remember, your recovery time objective it is not as trivial as you might imagine. I hope the techniques described in this post help you to find the optimal value for your situation!

Jul
21
2017
--

Faster Node Rejoins with Improved IST performance

IST Performance

In this blog, we’ll look at how improvements to Percona XtraDB Cluster improved IST performance.

Introduction

Starting in version 5.7.17-29.20 of Percona XtraDB Cluster significantly improved performance. Depending on the workload, the increase in throughput is in the range of 3-10x. (More details here). These optimization fixes also helped improve IST (Incremental State Transfer) performance. This blog is aimed at studying the IST impact.

IST

IST stands for incremental state transfer. When a node of the cluster leaves the cluster for a short period of time and then rejoins the cluster it needs to catch-up with cluster state. As part of this sync process existing node of the cluster (aka DONOR) donates missing write-sets to rejoining node (aka JOINER). In short, flow involves, applying missing write-sets on JOINER as it does during active workload replication.

Percona XtraDB Cluster / Galera already can apply write-sets in parallel using multiple applier threads. Unfortunately, due to commit contention, the commit action was serialized. This was fixed in the above Percona XtraDB Cluster release, allowing commits to proceed in parallel.

IST uses the same path for applying write-sets, except that it is more like a batch operation.

IST Performance

Let’s look at IST performance before and now.

Setup

  1. Two node cluster (node-1 and node-2) and gcache is configured large enough to avoid purging as we need IST
  2. Start workload against node-1 for 30 seconds
  3. Shutdown node-2
  4. Start workload that performs 4M requests against node-1. Workload produces ~3.5M write-sets that are cached in gcache and used later for IST
  5. Start node-2 with N-applier threads
  6. Wait until IST is done
  7. ….. repeat steps 3-6 with different values of N.

Observations:

  • IST is 4x faster with PXC-5.7.17 (compared to previous releases)
  • Improved performance means a quicker node rejoin, and an overall increase in cluster productivity as joiner node is available to process the workload more quickly

Conclusion

Percona XtraDB Cluster 5.7.17 significantly improved IST performance. A faster re-join of the node effectively means better cluster productivity and flexibility in planning maintenance window. So what are you waiting for? Upgrade to Percona XtraDB Cluster 5.7.17 or latest Percona XtraDB Cluster 5.7 release and experience the power!

Jun
22
2017
--

ClickHouse in a General Analytical Workload (Based on a Star Schema Benchmark)

ClickHouse

ClickHouseIn this blog post, we’ll look at how ClickHouse performs in a general analytical workload using the star schema benchmark test.

We have mentioned ClickHouse in some recent posts (ClickHouse: New Open Source Columnar Database, Column Store Database Benchmarks: MariaDB ColumnStore vs. Clickhouse vs. Apache Spark), where it showed excellent results. ClickHouse by itself seems to be event-oriented RDBMS, as its name suggests (clicks). Its primary purpose, using Yandex Metrica (the system similar to Google Analytics), also points to an event-based nature. We also can see there is a requirement for date-stamped columns.

It is possible, however, to use ClickHouse in a general analytical workload. This blog post shares my findings. For these tests, I used a Star Schema benchmark — slightly-modified so that able to handle ClickHouse specifics.

First, let’s talk about schemas. We need to adjust to ClickHouse data types. For example, the biggest fact table in SSB is “lineorder”. Below is how it is defined for Amazon RedShift (as taken from https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-create-test-data.html):

CREATE TABLE lineorder
(
  lo_orderkey          INTEGER NOT NULL,
  lo_linenumber        INTEGER NOT NULL,
  lo_custkey           INTEGER NOT NULL,
  lo_partkey           INTEGER NOT NULL,
  lo_suppkey           INTEGER NOT NULL,
  lo_orderdate         INTEGER NOT NULL,
  lo_orderpriority     VARCHAR(15) NOT NULL,
  lo_shippriority      VARCHAR(1) NOT NULL,
  lo_quantity          INTEGER NOT NULL,
  lo_extendedprice     INTEGER NOT NULL,
  lo_ordertotalprice   INTEGER NOT NULL,
  lo_discount          INTEGER NOT NULL,
  lo_revenue           INTEGER NOT NULL,
  lo_supplycost        INTEGER NOT NULL,
  lo_tax               INTEGER NOT NULL,
  lo_commitdate        INTEGER NOT NULL,
  lo_shipmode          VARCHAR(10) NOT NULL
);

For ClickHouse, the table definition looks like this:

CREATE TABLE lineorderfull (
        LO_ORDERKEY             UInt32,
        LO_LINENUMBER           UInt8,
        LO_CUSTKEY              UInt32,
        LO_PARTKEY              UInt32,
        LO_SUPPKEY              UInt32,
        LO_ORDERDATE            Date,
        LO_ORDERPRIORITY        String,
        LO_SHIPPRIORITY         UInt8,
        LO_QUANTITY             UInt8,
        LO_EXTENDEDPRICE        UInt32,
        LO_ORDTOTALPRICE        UInt32,
        LO_DISCOUNT             UInt8,
        LO_REVENUE              UInt32,
        LO_SUPPLYCOST           UInt32,
        LO_TAX                  UInt8,
        LO_COMMITDATE           Date,
        LO_SHIPMODE             String
)Engine=MergeTree(LO_ORDERDATE,(LO_ORDERKEY,LO_LINENUMBER),8192);

From this we can see we need to use datatypes like UInt8 and UInt32, which are somewhat unusual for database world datatypes.

The second table (RedShift definition):

CREATE TABLE customer
(
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL
);

For ClickHouse, I defined as:

CREATE TABLE customerfull (
        C_CUSTKEY       UInt32,
        C_NAME          String,
        C_ADDRESS       String,
        C_CITY          String,
        C_NATION        String,
        C_REGION        String,
        C_PHONE         String,
        C_MKTSEGMENT    String,
        C_FAKEDATE      Date
)Engine=MergeTree(C_FAKEDATE,(C_CUSTKEY),8192);

For reference, the full schema for the benchmark is here: https://github.com/vadimtk/ssb-clickhouse/blob/master/create.sql.

For this table, we need to define a rudimentary column C_FAKEDATE Date in order to use ClickHouse’s most advanced engine (MergeTree). I was told by the ClickHouse team that they plan to remove this limitation in the future.

To generate data acceptable by ClickHouse, I made modifications to ssb-dbgen. You can find my version here: https://github.com/vadimtk/ssb-dbgen. The most notable change is that ClickHouse can’t accept dates in CSV files formatted as “19971125”. It has to be “1997-11-25”. This is something to keep in mind when loading data into ClickHouse.

It is possible to do some preformating on the load, but I don’t have experience with that. A common approach is to create the staging table with datatypes that match loaded data, and then convert them using SQL functions when inserting to the main table.

Hardware Setup

One of the goals of this benchmark to see how ClickHouse scales on multiple nodes. I used a setup of one node, and then compared to a setup of three nodes. Each node is 24 cores of “Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz” CPUs, and the data is located on a very fast PCIe Flash storage.

For the SSB benchmark I use a scale factor of 2500, which provides (in raw data):

Table lineorder – 15 bln rows, raw size 1.7TB, Table customer – 75 mln rows

When loaded into ClickHouse, the table lineorder takes 464GB, which corresponds to a 3.7x compression ratio.

We compare a one-node (table names lineorderfull, customerfull) setup vs. a three-node (table names lineorderd, customerd) setup.

Single Table Operations

Query:

SELECT
    toYear(LO_ORDERDATE) AS yod,
    sum(LO_REVENUE)
FROM lineorderfull
GROUP BY yod

One node:

7 rows in set. Elapsed: 9.741 sec. Processed 15.00 billion rows, 90.00 GB (1.54 billion rows/s., 9.24 GB/s.)

Three nodes:

7 rows in set. Elapsed: 3.258 sec. Processed 15.00 billion rows, 90.00 GB (4.60 billion rows/s., 27.63 GB/s.)

We see a speed up of practically three times. Handling 4.6 billion rows/s is blazingly fast!

One Table with Filtering

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorderfull
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)

One node:

1 rows in set. Elapsed: 3.175 sec. Processed 2.28 billion rows, 18.20 GB (716.60 million rows/s., 5.73 GB/s.)

Three nodes:

1 rows in set. Elapsed: 1.295 sec. Processed 2.28 billion rows, 18.20 GB (1.76 billion rows/s., 14.06 GB/s.)

It’s worth mentioning that during the execution of this query, ClickHouse was able to use ALL 24 cores on each box. This confirms that ClickHouse is a massively parallel processing system.

Two Tables (Independent Subquery)

In this case, I want to show how Clickhouse handles independent subqueries:

SELECT sum(LO_REVENUE)
FROM lineorderfull
WHERE LO_CUSTKEY IN
(
    SELECT C_CUSTKEY AS LO_CUSTKEY
    FROM customerfull
    WHERE C_REGION = 'ASIA'
)

One node:

1 rows in set. Elapsed: 28.934 sec. Processed 15.00 billion rows, 120.00 GB (518.43 million rows/s., 4.15 GB/s.)

Three nodes:

1 rows in set. Elapsed: 14.189 sec. Processed 15.12 billion rows, 121.67 GB (1.07 billion rows/s., 8.57 GB/s.)

We  do not see, however, the close to 3x speedup on three nodes, because of the required data transfer to perform the match LO_CUSTKEY with C_CUSTKEY

Two Tables JOIN

With a subquery using columns to return results, or for GROUP BY, things get more complicated. In this case we want to GROUP BY the column from the second table.

First, ClickHouse doesn’t support traditional subquery syntax, so we need to use JOIN. For JOINs, ClickHouse also strictly prescribes how it must be written (a limitation that will also get changed in the future). Our JOIN should look like:

SELECT
    C_REGION,
    sum(LO_EXTENDEDPRICE * LO_DISCOUNT)
FROM lineorderfull
ANY INNER JOIN
(
    SELECT
        C_REGION,
        C_CUSTKEY AS LO_CUSTKEY
    FROM customerfull
) USING (LO_CUSTKEY)
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)
GROUP BY C_REGION

One node:

5 rows in set. Elapsed: 31.443 sec. Processed 2.35 billion rows, 28.79 GB (74.75 million rows/s., 915.65 MB/s.)

Three nodes:

5 rows in set. Elapsed: 25.160 sec. Processed 2.58 billion rows, 33.25 GB (102.36 million rows/s., 1.32 GB/s.)

In this case the speedup is not even two times. This corresponds to the fact of the random data distribution for the tables lineorderd and customerd. Both tables were defines as:

CREATE TABLE lineorderd AS lineorder ENGINE = Distributed(3shards, default, lineorder, rand());
CREATE TABLE customerd AS customer ENGINE = Distributed(3shards, default, customer, rand());

Where  rand() defines that records are distributed randomly across three nodes. When we perform a JOIN by LO_CUSTKEY=C_CUSTKEY, records might be located on different nodes. One way to deal with this is to define data locally. For example:

CREATE TABLE lineorderLD AS lineorderL ENGINE = Distributed(3shards, default, lineorderL, LO_CUSTKEY);
CREATE TABLE customerLD AS customerL ENGINE = Distributed(3shards, default, customerL, C_CUSTKEY);

Three Tables JOIN

This is where it becomes very complicated. Let’s consider the query that you would normally write:

SELECT sum(LO_REVENUE),P_MFGR, toYear(LO_ORDERDATE) yod FROM lineorderfull ,customerfull,partfull WHERE C_REGION = 'ASIA' and
LO_CUSTKEY=C_CUSTKEY and P_PARTKEY=LO_PARTKEY GROUP BY P_MFGR,yod ORDER BY P_MFGR,yod;

With Clickhouse’s limitations on JOINs syntax, the query becomes:

SELECT
    sum(LO_REVENUE),
    P_MFGR,
    toYear(LO_ORDERDATE) AS yod
FROM
(
    SELECT
        LO_PARTKEY,
        LO_ORDERDATE,
        LO_REVENUE
    FROM lineorderfull
    ALL INNER JOIN
    (
        SELECT
            C_REGION,
            C_CUSTKEY AS LO_CUSTKEY
        FROM customerfull
    ) USING (LO_CUSTKEY)
    WHERE C_REGION = 'ASIA'
)
ALL INNER JOIN
(
    SELECT
        P_MFGR,
        P_PARTKEY AS LO_PARTKEY
    FROM partfull
) USING (LO_PARTKEY)
GROUP BY
    P_MFGR,
    yod
ORDER BY
    P_MFGR ASC,
    yod ASC

By writing queries this way, we force ClickHouse to use the prescribed JOIN order — at this moment there is no optimizer in ClickHouse and it is totally unaware of data distribution.

There is also not much speedup when we compare one node vs. three nodes:

One node execution time:

35 rows in set. Elapsed: 697.806 sec. Processed 15.08 billion rows, 211.53 GB (21.61 million rows/s., 303.14 MB/s.)

Three nodes execution time:

35 rows in set. Elapsed: 622.536 sec. Processed 15.12 billion rows, 211.71 GB (24.29 million rows/s., 340.08 MB/s.)

There is a way to make the query faster for this 3-way JOIN, however. (Thanks to Alexander Zaytsev from https://www.altinity.com/ for help!)

Optimized query:

SELECT
    sum(revenue),
    P_MFGR,
    yod
FROM
(
    SELECT
        LO_PARTKEY AS P_PARTKEY,
        toYear(LO_ORDERDATE) AS yod,
        SUM(LO_REVENUE) AS revenue
    FROM lineorderfull
    WHERE LO_CUSTKEY IN
    (
        SELECT C_CUSTKEY
        FROM customerfull
        WHERE C_REGION = 'ASIA'
    )
    GROUP BY
        P_PARTKEY,
        yod
)
ANY INNER JOIN partfull USING (P_PARTKEY)
GROUP BY
    P_MFGR,
    yod
ORDER BY
    P_MFGR ASC,
    yod ASC

Optimized query time:

One node:

35 rows in set. Elapsed: 106.732 sec. Processed 15.00 billion rows, 210.05 GB (140.56 million rows/s., 1.97 GB/s.)

Three nodes:

35 rows in set. Elapsed: 75.854 sec. Processed 15.12 billion rows, 211.71 GB (199.36 million rows/s., 2.79 GB/s.

That’s an improvement of about 6.5 times compared to the original query. This shows the importance of understanding data distribution, and writing the optimal query to process the data.

Another option for dealing with JOIN complexity, and to improve performance, is to use ClickHouse’s dictionaries. These dictionaries are described here: https://www.altinity.com/blog/2017/4/12/dictionaries-explained.

I will review dictionary performance in future posts.

Another traditional way to deal with JOIN complexity in an analytics workload is to use denormalization. We can move some columns (for example, P_MFGR from the last query) to the facts table (lineorder).

Observations

  • ClickHouse can handle general analytical queries (it requires special schema design and considerations, however)
  • Linear speedup is possible, but it depends on query design and requires advanced planning — proper speedup depends on data locality
  • ClickHouse is blazingly fast (beyond what I’ve seen before) because it can use all available CPU cores for query, as shown above using 24 cores for single server and 72 cores for three nodes
  • Multi-table JOINs are cumbersome and require manual work to achieve better performance, so consider using dictionaries or denormalization
Jun
09
2017
--

Q & A: MySQL In the Cloud – Migration, Best Practices, High Availability, Scaling

MySQL in the Cloud

MySQL in the CloudIn this blog, we will provide answers to the Q & A for the MySQL In the Cloud: Migration, Best Practices, High Availability, Scaling webinar.

First, we want to thank everybody for attending the June 7, 2017 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that we were unable to answer during the webinar:

How does Percona XtraDB cluster work with AWS for MySQL clustering?

Percona XtraDB Cluster works especially well in cloud environments, including Amazon EC2. Since Percona XtraDB Cluster only requires one network round trip per transaction for write transactions commits, and keeps all reads local, allows it to deploy high performance multi AZ and even multi region clusters. The fact that each Percona XtraDB Cluster node contains all the data allows it to avoid reliance on the EBS storage. You can run Percona XtraDB Cluster on NVMe storage based i3 EC2 nodes to achieve high performance even with very IO-intensive workloads. Automatic provisioning and cluster self healing allows you to easily scale the cluster. We have simple tutorial on how to deploy Percona XtraDB Cluster on AWS – check it out here.

How do you approach master-master model? Are there enough reasons to use the model to implement multi-site scaling?

There are two distinct multi-master modes in existence. A synchronous Master-Master solution, like the one offered by Percona XtraDB Cluster (virtually synchronous to be exact), guarantees there are no data conflicts as you connect to the nodes located at different sites. The downside of this model is that writes can be expensive. As such, it works well in environments with low latency between the different sites, or when high latency for updates can be tolerated. Percona XtraDB Cluster is greatly optimized in that it requires only one network roundtrip to complete a commit transaction. This significantly reduces the added latency compared to many other solutions.

In contrast, asynchronous Master-Master means you can perform writes locally, without waiting on a network round trip.  It comes with the downside of possible data conflicts. In MySQL, it can be implemented using MySQL Replication. MySQL Replication only detects conflicts at this point, however, and stops if it detects a conflict. It has no good built-in conflict resolution. Ensuring conflicts do not happen on the application level is hard and error prone, and only recommended in rare cases. Most applications out there do not use Active Master-Master, but rather design an architecture where each database replication set operates with a only a single writable node.

Do the Percona tools work in the cloud, like in Amazon Aurora?

We try to make Percona software in the cloud when it makes sense. For example, Percona Toolkit and Percona Monitoring Management support Amazon RDS and Amazon Aurora. Percona XtraBackup does not, as it requires physical access to the database files (Amazon RDS and Aurora don’t provide that).  Having said that, Amazon recently updated its Aurora migration documentation to include the use of XtraBackup. Amazon Aurora supports backups taken by Percona XtraBackup as a way to import data.

What is the fastest way to verify and validate backups created by XtraBackup for databases around 2-3TB?

In the big picture, you test backups by doing some sort of restore and validation. This can be done manually, but is much better if automated. There are three levels of such validation:

  • Basic Validation. Run –apply-log and ensure it completes successfully. Start the MySQL instance and run some basic queries to ensure it works. Often running some queries to see that recent data is present is a good idea.  
  • Consistency Validation.  Additionally, run Check Table on all tables to ensure there is no corruption. This way, tables and indexes data structures are validated.   
  • Full Validation. Restore the backup and connect the restored backup as a MySQL slave (possibly to one of the existing slaves). Let it catch up and then run pt-table-checksum to validate consistency and ensure that the data in backup matches what is in the source.

Running a checktable on databases on AWS IO optimized instances takes up to eight hours. Any other suggestions on how to replace checktable in validation?”

Without knowing the table size, it is hard for me to assess whether eight hours is reasonable for your environment. However, generally speaking you should not run a Full Validation on every backup. Full Validation first and foremost validates the backup and restore pipeline. If you’re not seeing issues, doing it once per month is plenty. You want to do lighter checks on a daily and weekly basis. 

What approach would you recommend for a data warehouse needing about 80,000IOPS, currently on FusionIO bare metal? Which cloud solution would be my best bet?

This is complicated question. To answer it properly requires more information. We need to know what type of operations your database performs. Working with a Percona Consultant to do an A&D for your environment would give you best answer. In general though, EBS (even with a large number of provisioned IOPs) would not match FusionIO in IO request latency. I3 high IO instances with NVMe storage is closer match. If budget is not a concern, you can look into X1 instances. These can have up to 2TB of memory and often allow getting all (or a large portion) of the database in memory for even higher performance.

Thanks for attending the MySQL In the Cloud: Migration, Best Practices, High Availability, Scaling webinar! Post any more MySQL in the cloud comments below.

May
22
2017
--

Webinar May 23, 2017: MongoDB Monitoring and Performance for the Savvy DBA

MongoDB Monitoring

MongoDB MonitoringJoin Percona’s Senior Technical Services Engineer Bimal Kharel on Tuesday, May 23, 2017, as he presents a webinar on MongoDB monitoring called How to Help Your DBA’s Sleep Better at Night at 10:00 am PDT / 1:00 pm EDT (UTC-7).

Are you trying to stay on top of your database before things turn ugly? Between metrics for throughput, database performance, resource utilization, resource saturation, errors (asserts) and many others, how do you know which one needs to be looked at NOW (and which can wait)?

Both DBAs and system admins must stay on top of the systems they manage. But filtering between metrics that need immediate attention and those that should be watched over time is challenging. In this webinar, Bimal narrows down the list of metrics that help you decide whether the on-call DBA gets their recommended eight hours of shuteye, or gets to run on caffeine with no sleep.

Bimal also discusses which graphs relate to each other, with examples from Percona’s Monitoring and Management (PMM) tool, to help you understand how things in MongoDB can impact other areas.

Please register for the webinar here.

MongoDB MonitoringBimal Kharel, Senior Technical Services Engineer, Percona

Bimal is a MongoDB support engineer at Percona. Before Percona he worked as a MongoDB DBA at EA and Charles Schwab. He has been in various roles throughout his career, from graphics to web developer to systems administration. MongoDB was the first database Bimal got into (he used MySQL for some websites but never other relational databases).

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com