Aug
28
2017
--

Looking at Disk Utilization and Saturation

DIsk Utilization and Saturation small

In this blog post, I will look at disk utilization and saturation.

In my previous blog post, I wrote about CPU utilization and saturation, the practical difference between them and how different CPU utilization and saturation impact response times. Now we will look at another critical component of database performance: the storage subsystem. In this post, I will refer to the storage subsystem as “disk” (as a casual catch-all). 

The most common tool for command line IO performance monitoring is

iostat

, which shows information like this:

root@ts140i:~# iostat -x nvme0n1 5
Linux 4.4.0-89-generic (ts140i)         08/05/2017      _x86_64_        (4 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.51    0.00    2.00    9.45    0.00   88.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 3555.57 5887.81 52804.15 87440.73    29.70     0.53    0.06    0.13    0.01   0.05  50.71
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.60    0.00    1.06   20.77    0.00   77.57
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.50    0.00    1.26    6.08    0.00   92.16
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7653.20    0.00 113497.60     0.00    29.66     0.99    0.13    0.13    0.00   0.12  93.52

The first line shows the average performance since system start. In some cases, it is useful to compare the current load to the long term average. In this case, as it is a test system, it can be safely ignored. The next line shows the current performance metrics over five seconds intervals (as specified in the command line).

The

iostat

 command reports utilization information in the %util column, and you can look at saturation by either looking at the average request queue size (the avgqu-sz column) or looking at the r_await and w_await columns (which show the average wait for read and write operations). If it goes well above “normal” then the device is over-saturated.

As in my previous blog post, we’ll perform some system Sysbench runs and observe how the

iostat

 command line tool and Percona Monitoring and Management graphs behave.

To focus specifically on the disk, we’re using the Sysbench fileio test. I’m using just one 100GB file, as I’m using DirectIO so all requests hit the disk directly. I’m also using “sync” request submission mode so I can get better control of request concurrency.

I’m using an Intel 750 NVME SSD in this test (though it does not really matter).

Sysbench FileIO 1 Thread

root@ts140i:/mnt/data# sysbench  --threads=1 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      7113.16
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  111.14
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0001s
   total number of events:              4267910
Latency (ms):
        min:                                  0.07
        avg:                                  0.14
        max:                                  6.18
        95th percentile:                      0.17

A single thread run is always great as a baseline, as with only one request in flight we should expect the best response time possible (though typically not the best throughput possible).

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68

Disk LatencyDIsk Utilization and Saturation

The Disk Latency graph confirms the disk IO latency we saw in the

iostat

 command, and it will be highly device-specific. We use it as a baseline to compare changes to with higher concurrency.

Disk IO Utilization

DIsk Utilization and Saturation 2

Disk IO utilization is close to 100% even though we have just one outstanding IO request (queue depth). This is the problem with Linux disk utilization reporting: unlike CPUs, Linux does not have direct visibility on how the IO device is designed. How many “execution units” does it really have? How are they utilized?  Single spinning disks can be seen as a single execution unit while RAID, SSDs and cloud storage (such as EBS) are more than one.

Disk Load

DIsk Utilization and Saturation 3

This graph shows the disk load (or request queue size), which roughly matches the number of threads that are hitting disk as hard as possible.

Saturation (IO Load)

DIsk Utilization and Saturation 4

The IO load on the Saturation Metrics graph shows pretty much the same numbers. The only difference is that unlike Disk IO statistics, it shows the summary for the whole system.

Sysbench FileIO 4 Threads

Now let’s increase IO to four concurrent threads and see how disk responds:

sysbench  --threads=4 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      26248.44
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  410.13
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0002s
   total number of events:              15749205
Latency (ms):
        min:                                  0.06
        avg:                                  0.15
        max:                                  8.73
        95th percentile:                      0.21

We can see the number of requests scales almost linearly, while request latency changes very little: 0.14ms vs. 0.15ms. This shows the device has enough execution units internally to handle the load in parallel, and there are no other bottlenecks (such as the connection interface).

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 28808.60    0.00 427668.00     0.00    29.69     4.05    0.14    0.14    0.00   0.03  99.92

Disk Latency

DIsk Utilization and Saturation 5

Disk Utilization

DIsk Utilization and Saturation 6

Disk Load

DIsk Utilization and Saturation 7

Saturation Metrics (IO Load)

DIsk Utilization and Saturation 8

These stats and graphs show interesting picture: we barely see a response time increase for IO requests, while utilization inches closer to 100% (with four threads submitting requests all the time, it is hard to catch the time when the disk does not have any requests in flight). The load is near four (showing the disk has to handle four requests at the time on average).

Sysbench FileIO 16 Threads

root@ts140i:/mnt/data# sysbench  --threads=16 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      76845.96
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  1200.72
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0003s
   total number of events:              46107727
Latency (ms):
        min:                                  0.07
        avg:                                  0.21
        max:                                  9.72
        95th percentile:                      0.36

Going from four to 16 threads, we again see a good throughput increase with a mild response time increase. If you look at the results closely, you will notice one more interesting thing: the average response time has increased from 0.15ms to 0.21ms (which is a 40% increase), while the 95% response time has increased from 0.21ms to 0.36ms (which is 71%). I also ran a separate test measuring 99% response time, and the difference is even larger: 0.26ms vs. 0.48ms (or 84%).

This is an important observation to make: once saturation starts to happen, the variance is likely to increase and some of the requests will be disproportionately affected (beyond what the average response time shows).

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 82862.20    0.00 1230567.20     0.00    29.70    16.33    0.20    0.20    0.00   0.01 100.00

Disk IO Latency

DIsk Utilization and Saturation 9

Disk IO Utilization

DIsk Utilization and Saturation 10

Disk Load

DIsk Utilization and Saturation 11

Saturation Metrics IO Load

DIsk Utilization and Saturation 12

The graphs show an expected figure: the disk load and IO load from saturation are up to about 16, and utilization remains at 100%.

One thing to notice is increased jitter in the graphs. IO utilization jumps to over 100% and disk IO load spikes to 18, when there should not be as many requests in flight. This comes from how this information is gathered. An attempt is made to sample this data every second, but with the loaded system it takes time for this process to work: sometimes when we try to get the data for a one-second interval but really get data for 1.05- or 0.95-second intervals. When the math is applied to the data, it creates the spikes and dips in the graph when there should be none. You can just ignore them if you’re looking at the big picture.

Sysbench FileIO 64 Threads

Finally, let’s run sysbench with 64 concurrent threads hitting the disk:

root@ts140i:/mnt/data# sysbench  --threads=64 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      127840.59
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  1997.51
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0014s
   total number of events:              76704744
Latency (ms):
        min:                                  0.08
        avg:                                  0.50
        max:                                  9.34
        95th percentile:                      1.25

We can see the average has risen from 0.21ms to 0.50 (more than two times), and 95% almost tripped from 0.36ms to 1.25ms. From a practical standpoint, we can see some saturation starting to happen, but we’re still not seeing a linear response time increase with increasing numbers of parallel operations as we have seen with CPU saturation. I guess this points to the fact that this IO device has a lot of parallel capacity inside and can process requests more effectively (even going from 16 to 64 concurrent threads).

Over the series of tests, as we increased concurrency from one to 64, we saw response times increase from 0.14ms to 0.5ms (or approximately three times). The 95% response time at this time grew from 0.17ms to 1.25ms (or about seven times). For practical purposes, this is where we see the IO device saturation start to show.

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 138090.20    0.00 2049791.20     0.00    29.69    65.99    0.48    0.48    0.00   0.01 100.24

We’ll skip the rest of the graphs as they basically look the same, just with higher latency and 64 requests in flight.

Sysbench FileIO 256 Threads

root@ts140i:/mnt/data# sysbench  --threads=256 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      131558.79
   writes/s:                     0.00
   fsyncs/s:                     0.00
Throughput:
   read, MiB/s:                  2055.61
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0026s
   total number of events:              78935828
Latency (ms):
        min:                                  0.10
        avg:                                  1.95
        max:                                 17.08
        95th percentile:                      3.89

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 142227.60    0.00 2112719.20     0.00    29.71   268.30    1.89    1.89    0.00   0.01 100.00

With 256 threads, finally we’re seeing the linear growth of the average response time that indicates overload and queueing to process requests. There is no easy way to tell if it is due to the IO bus saturation (we’re reading 2GB/sec here) or if it is the internal device processing ability.  

As we’ve seen a less than linear increase in response time going from 16 to 64 connections, and a linear increase going from 64 to 256, we can see the “optimal” concurrency for this device: somewhere between 16 and 64 connections. This allows for peak throughput without a lot of queuing.

Before we get to the summary, I want to make an important note about this particular test. The test is a random reads test, which is a very important pattern for many database workloads, but it might not be the dominant load for your environment. You might be write-bound as well, or have mainly sequential IO access patterns (which could behave differently). For those other workloads, I hope this gives you some ideas on how to also analyze them.

Another Way to Think About Saturation

When I asked the Percona staff for feedback on this blog post by, my colleague Yves Trudeau provided another way to think about saturation: measure saturation as percent increase in the average response time compared to the single user. Like this:

Threads Avg Response Time Saturation
1 0.14
4 0.15 1.07x  or 7%
16 0.21 1.5x  or 50%
64 0.50 3.6x or 260%
256 1.95 13.9x or 1290%

 

Summary

We can see how understanding disk utilization and saturation is much more complicated than for the CPU:

  • The Utilization metric (as reported by
    iostat

     and by PMM) is not very helpful for showing true storage utilization, as it only measures the time when there is at least one request in flight. If you had the same metric for the CPU, it would correspond to something running on at least one of the cores (not very useful for highly parallel systems).

  • Unlike a CPU, Linux tools do not provide us with information about the structure of the underlying storage and how much parallel load it should be able to handle without saturation. Even more so, storage might well have different low-level resources that cause saturation. For example, it could be the network connection, SATA BUS or even the kernel IO stack for older kernels and very fast storage.
  • Saturation as measured by the number of requests in flight is helpful for guessing if there might be saturation, but since we do not know how many requests the device can efficiently process concurrently, just looking the raw metric doesn’t let us determine that the device is overloaded.
  • Avg Response Time is a great metric for looking at saturation, but as with the response time you can’t say what response time is good or bad for this device. You need to look at it in context and compare it to the baseline. When you’re looking at the Avg Response Time, make sure you’re looking at read request response time vs. write request response time separately, and keep the average request size in mind to ensure we are comparing apples to apples.
Jul
12
2017
--

Gh-ost benchmark against pt-online-schema-change performance

gh-ost-benchmark

In this blog post, I will run a gh-ost benchmark against the performance of pt-online-schema-change.

When gh-ost came out, I was very excited. As MySQL ROW replication became commonplace, you could use it to track changes instead of triggers. This practice is cleaner and safer compared to Percona Toolkit’s pt-online-schema-change. Since gh-ost doesn’t need triggers, I assumed it would generate lower overhead and work faster. I frequently called it “pt-online-schema-change on steroids” in my talks. Finally, I’ve found some time to check my theoretical claims with some benchmarks.

DISCLAIMER: These benchmarks correspond to one specific ALTER TABLE on the table of one specific structure and hardware configuration. I have not set up a broad set of tests. If you have other results – please comment!

Benchmark Setup Details

  • pt-online-schema-change from Percona Toolkit 3.0.3
  • gh-ost 1.0.36
  • Percona Server 5.7.18 on Ubuntu 16.04 LTS
  • Hardware: 28CPU cores/56 Threads.  128GB Memory.   Samsung 960 Pro 512GB
  • Sysbench 1.0.7

Prepare the table by running:

sysbench --threads=40 --rate=0 --report-interval=1 --percentile=99 --events=0 --time=0 --db-ps-mode=auto --mysql-user=sbtest --mysql-password=sbtest  /usr/share/sysbench/oltp_read_write.lua --table_size=10000000 prepare

The table size is about 3GB (completely fitting to innodb_buffer_pool).

Run the benchmark in “full ACID” mode with:

  • sync_binlog=1
  • innodb_flush_log_at_trx_commit=1
  • innodb_doublewrite=1

This is important as this workload is heavily commit-bound, and extensively relies on group commit.

This is the pt-online-schema-change command to alter table:

time pt-online-schema-change --execute --alter "ADD COLUMN c1 INT" D=sbtest,t=sbtest1

This the gh-ost command to alter table:

time ./gh-ost  --user="sbtest" --password="sbtest" --host=localhost --allow-on-master --database="sbtest" --table="sbtest1"  --alter="ADD COLUMN c1 INT" --execute

Tests Details

For each test the old sysbench table was dropped and a new one prepared. I tested alter table in three different cases:

  • When nothing else was running (“Idle Load”)   
  • When the system handled about 2% of load it can handle at full capacity (“Light Background Load”)
  • When the system handled about 40% of the possible load, with sysbench injected about 25% of the transactions/sec the system could handle at full load (“Heavy Background Load”)

I measured the alter table completion times for all cases, as well as the overhead generated by the alter (in other words, how much peak throughput is reduced by running alter table through the tools).

Idle Load

gh-ost benchmark 1

For the Idle Load test, pt-online-schema-change completed nearly twice as fast as gh-ost. This was a big surprise for me. I haven’t looked into the reasons or details yet, though I can see most of the CPU usage for gh-ost is on the MySQL server side. Perhaps the differences relate to the SQL used to perform non-blocking alter tables.

Light Background Load

I generated the Light Background Load by running the sysbench command below. It corresponds to a roughly 4% load, as the system can handle some 2500 transactions/sec at this concurrency under full load. Adjust the 

--rate

 value to scale it for your system.

time sysbench --threads=40 --rate=100 --report-interval=1 --percentile=99 --events=0 --time=0 --db-ps-mode=auto --mysql-user=sbtest --mysql-password=sbtest  /usr/share/sysbench/oltp_read_write.lua --table_size=10000000 run

gh-ost benchmark 2

The numbers changed (as expected), but pt-online-schema-change is still approximately twice as fast as gh-ost.

What is really interesting in this case is how a relatively light background load affects the process completion time. It took both pt-online-schema-change and gh-ost about 2.7x times longer to finish! 

Heavy Background Load

I generated the Heavy Background Load running the sysbench command below. It corresponds to a roughly 40% load, as the system can handle some 2500 transactions/sec at this concurrency under full load. Adjust

--rate

 value to scale it for your system.

time sysbench --threads=40 --rate=1000 --report-interval=1 --percentile=99 --events=0 --time=0 --db-ps-mode=auto --mysql-user=sbtest --mysql-password=sbtest  /usr/share/sysbench/oltp_read_write.lua --table_size=10000000 run

gh-ost benchmark 3

What happened in this case? When the load gets higher, gh-ost can’t keep up with binary log processing, and just never finishes at all. While this may be surprising at first, it makes sense if you think more about how these tools work. pt-online-schema-change uses triggers, and while they have a lot of limitations and overhead they can execute in parallel. gh-ost, on the other hand, processes the binary log in a single thread and might not be able to keep up.   

In MySQL 5.6 we didn’t have parallel replication, which applies writes to the same table in parallel. For that version the gh-ost limitation probably isn’t as big a deal, as such a heavy load would also cause replication lag. MySQL 5.7 has parallel replication. This makes it much easier to quickly replicate workloads that are too heavy for gh-ost to handle.

I should note that the workload being simulated in this benchmark is a rather extreme case. The table being altered by gh-ost here is at the same time handling a background load so high it can’t be replicated in a single thread.

Future versions of gh-ost could improve this issue by applying binlog events in parallel, similar to what MySQL replicas do.

An excerpt from the gh-ost log shows how it is totally backed up trying to apply the binary log:

root@rocky:/tmp# time ./gh-ost  --user="sbtest" --password="sbtest" --host=localhost --allow-on-master --database="sbtest" --table="sbtest1"  --alter="ADD COLUMN c1 INT" --execute
2017/06/25 19:16:05 binlogsyncer.go:75: [info] create BinlogSyncer with config &{99999 mysql localhost 3306 sbtest sbtest  false false <nil>}
2017/06/25 19:16:05 binlogsyncer.go:241: [info] begin to sync binlog from position (rocky-bin.000018, 640881773)
2017/06/25 19:16:05 binlogsyncer.go:134: [info] register slave for master server localhost:3306
2017/06/25 19:16:05 binlogsyncer.go:568: [info] rotate to (rocky-bin.000018, 640881773)
2017-06-25 19:16:05 ERROR parsing time "" as "2006-01-02T15:04:05.999999999Z07:00": cannot parse "" as "2006"
# Migrating `sbtest`.`sbtest1`; Ghost table is `sbtest`.`_sbtest1_gho`
# Migrating rocky:3306; inspecting rocky:3306; executing on rocky
# Migration started at Sun Jun 25 19:16:05 -0400 2017
# chunk-size: 1000; max-lag-millis: 1500ms; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# Serving on unix socket: /tmp/gh-ost.sbtest.sbtest1.sock
Copy: 0/9872432 0.0%; Applied: 0; Backlog: 0/100; Time: 0s(total), 0s(copy); streamer: rocky-bin.000018:641578191; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 0; Backlog: 100/100; Time: 1s(total), 1s(copy); streamer: rocky-bin.000018:641626699; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 640; Backlog: 100/100; Time: 2s(total), 2s(copy); streamer: rocky-bin.000018:641896215; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 1310; Backlog: 100/100; Time: 3s(total), 3s(copy); streamer: rocky-bin.000018:642178659; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 1920; Backlog: 100/100; Time: 4s(total), 4s(copy); streamer: rocky-bin.000018:642436043; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 2600; Backlog: 100/100; Time: 5s(total), 5s(copy); streamer: rocky-bin.000018:642722777; State:
...
Copy: 0/9872432 0.0%; Applied: 120240; Backlog: 100/100; Time: 3m0s(total), 3m0s(copy); streamer: rocky-bin.000018:694142377; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 140330; Backlog: 100/100; Time: 3m30s(total), 3m30s(copy); streamer: rocky-bin.000018:702948219; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 160450; Backlog: 100/100; Time: 4m0s(total), 4m0s(copy); streamer: rocky-bin.000018:711775662; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 180600; Backlog: 100/100; Time: 4m30s(total), 4m30s(copy); streamer: rocky-bin.000018:720626338; State: migrating; ETA: N/A
Copy: 0/9872432 0.0%; Applied: 200770; Backlog: 100/100; Time: 5m0s(total), 5m0s(copy); streamer: rocky-bin.000018:729509960; State: migrating; ETA: N/A

Online Schema Change Performance Impact

For this test I started the alter table, waited 60 seconds and then ran sysbench at full speed for five minutes. Then I measured how much the performance was impacted by running the tool:

sysbench --threads=40 --rate=0 --report-interval=1 --percentile=99 --events=0 --time=300 --db-ps-mode=auto --mysql-user=sbtest --mysql-password=sbtest  /usr/share/sysbench/oltp_read_write.lua --table_size=10000000 run

gh-ost benchmark 4

As we can see, gh-ost has negligible overhead in this case. pt-online-schema-change on the other hand, had peformance reduced by 12%. It is worth noting though that pt-online-schema-change still makes progress in this case (though slowly), while gh-ost would never complete.

If anything, I was surprised at how little impact the pt-online-schema-change run had on sysbench performance.

It’s important to note that in this case we only measured the overhead for the “copy” stage of the online schema change. Another thing you should worry about is the impact to performance during “table rotation” (which I have not measured).

Summary

While gh-ost introduces a number of design advantages, and gives better results in some situation, I wouldn’t call it always superior the tried and true pt-online-schema-change. At least in some cases, pt-online-schema-change offers better performance than gh-ost and completes a schema change when gh-ost is unable to keep up. Consider trying out both tools and see what works best in your situation.

Jul
06
2017
--

ClickHouse: One Year!

ClickHouse One Year

ClickHouse One YearIn this blog, we’ll look at ClickHouse on its one year anniversary.

It’s been a year already since the Yandex team released ClickHouse as open source software. I’ve had an interest in this project from the very start, as I didn’t think there was an open source analytical database that could compete with industry leaders like Vertica (for example).

This was an exciting year for ClickHouse early adopters. Let’s look at what it accomplished so far.

ClickHouse initially generated interest due to the Yandex name – the most popular search engine in Russia. It wasn’t long before jaw-dropping responses popped up: guys, this thing is crazy fast! Many early adopters who tried ClickHouse were really impressed.

Fast doesn’t mean convenient though. That was the main community concern to ClickHouse over the past months. Developed as an internal project for an internal customer (Yandex.Metrica), ClickHouse had a lot of limitations for general community use. It took several months before Yandex could restructure the team and mindset, establish proper communication with the community and start addressing external needs. There are still a lot of things that need to be done. The public roadmap is not easily available, for example, but the wheels are pointed in the right direction. The ClickHouse team has added a lot of the features people were screaming for, and more are in ClickHouse’s future plans.

The Yandex guys are actively attending international conferences, and they were:

They are speaking much more in Russia (no big surprise).

We were very excited by Yandex’s ClickHouse performance claims at Percona, and could not resist making our own benchmarks:

ClickHouse did very well in these benchmarks. There are many other benchmarks by other groups as well, including a benchmark against Amazon RedShift by Altinity.

The first ClickHouse production deployments outside of Yandex started in October-November 2016. Today, Yandex reports that dozens of companies around the world are using ClickHouse in production, with the biggest installations operating with up to several petabytes of data. Hundreds of other enterprises are deploying pilot installations or actively evaluating the software.

There are also interesting reports from CloudFare (How Cloudflare analyzes 1M DNS queries per second) and from Carto (Geospatial processing with ClickHouse).

There are also various community projects around ClickHouse worth mentioning:

Percona is also working to adapt ClickHouse to our projects. We are using ClickHouse to handle Query Analytics and as a long term metrics data for Metrics inside a new version (under development) of Percona Monitoring and Management.

I also will be speaking about ClickHouse at BIG DATA DAY LA 2017 on August 5th, 2017. You are welcome to attend if you are in Los Angeles this day!

ClickHouse has the potential to become one of the leading open source analytical DBMSs – much like MySQL and PostreSQL are leaders for OLTP workloads. We will see in the next couple of years if it happens or not. Congratulations to the Yandex team on their one-year milestone!

Jun
22
2017
--

ClickHouse in a General Analytical Workload (Based on a Star Schema Benchmark)

ClickHouse

ClickHouseIn this blog post, we’ll look at how ClickHouse performs in a general analytical workload using the star schema benchmark test.

We have mentioned ClickHouse in some recent posts (ClickHouse: New Open Source Columnar Database, Column Store Database Benchmarks: MariaDB ColumnStore vs. Clickhouse vs. Apache Spark), where it showed excellent results. ClickHouse by itself seems to be event-oriented RDBMS, as its name suggests (clicks). Its primary purpose, using Yandex Metrica (the system similar to Google Analytics), also points to an event-based nature. We also can see there is a requirement for date-stamped columns.

It is possible, however, to use ClickHouse in a general analytical workload. This blog post shares my findings. For these tests, I used a Star Schema benchmark — slightly-modified so that able to handle ClickHouse specifics.

First, let’s talk about schemas. We need to adjust to ClickHouse data types. For example, the biggest fact table in SSB is “lineorder”. Below is how it is defined for Amazon RedShift (as taken from https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-create-test-data.html):

CREATE TABLE lineorder
(
  lo_orderkey          INTEGER NOT NULL,
  lo_linenumber        INTEGER NOT NULL,
  lo_custkey           INTEGER NOT NULL,
  lo_partkey           INTEGER NOT NULL,
  lo_suppkey           INTEGER NOT NULL,
  lo_orderdate         INTEGER NOT NULL,
  lo_orderpriority     VARCHAR(15) NOT NULL,
  lo_shippriority      VARCHAR(1) NOT NULL,
  lo_quantity          INTEGER NOT NULL,
  lo_extendedprice     INTEGER NOT NULL,
  lo_ordertotalprice   INTEGER NOT NULL,
  lo_discount          INTEGER NOT NULL,
  lo_revenue           INTEGER NOT NULL,
  lo_supplycost        INTEGER NOT NULL,
  lo_tax               INTEGER NOT NULL,
  lo_commitdate        INTEGER NOT NULL,
  lo_shipmode          VARCHAR(10) NOT NULL
);

For ClickHouse, the table definition looks like this:

CREATE TABLE lineorderfull (
        LO_ORDERKEY             UInt32,
        LO_LINENUMBER           UInt8,
        LO_CUSTKEY              UInt32,
        LO_PARTKEY              UInt32,
        LO_SUPPKEY              UInt32,
        LO_ORDERDATE            Date,
        LO_ORDERPRIORITY        String,
        LO_SHIPPRIORITY         UInt8,
        LO_QUANTITY             UInt8,
        LO_EXTENDEDPRICE        UInt32,
        LO_ORDTOTALPRICE        UInt32,
        LO_DISCOUNT             UInt8,
        LO_REVENUE              UInt32,
        LO_SUPPLYCOST           UInt32,
        LO_TAX                  UInt8,
        LO_COMMITDATE           Date,
        LO_SHIPMODE             String
)Engine=MergeTree(LO_ORDERDATE,(LO_ORDERKEY,LO_LINENUMBER),8192);

From this we can see we need to use datatypes like UInt8 and UInt32, which are somewhat unusual for database world datatypes.

The second table (RedShift definition):

CREATE TABLE customer
(
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL
);

For ClickHouse, I defined as:

CREATE TABLE customerfull (
        C_CUSTKEY       UInt32,
        C_NAME          String,
        C_ADDRESS       String,
        C_CITY          String,
        C_NATION        String,
        C_REGION        String,
        C_PHONE         String,
        C_MKTSEGMENT    String,
        C_FAKEDATE      Date
)Engine=MergeTree(C_FAKEDATE,(C_CUSTKEY),8192);

For reference, the full schema for the benchmark is here: https://github.com/vadimtk/ssb-clickhouse/blob/master/create.sql.

For this table, we need to define a rudimentary column C_FAKEDATE Date in order to use ClickHouse’s most advanced engine (MergeTree). I was told by the ClickHouse team that they plan to remove this limitation in the future.

To generate data acceptable by ClickHouse, I made modifications to ssb-dbgen. You can find my version here: https://github.com/vadimtk/ssb-dbgen. The most notable change is that ClickHouse can’t accept dates in CSV files formatted as “19971125”. It has to be “1997-11-25”. This is something to keep in mind when loading data into ClickHouse.

It is possible to do some preformating on the load, but I don’t have experience with that. A common approach is to create the staging table with datatypes that match loaded data, and then convert them using SQL functions when inserting to the main table.

Hardware Setup

One of the goals of this benchmark to see how ClickHouse scales on multiple nodes. I used a setup of one node, and then compared to a setup of three nodes. Each node is 24 cores of “Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz” CPUs, and the data is located on a very fast PCIe Flash storage.

For the SSB benchmark I use a scale factor of 2500, which provides (in raw data):

Table lineorder – 15 bln rows, raw size 1.7TB, Table customer – 75 mln rows

When loaded into ClickHouse, the table lineorder takes 464GB, which corresponds to a 3.7x compression ratio.

We compare a one-node (table names lineorderfull, customerfull) setup vs. a three-node (table names lineorderd, customerd) setup.

Single Table Operations

Query:

SELECT
    toYear(LO_ORDERDATE) AS yod,
    sum(LO_REVENUE)
FROM lineorderfull
GROUP BY yod

One node:

7 rows in set. Elapsed: 9.741 sec. Processed 15.00 billion rows, 90.00 GB (1.54 billion rows/s., 9.24 GB/s.)

Three nodes:

7 rows in set. Elapsed: 3.258 sec. Processed 15.00 billion rows, 90.00 GB (4.60 billion rows/s., 27.63 GB/s.)

We see a speed up of practically three times. Handling 4.6 billion rows/s is blazingly fast!

One Table with Filtering

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorderfull
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)

One node:

1 rows in set. Elapsed: 3.175 sec. Processed 2.28 billion rows, 18.20 GB (716.60 million rows/s., 5.73 GB/s.)

Three nodes:

1 rows in set. Elapsed: 1.295 sec. Processed 2.28 billion rows, 18.20 GB (1.76 billion rows/s., 14.06 GB/s.)

It’s worth mentioning that during the execution of this query, ClickHouse was able to use ALL 24 cores on each box. This confirms that ClickHouse is a massively parallel processing system.

Two Tables (Independent Subquery)

In this case, I want to show how Clickhouse handles independent subqueries:

SELECT sum(LO_REVENUE)
FROM lineorderfull
WHERE LO_CUSTKEY IN
(
    SELECT C_CUSTKEY AS LO_CUSTKEY
    FROM customerfull
    WHERE C_REGION = 'ASIA'
)

One node:

1 rows in set. Elapsed: 28.934 sec. Processed 15.00 billion rows, 120.00 GB (518.43 million rows/s., 4.15 GB/s.)

Three nodes:

1 rows in set. Elapsed: 14.189 sec. Processed 15.12 billion rows, 121.67 GB (1.07 billion rows/s., 8.57 GB/s.)

We  do not see, however, the close to 3x speedup on three nodes, because of the required data transfer to perform the match LO_CUSTKEY with C_CUSTKEY

Two Tables JOIN

With a subquery using columns to return results, or for GROUP BY, things get more complicated. In this case we want to GROUP BY the column from the second table.

First, ClickHouse doesn’t support traditional subquery syntax, so we need to use JOIN. For JOINs, ClickHouse also strictly prescribes how it must be written (a limitation that will also get changed in the future). Our JOIN should look like:

SELECT
    C_REGION,
    sum(LO_EXTENDEDPRICE * LO_DISCOUNT)
FROM lineorderfull
ANY INNER JOIN
(
    SELECT
        C_REGION,
        C_CUSTKEY AS LO_CUSTKEY
    FROM customerfull
) USING (LO_CUSTKEY)
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)
GROUP BY C_REGION

One node:

5 rows in set. Elapsed: 31.443 sec. Processed 2.35 billion rows, 28.79 GB (74.75 million rows/s., 915.65 MB/s.)

Three nodes:

5 rows in set. Elapsed: 25.160 sec. Processed 2.58 billion rows, 33.25 GB (102.36 million rows/s., 1.32 GB/s.)

In this case the speedup is not even two times. This corresponds to the fact of the random data distribution for the tables lineorderd and customerd. Both tables were defines as:

CREATE TABLE lineorderd AS lineorder ENGINE = Distributed(3shards, default, lineorder, rand());
CREATE TABLE customerd AS customer ENGINE = Distributed(3shards, default, customer, rand());

Where  rand() defines that records are distributed randomly across three nodes. When we perform a JOIN by LO_CUSTKEY=C_CUSTKEY, records might be located on different nodes. One way to deal with this is to define data locally. For example:

CREATE TABLE lineorderLD AS lineorderL ENGINE = Distributed(3shards, default, lineorderL, LO_CUSTKEY);
CREATE TABLE customerLD AS customerL ENGINE = Distributed(3shards, default, customerL, C_CUSTKEY);

Three Tables JOIN

This is where it becomes very complicated. Let’s consider the query that you would normally write:

SELECT sum(LO_REVENUE),P_MFGR, toYear(LO_ORDERDATE) yod FROM lineorderfull ,customerfull,partfull WHERE C_REGION = 'ASIA' and
LO_CUSTKEY=C_CUSTKEY and P_PARTKEY=LO_PARTKEY GROUP BY P_MFGR,yod ORDER BY P_MFGR,yod;

With Clickhouse’s limitations on JOINs syntax, the query becomes:

SELECT
    sum(LO_REVENUE),
    P_MFGR,
    toYear(LO_ORDERDATE) AS yod
FROM
(
    SELECT
        LO_PARTKEY,
        LO_ORDERDATE,
        LO_REVENUE
    FROM lineorderfull
    ALL INNER JOIN
    (
        SELECT
            C_REGION,
            C_CUSTKEY AS LO_CUSTKEY
        FROM customerfull
    ) USING (LO_CUSTKEY)
    WHERE C_REGION = 'ASIA'
)
ALL INNER JOIN
(
    SELECT
        P_MFGR,
        P_PARTKEY AS LO_PARTKEY
    FROM partfull
) USING (LO_PARTKEY)
GROUP BY
    P_MFGR,
    yod
ORDER BY
    P_MFGR ASC,
    yod ASC

By writing queries this way, we force ClickHouse to use the prescribed JOIN order — at this moment there is no optimizer in ClickHouse and it is totally unaware of data distribution.

There is also not much speedup when we compare one node vs. three nodes:

One node execution time:

35 rows in set. Elapsed: 697.806 sec. Processed 15.08 billion rows, 211.53 GB (21.61 million rows/s., 303.14 MB/s.)

Three nodes execution time:

35 rows in set. Elapsed: 622.536 sec. Processed 15.12 billion rows, 211.71 GB (24.29 million rows/s., 340.08 MB/s.)

There is a way to make the query faster for this 3-way JOIN, however. (Thanks to Alexander Zaytsev from https://www.altinity.com/ for help!)

Optimized query:

SELECT
    sum(revenue),
    P_MFGR,
    yod
FROM
(
    SELECT
        LO_PARTKEY AS P_PARTKEY,
        toYear(LO_ORDERDATE) AS yod,
        SUM(LO_REVENUE) AS revenue
    FROM lineorderfull
    WHERE LO_CUSTKEY IN
    (
        SELECT C_CUSTKEY
        FROM customerfull
        WHERE C_REGION = 'ASIA'
    )
    GROUP BY
        P_PARTKEY,
        yod
)
ANY INNER JOIN partfull USING (P_PARTKEY)
GROUP BY
    P_MFGR,
    yod
ORDER BY
    P_MFGR ASC,
    yod ASC

Optimized query time:

One node:

35 rows in set. Elapsed: 106.732 sec. Processed 15.00 billion rows, 210.05 GB (140.56 million rows/s., 1.97 GB/s.)

Three nodes:

35 rows in set. Elapsed: 75.854 sec. Processed 15.12 billion rows, 211.71 GB (199.36 million rows/s., 2.79 GB/s.

That’s an improvement of about 6.5 times compared to the original query. This shows the importance of understanding data distribution, and writing the optimal query to process the data.

Another option for dealing with JOIN complexity, and to improve performance, is to use ClickHouse’s dictionaries. These dictionaries are described here: https://www.altinity.com/blog/2017/4/12/dictionaries-explained.

I will review dictionary performance in future posts.

Another traditional way to deal with JOIN complexity in an analytics workload is to use denormalization. We can move some columns (for example, P_MFGR from the last query) to the facts table (lineorder).

Observations

  • ClickHouse can handle general analytical queries (it requires special schema design and considerations, however)
  • Linear speedup is possible, but it depends on query design and requires advanced planning — proper speedup depends on data locality
  • ClickHouse is blazingly fast (beyond what I’ve seen before) because it can use all available CPU cores for query, as shown above using 24 cores for single server and 72 cores for three nodes
  • Multi-table JOINs are cumbersome and require manual work to achieve better performance, so consider using dictionaries or denormalization
May
24
2017
--

Percona Software and Roadmap Update with CEO Peter Zaitsev: Q2 2017

Percona Software and Services

This blog post is a summary of the Percona Software and Roadmap Update – Q2 2017 webinar given by Peter Zaitsev on May 4, 2017. This webinar reflects changes and updates since the last update (Q1 2017).

A full recording of this webinar, along with the presentation slide deck, can be found here.

Percona Software

Below are the latest and upcoming features in Percona’s software. All of Percona’s software is 100% free and open source, with no restricted “Enterprise” version. Percona doesn’t restrict users with open core or “open source, eventually” (BSL) licenses.

Percona Server for MySQL 5.7

Latest Improvements

Features About To Be Released 

  • Integration of TokuDB and Performance Schema
  • MyRocks integration in Percona Server
  • Starting to look towards MySQL 8

Percona XtraBackup 2.4

Latest Improvements

Percona Toolkit

Latest Improvements

Percona Server for MongoDB 3.4

Latest Improvements

Percona XtraDB Cluster 5.7

Latest Improvements

Performance Improvement Benchmarks

Below, you can see the benchmarks for improvements to Percona XtraDB Cluster 5.7 performance. You can read about the improvements and benchmark tests in more detail here and here.

Percona Software and Roadmap Update

Percona XtraDB Cluster 5.7 Integrated with ProxySQL 1.3

Percona Monitoring and Management

New in Percona Monitoring and Management

Advanced MariaDB Dashboards in PMM (Links go to PMM Demo)

Percona Q217 Roadmap 4

Improved MongoDB Dashboards in PMM (Links go to PMM Demo)

Percona Q217 Roadmap 7

Percona Q217 Roadmap 9

Percona Q217 Roadmap 10

Check out the PMM Demo

Thanks for tuning in for an update on Percona Software and Roadmap Update – Q2 2017.

New Percona Online Store – Easy to Buy, Pay Monthly

May
03
2017
--

Storing UUID and Generated Columns

Storing UUID

A lot of things have been said about UUID, and storing UUID in an optimized way. Now that we have generated columns, we can store the decomposed information inside the UUID and merge it again with generated columns. This blog post demonstrates this process.

First, I used a simple table with one char field that I called uuid_char to establish a base case. I used this table with and without a primary key:

CREATE TABLE uuid_char (
uuid char(36) CHARACTER SET utf8 NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE uuid_char_pk (
uuid char(36) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (uuid)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

I performed the tests on a local VM over MySQL 5.7.17 for 30 seconds, with only two threads, because I wanted to just compare the executions:

sysbench
--oltp-table-size=100000000
--test=/usr/share/doc/sysbench/tests/db/insert_uuid_generated_columns.uuid_char.lua
--oltp-tables-count=4
--num-threads=2
--mysql-user=root
--max-requests=5000000
--report-interval=5
--max-time=30
--mysql-db=generatedcolumn
run

One pair of executions is with the UUID generated by sysbench, which simulates the UUID that comes from the app:

rs = db_query("INSERT INTO uuid_char (uuid) VALUES " .. string.format("('%s')",c_val))

An alternative execution is for when the UUID is generated by the MySQL function uuid():

rs = db_query("INSERT INTO uuid_char (uuid) VALUES (uuid())")

Below we can see the results: 

The inserts are faster without a PK (but only by 5%), and using the uuid() function doesn’t impact performance.

Now, let’s see the alternative method, which is decomposing the UUID. It has four main information sets:

  • Timestamp: this is a number with seven decimals.
  • MAC: the MAC address of the device that creates the UUID
  • Unique value: this value avoids duplicate cases scenarios
  • UUID version: this will always be “1”, as we are going to use version 1. If you are going to use another version, you will need to review the functions that I used.

The structure of the table that we’ll use is:

CREATE TABLE `uuid_generated` (
`timestamp` decimal(18,7) unsigned NOT NULL,
`mac` bigint(20) unsigned NOT NULL,
`temp_uniq` binary(2) NOT NULL,
PRIMARY KEY (`timestamp`,`mac`,`temp_uniq`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

To understand how a UUID is unwrapped, I used this store procedure (which receives a UUID and inserts it into the table):

CREATE PROCEDURE ins_generated_uuid (uuid char(38))
begin
set @hex_timestamp = concat(substring(uuid, 16, 3), substring(uuid, 10, 4), substring(uuid, 1, 8));
set @timestamp = concat(conv(@hex_timestamp,16,10)div 10000000 - (141427 * 24 * 60 * 60),'.',right(conv(@hex_timestamp,16,10),7));
set @mac = conv(right(uuid,12),16,10);
set @temp_uniq = unhex(substring(uuid,20,4));
insert into uuid_generated (timestamp,mac,temp_uniq) values (@timestamp,@mac,@temp_uniq);
end ;;

Explanation:

  • @hex_timestamp is a temporary variable that collects the timestamp in hexadecimal format from the different sections of the UUID
  • @timestamp transforms the hexadecimal timestamp to a decimal number
  • @mac pulls the last number in the UUID (a MAC) so we can store it in as a bigint
  • @temp_uniq is a value to conserve the uniqueness, which is why we store it as binary and it is at the end of the Primary Key

If I wanted to get the UUID again, I can use these two generated columns:

`hex_timestamp` char(40) GENERATED ALWAYS AS (conv(((`timestamp` * 10000000) + (((141427 * 24) * 60) * 600000000)),10,16)) VIRTUAL,
`uuid_generated` char(38) GENERATED ALWAYS AS (concat(right(`hex_timestamp`,8),'-',substr(`hex_timestamp`,4,4),'-1',left(`hex_timestamp`,3),'-',convert(hex(`temp_uniq`) using utf8),'-',lpad(conv(`mac`,10,16),12,'0'))) VIRTUAL,

We performed tests over five scenarios:

  • Without the generated columns, the insert used data generated dynamically
  • Same as before, but we added a char field that stores the UUID
  • With the char field, and adding the generated column
  • We used the store procedure detailed before to insert the data into the table
  • We also tested the performance using triggers

The difference between the Base and the previous table structure with Primary Keys is very small. So, the new basic structure has no impact on performance.

We see that Base and +Char Field have the same performance. So leaving a char field has no performance impact (it just uses more disk space).

Using generated columns impact performance. This is expected, as the columns are generated to validate the type before the row is inserted.

Finally, the use of triggers and store procedure has the same impact in performance.

These are the three structures to the tables:

CREATE TABLE `uuid_generated` (
`timestamp` decimal(18,7) unsigned NOT NULL,
`mac` bigint(20) unsigned NOT NULL,
`temp_uniq` binary(2) NOT NULL,
PRIMARY KEY (`timestamp`,`mac`,`temp_uniq`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `uuid_generated_char` (
`timestamp` decimal(18,7) unsigned NOT NULL,
`mac` bigint(20) unsigned NOT NULL,
`temp_uniq` binary(2) NOT NULL,
`uuid` char(38) DEFAULT NULL,
PRIMARY KEY (`timestamp`,`mac`,`temp_uniq`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `uuid_generated_char_plus` (
`timestamp` decimal(18,7) unsigned NOT NULL,
`mac` bigint(20) unsigned NOT NULL,
`temp_uniq` binary(2) NOT NULL,
`uuid` char(38) DEFAULT NULL,
`hex_timestamp` char(40) GENERATED ALWAYS AS (conv(((`timestamp` * 10000000) + (((141427 * 24) * 60) * 600000000)),10,16)) VIRTUAL,
`uuid_generated` char(38) GENERATED ALWAYS AS (concat(right(`hex_timestamp`,8),'-',substr(`hex_timestamp`,4,4),'-1',left(`hex_timestamp`,3),'-',convert(hex(`temp_uniq`) using utf8),'-',lpad(conv(`mac`,10,16),12,'0'))) VIRTUAL,
PRIMARY KEY (`timestamp`,`mac`,`temp_uniq`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

And this is the trigger:

DROP TRIGGER IF EXISTS ins_generated_uuid;
delimiter ;;
CREATE TRIGGER ins_uuid_generated BEFORE INSERT ON uuid_generated
FOR EACH ROW
begin
set @hex_timestamp = concat(substring(NEW.uuid, 16, 3), substring(NEW.uuid, 10, 4), substring(NEW.uuid, 1, 8));
set NEW.timestamp = concat(conv(@hex_timestamp,16,10)div 10000000 - (141427 * 24 * 60 * 60),'.',right(conv(@hex_timestamp,16,10),7));
set NEW.mac = conv(right(NEW.uuid,12),16,10);
set NEW.temp_uniq = unhex(substring(NEW.uuid,20,4));
end ;;
delimiter ;

Conclusions

Decomposing the UUID is an alternative to storing them in order, but it won’t speed up inserts. It is simpler to execute queries over a range of dates, and look at the row for a particular device, as you will be able to use the MAC (it is recommended to add an index for it). Generated columns give you the possibility to build the UUID back in just one string.

Apr
10
2017
--

ProxySQL Rules: Do I Have Too Many?

In this blog post we are going to take a closer look at ProxySQL rules. How do they work, and how big is the performance impact of having many rules?

I would like to say thank you to Renè, who was willing to answer all my questions during my tests.

Overview

ProxySQL is heavily based on the query rules. We can set up ProxySQL without rules based only on the host groups, but if we want read/write splitting or sharding (or anything else) we need rules.

ProxySQL knows the SQL protocol and language, so we can easily create rules based on username, schema name and even on the query itself. We can write regular expressions that match the query digest. Let me show you an example:

insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('Testuser',601,1,3,'^SELECT');

This rule matches all the queries starting with “SELECT”, and sends them to host group 601.

After version 1.3.1, the default regex engine was RE2. Starting after version 1.4, the default regex engine will be PCRE.

I would like to highlight three options that can have a bigger impact on your rules than you think: flagINflagOUTapply.

With regards to the manual:

. . .these allow us to create “chains of rules” that get applied one after the other. An input flag value is set to 0, and only rules with flagIN=0 are considered at the beginning. When a matching rule is found for a specific query, flagOUT is evaluated and if NOT NULL the query will be flagged with the specified flag in flagOUT. If flagOUT differs from flagIN, the query will exit the current chain and enters a new chain of rules having flagIN as the new input flag. If flagOUT matches flagIN, the query will be re-evaluated again against the first rule with said flagIN. This happens until there are no more matching rules, or apply is set to 1 (which means this is the last rule to be applied)

You might not be sure what this means, but I will show you later.

As you can see, adding a rule is easy and we can add hundreds of rules, But is there any performance impact?

Test Case

We can write rules based on any part of the query (for example, “userid” or some “sharding key”). In these tests I wrote the rules based on table names because I can easily generate tables with “sysbench”, and run queries against these tables.

I created 1000 tables using sysbench, and I am going to test them with a direct MySQL connection, ProxySQL without rules, with ten rules and with 100 rules.

Time to do some tests to see if adding 100 or more rules have any effect on the performance?

I used two c4.4xlarge instances with SSDs, and I am going to share the steps so anybody can repeat my test and share/compare the results. NodeA is running MySQL 5.7.17 server, and NodeB is running “ProxySQL 1.3.4: and sysbench. During the test I increased the sysbench threads in the following steps:1,2,4,8,12,16,20,24.

I tried to use the simplest ProxySQL configuration as possible:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('10.10.10.243',600,3306,1000,0);
INSERT INTO mysql_replication_hostgroups VALUES (600,'','');
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;
insert into mysql_users (username,password,active,default_hostgroup,default_schema) values ('testuser_rw','Testpass1.',1,600,'test');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK; 

Only one server, one host group. I tried to measure the impact the rules had, so in all the test I sent the queries to the same host group. I only changed the rules (and some ProxySQL settings, as I will explain later).

As I mentioned, I am going to filter based on table names. Here are the 100 rules that I used:

insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('testuser_rw',600,1,3,'(from|into|update|into table) sbtest1b');
insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('testuser_rw',600,1,3,'(from|into|update|into table) sbtest2b'); ... insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('testuser_rw',600,1,3,'(from|into|update|into table) sbtest100b');First Test

First I ran tests with a direct MySQL connection, ProxySQL without rules, ProxySQL with ten rules and ProxySQL with 100 rules.

ProxySQL rules

ProxySQL itself has an impact on the performance, but there is a big difference between 10 and 100 rules. So adding more and more rules can have a negative effect on the performance.

That’s all? Can we do anything to speed things up? I used the default ProxySQL settings. Let’s have a look what can we tune.

Increasing the Number of Threads

Let’s go step by step. First we can increase the thread number inside ProxySQL (the default is 4). We will increase it to 8:

UPDATE global_variables SET variable_value='8' WHERE variable_name='mysql-threads';
SAVE MYSQL VARIABLES TO DISK;

ProxySQL has to be restarted after this changes.

ProxySQL rules

With this simple changes, we can improve the performance. As we can see, the difference is getting larger and larger as we increase the number of the sysbench threads.

Compiling

By compiling our own package, we can gain some extra performance. It is not clear why, so we opened a ticket for further investigation:

ProxySQL rules

I removed some of the columns because the graph got to busy.

ProxySQL 1.4

In ProxySQL 1.4 (which is not GA yet), we can change between the regex engines. However, even using the same engine (RE2) is faster in 1.4:

ProxySQL rules

Apply

As I mentioned, ProxySQL has a few important parameters like “apply”. With apply, if the query matches a rule it won’t check the remaining rules. In an ideal world, if you have 100 rules and 100 queries in random order which match only one rule, you only have to check 50 rules on average.

The new rules:

insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest,apply) values('testuser_rw',600,1,3,'(from|into|update|into table) sbtest1b',1);

ProxySQL rules

As you can see it didn’t help at all. But why? Because in this test we have 1000 tables, and we are running queries on all of the tables. This means 90% the queries have to check all the rules anyway. Let’s make a test with 100 tables to see if the “apply” helps or not:

ProxySQL rules

As we can see, with 100 tables we get a much better performance. But of course this is not a valid solution because we can’t just drop tables, “userids” or “sharding keys”. In the next post I will show you how to use “apply” in a more effective way.

Conclusion

So far, ProxySQL 1.4 with the PCRE engine and eight threads gives us the best performance with 100 rules and 1000 tables. As we can see, both the number of the rules and the query distribution matter. Both impact the performance. In my next blog post, I will show you how you can add some logic into your rules so that, even if you have more rules, you will get better performance.

Apr
05
2017
--

Evaluation of PMP Profiling Tools

PMP Profiling Tools

In this blog post, we’ll look at some of the available PMP profiling tools.

While debugging or analyzing issues with Percona Server for MySQL, we often need a quick understanding of what’s happening on the server. Percona experts frequently use the pt-pmp tool from Percona Toolkit (inspired by http://poormansprofiler.org).

The

pt-pmp

 tool collects application stack traces GDB and then post-processes them. From this you get a condensed, ordered list of the stack traces. The list helps you understand where the application spent most of the time: either running something or waiting for something.

Getting a profile with

pt-pmp

 is handy, but it has a cost: it’s quite intrusive. In order to get stack traces, GDB has to attach to each thread of your application, which results in interruptions. Under high loads, these stops can be quite significant (up to 15-30-60 secs). This means that the

pt-pmp

 approach is not really usable in production.

Below I’ll describe how to reduce GDB overhead, and also what other tools can be used instead of GDB to get stack traces.

  • GDB
    By default, the symbol resolution process in GDB is very slow. As a result, getting stack traces with GDB is quite intrusive (especially under high loads).There are two options available that can help notably reduce GDB tracing overhead:

      1. Use readnever patch. RHEL and other distros based on it include GDB with the readnever patch applied. This patch allows you to avoid unnecessary symbol resolving with the 
        --readnever

         option. As a result you get  up to 10 times better speed.

      2. Use gdb_index. This feature was added to address symbol resolving issue by creating and embedding a special index into the binaries. This index is quite compact: I’ve created and embedded gdb_index for Percona server binary (it increases the size around 7-8MB). The addition of the gdb_index speeds up obtaining stack traces/resolving symbols two to three times.
    # to check if index already exists:
      readelf -S  | grep gdb_index
    # to generate index:
      gdb -batch mysqld -ex "save gdb-index /tmp" -ex "quit"
    # to embed index:
      objcopy --add-section .gdb_index=tmp/mysqld.gdb-index --set-section-flags .gdb_index=readonly mysqld mysqld
  • eu-stack (elfutils)
    The eu-stack from the elfutils package prints the stack for each thread in a process or core file.Symbol resolving also is not very optimized in eu-stack. By default, if you run it under load it will take even more time than GDB. But eu-stack allows you to skip resolving completely, so it can get stack frames quickly and then resolve them without any impact on the workload later.
  • Quickstack
    Quickstack is a tool from Facebook that gets stack traces with minimal overheads.

Now let’s compare all the above profilers. We will measure the amount of time it needs to take all the stack traces from Percona Server for MySQL under a high load (sysbench OLTP_RW with 512 threads).

The results show that eu-stack (without resolving) got all stack traces in less than a second, and that Quickstack and GDB (with the readnever patch) got very close results. For other profilers, the time was around two to five times higher. This is quite unacceptable for profiling (especially in production).

There is one more note regarding the

pt-pmp

 tool. The current version only supports GDB as the profiler. However, there is a development version of this tool that supports GDB, Quickstack, eu-stack and eu-stack with offline symbol resolving. It also allows you to look at stack traces for specific threads (tids). So for instance, in the case of Percona Server for MySQL, we can analyze just the purge, cleaner or IO threads.

Below are the command lines used in testing:

# gdb & gdb+gdb_index
  time gdb  -ex "set pagination 0" -ex "thread apply all bt" -batch -p `pidof mysqld` > /dev/null
# gdb+readnever
  time gdb --readnever -ex "set pagination 0" -ex "thread apply all bt" -batch -p `pidof mysqld` > /dev/null
# eu-stack
  time eu-stack -s -m -p `pidof mysqld` > /dev/null
# eu-stack without resolving
  time eu-stack -q -p `pidof mysqld` > /dev/null
# quickstack - 1 sample
  time quickstack  -c 1 -p `pidof mysqld` > /dev/null
# quickstack - 1000 samples
  time quickstack  -c 1000 -p `pidof mysqld` > /dev/null

Mar
22
2017
--

The Puzzling Performance of the Samsung 960 Pro

samsung 960 pro small

In this blog post, I’ll take a look at the performance of the Samsung 960 Pro SSD NVME.

First, I know the Samsung 960 Pro is a consumer SSD NVME drive, not intended for sustained data center workloads. But the AnandTech review looked good enough that I decided to take it for a test spin to see if it would work well with MySQL benchmarks.

Before that, I decided to do a simple sysbench file IO test to see how the drives handled sustained workloads, and if it would start acting up.

My expectation for a consumer SSD drive is that its write consistency will suffer. Many of those drives can sustain high bursts for short periods of time but have to slow down to keep up with write leveling (and other internal activities SSDs must to do). This is not what I saw, however.

I did a benchmark on E5-2630L V3, 64GB RAM Ubuntu 16.04 LTS, XFS Filesystem, Samsung 960 Pro 512GB (FW:1B6QCXP7):  

sysbench --num-threads=64 --max-time=86400 --max-requests=0 --test=fileio --file-num=1 --file-total-size=260G --file-io-mode=async --file-extra-flags=direct --file-test-mode=rndrd run

Note: I used asynchronous direct IO to keep it close to how MySQL (InnoDB) submits IO requests.

This is what the “Read Throughput” graph looks in Percona Monitoring and Management (PMM):

Samsung 960 Pro

As you can see, in addition to some reasonable ebbs and flows we have some major dips from about 1.5GB/sec of random reads to around 800MB/sec. This almost halves the performance. We can clearly see two of those dips, with the third one starting when the test ended.  

What is really interesting is that as I did a read-write test, it performed much more uniformly:

sysbench --num-threads=64 --max-time=86400 --max-requests=0 --test=fileio --file-num=1 --file-total-size=260G --file-io-mode=async --file-extra-flags=direct --file-test-mode=rndrw run

Samsung 960 Pro

Any ideas on what the cause of such strange periodic IO performance regression for reads could be?

This does not look like overheating throttling. It is much too regular for that (and I checked the temperature – is wasn’t any different during this performance regression).

One theory I have is “read disturb management”: could the SSD need to rewrite the data after so many reads? By my calculations, every cell is read some 166 times during the eight hours between those gaps. This doesn’t sound like a lot.

What are your thoughts?

Feb
13
2017
--

ClickHouse: New Open Source Columnar Database

Clickhouse

For this blog post, I’ve decided to try ClickHouse: an open source column-oriented database management system developed by Yandex (it currently powers Yandex.Metrica, the world’s second-largest web analytics platform).

In my previous set of posts, I tested Apache Spark for big data analysis and used Wikipedia page statistics as a data source. I’ve used the same data as in the Apache Spark blog post: Wikipedia Page Counts. This allows me to compare ClickHouse’s performance to Spark’s.

ClickHouse

I’ve spent some time testing ClickHouse for relatively large volumes of data (1.2Tb uncompressed). Here is a list of ClickHouse advantages and disadvantages that I saw:

ClickHouseClickHouse advantages

  • Parallel processing for single query (utilizing multiple cores)
  • Distributed processing on multiple servers
  • Very fast scans (see benchmarks below) that can be used for real-time queries
  • Column storage is great for working with “wide” / “denormalized” tables (many columns)
  • Good compression
  • SQL support (with limitations)
  • Good set of functions, including support for approximated calculations
  • Different storage engines (disk storage format)
  • Great for structural log/event data as well as time series data (engine MergeTree requires date field)
  • Index support (primary key only, not all storage engines)
  • Nice command line interface with user-friendly progress bar and formatting

Here is a full list of ClickHouse features

ClickHouse disadvantages

  • No real delete/update support, and no transactions (same as Spark and most of the big data systems)
  • No secondary keys (same as Spark and most of the big data systems)
  • Own protocol (no MySQL protocol support)
  • Limited SQL support, and the joins implementation is different. If you are migrating from MySQL or Spark, you will probably have to re-write all queries with joins.
  • No window functions

Full list of ClickHouse limitations

Group by: in-memory vs. on-disk

Running out of memory is one of the potential problems you may encounter when working with large datasets in ClickHouse:

SELECT
    min(toMonth(date)),
    max(toMonth(date)),
    path,
    count(*),
    sum(hits),
    sum(hits) / count(*) AS hit_ratio
FROM wikistat
WHERE (project = 'en')
GROUP BY path
ORDER BY hit_ratio DESC
LIMIT 10
? Progress: 1.83 billion rows, 85.31 GB (68.80 million rows/s., 3.21 GB/s.) ???????????                                                                                                                                                    6%Received exception from server:
Code: 241. DB::Exception: Received from localhost:9000, 127.0.0.1.
DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 1048576 bytes), maximum: 9.31 GiB:
(while reading column hits):

By default, ClickHouse limits the amount of memory for group by (it uses a hash table for group by). This is easily fixed – if you have free memory, increase this parameter:

SET max_memory_usage = 128000000000; #128G

If you don’t have that much memory available, ClickHouse can “spill” data to disk by setting this:

set max_bytes_before_external_group_by=20000000000; #20G
set max_memory_usage=40000000000; #40G

According to the documentation, if you need to use

max_bytes_before_external_group_by

 it is recommended to set

max_memory_usage

 to be ~2x of the size of

max_bytes_before_external_group_by

.

(The reason for this is that the aggregation is performed in two phases: (1) reading and building an intermediate data, and (2) merging the intermediate data. The spill to disk can only happen during the first phase. If there won’t be spill, ClickHouse might need the same amount of RAM for stage 1 and 2.)

Benchmarks: ClickHouse vs. Spark

Both ClickHouse and Spark can be distributed. However, for the purpose of this test I’ve run a single node for both ClickHouse and Spark. The results are quite impressive.

Benchmark summary

 Size / compression  Spark v. 2.0.2  ClickHouse
 Data storage format  Parquet, compressed: snappy   Internal storage, compressed 
 Size (uncompressed: 1.2TB)   395G  212G

Clickhouse

 

 Test  Spark v. 2.0.2  ClickHouse   Diff
 Query 1: count (warm)  7.37 sec (no disk IO)  6.61 sec   ~same
 Query 2: simple group (warm)   792.55 sec (no disk IO)   37.45 sec  21x better
 Query 3: complex group by   2522.9 sec  398.55 sec  6.3x better

 

ClickHouse vs. MySQL

I wanted to see how ClickHouse compared to MySQL. Obviously, we can’t compare some workloads. For example:

  • Storing terabytes of data and querying (“crunching” would be a better word here) data without an index. It would take weeks (or even months) to load data and build the indexes. That is a much more suitable workload for ClickHouse or Spark.
  • Real-time updates / OLTP. ClickHouse does not support real-time updates / deletes.

Usually big data systems provide us with real-time queries. Systems based on map/reduce (i.e., Hive on top of HDFS) are just too slow for real-time queries, as it takes a long time to initialize the map/reduce job and send the code to all nodes.

Potentially, you can use ClickHouse for real-time queries. It does not support secondary indexes, however. This means it will probably scan lots of rows, but it can do it very quickly.

To do this test, I’m using the data from the Percona Monitoring and Management system. The table I’m using has 150 columns, so it is good for column storage. The size in MySQL is ~250G:

mysql> show table status like 'query_class_metrics'G
*************************** 1. row ***************************
           Name: query_class_metrics
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 364184844
 Avg_row_length: 599
    Data_length: 218191888384
Max_data_length: 0
   Index_length: 18590056448
      Data_free: 6291456
 Auto_increment: 416994305

Scanning the whole table is significantly faster in ClickHouse. Retrieving just ten rows by key is faster in MySQL (especially from memory).

But what if we only need to scan limited amount of rows and do a group by? In this case, ClickHouse may be faster. Here is the example (real query used to create sparklines):

MySQL

SELECT
	(1480888800 - UNIX_TIMESTAMP(start_ts)) / 11520 as point,
	FROM_UNIXTIME(1480888800 - (SELECT point) * 11520) AS ts,
	COALESCE(SUM(query_count), 0) / 11520 AS query_count_per_sec,
	COALESCE(SUM(Query_time_sum), 0) / 11520 AS query_time_sum_per_sec,
	COALESCE(SUM(Lock_time_sum), 0) / 11520 AS lock_time_sum_per_sec,
	COALESCE(SUM(Rows_sent_sum), 0) / 11520 AS rows_sent_sum_per_sec,
	COALESCE(SUM(Rows_examined_sum), 0) / 11520 AS rows_examined_sum_per_sec
FROM  query_class_metrics
WHERE  query_class_id = 7 AND instance_id = 1259 AND (start_ts >= '2014-11-27 00:00:00'
AND start_ts < '2014-12-05 00:00:00')
GROUP BY point;
...
61 rows in set (0.10 sec)
# Query_time: 0.101203  Lock_time: 0.000407  Rows_sent: 61  Rows_examined: 11639  Rows_affected: 0
explain SELECT ...
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: query_class_metrics
   partitions: NULL
         type: range
possible_keys: agent_class_ts,agent_ts
          key: agent_class_ts
      key_len: 12
          ref: NULL
         rows: 21686
     filtered: 100.00
        Extra: Using index condition; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: NULL
   partitions: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
     filtered: NULL
        Extra: No tables used
2 rows in set, 2 warnings (0.00 sec)

It is relatively fast.

ClickHouse (some functions are different, so we will have to rewrite the query):

SELECT
    intDiv(1480888800 - toRelativeSecondNum(start_ts), 11520) AS point,
    toDateTime(1480888800 - (point * 11520)) AS ts,
    SUM(query_count) / 11520 AS query_count_per_sec,
    SUM(Query_time_sum) / 11520 AS query_time_sum_per_sec,
    SUM(Lock_time_sum) / 11520 AS lock_time_sum_per_sec,
    SUM(Rows_sent_sum) / 11520 AS rows_sent_sum_per_sec,
    SUM(Rows_examined_sum) / 11520 AS rows_examined_sum_per_sec,
    SUM(Rows_affected_sum) / 11520 AS rows_affected_sum_per_sec
FROM query_class_metrics
WHERE (query_class_id = 7) AND (instance_id = 1259) AND ((start_ts >= '2014-11-27 00:00:00')
AND (start_ts < '2014-12-05 00:00:00'))
GROUP BY point;
61 rows in set. Elapsed: 0.017 sec. Processed 270.34 thousand rows, 14.06 MB (15.73 million rows/s., 817.98 MB/s.)

As we can see, even though ClickHouse scans more rows (270K vs. 11K – over 20x more) it is faster to execute the ClickHouse query (0.10 seconds in MySQL compared to 0.01 second in ClickHouse). The column store format helps a lot here, as MySQL has to read all 150 columns (stored inside InnoDB pages) and ClickHouse only needs to read seven columns.

Wikipedia trending article of the month

Inspired by the article about finding trending topics using Google Books n-grams data, I decided to implement the same algorithm on top of the Wikipedia page visit statistics data. My goal here is to find the “article trending this month,” which has significantly more visits this month compared to the previous month. As I was implementing the algorithm, I came across another ClickHouse limitation: join syntax is limited. In ClickHouse, you can only do join with the “using” keyword. This means that the fields you’re joining need to have the same name. If the field name is different, we have to use a subquery.

Below is an example.

First, create a temporary table to aggregate the visits per month per page:

CREATE TABLE wikistat_by_month ENGINE = Memory AS
SELECT
    path,
    mon,
    sum(hits) / total_hits AS ratio
FROM
(
    SELECT
        path,
        hits,
        toMonth(date) AS mon
    FROM wikistat
    WHERE (project = 'en') AND (lower(path) NOT LIKE '%special%') AND (lower(path) NOT LIKE '%page%') AND (lower(path) NOT LIKE '%test%') AND (lower(path) NOT LIKE '%wiki%') AND (lower(path) NOT LIKE '%index.html%')
) AS a
ANY INNER JOIN
(
    SELECT
        toMonth(date) AS mon,
        sum(hits) AS total_hits
    FROM wikistat
    WHERE (project = 'en') AND (lower(path) NOT LIKE '%special%') AND (lower(path) NOT LIKE '%page%') AND (lower(path) NOT LIKE '%test%') AND (lower(path) NOT LIKE '%wiki%') AND (lower(path) NOT LIKE '%index.html%')
    GROUP BY toMonth(date)
) AS b USING (mon)
GROUP BY
    path,
    mon,
    total_hits
ORDER BY ratio DESC
Ok.
0 rows in set. Elapsed: 543.607 sec. Processed 53.77 billion rows, 2.57 TB (98.91 million rows/s., 4.73 GB/s.)

Second, calculate the actual list:

SELECT
    path,
    mon + 1,
    a_ratio AS ratio,
    a_ratio / b_ratio AS increase
FROM
(
    SELECT
        path,
        mon,
        ratio AS a_ratio
    FROM wikistat_by_month
    WHERE ratio > 0.0001
) AS a
ALL INNER JOIN
(
    SELECT
        path,
        CAST((mon - 1) AS UInt8) AS mon,
        ratio AS b_ratio
    FROM wikistat_by_month
    WHERE ratio > 0.0001
) AS b USING (path, mon)
WHERE (mon > 0) AND (increase > 2)
ORDER BY
    mon ASC,
    increase DESC
LIMIT 100
??path?????????????????????????????????????????????????plus(mon, 1)????????????????????ratio?????????????increase??
? Heath_Ledger                                       ?            2 ?  0.0008467223172121601 ?  6.853825241458039 ?
? Cloverfield                                        ?            2 ?  0.0009372609760313347 ?  3.758937474560766 ?
? The_Dark_Knight_(film)                             ?            2 ?  0.0003508532447770276 ? 2.8858100355450484 ?
? Scientology                                        ?            2 ?  0.0003300109101992719 ?   2.52497180013816 ?
? Barack_Obama                                       ?            3 ?  0.0005786473399980557 ?  2.323409928527576 ?
? Canine_reproduction                                ?            3 ?  0.0004836300843539438 ? 2.0058985801174662 ?
? Iron_Man                                           ?            6 ?    0.00036261003907049 ? 3.5301196568303888 ?
? Iron_Man_(film)                                    ?            6 ? 0.00035634745198422497 ? 3.3815325090507193 ?
? Grand_Theft_Auto_IV                                ?            6 ?  0.0004036713142943461 ? 3.2112732008504885 ?
? Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull ?            6 ?  0.0002856570195547951 ?  2.683443198030021 ?
? Tha_Carter_III                                     ?            7 ? 0.00033954377342889735 ?  2.820114216429247 ?
? EBay                                               ?            7 ?  0.0006575000133427979 ? 2.5483158977946787 ?
? Bebo                                               ?            7 ?  0.0003958340022793501 ? 2.3260912792668162 ?
? Facebook                                           ?            7 ?   0.001683658379576915 ?   2.16460972864883 ?
? Yahoo!_Mail                                        ?            7 ?  0.0002190640575012259 ? 2.1075879062784737 ?
? MySpace                                            ?            7 ?   0.001395608643577507 ?  2.103263660621813 ?
? Gmail                                              ?            7 ?  0.0005449834079575953 ? 2.0675919337716757 ?
? Hotmail                                            ?            7 ?  0.0009126863121737026 ?  2.052471735190232 ?
? Google                                             ?            7 ?   0.000601645849087389 ? 2.0155448612416644 ?
? Barack_Obama                                       ?            7 ? 0.00027336526076130943 ? 2.0031305241832302 ?
? Facebook                                           ?            8 ?  0.0007778115183044431 ?  2.543477658022576 ?
? MySpace                                            ?            8 ?   0.000663544314346641 ?  2.534512981232934 ?
? Two-Face                                           ?            8 ? 0.00026975137404447024 ? 2.4171743959768803 ?
? YouTube                                            ?            8 ?   0.001482456447101451 ? 2.3884527929836152 ?
? Hotmail                                            ?            8 ? 0.00044467667764940547 ? 2.2265750216262954 ?
? The_Dark_Knight_(film)                             ?            8 ?  0.0010482536106662156 ?  2.190078096294301 ?
? Google                                             ?            8 ?  0.0002985028319919154 ? 2.0028812075734637 ?
? Joe_Biden                                          ?            9 ? 0.00045067411455437264 ?  2.692262662620829 ?
? The_Dark_Knight_(film)                             ?            9 ? 0.00047863754833213585 ?  2.420864550676665 ?
? Sarah_Palin                                        ?           10 ?  0.0012459220318907518 ?  2.607063205782761 ?
? Barack_Obama                                       ?           12 ?  0.0034487235202817087 ? 15.615409029600414 ?
? George_W._Bush                                     ?           12 ? 0.00042708730873936023 ? 3.6303098900144937 ?
? Fallout_3                                          ?           12 ?  0.0003568429236849597 ? 2.6193094036745155 ?
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
34 rows in set. Elapsed: 1.062 sec. Processed 1.22 billion rows, 49.03 GB (1.14 billion rows/s., 46.16 GB/s.)

Their response time is really good, considering the amount of data it needed to scan (the first query scanned 2.57 TB of data).

Conclusion

The ClickHouse column-oriented database looks promising for data analytics, as well as for storing and processing structural event data and time series data. ClickHouse can be ~10x faster than Spark for some workloads.

Appendix: Benchmark details

Hardware

  • CPU: 24xIntel(R) Xeon(R) CPU L5639 @ 2.13GHz (physical = 2, cores = 12, virtual = 24, hyperthreading = yes)
  • Disk: 2 consumer grade SSD in software RAID 0 (mdraid)

Query 1

select count(*) from wikistat

ClickHouse:

:)  select count(*) from wikistat;
SELECT count(*)
FROM wikistat
??????count()??
? 26935251789 ?
???????????????
1 rows in set. Elapsed: 6.610 sec. Processed 26.88 billion rows, 53.77 GB (4.07 billion rows/s., 8.13 GB/s.)

Spark:

spark-sql> select count(*) from wikistat;
26935251789
Time taken: 7.369 seconds, Fetched 1 row(s)

Query 2

select count(*), month(dt) as mon
from wikistat where year(dt)=2008
and month(dt) between 1 and 10
group by month(dt)
order by month(dt);

ClickHouse:

:) select count(*), toMonth(date) as mon from wikistat
where toYear(date)=2008 and toMonth(date) between 1 and 10 group by mon;
SELECT
    count(*),
    toMonth(date) AS mon
FROM wikistat
WHERE (toYear(date) = 2008) AND ((toMonth(date) >= 1) AND (toMonth(date) <= 10))
GROUP BY mon
?????count()???mon??
? 2100162604 ?   1 ?
? 1969757069 ?   2 ?
? 2081371530 ?   3 ?
? 2156878512 ?   4 ?
? 2476890621 ?   5 ?
? 2526662896 ?   6 ?
? 2489723244 ?   7 ?
? 2480356358 ?   8 ?
? 2522746544 ?   9 ?
? 2614372352 ?  10 ?
????????????????????
10 rows in set. Elapsed: 37.450 sec. Processed 23.37 billion rows, 46.74 GB (623.97 million rows/s., 1.25 GB/s.)

Spark:

spark-sql> select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt);
2100162604      1
1969757069      2
2081371530      3
2156878512      4
2476890621      5
2526662896      6
2489723244      7
2480356358      8
2522746544      9
2614372352      10
Time taken: 792.552 seconds, Fetched 10 row(s)

Query 3

SELECT
path,
count(*),
sum(hits) AS sum_hits,
round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = 'en'
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100;

ClickHouse:

:) SELECT
:-]     path,
:-]     count(*),
:-]     sum(hits) AS sum_hits,
:-]     round(sum(hits) / count(*), 2) AS hit_ratio
:-] FROM wikistat
:-] WHERE (project = 'en')
:-] GROUP BY path
:-] ORDER BY sum_hits DESC
:-] LIMIT 100;
SELECT
    path,
    count(*),
    sum(hits) AS sum_hits,
    round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = 'en'
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100
??path??????????????????????????????????????????????????count()?????sum_hits???hit_ratio??
? Special:Search                                      ?   44795 ? 4544605711 ? 101453.41 ?
? Main_Page                                           ?   31930 ? 2115896977 ?  66266.74 ?
? Special:Random                                      ?   30159 ?  533830534 ?  17700.54 ?
? Wiki                                                ?   10237 ?   40488416 ?   3955.11 ?
? Special:Watchlist                                   ?   38206 ?   37200069 ?    973.67 ?
? YouTube                                             ?    9960 ?   34349804 ?   3448.78 ?
? Special:Randompage                                  ?    8085 ?   28959624 ?    3581.9 ?
? Special:AutoLogin                                   ?   34413 ?   24436845 ?    710.11 ?
? Facebook                                            ?    7153 ?   18263353 ?   2553.24 ?
? Wikipedia                                           ?   23732 ?   17848385 ?    752.08 ?
? Barack_Obama                                        ?   13832 ?   16965775 ?   1226.56 ?
? index.html                                          ?    6658 ?   16921583 ?   2541.54 ?
…
100 rows in set. Elapsed: 398.550 sec. Processed 26.88 billion rows, 1.24 TB (67.45 million rows/s., 3.10 GB/s.)

Spark:

spark-sql> SELECT
         >     path,
         >     count(*),
         >     sum(hits) AS sum_hits,
         >     round(sum(hits) / count(*), 2) AS hit_ratio
         > FROM wikistat
         > WHERE (project = 'en')
         > GROUP BY path
         > ORDER BY sum_hits DESC
         > LIMIT 100;
...
Time taken: 2522.903 seconds, Fetched 100 row(s)

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com