Nov
01
2018
--

How To Best Use Percona Server Column Compression With Dictionary

Database Compression

column compressionVery often, database performance is affected by the inability to cache all the required data in memory. Disk IO, even when using the fastest devices, takes much more time than a memory access. With MySQL/InnoDB, the main memory cache is the InnoDB buffer pool. There are many strategies we can try to fit as much data as possible in the buffer pool, and one of them is data compression.

With regular MySQL, to compress InnoDB data you can either use “Barracuda page compression” or “transparent page compression with punch holes”. The use of the ZFS filesystem is another possibility, but it is external to MySQL and doesn’t help with caching. All these solutions are transparent, but often they also have performance and management implications. If you are using Percona Server for MySQL, you have yet another option, “column compression with dictionary“. This feature is certainly not receiving the attention it merits. I think it is really cool—let me show you why.

We all know what compression means, who has not zipped a file before attaching it to an email? Compression removes redundancy from a file. What about the dictionary? A compression dictionary is a way to seed the compressor with expected patterns, in order to improve the compression ratio. Because you can specify a dictionary, the scope of usefulness of column compression with the Percona Server for MySQL feature is greatly increased. In the following sections, we’ll review the impacts of a good dictionary, and devise a way to create a good one without any guessing.

A simple use case

A compression algorithm needs a minimal amount of data in order to achieve a reasonable compression ratio. Typically, if the object is below a few hundred bytes, there is rarely enough data to have repetitive patterns and when the compression header is added, the compressed data can end up larger than the original.

mysql> select length('Hi!'), length(compress('Hi!'));
+---------------+-------------------------+
| length('Hi!') | length(compress('Hi!')) |
+---------------+-------------------------+
|             3 |                      15 |
+---------------+-------------------------+
1 row in set (0.02 sec)

Compressing a string of three bytes results in a binary object of 15 bytes. That’s counter productive.

In order to illustrate the potential of the dictionary, I used this dataset:

http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz

It is a set of 100k Jeopardy questions written in JSON. To load the data in MySQL, I created the following table:

mysql> show create table TestColCompression\G
*************************** 1. row ***************************
Table: TestColCompression
Create Table: CREATE TABLE `TestColCompression` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`question` text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=79977 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Then, I did some formatting to create insert statements:

zcat JEOPARDY_QUESTIONS1.json.gz | perl -p -e 's/\[\{/\{/g' | perl -p -e 's/\}, \{/\}\n\{/g' | perl -p -e "s/'/''/g" | \
  (while read line; do echo "insert into testColComp (questionJson) values ('$line');"; done )

And I executed the inserts. About 20% of the rows had some formatting issues but nevertheless, I ended up with close to 80k rows:

mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78110
Avg_row_length: 316
Data_length: 24690688
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:16:41
Update_time: 2018-10-26 15:40:34
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

The average row length is 316 bytes for a total data size of 23.55MB. The question JSON objects are large enough to matter, but barely large enough for compression. Here are the first five rows:

mysql> select question from TestColCompression limit 5\G
*************************** 1. row ***************************
question: {"category": "HISTORY", "air_date": "2004-12-31", "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"}
*************************** 2. row ***************************
question: {"category": "ESPN's TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"}
*************************** 3. row ***************************
question: {"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"}
*************************** 4. row ***************************
question: {"category": "OLD FOLKS IN THEIR 30s", "air_date": "2009-05-08", "question": "'The district of conservative rep. Patrick McHenry in this state includes Mooresville, a home of NASCAR'", "value": "$800", "answer": "North Carolina", "round": "Jeopardy!", "show_number": "5690"}
*************************** 5. row ***************************
question: {"category": "MOVIES & TV", "air_date": "2009-05-08", "question": "'Tim Robbins played a public TV newsman in "Anchorman: The Legend of" him'", "value": "$800", "answer": "Ron Burgundy", "round": "Jeopardy!", "show_number": "5690"}

Let’s begin by a straight column compression without specifying a dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED;
Query OK, 79976 rows affected (4.25 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78995
Avg_row_length: 259
Data_length: 20496384
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:47:56
Update_time: 2018-10-26 15:47:56
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

As expected the data didn’t compress much. The compression ration is 0.82 or if expressed as a percentage, 18%. Since the JSON headers are always the same, and are present in all questions, we should minimally use them for the dictionary. Trying a minimal dictionary made of the headers gives:

mysql> SET @dictionary_data = 'category' 'air_date' 'question' 'value' 'answer' 'round' 'show_number' ;
Query OK, 0 rows affected (0.01 sec)
mysql> CREATE COMPRESSION_DICTIONARY simple_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY simple_dictionary;
Query OK, 79976 rows affected (4.72 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78786
Avg_row_length: 246
Data_length: 19447808
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:58:17
Update_time: 2018-10-26 17:58:17
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

There is a little progress, we now have a compression ratio of 0.79. Obviously, we could do more but without a tool, we’ll have to guess. A compressor like zlib builds a dictionary as part of its compression effort, could we use that? Yes, but only if we can generate it correctly and access the result. That’s not readily available with the common compressors I know. Fortunately, someone else had the same issue and wrote a compressor able to save its dictionary. Please let me introduce femtozip.

Femtozip to the rescue

The tool, by itself, has no magic algorithm. It is based on zlib, from what I can understand from the code. Anyway, we won’t compress anything with it, we’ll use it to generate a good dictionary. In order to create a dictionary, the tool looks at a set of files and try to see patterns between them. The use of a single big file defeat the purpose. So, I generated one file per question with:

mkdir questions
cd questions
l=1; mysql -u blog -pblog colcomp -e 'select question from TestColCompression' | (while read line; do echo $line > ${l}; let l=l+1; done)

Then, I used the following command to generate a 1024 bytes dictionary using all the files starting by “1”:

../femtozip/cpp/fzip/src/fzip --model ../questions_1s.mod --build --dictonly --maxdict 1024 1*
Building dictionary...

In about 10s the job was done. I tried with all the 80k files and… I had to kill the process after thirty minutes. Anyway, there are 11111 files starting with “1”, a very decent sample. Our generated dictionary looks like:

cat ../questions_1s.mod
", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "'e", "round": "Jeopardy!", "show_number": "r", "round": "{"cate gory": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "'Jeopardy!", "show_number": "2'", "value": "$1000", "answer": "7", "question": "'The ", "question": "'A'", "value": "$600", "answer": "9", "questi on": "'In ", "question": "'This 3", "question": "'2", "question": "'e'", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"'", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s'", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy !", "show_number": "3", "round": "Jeopardy!", "show_number": "5'", "value": "$200", "answer": "'", "value": "$800", "answer": "'", "value": "$400", "answer": "

With some formatting, I was able to create a dictionary with the above data:

mysql> SET @dictionary_data = '", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "''e", "round": "Jeopardy!", "show_number": "r", "round": "{"category": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "''Jeopardy!", "show_number": "2''", "value": "$1000", "answer": "7", "question": "''The ", "question": "''A''", "value": "$600", "answer": "9", "question": "''In ", "question": "''This 3", "question": "''2", "question": "''e''", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"''", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s''", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "5''", "value": "$200", "answer": "''", "value": "$800", "answer": "''", "value": "$400", "answer": "' ;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE COMPRESSION_DICTIONARY femtozip_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
<\pre>
And then, I altered the table to use the new dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY femtozip_dictionary;
Query OK, 79976 rows affected (4.05 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 79861
Avg_row_length: 190
Data_length: 15220736
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:56:09
Update_time: 2018-10-26 17:56:09
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

That’s interesting, we are now achieving a ratio of 0.61, a significant improvement. I pushed my luck and tried with a 2048 bytes dictionary. It further reduced the ratio to 0.57 but that was about the best I got. Larger dictionaries didn’t lower the ratio below 0.57. Zlib supports up to 32KB for the dictionary.

So, to recap:

  • column compression without dictionary, ratio of 0.82
  • column compression with simple dictionary, ratio of 0.79
  • column compression with a 1k dictionary from femtozip, ratio of 0.61
  • column compression with a 2k dictionary from femtozip, ratio of 0.57

The above example stores a JSON document in a text column. MySQL 5.7 includes a JSON datatype which behaves a bit differently regarding the dictionary. Delimiting characters like ‘{}’ are removed in the on disk representation of a JSON column. If you have TBs of data in similar tables, you should really consider column compression and a systematic way of determining the dictionary with femtozip. In addition to improve the compression, it is likely to be the less performance impacting solution. Would it be interesting to generate a dictionary from existing data with a command like this one?

CREATE COMPRESSION_DICTIONARY_FROM_DATA A_good_dictionary (2048, select questions from TestColCompression limit 10000);

where the dictionary creation process would implicitly includes steps similar to the ones I did with femtozip.

Nov
20
2017
--

InnoDB Page Compression: the Good, the Bad and the Ugly

InnoDB Page CompressionIn this blog post, we’ll look at some of the facets of InnoDB page compression.

Somebody recently asked me about the best way to handle JSON data compression in MySQL. I took a quick look at InnoDB page compression and wanted to share my findings.

There is also some great material on this topic that was prepared and presented by Yura Sorokin at Percona Live Europe 2017: https://www.percona.com/live/e17/sessions/percona-xtradb-compressed-columns-with-dictionaries-an-alternative-to-innodb-table-compression. Yura also implemented Compressed Columns in Percona Server.

First, the good part.

InnoDB page compression is actually really easy to use and provides a decent compression ratio. To use it, I just ran

CREATE TABLE commententry (...) COMPRESSION="zlib";

 – and that’s all. By the way, for my experiment I used the subset of Reddit comments stored in JSON (described here: Big Dataset: All Reddit Comments – Analyzing with ClickHouse).

This method got me a compressed table of 3.9GB. Compare this to 8.4GB for an uncompressed table and it’s about a 2.15x compression ratio.

Now, the bad part.

As InnoDB page compression uses “hole punching,” the standard Linux utils do not always properly support files created this way. In fact, to see the size “3.9GB” I had to use

du --block-size=1 tablespace_name.ibd

 , as the standard

ls -l tablespace_name.ibd

 shows the wrong size (8.4GB). There is a similar limitation on copying files. The standard way

cp old_file new_file

 may not always work, and to be sure I had to use

cp --sparse=always old_file new_file

.

Speaking about copying, here’s the ugly part.

The actual time to copy the sparse file was really bad.

On a fairly fast device (a Samsung SM863), copying the sparse file mentioned above in its compressed size of 3.9GB took 52 minutes! That’s shocking, so let me repeat it again: 52 minutes to copy a 3.9GB file on an enterprise SATA SSD.

By comparison, copying regular 8.4GB file takes 9 seconds! Compare 9 sec and 52 mins.

To be fair, the NMVe device (Intel® SSD DC D3600) handles sparse files much better. It took only 12 seconds to copy the same sparse file on this device.

Having considered all this, it is hard to recommend that you use InnoDB page compression for serious production. Well, unless you power your database servers with NVMe storage.

For JSON data, the Compressed Columns in Percona Server for MySQL should work quite well using Dictionary to store JSON keys – give it a try!

Mar
09
2016
--

Evaluating Database Compression Methods

Database Compression Methods

Database Compression MethodsVadim Tkachenko and I have been working with Fractal Tree® storage engines (Fractal Tree engines are available in Percona Server for MySQL and MongoDB as TokuDB and PerconaFT, respectively). While doing so, we’ve become interested evaluating database compression methods, to see how to make compression algorithms work even better than they do currently.  

In this blog post, I will discuss what we found in our compression research.

Introduction

Before we get to evaluation database compression methods, let’s review what compression properties are most relevant to databases in general. The first thing to consider is compression and decompression performance. Databases tend to be very sensitive to decompression performance, as it is often done in the “foreground” – adding to client response latency. Compression performance, on the other hand, is less critical because it can typically run in the background without adding client latency. It can, however, cause an issue if the database fills its data compression queue and “chokes.” The database workload also affects compression performance demands. If the data is loaded only once and essentially becomes read only, it might make sense to spend extra time compressing it – as long as the better compression ratio is achieved without impact to decompression speed.

The next important thing to consider is the compression block size, which can significantly affect compression ratio and performance. In some cases, the compression block size is fixed. Most InnoDB installations, for example, use a 16KB block size. In MySQL 5.7 it is possible to change block size from 4KB to 64KB, but since this setting applies to the whole MySQL instance it isn’t commonly used. TokuDB and PerconaFT allow a much more flexible compression block size configuration. Larger compression block sizes tend to give a better compression ratio and may be more optimal for sequential scan workloads, but if you’re accessing data in a random fashion you may see significant overhead as the complete block typically must be decompressed.

Of course, compression will also depend on the data you’re compressing, and different algorithms may be more optimal at handling different types of data. Additionally, different data structures in databases may structure data more or less optimally for compression. For example, if a database already implements prefix compression for data in the indexes, indexes are likely to be less compressible with block compression systems.

Let’s examine what choices we have when it comes to the compression algorithm selection and configuration. Typically for a given block size – which is essentially a database configuration setting – you will have a choice to select the compression algorithm (such as zlib), the library version and the compression level.  

Comparing different algorithms was tough until lzbench was introduced. Izbench allows for a simple comparison of different compression libraries through a single interface.

For our test, we loaded different kinds of data in an uncompressed InnoDB table and then used it as a source for lzbench:

./lzbench -equicklz,1/zstd,1/snappy/lzma,1/zlib,1/lz4,1/brotli,1 -o3 -b16 data.ibd

This method is a good way to represent database structures and is likely to be more realistic than testing compression on the source text files.

All results shown here are for “OnTime Air Performance.” We tried a variety of data, and even though the numbers varied the main outcomes are the same. You can see results for our other data types in this document.

The results for compression are heavily CPU dependent. All the data below is single thread compression benchmarks run on an Intel Xeon E5-2643 v2 @ 3.5Ghz.

Below are some of the most interesting results we found.

Comparing Compression Algorithms

database compression methods

Using a standard 16KB block size and a low level of compression, we can see that there is a huge variety of compression and decompression speed. The results ranged from 30MB per second for LZMA to more than 1GB per second for LZ4 for compression, and 100MB per second and  3.5GB per second for decompression (for the same pair).

Now let’s look at the compression ratios achieved.

database compression methods

You can see a large variety of outcomes for this data set as well, with ratios ranging from 1.89:1 (LZ4) to  6.57:1 (LZMA). Notice how the fastest and slowest compression libraries achieve the worst and best compression: better compression generally comes at disproportionately more CPU usage. Achieving 3.5 times more compression (LZMA) requires spending 37 times more CPU resources. This ratio, though, is not at all fixed: for example, Brotli provides 2.9 times better compression at 9 times higher CPU cost while Snappy manages to provide 1.9 times better compression than LZ4 with only 1.7 times more CPU cost.

database compression methods

Another interesting compression algorithm property is how much faster decompression is than compression. It is interesting to see there is not as large a variance between compression algorithms, which implies that the default compression level is chosen in such a way that compression is 2 to 3.5 times slower than decompression.

Block Size Impact on Compression

Now let’s look at how the compression block size affects compression and decompression performance.

database compression methods

On-Time Performance Data Compression Speed vs Block Size (MB/sec)
Compression Method 4KB 16KB 64KB 128KB 256KB 256KB/4KB
quicklz 1.5.0 level 1 128.62 299.42 467.9 518.97 550.8 4.28
zstd v0.4.1 level 1 177.77 304.16 357.38 396.65 396.02 2.23
snappy 1.1.3 674.99 644.08 622.24 626.79 629.83 0.93
lzma 9.38 level 1 18.65 30.23 36.43 37.44 38.01 2.04
zlib 1.2.8 level 1 64.73 110.34 128.85 124.74 124.1 1.92
lz4 r131 996.11 1114.35 1167.11 1067.69 1043.86 1.05
brotli 2015-10-29 level 1 64.92 123.92 170.52 177.1 179.51 2.77

 

database compression methods

On-Time Performance Data Compression Speed vs Block Size (MB/sec)
Compression Method 4KB 16KB 64KB 128KB 256KB 256KB/4KB
quicklz 1.5.0 level 1 128.62 299.42 467.9 518.97 550.8 4.28
zstd v0.4.1 level 1 177.77 304.16 357.38 396.65 396.02 2.23
snappy 1.1.3 674.99 644.08 622.24 626.79 629.83 0.93
lzma 9.38 level 1 18.65 30.23 36.43 37.44 38.01 2.04
zlib 1.2.8 level 1 64.73 110.34 128.85 124.74 124.1 1.92
lz4 r131 996.11 1114.35 1167.11 1067.69 1043.86 1.05
brotli 2015-10-29 level 1 64.92 123.92 170.52 177.1 179.51 2.77

If we look at compression and decompression speed versus block size, we can see that there is a difference both for compression and decompression, and that it depends a lot on the compression algorithm. QuickLZ, using these settings, compresses 4.3 times faster with 256KB blocks rather than 4KB blocks. It is interesting that LZ4, which I would consider a “similar” fast compression algorithm, is not at all similar, demonstrating only minimal changes in compression and decompression performance with increased block size.

Snappy is perhaps the most curious compression algorithm of them all. It has lower performance when compressing and decompressing larger blocks.

Let’s examine how compression ratio varies with different block sizes.

database compression methods

On-Time Performance Data DeCompression Ratio vs Block Size
Compression Method 4KB 16KB 64KB 128KB 256KB 256KB/4KB
quicklz 1.5.0 level 1 3.09 3.91 4.56 4.79 4.97 1.61
zstd v0.4.1 level 1 3.95 5.24 6.41 6.82 7.17 1.82
snappy 1.1.3 2.98 3.65 4.21 4.21 4.21 1.41
lzma 9.38 level 1 4.86 6.57 7.96 8.43 8.71 1.79
zlib 1.2.8 level 1 3.79 4.73 5.33 5.44 5.50 1.45
lz4 r131 1.75 1.89 1.99 2.00 2.01 1.15
brotli 2015-10-29 level 1 4.12 5.47 6.61 7.00 7.35 1.78

We can see all the compression libraries perform better with larger block sizes, though how much better varies. LZ4 only benefits a little from larger blocks, with only a 15% better compression ratio between 4KB to 256KB, while Zstd, Brotli and LZMA all get about an 80% better compression ratio with large block sizes. This is another area where I would expect results to be data dependent. With highly repetitive data, gains are likely to be more significant with larger block sizes.

Compression library gains from larger block sizes decrease as the base block sizes increase. For example most compression libraries are able to get at least a 20% better compression ratio going from 4KB to 16KB block size, however going from 64KB to 256KB only allows for a 4-6% better compression ratio – at least for this data set.

Compression Level Impact

Now let’s review what the compression level does to compression performance and ratios.

database compression methods

Compression Method 1 2 3 4 5 6 7 8 9 Max
zstd v0.4.1 404.25 415.92 235.32 217.69 207.01 146.96 124.08 94.93 82.43 21.87
lzma 9.38 39.1 37.96 36.52 35.07 30.85 3.69 3.69 3.69 3.69 3.69
zlib 1.2.8 120.25 114.52 84.14 76.91 53.97 33.06 25.94 14.77 6.92 6.92
brotli 2015-10-29 172.97 179.71 148.3 135.66 119.74 56.08 50.13 29.4 35.46 0.39

Note. Not every compression algorithm provides level selection, so we’re only looking at the ZSTD, LZMA, ZLIB and BROTLI compression libraries. Also, not every library provides ten compression levels. If more than ten levels were available, the first nine and the maximum compression level were tested. If less than ten levels were available (like LZMA), the result for the maximum compression level was used to fill the gaps.

As you might expect, higher compression levels generally mean slower compression. For most compression libraries, the difference between the fastest and slowest compression level is 10-20 times – with the exception of Brotli where the highest compression level means really slow compression (more than 400 times slower than fastest compression).

database compression methods

Compression Method 1 2 3 4 5 6 7 8 9 Max
zstd v0.4.1 827.61 848.71 729.93 809.72 796.61 904.85 906.55 843.01 894.91 893.31
lzma 9.38 128.91 142.28 148.57 148.72 148.75 157.67 157.67 157.67 157.67 157.67
zlib 1.2.8 386.59 404.28 434.77 415.5 418.28 438.07 441.02 448.56 453.64 453.64
brotli 2015-10-29 476.89 481.89 543.69 534.24 512.68 505.55 513.24 517.55 521.84 499.26

This is where things get really interesting. With a higher compression level, decompression speed doesn’t change much – if anything becomes higher. If you think about it, it makes sense: during the compression phase we’re searching for patterns in data and building some sort of dictionary, and extensive pattern searches can be very slow. Decompression, however, just restores the data using the same dictionary and doesn’t need much time finding data patterns. The smaller the compressed data size is, the better the performance should be.

Let’s examine the compression ratio.

database compression methods

Compression Method 1 2 3 4 5 6 7 8 9 Max
zstd v0.4.1 7.17 7.20 6.98 7.05 7.11 7.62 7.76 7.89 7.89 8.16
lzma 9.38 8.20 8.71 8.95 8.96 8.96 10.45 10.45 10.45 10.45 10.45
zlib 1.2.8 5.50 5.71 5.97 6.27 6.50 6.80 6.88 7.01 7.09 7.09
brotli 2015-10-29 7.35 7.35 7.41 7.46 7.51 8.70 8.76 8.80 8.83 10.36

As we can see, higher compression levels indeed improve the compression ratio most of the time.  The ZSTD library seems to be some strange exception where a higher level of compression does not always mean a better ratio. We can also see that BROTLI’s extremely slow compression mode can really produce a significant boost to compression, getting it to the level of compression LZMA achieves – quite an accomplishment.   

Different compression levels don’t have the same effect on compression ratios as different compression methods do. While we saw a 3.5 times compression rate difference between LZ4 and LZMA, the highest compression rate difference between the fastest and slowest mode is 1.4x for Brotli – with 20-30% improvement in compression ratio more likely.

An important point, however, is that the compression ratio improvement from higher compression levels comes at no decompression slowdown – in contrast to using a more complicated compression algorithm to achieve better compression.

In practice, this means having control over the compression level is very important, especially for workloads where data is written once and read frequently. In that case, you could choose a higher compression leves rather than change and recompress the data frequently. Another factor is that the compression level is very easy to change dynamically, unlike the compression algorithm. In theory, a database engine could dynamically choose the compression level based on the workload – a higher compression level can be used if there are a lot of CPU resources available, and a lower compression level can be used if the system can’t keep up with compressing data.

Records

It is interesting to note a few records generated from all of these tests. Among all the methods tried, the lowest compression ratio was LZ4 with a 4KB block size, providing a 1.75 compression ratio. The highest ration was LZMA with a 256K block size, providing a maximum compression ratio of 10.45.

LZ4 is fastest both in compression and decompression, showing the best compression speed of 1167MB per second with a 64KB block size, and a decompression speed of 3666MB per second with a 16KB block size.   

LZMA appears to generally be the slowest compression and decompression, compressing at 0.88MB per second with a 16KB block size and the maximum compression level. It decompresses at 82MB per second with a 4KB block size. Only Brotli at the highest compression level compressed data slower (at 0.39MB per second).

Looking at these we see three orders of magnitude difference in compression performance, and 50 times in decompression performance. This demonstrates how the right compression algorithms choices and settings can make or break compression for your application.

Recommendations

  • Snappy looks like a great and fast compression algorithm, offering a pretty decent compression with performance that is likely to be good enough for most workloads.
  • Zstd is new and not yet 100% stable, but once it is it will be a great replacement to zlib as a general purpose compression algorithm. It gives a better compression ratio and better compression and decompression speed. At higher levels of compression, it is able to get close to LZMA’s compression ratio, at least for some kinds of data, while having a much better decompression speed.
  • LZMA remains the choice when you want the highest compression ratio at all costs. However, it will be all costs indeed! LZMA is slow both for compression and decompression. More often than not, LZMA is chosen without a clear understanding how slow it is, leading to performance issues.
  • Generally, it’s better to get the compression ratio you’re looking for by adjusting the compression level rather than by the type of algorithm, as the compression level affects compression performance more – and may even positively impact decompression performance.
  • If your system allows it, choosing larger block sizes can be best for your workload. Larger block sizes generally have a better compression ratio and a better compression and decompression speed. However, if you have many “random” data lookups, constant decompression of large blocks of data is likely to negate all of those benefits.  
  • In our tests, block sizes up to 64K provided the most benefit, with further increases showing minimal impact. It is possible, however, that these diminishing returns on adjusting the block size upward depend significantly on the data type.

Data and Details

Raw results, data sources and additional details for our evaluation of database compression methods are available in Google Docs.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com