Very often, database performance is affected by the inability to cache all the required data in memory. Disk IO, even when using the fastest devices, takes much more time than a memory access. With MySQL/InnoDB, the main memory cache is the InnoDB buffer pool. There are many strategies we can try to fit as much data as possible in the buffer pool, and one of them is data compression.
With regular MySQL, to compress InnoDB data you can either use “Barracuda page compression” or “transparent page compression with punch holes”. The use of the ZFS filesystem is another possibility, but it is external to MySQL and doesn’t help with caching. All these solutions are transparent, but often they also have performance and management implications. If you are using Percona Server for MySQL, you have yet another option, “column compression with dictionary“. This feature is certainly not receiving the attention it merits. I think it is really cool—let me show you why.
We all know what compression means, who has not zipped a file before attaching it to an email? Compression removes redundancy from a file. What about the dictionary? A compression dictionary is a way to seed the compressor with expected patterns, in order to improve the compression ratio. Because you can specify a dictionary, the scope of usefulness of column compression with the Percona Server for MySQL feature is greatly increased. In the following sections, we’ll review the impacts of a good dictionary, and devise a way to create a good one without any guessing.
A simple use case
A compression algorithm needs a minimal amount of data in order to achieve a reasonable compression ratio. Typically, if the object is below a few hundred bytes, there is rarely enough data to have repetitive patterns and when the compression header is added, the compressed data can end up larger than the original.
mysql> select length('Hi!'), length(compress('Hi!')); +---------------+-------------------------+ | length('Hi!') | length(compress('Hi!')) | +---------------+-------------------------+ | 3 | 15 | +---------------+-------------------------+ 1 row in set (0.02 sec)
Compressing a string of three bytes results in a binary object of 15 bytes. That’s counter productive.
In order to illustrate the potential of the dictionary, I used this dataset:
http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz
It is a set of 100k Jeopardy questions written in JSON. To load the data in MySQL, I created the following table:
mysql> show create table TestColCompression\G *************************** 1. row *************************** Table: TestColCompression Create Table: CREATE TABLE `TestColCompression` ( `id` int(11) NOT NULL AUTO_INCREMENT, `question` text NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=79977 DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
Then, I did some formatting to create insert statements:
zcat JEOPARDY_QUESTIONS1.json.gz | perl -p -e 's/\[\{/\{/g' | perl -p -e 's/\}, \{/\}\n\{/g' | perl -p -e "s/'/''/g" | \ (while read line; do echo "insert into testColComp (questionJson) values ('$line');"; done )
And I executed the inserts. About 20% of the rows had some formatting issues but nevertheless, I ended up with close to 80k rows:
mysql> show table status\G *************************** 1. row *************************** Name: TestColCompression Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 78110 Avg_row_length: 316 Data_length: 24690688 Max_data_length: 0 Index_length: 0 Data_free: 4194304 Auto_increment: 79977 Create_time: 2018-10-26 15:16:41 Update_time: 2018-10-26 15:40:34 Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)
The average row length is 316 bytes for a total data size of 23.55MB. The question JSON objects are large enough to matter, but barely large enough for compression. Here are the first five rows:
mysql> select question from TestColCompression limit 5\G *************************** 1. row *************************** question: {"category": "HISTORY", "air_date": "2004-12-31", "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"} *************************** 2. row *************************** question: {"category": "ESPN's TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"} *************************** 3. row *************************** question: {"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"} *************************** 4. row *************************** question: {"category": "OLD FOLKS IN THEIR 30s", "air_date": "2009-05-08", "question": "'The district of conservative rep. Patrick McHenry in this state includes Mooresville, a home of NASCAR'", "value": "$800", "answer": "North Carolina", "round": "Jeopardy!", "show_number": "5690"} *************************** 5. row *************************** question: {"category": "MOVIES & TV", "air_date": "2009-05-08", "question": "'Tim Robbins played a public TV newsman in "Anchorman: The Legend of" him'", "value": "$800", "answer": "Ron Burgundy", "round": "Jeopardy!", "show_number": "5690"}
Let’s begin by a straight column compression without specifying a dictionary:
mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED; Query OK, 79976 rows affected (4.25 sec) Records: 79976 Duplicates: 0 Warnings: 0 mysql> analyze table TestColCompression; +----------------------------+---------+----------+----------+ | Table | Op | Msg_type | Msg_text | +----------------------------+---------+----------+----------+ | colcomp.TestColCompression | analyze | status | OK | +----------------------------+---------+----------+----------+ mysql> show table status\G *************************** 1. row *************************** Name: TestColCompression Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 78995 Avg_row_length: 259 Data_length: 20496384 Max_data_length: 0 Index_length: 0 Data_free: 4194304 Auto_increment: 79977 Create_time: 2018-10-26 15:47:56 Update_time: 2018-10-26 15:47:56 Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)
As expected the data didn’t compress much. The compression ration is 0.82 or if expressed as a percentage, 18%. Since the JSON headers are always the same, and are present in all questions, we should minimally use them for the dictionary. Trying a minimal dictionary made of the headers gives:
mysql> SET @dictionary_data = 'category' 'air_date' 'question' 'value' 'answer' 'round' 'show_number' ; Query OK, 0 rows affected (0.01 sec) mysql> CREATE COMPRESSION_DICTIONARY simple_dictionary (@dictionary_data); Query OK, 0 rows affected (0.00 sec) mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY simple_dictionary; Query OK, 79976 rows affected (4.72 sec) Records: 79976 Duplicates: 0 Warnings: 0 mysql> analyze table TestColCompression; +----------------------------+---------+----------+----------+ | Table | Op | Msg_type | Msg_text | +----------------------------+---------+----------+----------+ | colcomp.TestColCompression | analyze | status | OK | +----------------------------+---------+----------+----------+ 1 row in set (0.00 sec) mysql> show table status\G *************************** 1. row *************************** Name: TestColCompression Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 78786 Avg_row_length: 246 Data_length: 19447808 Max_data_length: 0 Index_length: 0 Data_free: 4194304 Auto_increment: 79977 Create_time: 2018-10-26 17:58:17 Update_time: 2018-10-26 17:58:17 Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)
There is a little progress, we now have a compression ratio of 0.79. Obviously, we could do more but without a tool, we’ll have to guess. A compressor like zlib builds a dictionary as part of its compression effort, could we use that? Yes, but only if we can generate it correctly and access the result. That’s not readily available with the common compressors I know. Fortunately, someone else had the same issue and wrote a compressor able to save its dictionary. Please let me introduce femtozip.
Femtozip to the rescue
The tool, by itself, has no magic algorithm. It is based on zlib, from what I can understand from the code. Anyway, we won’t compress anything with it, we’ll use it to generate a good dictionary. In order to create a dictionary, the tool looks at a set of files and try to see patterns between them. The use of a single big file defeat the purpose. So, I generated one file per question with:
mkdir questions cd questions l=1; mysql -u blog -pblog colcomp -e 'select question from TestColCompression' | (while read line; do echo $line > ${l}; let l=l+1; done)
Then, I used the following command to generate a 1024 bytes dictionary using all the files starting by “1”:
../femtozip/cpp/fzip/src/fzip --model ../questions_1s.mod --build --dictonly --maxdict 1024 1* Building dictionary...
In about 10s the job was done. I tried with all the 80k files and… I had to kill the process after thirty minutes. Anyway, there are 11111 files starting with “1”, a very decent sample. Our generated dictionary looks like:
cat ../questions_1s.mod ", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "'e", "round": "Jeopardy!", "show_number": "r", "round": "{"cate gory": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "'Jeopardy!", "show_number": "2'", "value": "$1000", "answer": "7", "question": "'The ", "question": "'A'", "value": "$600", "answer": "9", "questi on": "'In ", "question": "'This 3", "question": "'2", "question": "'e'", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"'", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s'", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy !", "show_number": "3", "round": "Jeopardy!", "show_number": "5'", "value": "$200", "answer": "'", "value": "$800", "answer": "'", "value": "$400", "answer": "
With some formatting, I was able to create a dictionary with the above data:
mysql> SET @dictionary_data = '", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "''e", "round": "Jeopardy!", "show_number": "r", "round": "{"category": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "''Jeopardy!", "show_number": "2''", "value": "$1000", "answer": "7", "question": "''The ", "question": "''A''", "value": "$600", "answer": "9", "question": "''In ", "question": "''This 3", "question": "''2", "question": "''e''", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"''", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s''", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "5''", "value": "$200", "answer": "''", "value": "$800", "answer": "''", "value": "$400", "answer": "' ; Query OK, 0 rows affected (0.00 sec) mysql> CREATE COMPRESSION_DICTIONARY femtozip_dictionary (@dictionary_data); Query OK, 0 rows affected (0.00 sec) <\pre> And then, I altered the table to use the new dictionary:
mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY femtozip_dictionary; Query OK, 79976 rows affected (4.05 sec) Records: 79976 Duplicates: 0 Warnings: 0 mysql> analyze table TestColCompression; +----------------------------+---------+----------+----------+ | Table | Op | Msg_type | Msg_text | +----------------------------+---------+----------+----------+ | colcomp.TestColCompression | analyze | status | OK | +----------------------------+---------+----------+----------+ 1 row in set (0.00 sec) mysql> show table status\G *************************** 1. row *************************** Name: TestColCompression Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 79861 Avg_row_length: 190 Data_length: 15220736 Max_data_length: 0 Index_length: 0 Data_free: 4194304 Auto_increment: 79977 Create_time: 2018-10-26 17:56:09 Update_time: 2018-10-26 17:56:09 Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)
That’s interesting, we are now achieving a ratio of 0.61, a significant improvement. I pushed my luck and tried with a 2048 bytes dictionary. It further reduced the ratio to 0.57 but that was about the best I got. Larger dictionaries didn’t lower the ratio below 0.57. Zlib supports up to 32KB for the dictionary.
So, to recap:
- column compression without dictionary, ratio of 0.82
- column compression with simple dictionary, ratio of 0.79
- column compression with a 1k dictionary from femtozip, ratio of 0.61
- column compression with a 2k dictionary from femtozip, ratio of 0.57
The above example stores a JSON document in a text column. MySQL 5.7 includes a JSON datatype which behaves a bit differently regarding the dictionary. Delimiting characters like ‘{}’ are removed in the on disk representation of a JSON column. If you have TBs of data in similar tables, you should really consider column compression and a systematic way of determining the dictionary with femtozip. In addition to improve the compression, it is likely to be the less performance impacting solution. Would it be interesting to generate a dictionary from existing data with a command like this one?
CREATE COMPRESSION_DICTIONARY_FROM_DATA A_good_dictionary (2048, select questions from TestColCompression limit 10000);
where the dictionary creation process would implicitly includes steps similar to the ones I did with femtozip.