This blog discusses column compression with an optional predefined dictionary.
Compression, more compression with different algorithms, compress again, compress multiple times! Compression is a hot topic in our lives.
In general, testing new things is great if the processes are well-described and easy to follow. Let’s try to think like a QA engineer: the first golden rule of QA is “everything is buggy, life is full of bugs: good night bugs, good morning bugs, hello my old bug friends.”
The second golden rule of QA is “OK, now let’s find a way to catch a bug — but remember that your methods can be buggy, too.”
Remember: always test! No bugs, no happiness!
When you start to test, the first goal is getting an idea of what is going on. This blog will demonstrate a test scenario for column compression with an optional predefined dictionary. For reference on column compression, read Compressed columns with dictionaries.”
To begin, let’s set up a basic environment:
The installation process requires installing Percona Server which is already documented here -> PS 5.6 installation
Secondly, find an already existing test: xtradb_compressed_columns_ibd_sizes.test.
Third, write a simple script to get started:
import mysql.connector
cnx = mysql.connector.connect(user='msandbox', password='msandbox',
host='127.0.0.1',
database='dbtest',
port=22896,
autocommit=True)
cursor = cnx.cursor()
crt_comp_dic = "CREATE COMPRESSION_DICTIONARY names2 ('Bartholomew')"
cursor.execute(crt_comp_dic)
table_t1 = "CREATE TABLE t1(id INT,a BLOB) ENGINE=InnoDB"
table_t2 = "CREATE TABLE t2(id INT,a BLOB COLUMN_FORMAT COMPRESSED) ENGINE=InnoDB"
table_t3 = "CREATE TABLE t3(id INT,a BLOB COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY names) ENGINE=InnoDB"
cursor.execute(table_t1);
cursor.execute(table_t2);
cursor.execute(table_t3);
insert_stmt = "insert into {} values({},repeat('Bartholomew', 128))"
for i in range(0, 100000):
cursor.execute(insert_stmt.format('t1', int(i)))
print insert_stmt.format('t1', int(i))
cursor.execute(insert_stmt.format('t2', int(i)))
print insert_stmt.format('t2', int(i))
cursor.execute(insert_stmt.format('t3', int(i)))
print insert_stmt.format('t3', int(i))
cursor.close()
cnx.close()
As you might notice, column compression might be with or without a compression dictionary. The visible difference, of course, is in the size of the tables. If you want to compress columns based on a predefined dictionary, you should create it with frequently used data. It is possible to create an empty dictionary, but it will have no effect. (See here: #1628231.)
The result of running this script is:
100.000 rows tables
t1 -> uncompressedt2 -> compressed column, t3 -> compressed column with compression dictionary, ‘names2’ dictionary
t2 -> compressed column,
t3 -> compressed column with compression dictionary, ‘names2’ dictionary.
Table size difference:
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lth | grep .ibd
-rw-rw---- 1 sh sh 168M Sep 29 23:43 t1.ibd
-rw-rw---- 1 sh sh 15M Sep 29 23:43 t2.ibd
-rw-rw---- 1 sh sh 14M Sep 29 23:43 t3.ibd
After running an optimize table:
master [localhost] {msandbox} (dbtest) > optimize table t1;
+-----------+----------+----------+-------------------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------+----------+----------+-------------------------------------------------------------------+
| dbtest.t1 | optimize | note | Table does not support optimize, doing recreate + analyze instead |
| dbtest.t1 | optimize | status | OK |
+-----------+----------+----------+-------------------------------------------------------------------+
2 rows in set (1 min 35.88 sec)
master [localhost] {msandbox} (dbtest) > optimize table t2;
+-----------+----------+----------+-------------------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------+----------+----------+-------------------------------------------------------------------+
| dbtest.t2 | optimize | note | Table does not support optimize, doing recreate + analyze instead |
| dbtest.t2 | optimize | status | OK |
+-----------+----------+----------+-------------------------------------------------------------------+
2 rows in set (11.82 sec)
master [localhost] {msandbox} (dbtest) > optimize table t3;
+-----------+----------+----------+-------------------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------+----------+----------+-------------------------------------------------------------------+
| dbtest.t3 | optimize | note | Table does not support optimize, doing recreate + analyze instead |
| dbtest.t3 | optimize | status | OK |
+-----------+----------+----------+-------------------------------------------------------------------+
2 rows in set (7.89 sec)
The resulted size:
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd
-rw-rw---- 1 sh sh 160M Sep 29 23:52 t1.ibd
-rw-rw---- 1 sh sh 8.0M Sep 29 23:52 t2.ibd
-rw-rw---- 1 sh sh 7.0M Sep 29 23:52 t3.ibd
I want more:
master [localhost] {msandbox} (dbtest) > alter table t1 row_format=compressed;
Query OK, 0 rows affected (2 min 38.85 sec)
Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (dbtest) > alter table t2 row_format=compressed;
Query OK, 0 rows affected (14.41 sec)
Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (dbtest) > alter table t3 row_format=compressed;
Query OK, 0 rows affected (10.74 sec)
Records: 0 Duplicates: 0 Warnings: 0
Using
ROW_FORMAT=COMPRESSED
requires
innodb_file_format
to be > Antelope. But this is not true for
COLUMN_FORMAT
.
Again, check the size:
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd
-rw-rw---- 1 sh sh 76M Sep 29 23:57 t1.ibd
-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t2.ibd
-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t3.ibd
Question: How do I get information about column compression dictionary and tables? Answer: tables from information_schema:
master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict;
+----+--------+-------------+
| id | name | zip_dict |
+----+--------+-------------+
| 1 | names | Bartholomew |
| 2 | names2 | Bartholomew |
+----+--------+-------------+
2 rows in set (0.00 sec)
master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict_cols;
+----------+------------+---------+
| table_id | column_pos | dict_id |
+----------+------------+---------+
| 67 | 1 | 1 |
+----------+------------+---------+
1 row in set (0.00 sec)
Question: How do I drop the compression dictionary? Answer: if it is in use, you will get:
master [localhost] {msandbox} (dbtest) > drop COMPRESSION_DICTIONARY `names`;
ERROR 1894 (HY000): Compression dictionary 'names' is in use
Before dropping it, make sure there are no tables using the dictionary. There is an extreme condition where you are unable to drop the dictionary (see #1628824).
Question: Great! How about mysqldump? Answer: read here: mysqldump.
I might make this the topic of a dedicated post. Thanks for reading!