The utf8mb4 character set is the new default as of MySQL 8.0, and this change neither affects existing data nor forces any upgrades.
Migration to utf8mb4 has many advantages including:
- It can store more symbols, including emojis
- It has new collations for Asian languages
- It is faster than utf8mb3
Still, you may wonder how migration affects your existing data. This blog covers multiple aspects of it.
Storage Requirements
As the name suggests, the maximum number of bytes that one character can take with character set utf8mb4 is four bytes. This is larger than the requirements for utf8mb3 which takes three bytes and many other MySQL character sets.
Fortunately, utf8mb3 is a subset of utf8mb4, and migration of existing data does not increase the size of the data stored on disk: each character takes as many bytes as needed. For example, any digit or letter in the Latin alphabet will require one byte. Characters from other alphabets can take up to four bytes. This can be verified with a simple test.
mysql> set names utf8mb4; Query OK, 0 rows affected (0,00 sec) mysql> CREATE TABLE charset_len( name VARCHAR(255), val CHAR(1) ) CHARACTER SET=utf8mb4; Query OK, 0 rows affected (0,03 sec) mysql> INSERT INTO charset_len VALUES('Latin A', 'A'), ('Cyrillic ?', '?'), ('Korean ?', '?'), ('Dolphin ', ''); Query OK, 4 rows affected (0,02 sec) Records: 4 Duplicates: 0 Warnings: 0 mysql> SELECT name, val, HEX(val), BIT_LENGTH(val)/8 FROM charset_len; +--------------+------+----------+-------------------+ | name | val | HEX(val) | BIT_LENGTH(val)/8 | +--------------+------+----------+-------------------+ | Latin A | A | 41 | 1.0000 | | Cyrillic ? | ? | D090 | 2.0000 | | Korean ? | ? | E389BF | 3.0000 | | Dolphin | | F09F90AC | 4.0000 | +--------------+------+----------+-------------------+ 4 rows in set (0,00 sec)
As a result, all your data that uses a maximum of three bytes would not change and you will be able to store characters that require 4-bytes encoding.
Maximum Length of the Column
While the data storage does not change, when MySQL calculates the maximum amount of data that the column can store, it may fail for some column size definitions that work fine for utf8mb3. For example, you can have a table with this definition:
mysql> CREATE TABLE len_test( -> foo VARCHAR(16384) -> ) ENGINE=InnoDB CHARACTER SET utf8mb3; Query OK, 0 rows affected, 1 warning (0,06 sec)
If you decide to convert this table to use the utf8mb4 character set, the operation will fail:
mysql> ALTER TABLE len_test CONVERT TO CHARACTER SET utf8mb4; ERROR 1074 (42000): Column length too big for column 'foo' (max = 16383); use BLOB or TEXT instead
The reason for this is that the maximum number of bytes that MySQL can store in a VARCHAR column is 65,535, and that is 21845 characters for utf8mb3 character set and 16383 characters for the utf8mb4 character set.
Therefore, if you have columns that could contain more than 16383 characters, you will need to convert them to the TEXT or LONGTEXT data type.
You can find all such columns if you run the query:
SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CHARACTER_MAXIMUM_LENGTH, DATA_TYPE FROM information_schema.columns WHERE CHARACTER_MAXIMUM_LENGTH > 16383 AND DATA_TYPE NOT LIKE '%text%' AND DATA_TYPE NOT LIKE '%blob%' AND TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema');
For example, in my test environment, it returns:
*************************** 1. row *************************** TABLE_SCHEMA: test TABLE_NAME: setup COLUMN_NAME: value CHARACTER_MAXIMUM_LENGTH: 20000 DATA_TYPE: varchar 1 row in set (0,02 sec
Index Storage Requirement
MySQL does not know in advance which characters you will store in the column when you are creating indexes. Therefore, when it calculates the storage required for the index, it takes the maximum value for the character set chosen. As a result, you may hit the index storage limit when converting from another character set to utf8mb4. For InnoDB, the maximum size of the index is 767 bytes for REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats. See The User Reference Manual for details.
That means you need to check if you have indexes that could grow to exceed these values before performing the update. You can do this with the following query:
WITH indexes AS ( WITH tables AS ( SELECT SUBSTRING_INDEX(t.NAME, '/', 1) AS `database`, SUBSTRING_INDEX(t.NAME, '/', -1) AS `table`, i.NAME AS `index`, ROW_FORMAT FROM information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLES t USING(TABLE_ID) ) SELECT `database`, `table`, `index`, ROW_FORMAT, GROUP_CONCAT(kcu.COLUMN_NAME) AS columns, SUM(c.CHARACTER_MAXIMUM_LENGTH) * 4 AS index_len_bytes FROM tables JOIN information_schema.KEY_COLUMN_USAGE kcu ON (`database` = TABLE_SCHEMA AND `table` = kcu.TABLE_NAME AND `index` = kcu.CONSTRAINT_NAME) JOIN information_schema.COLUMNS c ON (kcu.COLUMN_NAME = c.COLUMN_NAME AND `database` = c.TABLE_SCHEMA AND `table` = c.TABLE_NAME) WHERE c.CHARACTER_MAXIMUM_LENGTH IS NOT NULL GROUP BY `database`, `table`, `index`, ROW_FORMAT ORDER BY index_len_bytes ) SELECT * FROM indexes WHERE index_len_bytes >= 768;
Here is the result of running the query in my test environment:
+----------+--------------+---------+------------+------------+-----------------+ | database | table | index | ROW_FORMAT | columns | index_len_bytes | +----------+--------------+---------+------------+------------+-----------------+ | cookbook | hitcount | PRIMARY | Dynamic | path | 1020 | | cookbook | phrase | PRIMARY | Dynamic | phrase_val | 1020 | | cookbook | ruby_session | PRIMARY | Dynamic | session_id | 1020 | +----------+--------------+---------+------------+------------+-----------------+ 3 rows in set (0,04 sec)
Once you have identified such indexes, check the columns and adjust the table definition accordingly.
Note: The query uses CTE, available as of MySQL 8.0. If you are still on version 5.7 or earlier, you will need to rewrite the query.
Temporary Tables
One more issue you can hit after converting to the utf8mb4 character set is an increased size of the implicit temporary tables that MySQL creates to resolve queries. Since utf8mb4 may store more data than other character sets, the column size of such implicit tables will also be bigger. To figure out if you are affected by this issue, watch the global status variable Created_tmp_disk_tables. If this starts significantly increasing after the migration, you may consider updating RAM on your machine and increasing the maximum size of the temporary tables. Note that this issue could be a symptom that some of your queries are poorly optimized.
Conclusion
Converting to the utf8mb4 character set brings you the advantages of better performance, a larger range of characters that you can use, including emojis and new collations (sorting rules). This conversion comes at almost no price, and it can be done smoothly.
Ensure:
- You converted all VARCHAR columns that could store more than 16383 characters to the TEXT or LONGTEXT data type
- You adjusted index definitions that could take more than 767 bytes for the REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats after migration.
- You optimized your queries so that they should not start using internal disk-based temporary tables