Mar
29
2022
--

Migrating to utf8mb4: Things to Consider

Migrating to utf8mb4

Migrating to utf8mb4The utf8mb4 character set is the new default as of MySQL 8.0, and this change neither affects existing data nor forces any upgrades.

Migration to utf8mb4 has many advantages including:

  • It can store more symbols, including emojis
  • It has new collations for Asian languages
  • It is faster than utf8mb3

Still, you may wonder how migration affects your existing data. This blog covers multiple aspects of it.

Storage Requirements

As the name suggests, the maximum number of bytes that one character can take with character set utf8mb4 is four bytes. This is larger than the requirements for utf8mb3 which takes three bytes and many other MySQL character sets.

Fortunately, utf8mb3 is a subset of utf8mb4, and migration of existing data does not increase the size of the data stored on disk: each character takes as many bytes as needed. For example, any digit or letter in the Latin alphabet will require one byte. Characters from other alphabets can take up to four bytes. This can be verified with a simple test.

mysql?> set names utf8mb4;
Query OK, 0 rows affected (0,00 sec)

mysql?> CREATE TABLE charset_len( name VARCHAR(255), val CHAR(1) ) CHARACTER SET=utf8mb4;
Query OK, 0 rows affected (0,03 sec)

mysql?> INSERT INTO charset_len VALUES('Latin A', 'A'),  ('Cyrillic ?', '?'), ('Korean ?', '?'), ('Dolphin ?', '?');
Query OK, 4 rows affected (0,02 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql?> SELECT name, val, HEX(val), BIT_LENGTH(val)/8 FROM charset_len;
+--------------+------+----------+-------------------+
| name         | val  | HEX(val) | BIT_LENGTH(val)/8 |
+--------------+------+----------+-------------------+
| Latin A      | A    | 41       |            1.0000 |
| Cyrillic ?   | ?    | D090     |            2.0000 |
| Korean ?    | ?    | E389BF   |            3.0000 |
| Dolphin ?   | ?    | F09F90AC |            4.0000 |
+--------------+------+----------+-------------------+
4 rows in set (0,00 sec)

As a result, all your data that uses a maximum of three bytes would not change and you will be able to store characters that require 4-bytes encoding.

Maximum Length of the Column

While the data storage does not change, when MySQL calculates the maximum amount of data that the column can store, it may fail for some column size definitions that work fine for utf8mb3. For example, you can have a table with this definition:

mysql?> CREATE TABLE len_test(
      -> foo VARCHAR(16384)
      -> ) ENGINE=InnoDB CHARACTER SET utf8mb3;
Query OK, 0 rows affected, 1 warning (0,06 sec)

If you decide to convert this table to use the utf8mb4 character set, the operation will fail:

mysql?> ALTER TABLE len_test CONVERT TO CHARACTER SET utf8mb4;
ERROR 1074 (42000): Column length too big for column 'foo' (max = 16383); use BLOB or TEXT instead

The reason for this is that the maximum number of bytes that MySQL can store in a VARCHAR column is 65,535, and that is 21845 characters for utf8mb3 character set and 16383 characters for the utf8mb4 character set.

Therefore, if you have columns that could contain more than 16383 characters, you will need to convert them to the TEXT or LONGTEXT data type.

You can find all such columns if you run the query:

SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME,
       CHARACTER_MAXIMUM_LENGTH, DATA_TYPE
FROM information_schema.columns
WHERE CHARACTER_MAXIMUM_LENGTH > 16383 AND
      DATA_TYPE NOT LIKE '%text%' AND 
      DATA_TYPE NOT LIKE '%blob%' AND
      TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema');

For example, in my test environment, it returns:

*************************** 1. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: setup
COLUMN_NAME: value
CHARACTER_MAXIMUM_LENGTH: 20000
DATA_TYPE: varchar
1 row in set (0,02 sec

 

Index Storage Requirement

MySQL does not know in advance which characters you will store in the column when you are creating indexes. Therefore, when it calculates the storage required for the index, it takes the maximum value for the character set chosen. As a result, you may hit the index storage limit when converting from another character set to utf8mb4. For InnoDB, the maximum size of the index is 767 bytes for REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats. See The User Reference Manual for details.

That means you need to check if you have indexes that could grow to exceed these values before performing the update. You can do this with the following query:

WITH indexes AS (
     WITH tables AS  (
          SELECT SUBSTRING_INDEX(t.NAME, '/', 1) AS `database`, SUBSTRING_INDEX(t.NAME, '/', -1) AS `table`, i.NAME AS `index`, ROW_FORMAT
          FROM information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLES t USING(TABLE_ID)
    )
    SELECT `database`, `table`, `index`, ROW_FORMAT, GROUP_CONCAT(kcu.COLUMN_NAME) AS columns,
           SUM(c.CHARACTER_MAXIMUM_LENGTH) * 4 AS index_len_bytes
    FROM tables JOIN information_schema.KEY_COLUMN_USAGE kcu
         ON (`database` = TABLE_SCHEMA AND `table` = kcu.TABLE_NAME AND `index` = kcu.CONSTRAINT_NAME)
         JOIN information_schema.COLUMNS c ON (kcu.COLUMN_NAME = c.COLUMN_NAME AND `database` = c.TABLE_SCHEMA AND `table` = c.TABLE_NAME)
    WHERE c.CHARACTER_MAXIMUM_LENGTH IS NOT NULL
    GROUP BY `database`, `table`, `index`, ROW_FORMAT ORDER BY index_len_bytes
) SELECT * FROM indexes WHERE index_len_bytes >= 768;

Here is the result of running the query in my test environment:

+----------+--------------+---------+------------+------------+-----------------+
| database | table        | index   | ROW_FORMAT | columns    | index_len_bytes |
+----------+--------------+---------+------------+------------+-----------------+
| cookbook | hitcount     | PRIMARY | Dynamic    | path       |            1020 |
| cookbook | phrase       | PRIMARY | Dynamic    | phrase_val |            1020 |
| cookbook | ruby_session | PRIMARY | Dynamic    | session_id |            1020 |
+----------+--------------+---------+------------+------------+-----------------+
3 rows in set (0,04 sec)

Once you have identified such indexes, check the columns and adjust the table definition accordingly.

Note: The query uses CTE, available as of MySQL 8.0. If you are still on version 5.7 or earlier, you will need to rewrite the query.

Temporary Tables

One more issue you can hit after converting to the utf8mb4 character set is an increased size of the implicit temporary tables that MySQL creates to resolve queries. Since utf8mb4 may store more data than other character sets, the column size of such implicit tables will also be bigger. To figure out if you are affected by this issue, watch the global status variable Created_tmp_disk_tables. If this starts significantly increasing after the migration, you may consider updating RAM on your machine and increasing the maximum size of the temporary tables. Note that this issue could be a symptom that some of your queries are poorly optimized.

Conclusion

Converting to the utf8mb4 character set brings you the advantages of better performance, a larger range of characters that you can use, including emojis and new collations (sorting rules). This conversion comes at almost no price, and it can be done smoothly.

Ensure:

  • You converted all VARCHAR columns that could store more than 16383 characters to the TEXT or LONGTEXT data type
  • You adjusted index definitions that could take more than 767 bytes for the REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats after migration.
  • You optimized your queries so that they should not start using internal disk-based temporary tables
Oct
15
2013
--

utf8 data on latin1 tables: converting to utf8 without downtime or double encoding

Here’s a problem some or most of us have encountered. You have a latin1 table defined like below, and your application is storing utf8 data to the column on a latin1 connection. Obviously, double encoding occurs. Now your development team decided to use utf8 everywhere, but during the process you can only have as little to no downtime while keeping your stored data valid.

CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
master> SET NAMES latin1;
master> INSERT INTO t (c) VALUES ('¡Celebración!');
master> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)
master> SET NAMES utf8;
master> SELECT id, c, HEX(c) FROM t;
+----+---------------------+--------------------------------+
| id | c                   | HEX(c)                         |
+----+---------------------+--------------------------------+
|  3 | ¡Celebración!     | C2A143656C656272616369C3B36E21 |
+----+---------------------+--------------------------------+
1 row in set (0.00 sec)

One approach here is as described to the manual is to convert the TEXT column into BLOB, then convert the table character set to utf8 and the c column back to TEXT, like this:

master> ALTER TABLE t CHANGE c c BLOB;
master> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;
master> SET NAMES utf8;
master> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)

All good so far, but, if the tables are too big or big enough to disrupt your application significantly without downtime, this becomes a problem. The old little trick of using slaves now comes into play. In a nutshell, you can convert the TEXT column first on a slave into BLOB, then switch your application to use this slave as its PRIMARY. Any utf8 data written via replication or from the application should be stored and retrieved without issues either via latin1 connection character set or otherwise. This is because the BINARY data type does not really have character sets. Let me show you:

slave> SET NAMES latin1;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)
slave> ALTER TABLE t CHANGE c c BLOB;
slave> SET NAMES latin1;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
2 rows in set (0.00 sec)
slave> SET NAMES utf8;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  5 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
3 rows in set (0.00 sec)

As you can see, while the column is still in BLOB, I have no problems reading or storing utf8 data into it. Now, after your application has been configured to use this slave and use utf8 connection, you can now convert the column and the table back to TEXT and utf8 character set.

slave> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;
slave> SET NAMES utf8;
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  5 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
3 rows in set (0.00 sec)

Some caveats though, you cannot replicate from BLOB or utf8 back to the latin1 column, so you will have to discard the data from the original master. Doing so will just result in double encoding. Second, while the column is in BLOB or any other BINARY type and this column is indexed, you may experience different results when the index is used. This is because BINARY data is indexed based on their numeric values per bytes not per character strings. Here is an example:

master> SHOW CREATE TABLE t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` blob,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
master> SET NAMES latin1;
master> INSERT INTO t (c) VALUES ('¡Celebración!'), ('Férrêts being fërøcîóúß'), ('Voyage à Montreal');
master> SELECT c FROM t ORDER BY c;
+---------------------------------+
| c                               |
+---------------------------------+
| ¡Celebración!                   |
| Férrêts being fërøcîóúß         |
| Voyage à Montreal               |
+---------------------------------+
3 rows in set (0.00 sec)
master> ALTER TABLE t CHANGE c c BLOB;
master> SELECT c FROM t ORDER BY c;
+---------------------------------+
| c                               |
+---------------------------------+
| Férrêts being fërøcîóúß         |
| Voyage à Montreal               |
| ¡Celebración!                   |
+---------------------------------+
3 rows in set (0.00 sec)

See how the results are now ordered differently?

What’s your utf8 horror? Share with us on the comments below :-)

UPDATE: This was how the process looks like without downtime or extended table being blocked, but there are other ways. One of them is creating a copy of the original table converted to utf8 and doing an 

INSERT INTO .. SELECT

  using the 

CAST

  or 

CONVERT

  functions like below.

master> SHOW CREATE TABLE t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
master> SHOW CREATE TABLE x \G
*************************** 1. row ***************************
       Table: x
Create Table: CREATE TABLE `x` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
master> SET NAMES latin1;
master> SELECT * FROM t;
+----+-----------------+
| id | c               |
+----+-----------------+
|  1 | ¡Celebración!   |
|  2 | a               |
|  3 | A               |
|  4 | ??            |
+----+-----------------+
4 rows in set (0.00 sec)
master> INSERT INTO x SELECT id, CONVERT(c USING BINARY) FROM t;
master> SELECT * FROM x;
+----+---------------+
| id | c             |
+----+---------------+
|  1 | ebraci  |
|  2 | a             |
|  3 | A             |
|  4 | ??            |
+----+---------------+
4 rows in set (0.00 sec)
master> SET NAMES utf8;
master> SELECT * FROM x;
+----+-----------------+
| id | c               |
+----+-----------------+
|  1 | ¡Celebración!   |
|  2 | a               |
|  3 | A               |
|  4 | ??            |
+----+-----------------+
4 rows in set (0.00 sec)

Another method is to copy the FRM file of the same table structure but in 

utf8

  and replace your original table’s FRM file. Since the data is already stored as utf8, you should be able to read them on utf8 connection. However, you will have to rebuild you indexes based on affected columns as they are sorted as latin1 originally. In my tests though, there was no difference before and after rebuilding the index, so, YMMV. To demonstrate, still the same 2 previous tables – on the filesystem, I replaced 

t.frm

  with a copy of 

x.frm

  then did a

FLUSH TABLES

 , afterwards, t looked like this:

master> SHOW CREATE TABLE t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

Now, attempting to read the data on latin1 connection causes truncation:

master> SET NAMES latin1;
master> SELECT * FROM t ORDER BY c;
+----+---------------+
| id | c             |
+----+---------------+
|  3 | a             |
|  4 | A             |
|  2 | ebraci  |
|  1 | ??            |
+----+---------------+
4 rows in set (0.00 sec)

But on utf8, I am now able to read it fine:

master> SET NAMES utf8;
master> SELECT * FROM t ORDER BY c;
+----+-----------------+
| id | c               |
+----+-----------------+
|  3 | a               |
|  4 | A               |
|  2 | ¡Celebración!   |
|  1 | ??            |
+----+-----------------+
4 rows in set (0.00 sec)

Rebuilding the secondary key on 

c

  column has no difference on the results too.

master> ALTER TABLE t DROP KEY c, ADD KEY (c(255));
master> SELECT * FROM t ORDER BY c;
+----+-----------------+
| id | c               |
+----+-----------------+
|  3 | a               |
|  4 | A               |
|  2 | ¡Celebración!   |
|  1 | ??            |
+----+-----------------+
4 rows in set (0.00 sec)

The post utf8 data on latin1 tables: converting to utf8 without downtime or double encoding appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com