In this blog, I will provide answers to the Q & A for the Troubleshooting Issues with MySQL Character Sets webinar.
First, I want to thank everybody for attending the March 9 MySQL character sets troubleshooting webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:
Q: We’ve had some issues converting tables from
utf8
to
utf8mb4
. Our issue was that the collation we wanted to use –
utf8mb4_unicode_520_ci
– did not distinguish between spaces and ideographic (Japanese) spaces, so we were getting unique constraint violations for the
varchar
fields when two entries had the same text with different kinds of spaces. Have you seen this problem and is there a workaround? We were wondering if this was related to the mother-child character bug with this collation.
A: Unfortunately this issue exists for many languages. For example, in Russian you cannot distinguish “?” and “?” if you use
utf8
or
utf8mb4
. However, there is hope for Japanese: Oracle announced that they will implement new language-specific
utf8mb4
collations in MySQL 8.0. I already see 21 new collations in my 8.0.0 installation.
mysql> show collation like '%0900%'; +----------------------------+---------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +----------------------------+---------+-----+---------+----------+---------+ | utf8mb4_0900_ai_ci | utf8mb4 | 255 | | Yes | 8 | | utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 8 | | utf8mb4_da_0900_ai_ci | utf8mb4 | 267 | | Yes | 8 | | utf8mb4_de_pb_0900_ai_ci | utf8mb4 | 256 | | Yes | 8 | | utf8mb4_eo_0900_ai_ci | utf8mb4 | 273 | | Yes | 8 | | utf8mb4_es_0900_ai_ci | utf8mb4 | 263 | | Yes | 8 | | utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 | | Yes | 8 | | utf8mb4_et_0900_ai_ci | utf8mb4 | 262 | | Yes | 8 | | utf8mb4_hr_0900_ai_ci | utf8mb4 | 275 | | Yes | 8 | | utf8mb4_hu_0900_ai_ci | utf8mb4 | 274 | | Yes | 8 | | utf8mb4_is_0900_ai_ci | utf8mb4 | 257 | | Yes | 8 | | utf8mb4_la_0900_ai_ci | utf8mb4 | 271 | | Yes | 8 | | utf8mb4_lt_0900_ai_ci | utf8mb4 | 268 | | Yes | 8 | | utf8mb4_lv_0900_ai_ci | utf8mb4 | 258 | | Yes | 8 | | utf8mb4_pl_0900_ai_ci | utf8mb4 | 261 | | Yes | 8 | | utf8mb4_ro_0900_ai_ci | utf8mb4 | 259 | | Yes | 8 | | utf8mb4_sk_0900_ai_ci | utf8mb4 | 269 | | Yes | 8 | | utf8mb4_sl_0900_ai_ci | utf8mb4 | 260 | | Yes | 8 | | utf8mb4_sv_0900_ai_ci | utf8mb4 | 264 | | Yes | 8 | | utf8mb4_tr_0900_ai_ci | utf8mb4 | 265 | | Yes | 8 | | utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 8 | +----------------------------+---------+-----+---------+----------+---------+ 21 rows in set (0,03 sec)
In 8.0.1 they promised new case-sensitive and Japanese collations. Please see this blog post for details. The note about the planned Japanese support is at the end.
Meanwhile, I can only suggest that you implement your own collation as described here. You may use
utf8_russian_ci
collation from Bug #51976 as an example.
Although the user manual does not list
utf8mb4
as a character set for which it’s possible to create new collations, you can actually do it. What you need to do is add a record about the character set
utf8mb4
and the new collation into
Index.xml
, then restart the server.
<charset name="utf8mb4"> <collation name="utf8mb4_russian_ci" id="1033"> <rules> <reset>u0415</reset><p>u0451</p><t>u0401</t> </rules> </collaiton> </charset> mysql> show collation like 'utf8mb4_russian_ci'; +--------------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +--------------------+---------+------+---------+----------+---------+ | utf8mb4_russian_ci | utf8mb4 | 1033 | | | 8 | +--------------------+---------+------+---------+----------+---------+ 1 row in set (0,03 sec) mysql> create table test_yo(gen varchar(100) CHARACTER SET utf8mb4, yo varchar(100) CHARACTER SET utf8mb4 collate utf8mb4_russian_ci) engine=innodb default character set=utf8mb4; Query OK, 0 rows affected (0,25 sec) mysql> set names utf8mb4; Query OK, 0 rows affected (0,02 sec) mysql> insert into test_yo values('??', '??'), ('???', '???'), ('????', '????'); Query OK, 3 rows affected (0,05 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> insert into test_yo values('??', '??'), ('???', '???'), ('????', '????'); Query OK, 3 rows affected (0,06 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> select * from test_yo order by gen; +----------+----------+ | gen | yo | +----------+----------+ | ?? | ?? | | ?? | ?? | | ???? | ???? | | ???? | ???? | | ??? | ??? | | ??? | ??? | +----------+----------+ 6 rows in set (0,00 sec) mysql> select * from test_yo order by yo; +----------+----------+ | gen | yo | +----------+----------+ | ?? | ?? | | ?? | ?? | | ??? | ??? | | ??? | ??? | | ???? | ???? | | ???? | ???? | +----------+----------+ 6 rows in set (0,00 sec)
Q: If receiving
utf8
on
latin1
charset it will be corrupted. Just want to confirm that you can reformat as
utf8
and un-corrupt the data? Also, is there a time limit on how quickly this needs to be done?
A: It will be corrupted only if you store
utf8
data in the
latin1
column. For example, if you have a table, defined as:
create table latin1( f1 varchar(100) ) engine=innodb default charset=latin1;
And then insert a word in
utf8
format into it that contains characters that are not in the
latin1
character set:
mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> set sql_mode=''; Query OK, 0 rows affected, 1 warning (0,00 sec) mysql> insert into latin1 values('Sveta'), ('?????'); Query OK, 2 rows affected, 1 warning (0,04 sec) Records: 2 Duplicates: 0 Warnings: 1
The data in
UTF8
will be corrupted and can never be recovered:
mysql> select * from latin1; +-------+ | f1 | +-------+ | Sveta | | ????? | +-------+ 2 rows in set (0,00 sec) mysql> select f1, hex(f1) from latin1; +-------+------------+ | f1 | hex(f1) | +-------+------------+ | Sveta | 5376657461 | | ????? | 3F3F3F3F3F | +-------+------------+ 2 rows in set (0,01 sec)
However, if your data is stored in the
UTF8
column and you use
latin1
for a connection, you will only get a corrupted result set. The data itself will be left untouched:
mysql> create table utf8(f1 varchar(100)) engine=innodb character set utf8; Query OK, 0 rows affected (0,18 sec) mysql> insert into utf8 values('Sveta'), ('?????'); Query OK, 2 rows affected (0,15 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> set names latin1; Query OK, 0 rows affected (0,00 sec) mysql> select f1, hex(f1) from utf8; +-------+----------------------+ | f1 | hex(f1) | +-------+----------------------+ | Sveta | 5376657461 | | ????? | D0A1D0B2D0B5D182D0B0 | +-------+----------------------+ 2 rows in set (0,00 sec) mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> select f1, hex(f1) from utf8; +------------+----------------------+ | f1 | hex(f1) | +------------+----------------------+ | Sveta | 5376657461 | | ????? | D0A1D0B2D0B5D182D0B0 | +------------+----------------------+ 2 rows in set (0,00 sec)
Q: Can you discuss how charsets affect mysqldump? Specifically, how do we dump a database containing tables with different default charsets?
A: Yes, you can. MySQL can successfully convert data that uses different character sets, so your only job is to specify option
--default-character-set
for
mysqldump
. In this case, strings in any character set you use can be converted to the character set specified. For example, if you use
cp1251
and
latin1
, you may set option
--default-character-set
to
cp1251
,
utf8
and
utf8mb4
. However, you cannot set it to
latin1
because Cyrillic characters exist in the
cp1251
character set, but do not exist in
latin1
.
The default value for
mysqldump
is
utf8
. You only need to change this default if you use values that are outside of the range supported by
utf8
(for example, the smileys in
utf8mb4
).
Q: But if you use the
--single-transaction
option for
mysqldump
, you can only specify one character set in the default?
A: Yes, and this is OK: all data will be converted into this character set. And then, when you will restore the dump, it will be converted back to the character set specified in column definitions.
Q: I noticed that MySQL doesn’t support case-sensitive
UTF-8
character sets. What do you recommend for implementing case-sensitive
UTF-8
, if it’s at all possible?
A: In the link I provided earlier, Oracle promises to implement case-sensitive collations for
utf8mb4
in version 8.0.1. Before that happens, I recommend you to implement your own case-sensitive collation.
Q: How are tools like
pt-table-checksum
affected by charsets? Is it safe to use a 4-byte charset (like
utf8mb4
) as the default charset for all comparisons? Assuming our tables are a mix of
latin1
,
utf8
and
utf8mb4
.
A: With this combination, you won’t have any issues:
pt-table-checksum
uses a complicated set of functions that joins columns and calculates a
crc32
checksum on them. In your case, all data will be converted to
utf8mb4
and no conflicts will happen.
However, if you use incompatible character sets in a single table, you may get the error
"Illegal mix of collations for operation 'concat_ws' "
:
mysql> create table cp1251(f1 varchar(100) character set latin1, f2 varchar(100) character set cp1251) engine=innodb; Query OK, 0 rows affected (0,32 sec) mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> insert into cp1251 values('Sveta', '?????'); Query OK, 1 row affected (0,07 sec) sveta@Thinkie:~/build/mysql-8.0/mysql-test$ ~/build/percona-toolkit/bin/pt-table-checksum h=127.0.0.1,P=13000,u=root,D=test Diffs cannot be detected because no slaves were found. Please read the --recursion-method documentation for information. 03-18T03:51:58 Error executing EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/"] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 11351. 03-18T03:51:58 Error checksumming table db1.cp1251: Error executing checksum query: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*checksum table*/" with ParamValues: 0='db1', 1='cp1251', 2=1, 3=undef, 4=undef, 5=undef] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 10741. TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 03-18T03:51:58 2 0 0 1 0 0.003 db1.cp1251 03-18T03:51:58 0 0 2 1 0 0.167 db1.latin1 03-18T03:51:58 0 0 6 1 0 0.198 db1.test_yo ...
The tool continues working, and will process the rest of your tables. I reported this behavior as Bug #1674266.
Thanks for attending the Troubleshooting Issues with MySQL Character Sets webinar.