Mar
28
2017
--

Troubleshooting Issues with MySQL Character Sets Q & A

MySQL Character Sets

MySQL Character SetsIn this blog, I will provide answers to the Q & A for the Troubleshooting Issues with MySQL Character Sets webinar.

First, I want to thank everybody for attending the March 9 MySQL character sets troubleshooting webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: We’ve had some issues converting tables from

utf8

  to

utf8mb4

. Our issue was that the collation we wanted to use –

utf8mb4_unicode_520_ci

 – did not distinguish between spaces and ideographic (Japanese) spaces, so we were getting unique constraint violations for the 

varchar

 fields when two entries had the same text with different kinds of spaces. Have you seen this problem and is there a workaround? We were wondering if this was related to the mother-child character bug with this collation.

A: Unfortunately this issue exists for many languages. For example, in Russian you cannot distinguish “?” and “?” if you use

utf8

 or

utf8mb4

. However, there is hope for Japanese: Oracle announced that they will implement new language-specific

utf8mb4

 collations in MySQL 8.0. I already see 21 new collations in my 8.0.0 installation.

mysql> show collation like '%0900%';
+----------------------------+---------+-----+---------+----------+---------+
| Collation                  | Charset | Id  | Default | Compiled | Sortlen |
+----------------------------+---------+-----+---------+----------+---------+
| utf8mb4_0900_ai_ci         | utf8mb4 | 255 |         | Yes      |       8 |
| utf8mb4_cs_0900_ai_ci      | utf8mb4 | 266 |         | Yes      |       8 |
| utf8mb4_da_0900_ai_ci      | utf8mb4 | 267 |         | Yes      |       8 |
| utf8mb4_de_pb_0900_ai_ci   | utf8mb4 | 256 |         | Yes      |       8 |
| utf8mb4_eo_0900_ai_ci      | utf8mb4 | 273 |         | Yes      |       8 |
| utf8mb4_es_0900_ai_ci      | utf8mb4 | 263 |         | Yes      |       8 |
| utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 |         | Yes      |       8 |
| utf8mb4_et_0900_ai_ci      | utf8mb4 | 262 |         | Yes      |       8 |
| utf8mb4_hr_0900_ai_ci      | utf8mb4 | 275 |         | Yes      |       8 |
| utf8mb4_hu_0900_ai_ci      | utf8mb4 | 274 |         | Yes      |       8 |
| utf8mb4_is_0900_ai_ci      | utf8mb4 | 257 |         | Yes      |       8 |
| utf8mb4_la_0900_ai_ci      | utf8mb4 | 271 |         | Yes      |       8 |
| utf8mb4_lt_0900_ai_ci      | utf8mb4 | 268 |         | Yes      |       8 |
| utf8mb4_lv_0900_ai_ci      | utf8mb4 | 258 |         | Yes      |       8 |
| utf8mb4_pl_0900_ai_ci      | utf8mb4 | 261 |         | Yes      |       8 |
| utf8mb4_ro_0900_ai_ci      | utf8mb4 | 259 |         | Yes      |       8 |
| utf8mb4_sk_0900_ai_ci      | utf8mb4 | 269 |         | Yes      |       8 |
| utf8mb4_sl_0900_ai_ci      | utf8mb4 | 260 |         | Yes      |       8 |
| utf8mb4_sv_0900_ai_ci      | utf8mb4 | 264 |         | Yes      |       8 |
| utf8mb4_tr_0900_ai_ci      | utf8mb4 | 265 |         | Yes      |       8 |
| utf8mb4_vi_0900_ai_ci      | utf8mb4 | 277 |         | Yes      |       8 |
+----------------------------+---------+-----+---------+----------+---------+
21 rows in set (0,03 sec)

In 8.0.1 they promised new case-sensitive and Japanese collations. Please see this blog post for details. The note about the planned Japanese support is at the end.

Meanwhile, I can only suggest that you implement your own collation as described here. You may use

utf8_russian_ci

 collation from Bug #51976 as an example.

Although the user manual does not list

utf8mb4

 as a character set for which it’s possible to create new collations, you can actually do it. What you need to do is add a record about the character set

utf8mb4

 and the new collation into

Index.xml

, then restart the server.

<charset name="utf8mb4">
<collation name="utf8mb4_russian_ci" id="1033">
 <rules>
    <reset>u0415</reset><p>u0451</p><t>u0401</t>
  </rules>
</collaiton>
</charset>
mysql> show collation like 'utf8mb4_russian_ci';
+--------------------+---------+------+---------+----------+---------+
| Collation          | Charset | Id   | Default | Compiled | Sortlen |
+--------------------+---------+------+---------+----------+---------+
| utf8mb4_russian_ci | utf8mb4 | 1033 |         |          |       8 |
+--------------------+---------+------+---------+----------+---------+
1 row in set (0,03 sec)
mysql> create table test_yo(gen varchar(100) CHARACTER SET utf8mb4, yo varchar(100) CHARACTER SET utf8mb4 collate utf8mb4_russian_ci) engine=innodb default character set=utf8mb4;
Query OK, 0 rows affected (0,25 sec)
mysql> set names utf8mb4;
Query OK, 0 rows affected (0,02 sec)
mysql> insert into test_yo values('??', '??'), ('???', '???'), ('????', '????');
Query OK, 3 rows affected (0,05 sec)
Records: 3  Duplicates: 0  Warnings: 0
mysql> insert into test_yo values('??', '??'), ('???', '???'), ('????', '????');
Query OK, 3 rows affected (0,06 sec)
Records: 3  Duplicates: 0  Warnings: 0
mysql> select * from test_yo order by gen;
+----------+----------+
| gen      | yo       |
+----------+----------+
| ??       | ??       |
| ??       | ??       |
| ????     | ????     |
| ????     | ????     |
| ???      | ???      |
| ???      | ???      |
+----------+----------+
6 rows in set (0,00 sec)
mysql> select * from test_yo order by yo;
+----------+----------+
| gen      | yo       |
+----------+----------+
| ??       | ??       |
| ??       | ??       |
| ???      | ???      |
| ???      | ???      |
| ????     | ????     |
| ????     | ????     |
+----------+----------+
6 rows in set (0,00 sec)

Q: If receiving

utf8

 on

latin1

 charset it will be corrupted. Just want to confirm that you can reformat as

utf8

 and un-corrupt the data? Also, is there a time limit on how quickly this needs to be done?

A: It will be corrupted only if you store

utf8

 data in the 

latin1

 column. For example, if you have a table, defined as:

create table latin1(
  f1 varchar(100)
) engine=innodb default charset=latin1;

And then insert a word in

utf8

 format into it that contains characters that are not in the 

latin1

 character set:

mysql> set names utf8;
Query OK, 0 rows affected (0,00 sec)
mysql> set sql_mode='';
Query OK, 0 rows affected, 1 warning (0,00 sec)
mysql> insert into latin1 values('Sveta'), ('?????');
Query OK, 2 rows affected, 1 warning (0,04 sec)
Records: 2  Duplicates: 0  Warnings: 1

The data in

UTF8

 will be corrupted and can never be recovered:

mysql> select * from latin1;
+-------+
| f1    |
+-------+
| Sveta |
| ????? |
+-------+
2 rows in set (0,00 sec)
mysql> select f1, hex(f1) from latin1;
+-------+------------+
| f1    | hex(f1)    |
+-------+------------+
| Sveta | 5376657461 |
| ????? | 3F3F3F3F3F |
+-------+------------+
2 rows in set (0,01 sec)

However, if your data is stored in the 

UTF8

 column and you use

latin1

 for a connection, you will only get a corrupted result set. The data itself will be left untouched:

mysql> create table utf8(f1 varchar(100)) engine=innodb character set utf8;
Query OK, 0 rows affected (0,18 sec)
mysql> insert into utf8 values('Sveta'), ('?????');
Query OK, 2 rows affected (0,15 sec)
Records: 2  Duplicates: 0  Warnings: 0
mysql> set names latin1;
Query OK, 0 rows affected (0,00 sec)
mysql> select f1, hex(f1) from utf8;
+-------+----------------------+
| f1    | hex(f1)              |
+-------+----------------------+
| Sveta | 5376657461           |
| ????? | D0A1D0B2D0B5D182D0B0 |
+-------+----------------------+
2 rows in set (0,00 sec)
mysql> set names utf8;
Query OK, 0 rows affected (0,00 sec)
mysql> select f1, hex(f1) from utf8;
+------------+----------------------+
| f1         | hex(f1)              |
+------------+----------------------+
| Sveta      | 5376657461           |
| ?????      | D0A1D0B2D0B5D182D0B0 |
+------------+----------------------+
2 rows in set (0,00 sec)

Q: Can you discuss how charsets affect mysqldump? Specifically, how do we dump a database containing tables with different default charsets?

A: Yes, you can. MySQL can successfully convert data that uses different character sets, so your only job is to specify option

--default-character-set

 for

mysqldump

. In this case, strings in any character set you use can be converted to the character set specified. For example, if you use

cp1251

 and

latin1

, you may set option

--default-character-set

 to

cp1251

,

utf8

 and 

utf8mb4

. However, you cannot set it to

latin1

 because Cyrillic characters exist in the 

cp1251

 character set, but do not exist in

latin1

.

The default value for

mysqldump

 is

utf8

. You only need to change this default if you use values that are outside of the range supported by 

utf8

 (for example, the smileys in

utf8mb4

).

Q: But if you use the 

--single-transaction

 option for

mysqldump

, you can only specify one character set in the default?

A: Yes, and this is OK: all data will be converted into this character set. And then, when you will restore the dump, it will be converted back to the character set specified in column definitions.

Q: I noticed that MySQL doesn’t support case-sensitive

UTF-8

 character sets. What do you recommend for implementing case-sensitive

UTF-8

, if it’s at all possible?

A: In the link I provided earlier, Oracle promises to implement case-sensitive collations for

utf8mb4

 in version 8.0.1. Before that happens, I recommend you to implement your own case-sensitive collation.

Q: How are tools like

pt-table-checksum

 affected by charsets? Is it safe to use a 4-byte charset (like

utf8mb4

) as the default charset for all comparisons? Assuming our tables are a mix of

latin1

 ,

utf8

 and

utf8mb4

.

A: With this combination, you won’t have any issues:

pt-table-checksum

 uses a complicated set of functions that joins columns and calculates a 

crc32

 checksum on them. In your case, all data will be converted to

utf8mb4

 and no conflicts will happen.

However, if you use incompatible character sets in a single table, you may get the error

"Illegal mix of collations for operation 'concat_ws' "

:

mysql> create table cp1251(f1 varchar(100) character set latin1, f2 varchar(100) character set cp1251) engine=innodb;
Query OK, 0 rows affected (0,32 sec)
mysql> set names utf8;
Query OK, 0 rows affected (0,00 sec)
mysql> insert into cp1251 values('Sveta', '?????');
Query OK, 1 row affected (0,07 sec)
sveta@Thinkie:~/build/mysql-8.0/mysql-test$ ~/build/percona-toolkit/bin/pt-table-checksum h=127.0.0.1,P=13000,u=root,D=test
Diffs cannot be detected because no slaves were found.  Please read the --recursion-method documentation for information.
03-18T03:51:58 Error executing EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/"] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 11351.
03-18T03:51:58 Error checksumming table db1.cp1251: Error executing checksum query: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*checksum table*/" with ParamValues: 0='db1', 1='cp1251', 2=1, 3=undef, 4=undef, 5=undef] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 10741.
TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
03-18T03:51:58      2      0        0       1       0   0.003 db1.cp1251
03-18T03:51:58      0      0        2       1       0   0.167 db1.latin1
03-18T03:51:58      0      0        6       1       0   0.198 db1.test_yo
...

The tool continues working, and will process the rest of your tables. I reported this behavior as Bug #1674266.

Thanks for attending the Troubleshooting Issues with MySQL Character Sets webinar.

Feb
22
2013
--

CentOS 5.8 users: your UTF-8 data is in peril with Perl MySQL

CentOS 5.8 and earlier use Perl module DBD::mysql v3.0007 which has a bug that causes Perl not to flag UTF-8 data as being UTF-8.  Presuming that the MySQL table/column is using UTF-8, and the Perl MySQL connection is also using UTF-8, then a correct system returns:

PV = 0x9573840 "\343\203\213 \303\250"\0 [UTF8 "\x{30cb} \x{e8}"]

That’s a Devel::Peek inside a Perl scalar variable which clearly shows that Perl has recognized and flagged the data at UTF-8.  With DBD::mysql v3.0007 however, an incorrect system returns:

PV = 0x92df9a8 "\343\203\213 \303\250"\0

Notice that it’s the same byte sequence (in octal), but there’s no UTF-8 flag.  As far as Perl is concerned, this is Latin1 data.

What does this mean for you?  In general, it means that Perl could corrupt the data by treating UTF-8 data as Latin1 data.  If the program doesn’t alter the data, then the problem is “overlooked” and compensated for by the fact that MySQL knows that the data is UTF-8 and treats it as such.  We have found, however, that a program can modify the data without corrupting it, but this is risky and really only works by luck, so you shouldn’t rely on it.

I’d like to clarifying two things.  First, DBD::mysql v3.0007 was released in September 2006, but this very old problem still exists today because CentOS 5 is still a popular Linux distro.  So this isn’t “breaking news”, and Perl and DBD::mysql have handled UTF-8 correctly for nearly the last decade.   Second, just a reminder: all Percona Toolkit tools that connect to MySQL have a –charset option and an “A” DSN option for setting the character set.

In conclusion, if you

  1. Run CentOS 5
  2. Have UTF-8 data in MySQL
  3. Use Perl to access that data
  4. Have not upgraded DBD::mysql (perl-DBD-MySQL)

then your UTF-8 data is in peril with Perl.  The solution is simple: upgrade to any newer version of DBD::mysql (except 4.014).

The post CentOS 5.8 users: your UTF-8 data is in peril with Perl MySQL appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com