Mar
29
2022
--

Migrating to utf8mb4: Things to Consider

Migrating to utf8mb4

Migrating to utf8mb4The utf8mb4 character set is the new default as of MySQL 8.0, and this change neither affects existing data nor forces any upgrades.

Migration to utf8mb4 has many advantages including:

  • It can store more symbols, including emojis
  • It has new collations for Asian languages
  • It is faster than utf8mb3

Still, you may wonder how migration affects your existing data. This blog covers multiple aspects of it.

Storage Requirements

As the name suggests, the maximum number of bytes that one character can take with character set utf8mb4 is four bytes. This is larger than the requirements for utf8mb3 which takes three bytes and many other MySQL character sets.

Fortunately, utf8mb3 is a subset of utf8mb4, and migration of existing data does not increase the size of the data stored on disk: each character takes as many bytes as needed. For example, any digit or letter in the Latin alphabet will require one byte. Characters from other alphabets can take up to four bytes. This can be verified with a simple test.

mysql?> set names utf8mb4;
Query OK, 0 rows affected (0,00 sec)

mysql?> CREATE TABLE charset_len( name VARCHAR(255), val CHAR(1) ) CHARACTER SET=utf8mb4;
Query OK, 0 rows affected (0,03 sec)

mysql?> INSERT INTO charset_len VALUES('Latin A', 'A'),  ('Cyrillic ?', '?'), ('Korean ?', '?'), ('Dolphin ?', '?');
Query OK, 4 rows affected (0,02 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql?> SELECT name, val, HEX(val), BIT_LENGTH(val)/8 FROM charset_len;
+--------------+------+----------+-------------------+
| name         | val  | HEX(val) | BIT_LENGTH(val)/8 |
+--------------+------+----------+-------------------+
| Latin A      | A    | 41       |            1.0000 |
| Cyrillic ?   | ?    | D090     |            2.0000 |
| Korean ?    | ?    | E389BF   |            3.0000 |
| Dolphin ?   | ?    | F09F90AC |            4.0000 |
+--------------+------+----------+-------------------+
4 rows in set (0,00 sec)

As a result, all your data that uses a maximum of three bytes would not change and you will be able to store characters that require 4-bytes encoding.

Maximum Length of the Column

While the data storage does not change, when MySQL calculates the maximum amount of data that the column can store, it may fail for some column size definitions that work fine for utf8mb3. For example, you can have a table with this definition:

mysql?> CREATE TABLE len_test(
      -> foo VARCHAR(16384)
      -> ) ENGINE=InnoDB CHARACTER SET utf8mb3;
Query OK, 0 rows affected, 1 warning (0,06 sec)

If you decide to convert this table to use the utf8mb4 character set, the operation will fail:

mysql?> ALTER TABLE len_test CONVERT TO CHARACTER SET utf8mb4;
ERROR 1074 (42000): Column length too big for column 'foo' (max = 16383); use BLOB or TEXT instead

The reason for this is that the maximum number of bytes that MySQL can store in a VARCHAR column is 65,535, and that is 21845 characters for utf8mb3 character set and 16383 characters for the utf8mb4 character set.

Therefore, if you have columns that could contain more than 16383 characters, you will need to convert them to the TEXT or LONGTEXT data type.

You can find all such columns if you run the query:

SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME,
       CHARACTER_MAXIMUM_LENGTH, DATA_TYPE
FROM information_schema.columns
WHERE CHARACTER_MAXIMUM_LENGTH > 16383 AND
      DATA_TYPE NOT LIKE '%text%' AND 
      DATA_TYPE NOT LIKE '%blob%' AND
      TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema');

For example, in my test environment, it returns:

*************************** 1. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: setup
COLUMN_NAME: value
CHARACTER_MAXIMUM_LENGTH: 20000
DATA_TYPE: varchar
1 row in set (0,02 sec

 

Index Storage Requirement

MySQL does not know in advance which characters you will store in the column when you are creating indexes. Therefore, when it calculates the storage required for the index, it takes the maximum value for the character set chosen. As a result, you may hit the index storage limit when converting from another character set to utf8mb4. For InnoDB, the maximum size of the index is 767 bytes for REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats. See The User Reference Manual for details.

That means you need to check if you have indexes that could grow to exceed these values before performing the update. You can do this with the following query:

WITH indexes AS (
     WITH tables AS  (
          SELECT SUBSTRING_INDEX(t.NAME, '/', 1) AS `database`, SUBSTRING_INDEX(t.NAME, '/', -1) AS `table`, i.NAME AS `index`, ROW_FORMAT
          FROM information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLES t USING(TABLE_ID)
    )
    SELECT `database`, `table`, `index`, ROW_FORMAT, GROUP_CONCAT(kcu.COLUMN_NAME) AS columns,
           SUM(c.CHARACTER_MAXIMUM_LENGTH) * 4 AS index_len_bytes
    FROM tables JOIN information_schema.KEY_COLUMN_USAGE kcu
         ON (`database` = TABLE_SCHEMA AND `table` = kcu.TABLE_NAME AND `index` = kcu.CONSTRAINT_NAME)
         JOIN information_schema.COLUMNS c ON (kcu.COLUMN_NAME = c.COLUMN_NAME AND `database` = c.TABLE_SCHEMA AND `table` = c.TABLE_NAME)
    WHERE c.CHARACTER_MAXIMUM_LENGTH IS NOT NULL
    GROUP BY `database`, `table`, `index`, ROW_FORMAT ORDER BY index_len_bytes
) SELECT * FROM indexes WHERE index_len_bytes >= 768;

Here is the result of running the query in my test environment:

+----------+--------------+---------+------------+------------+-----------------+
| database | table        | index   | ROW_FORMAT | columns    | index_len_bytes |
+----------+--------------+---------+------------+------------+-----------------+
| cookbook | hitcount     | PRIMARY | Dynamic    | path       |            1020 |
| cookbook | phrase       | PRIMARY | Dynamic    | phrase_val |            1020 |
| cookbook | ruby_session | PRIMARY | Dynamic    | session_id |            1020 |
+----------+--------------+---------+------------+------------+-----------------+
3 rows in set (0,04 sec)

Once you have identified such indexes, check the columns and adjust the table definition accordingly.

Note: The query uses CTE, available as of MySQL 8.0. If you are still on version 5.7 or earlier, you will need to rewrite the query.

Temporary Tables

One more issue you can hit after converting to the utf8mb4 character set is an increased size of the implicit temporary tables that MySQL creates to resolve queries. Since utf8mb4 may store more data than other character sets, the column size of such implicit tables will also be bigger. To figure out if you are affected by this issue, watch the global status variable Created_tmp_disk_tables. If this starts significantly increasing after the migration, you may consider updating RAM on your machine and increasing the maximum size of the temporary tables. Note that this issue could be a symptom that some of your queries are poorly optimized.

Conclusion

Converting to the utf8mb4 character set brings you the advantages of better performance, a larger range of characters that you can use, including emojis and new collations (sorting rules). This conversion comes at almost no price, and it can be done smoothly.

Ensure:

  • You converted all VARCHAR columns that could store more than 16383 characters to the TEXT or LONGTEXT data type
  • You adjusted index definitions that could take more than 767 bytes for the REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats after migration.
  • You optimized your queries so that they should not start using internal disk-based temporary tables
Mar
28
2017
--

Troubleshooting Issues with MySQL Character Sets Q & A

MySQL Character Sets

MySQL Character SetsIn this blog, I will provide answers to the Q & A for the Troubleshooting Issues with MySQL Character Sets webinar.

First, I want to thank everybody for attending the March 9 MySQL character sets troubleshooting webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: We’ve had some issues converting tables from

utf8

  to

utf8mb4

. Our issue was that the collation we wanted to use –

utf8mb4_unicode_520_ci

 – did not distinguish between spaces and ideographic (Japanese) spaces, so we were getting unique constraint violations for the 

varchar

 fields when two entries had the same text with different kinds of spaces. Have you seen this problem and is there a workaround? We were wondering if this was related to the mother-child character bug with this collation.

A: Unfortunately this issue exists for many languages. For example, in Russian you cannot distinguish “?” and “?” if you use

utf8

 or

utf8mb4

. However, there is hope for Japanese: Oracle announced that they will implement new language-specific

utf8mb4

 collations in MySQL 8.0. I already see 21 new collations in my 8.0.0 installation.

mysql> show collation like '%0900%';
+----------------------------+---------+-----+---------+----------+---------+
| Collation                  | Charset | Id  | Default | Compiled | Sortlen |
+----------------------------+---------+-----+---------+----------+---------+
| utf8mb4_0900_ai_ci         | utf8mb4 | 255 |         | Yes      |       8 |
| utf8mb4_cs_0900_ai_ci      | utf8mb4 | 266 |         | Yes      |       8 |
| utf8mb4_da_0900_ai_ci      | utf8mb4 | 267 |         | Yes      |       8 |
| utf8mb4_de_pb_0900_ai_ci   | utf8mb4 | 256 |         | Yes      |       8 |
| utf8mb4_eo_0900_ai_ci      | utf8mb4 | 273 |         | Yes      |       8 |
| utf8mb4_es_0900_ai_ci      | utf8mb4 | 263 |         | Yes      |       8 |
| utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 |         | Yes      |       8 |
| utf8mb4_et_0900_ai_ci      | utf8mb4 | 262 |         | Yes      |       8 |
| utf8mb4_hr_0900_ai_ci      | utf8mb4 | 275 |         | Yes      |       8 |
| utf8mb4_hu_0900_ai_ci      | utf8mb4 | 274 |         | Yes      |       8 |
| utf8mb4_is_0900_ai_ci      | utf8mb4 | 257 |         | Yes      |       8 |
| utf8mb4_la_0900_ai_ci      | utf8mb4 | 271 |         | Yes      |       8 |
| utf8mb4_lt_0900_ai_ci      | utf8mb4 | 268 |         | Yes      |       8 |
| utf8mb4_lv_0900_ai_ci      | utf8mb4 | 258 |         | Yes      |       8 |
| utf8mb4_pl_0900_ai_ci      | utf8mb4 | 261 |         | Yes      |       8 |
| utf8mb4_ro_0900_ai_ci      | utf8mb4 | 259 |         | Yes      |       8 |
| utf8mb4_sk_0900_ai_ci      | utf8mb4 | 269 |         | Yes      |       8 |
| utf8mb4_sl_0900_ai_ci      | utf8mb4 | 260 |         | Yes      |       8 |
| utf8mb4_sv_0900_ai_ci      | utf8mb4 | 264 |         | Yes      |       8 |
| utf8mb4_tr_0900_ai_ci      | utf8mb4 | 265 |         | Yes      |       8 |
| utf8mb4_vi_0900_ai_ci      | utf8mb4 | 277 |         | Yes      |       8 |
+----------------------------+---------+-----+---------+----------+---------+
21 rows in set (0,03 sec)

In 8.0.1 they promised new case-sensitive and Japanese collations. Please see this blog post for details. The note about the planned Japanese support is at the end.

Meanwhile, I can only suggest that you implement your own collation as described here. You may use

utf8_russian_ci

 collation from Bug #51976 as an example.

Although the user manual does not list

utf8mb4

 as a character set for which it’s possible to create new collations, you can actually do it. What you need to do is add a record about the character set

utf8mb4

 and the new collation into

Index.xml

, then restart the server.

<charset name="utf8mb4">
<collation name="utf8mb4_russian_ci" id="1033">
 <rules>
    <reset>u0415</reset><p>u0451</p><t>u0401</t>
  </rules>
</collaiton>
</charset>
mysql> show collation like 'utf8mb4_russian_ci';
+--------------------+---------+------+---------+----------+---------+
| Collation          | Charset | Id   | Default | Compiled | Sortlen |
+--------------------+---------+------+---------+----------+---------+
| utf8mb4_russian_ci | utf8mb4 | 1033 |         |          |       8 |
+--------------------+---------+------+---------+----------+---------+
1 row in set (0,03 sec)
mysql> create table test_yo(gen varchar(100) CHARACTER SET utf8mb4, yo varchar(100) CHARACTER SET utf8mb4 collate utf8mb4_russian_ci) engine=innodb default character set=utf8mb4;
Query OK, 0 rows affected (0,25 sec)
mysql> set names utf8mb4;
Query OK, 0 rows affected (0,02 sec)
mysql> insert into test_yo values('??', '??'), ('???', '???'), ('????', '????');
Query OK, 3 rows affected (0,05 sec)
Records: 3  Duplicates: 0  Warnings: 0
mysql> insert into test_yo values('??', '??'), ('???', '???'), ('????', '????');
Query OK, 3 rows affected (0,06 sec)
Records: 3  Duplicates: 0  Warnings: 0
mysql> select * from test_yo order by gen;
+----------+----------+
| gen      | yo       |
+----------+----------+
| ??       | ??       |
| ??       | ??       |
| ????     | ????     |
| ????     | ????     |
| ???      | ???      |
| ???      | ???      |
+----------+----------+
6 rows in set (0,00 sec)
mysql> select * from test_yo order by yo;
+----------+----------+
| gen      | yo       |
+----------+----------+
| ??       | ??       |
| ??       | ??       |
| ???      | ???      |
| ???      | ???      |
| ????     | ????     |
| ????     | ????     |
+----------+----------+
6 rows in set (0,00 sec)

Q: If receiving

utf8

 on

latin1

 charset it will be corrupted. Just want to confirm that you can reformat as

utf8

 and un-corrupt the data? Also, is there a time limit on how quickly this needs to be done?

A: It will be corrupted only if you store

utf8

 data in the 

latin1

 column. For example, if you have a table, defined as:

create table latin1(
  f1 varchar(100)
) engine=innodb default charset=latin1;

And then insert a word in

utf8

 format into it that contains characters that are not in the 

latin1

 character set:

mysql> set names utf8;
Query OK, 0 rows affected (0,00 sec)
mysql> set sql_mode='';
Query OK, 0 rows affected, 1 warning (0,00 sec)
mysql> insert into latin1 values('Sveta'), ('?????');
Query OK, 2 rows affected, 1 warning (0,04 sec)
Records: 2  Duplicates: 0  Warnings: 1

The data in

UTF8

 will be corrupted and can never be recovered:

mysql> select * from latin1;
+-------+
| f1    |
+-------+
| Sveta |
| ????? |
+-------+
2 rows in set (0,00 sec)
mysql> select f1, hex(f1) from latin1;
+-------+------------+
| f1    | hex(f1)    |
+-------+------------+
| Sveta | 5376657461 |
| ????? | 3F3F3F3F3F |
+-------+------------+
2 rows in set (0,01 sec)

However, if your data is stored in the 

UTF8

 column and you use

latin1

 for a connection, you will only get a corrupted result set. The data itself will be left untouched:

mysql> create table utf8(f1 varchar(100)) engine=innodb character set utf8;
Query OK, 0 rows affected (0,18 sec)
mysql> insert into utf8 values('Sveta'), ('?????');
Query OK, 2 rows affected (0,15 sec)
Records: 2  Duplicates: 0  Warnings: 0
mysql> set names latin1;
Query OK, 0 rows affected (0,00 sec)
mysql> select f1, hex(f1) from utf8;
+-------+----------------------+
| f1    | hex(f1)              |
+-------+----------------------+
| Sveta | 5376657461           |
| ????? | D0A1D0B2D0B5D182D0B0 |
+-------+----------------------+
2 rows in set (0,00 sec)
mysql> set names utf8;
Query OK, 0 rows affected (0,00 sec)
mysql> select f1, hex(f1) from utf8;
+------------+----------------------+
| f1         | hex(f1)              |
+------------+----------------------+
| Sveta      | 5376657461           |
| ?????      | D0A1D0B2D0B5D182D0B0 |
+------------+----------------------+
2 rows in set (0,00 sec)

Q: Can you discuss how charsets affect mysqldump? Specifically, how do we dump a database containing tables with different default charsets?

A: Yes, you can. MySQL can successfully convert data that uses different character sets, so your only job is to specify option

--default-character-set

 for

mysqldump

. In this case, strings in any character set you use can be converted to the character set specified. For example, if you use

cp1251

 and

latin1

, you may set option

--default-character-set

 to

cp1251

,

utf8

 and 

utf8mb4

. However, you cannot set it to

latin1

 because Cyrillic characters exist in the 

cp1251

 character set, but do not exist in

latin1

.

The default value for

mysqldump

 is

utf8

. You only need to change this default if you use values that are outside of the range supported by 

utf8

 (for example, the smileys in

utf8mb4

).

Q: But if you use the 

--single-transaction

 option for

mysqldump

, you can only specify one character set in the default?

A: Yes, and this is OK: all data will be converted into this character set. And then, when you will restore the dump, it will be converted back to the character set specified in column definitions.

Q: I noticed that MySQL doesn’t support case-sensitive

UTF-8

 character sets. What do you recommend for implementing case-sensitive

UTF-8

, if it’s at all possible?

A: In the link I provided earlier, Oracle promises to implement case-sensitive collations for

utf8mb4

 in version 8.0.1. Before that happens, I recommend you to implement your own case-sensitive collation.

Q: How are tools like

pt-table-checksum

 affected by charsets? Is it safe to use a 4-byte charset (like

utf8mb4

) as the default charset for all comparisons? Assuming our tables are a mix of

latin1

 ,

utf8

 and

utf8mb4

.

A: With this combination, you won’t have any issues:

pt-table-checksum

 uses a complicated set of functions that joins columns and calculates a 

crc32

 checksum on them. In your case, all data will be converted to

utf8mb4

 and no conflicts will happen.

However, if you use incompatible character sets in a single table, you may get the error

"Illegal mix of collations for operation 'concat_ws' "

:

mysql> create table cp1251(f1 varchar(100) character set latin1, f2 varchar(100) character set cp1251) engine=innodb;
Query OK, 0 rows affected (0,32 sec)
mysql> set names utf8;
Query OK, 0 rows affected (0,00 sec)
mysql> insert into cp1251 values('Sveta', '?????');
Query OK, 1 row affected (0,07 sec)
sveta@Thinkie:~/build/mysql-8.0/mysql-test$ ~/build/percona-toolkit/bin/pt-table-checksum h=127.0.0.1,P=13000,u=root,D=test
Diffs cannot be detected because no slaves were found.  Please read the --recursion-method documentation for information.
03-18T03:51:58 Error executing EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/"] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 11351.
03-18T03:51:58 Error checksumming table db1.cp1251: Error executing checksum query: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*checksum table*/" with ParamValues: 0='db1', 1='cp1251', 2=1, 3=undef, 4=undef, 5=undef] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 10741.
TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
03-18T03:51:58      2      0        0       1       0   0.003 db1.cp1251
03-18T03:51:58      0      0        2       1       0   0.167 db1.latin1
03-18T03:51:58      0      0        6       1       0   0.198 db1.test_yo
...

The tool continues working, and will process the rest of your tables. I reported this behavior as Bug #1674266.

Thanks for attending the Troubleshooting Issues with MySQL Character Sets webinar.

Jul
05
2016
--

MySQL 5.7, utf8mb4 and the load data infile

utf8mb4 and the load data infile

utf8mb4 and the load data infileIn this post, I’ll discuss how MySQL 5.7 handles UTF8MB4 and the load data infile.

Many of my clients have told me that they do not like using the LOAD DATA INFILE statement and prefer to manually parse and load the data. The main reason they do it is issues with the character sets, specifically UTF8MB4 and the load data infile. This was surprising to me as nowadays everyone uses UTF8. MySQL 5.7 (as well as 5.6) has full support for UTF8MB4, which should fix any remaining issues (i.e., you can now load new emoji, like ?).

Last week I was investigating an interesting case where we were loading data and got the following error:

mysql -e 'select version()'
+-----------+
| version() |
+-----------+
| 5.7.12    |
+-----------+
$ mysql -vvv testdb < load_data.sql
ERROR 1300 (HY000) at line 1: Invalid utf8mb4 character string: 'Casa N'

The load data statement:

LOAD DATA LOCAL INFILE
                           'input.psv'
                        REPLACE INTO TABLE
                            input
                        CHARACTER SET
                            utf8mb4
                        FIELDS
                            TERMINATED BY '|'
                        LINES
                            TERMINATED BY 'rn'
                        IGNORE
                            1 LINES

The table uses the correct character set (global character set applied to all varchar fields):

CREATE TABLE `input` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  ...
  `address` varchar(255) DEFAULT NULL,
  ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

The string looked like “Casa Nº 24”. So this should be N + U+00BA (MASCULINE ORDINAL INDICATOR, hex code: c2ba). When I do “less input.tsv”, it shows N<BA> 24. So why can’t MySQL load it?

After further investigation, we discovered the original encoding is not UTF8. WE found out by running:

$ file -i input.tsv
input.tsv: text/plain; charset=iso-8859-1

So the code <BA> was misleading. Also, when I got the actual character from the file, it was just one byte (UTF8 for this character should be two bytes). When MySQL parsed the UTF8 input file, it found only the first part of the multibyte UTF8 code and stopped with an error.

The original character in hex is “ba”:

xxd -p char_ascii
ba0a

(0a is a carriage return, and “ba” is “masculine ordinal indicator”)

The UTF8 equivalent:

$ xxd -p char_utf8
c2ba0a

This is now two bytes (+ carriage return): c2ba

To solve the problem we can simply change the CHARACTER SET utf8mb4 to CHARACTER SET latin1 when doing a load data infile. This fixed the issue:

Query OK, 2 rows affected (0.00 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
mysql> select address from input;
+--------------------------------+
| consignee_address              |
+--------------------------------+
| Casa Nº 24 ................... |
...
+--------------------------------+
2 rows in set (0.00 sec)

Another option will be to detect the character set encoding (iconv can do it) and convert to UTF8.

But it worked before…?

It worked a bit differently in MySQL 5.6:

$ mysql -e 'select version()'
+-------------+
| version()   |
+-------------+
| 5.6.25-73.0 |
+-------------+
$ mysql -vvv testdb < load_data.sql
...
Query OK, 2 rows affected, 2 warnings (0.00 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 2
--------------
show warnings
--------------
+---------+------+---------------------------------------------------------------------+
| Level   | Code | Message                                                             |
+---------+------+---------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: 'xBA 24 ...' for column 'address' at row 1  |
| Warning | 1366 | Incorrect string value: 'xBA 24 ...' for column 'address' at row 2  |
+---------+------+---------------------------------------------------------------------+
2 rows in set (0.00 sec)

MySQL 5.7 is more strict and doesn’t allow you to insert data in the wrong format. However, it is not 100% consistent. For some characters, MySQL 5.7 will also throw a warning if disabling strict SQL mode.

Another character that caused the same issue was xC9. When loading to MySQL 5.7 with the default sql_mode (ONLY_FULL_GROUP_BY, STRICT_TRANS_TABLES, NO_ZERO_IN_DATE, NO_ZERO_DATE, ERROR_FOR_DIVISION_BY_ZERO, NO_AUTO_CREATE_USER, NO_ENGINE_SUBSTITUTION) it throws an error:

ERROR 1366 (HY000) at line 1: Incorrect string value: 'xC9' for column 'address' at row 1

When disabling the strict mode it now defaults to warnings:

mysql> set global sql_mode = '';
Query OK, 0 rows affected (0.00 sec)
Query OK, 2 rows affected, 1 warning (0.00 sec)
Records: 1  Deleted: 1  Skipped: 0  Warnings: 1
--------------
show warnings
--------------
+---------+------+--------------------------------------------------------------+
| Level   | Code | Message                                                      |
+---------+------+--------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: 'xC9' for column 'address' at row 1  |
+---------+------+--------------------------------------------------------------+
1 row in set (0.00 sec)

Emoji in MySQL

With UTF8MB4 support (in MySQL 5.6 and 5.7), you can also insert a little dolphin into a MySQL table:

CREATE TABLE `test_utf8mb4` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `v` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into test_utf8mb4 (v) values ('Dolphin:?');
Query OK, 1 row affected (0.00 sec)
mysql> select * from test_utf8mb4;
+----+--------------+
| id | v            |
+----+--------------+
|  1 | Dolphin:?   |
+----+--------------+
1 row in set (0.00 sec)

This should help you clear up issues with UTF8MB4 and the load data infile. Have fun!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com