Aiden, a London-based startup building a machine learning-powered personal assistant to save mobile marketers time and money, closed a $750,000 seed round today from Kima Ventures and a number of angels, including Nicolas Pinto, Pierre Valade and Jonathan Wolf. The team first demoed the capabilities of its service on the stage of TechCrunch Disrupt as a Battlefield finalist. Read More
Alchemist Accelerator has raised $2 million from GE Digital to start a new program for industrial IoT startups. Stanford lecturer Timothy Chou, formerly President of Oracle On Demand, will chair Alchemist’s new IIoT accelerator along with GE Digital’s West Coast group. In the past, enterprise hardware and software startups were seen as capital intensive, with the challenge of… Read More
Cloud Foundry, a massive open source project that allows enterprises to host their own platform-as-a-service for running cloud applications in their own data center or in a public cloud, today announced the launch of its “Cloud Foundry Certified Developer” program. The Cloud Foundry Foundation calls this “the world’s largest cloud-native developer certification… Read More
In the recruiting wars, a cool office is critical. Not just for scrappy startups, but big businesses with regional HQs, as well. So while WeWork signs questionable 20-year leases to provide desks for twenty-something engineers, Industrious is taking a more classy and conservative approach to coworking space. Read More
In this blog, I will provide answers to the Q & A for the Troubleshooting Issues with MySQL Character Sets webinar.
First, I want to thank everybody for attending the March 9 MySQL character sets troubleshooting webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:
Q: We’ve had some issues converting tables from
. Our issue was that the collation we wanted to use –
– did not distinguish between spaces and ideographic (Japanese) spaces, so we were getting unique constraint violations for the
fields when two entries had the same text with different kinds of spaces. Have you seen this problem and is there a workaround? We were wondering if this was related to the mother-child character bug with this collation.
A: Unfortunately this issue exists for many languages. For example, in Russian you cannot distinguish “?” and “?” if you use
. However, there is hope for Japanese: Oracle announced that they will implement new language-specific
collations in MySQL 8.0. I already see 21 new collations in my 8.0.0 installation.
mysql> show collation like '%0900%'; +----------------------------+---------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +----------------------------+---------+-----+---------+----------+---------+ | utf8mb4_0900_ai_ci | utf8mb4 | 255 | | Yes | 8 | | utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 8 | | utf8mb4_da_0900_ai_ci | utf8mb4 | 267 | | Yes | 8 | | utf8mb4_de_pb_0900_ai_ci | utf8mb4 | 256 | | Yes | 8 | | utf8mb4_eo_0900_ai_ci | utf8mb4 | 273 | | Yes | 8 | | utf8mb4_es_0900_ai_ci | utf8mb4 | 263 | | Yes | 8 | | utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 | | Yes | 8 | | utf8mb4_et_0900_ai_ci | utf8mb4 | 262 | | Yes | 8 | | utf8mb4_hr_0900_ai_ci | utf8mb4 | 275 | | Yes | 8 | | utf8mb4_hu_0900_ai_ci | utf8mb4 | 274 | | Yes | 8 | | utf8mb4_is_0900_ai_ci | utf8mb4 | 257 | | Yes | 8 | | utf8mb4_la_0900_ai_ci | utf8mb4 | 271 | | Yes | 8 | | utf8mb4_lt_0900_ai_ci | utf8mb4 | 268 | | Yes | 8 | | utf8mb4_lv_0900_ai_ci | utf8mb4 | 258 | | Yes | 8 | | utf8mb4_pl_0900_ai_ci | utf8mb4 | 261 | | Yes | 8 | | utf8mb4_ro_0900_ai_ci | utf8mb4 | 259 | | Yes | 8 | | utf8mb4_sk_0900_ai_ci | utf8mb4 | 269 | | Yes | 8 | | utf8mb4_sl_0900_ai_ci | utf8mb4 | 260 | | Yes | 8 | | utf8mb4_sv_0900_ai_ci | utf8mb4 | 264 | | Yes | 8 | | utf8mb4_tr_0900_ai_ci | utf8mb4 | 265 | | Yes | 8 | | utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 8 | +----------------------------+---------+-----+---------+----------+---------+ 21 rows in set (0,03 sec)
In 8.0.1 they promised new case-sensitive and Japanese collations. Please see this blog post for details. The note about the planned Japanese support is at the end.
Meanwhile, I can only suggest that you implement your own collation as described here. You may use
collation from Bug #51976 as an example.
Although the user manual does not list
as a character set for which it’s possible to create new collations, you can actually do it. What you need to do is add a record about the character set
and the new collation into
, then restart the server.
<charset name="utf8mb4"> <collation name="utf8mb4_russian_ci" id="1033"> <rules> <reset>u0415</reset><p>u0451</p><t>u0401</t> </rules> </collaiton> </charset> mysql> show collation like 'utf8mb4_russian_ci'; +--------------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +--------------------+---------+------+---------+----------+---------+ | utf8mb4_russian_ci | utf8mb4 | 1033 | | | 8 | +--------------------+---------+------+---------+----------+---------+ 1 row in set (0,03 sec) mysql> create table test_yo(gen varchar(100) CHARACTER SET utf8mb4, yo varchar(100) CHARACTER SET utf8mb4 collate utf8mb4_russian_ci) engine=innodb default character set=utf8mb4; Query OK, 0 rows affected (0,25 sec) mysql> set names utf8mb4; Query OK, 0 rows affected (0,02 sec) mysql> insert into test_yo values('??', '??'), ('???', '???'), ('????', '????'); Query OK, 3 rows affected (0,05 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> insert into test_yo values('??', '??'), ('???', '???'), ('????', '????'); Query OK, 3 rows affected (0,06 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> select * from test_yo order by gen; +----------+----------+ | gen | yo | +----------+----------+ | ?? | ?? | | ?? | ?? | | ???? | ???? | | ???? | ???? | | ??? | ??? | | ??? | ??? | +----------+----------+ 6 rows in set (0,00 sec) mysql> select * from test_yo order by yo; +----------+----------+ | gen | yo | +----------+----------+ | ?? | ?? | | ?? | ?? | | ??? | ??? | | ??? | ??? | | ???? | ???? | | ???? | ???? | +----------+----------+ 6 rows in set (0,00 sec)
Q: If receiving
charset it will be corrupted. Just want to confirm that you can reformat as
and un-corrupt the data? Also, is there a time limit on how quickly this needs to be done?
A: It will be corrupted only if you store
data in the
column. For example, if you have a table, defined as:
create table latin1( f1 varchar(100) ) engine=innodb default charset=latin1;
And then insert a word in
format into it that contains characters that are not in the
mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> set sql_mode=''; Query OK, 0 rows affected, 1 warning (0,00 sec) mysql> insert into latin1 values('Sveta'), ('?????'); Query OK, 2 rows affected, 1 warning (0,04 sec) Records: 2 Duplicates: 0 Warnings: 1
The data in
will be corrupted and can never be recovered:
mysql> select * from latin1; +-------+ | f1 | +-------+ | Sveta | | ????? | +-------+ 2 rows in set (0,00 sec) mysql> select f1, hex(f1) from latin1; +-------+------------+ | f1 | hex(f1) | +-------+------------+ | Sveta | 5376657461 | | ????? | 3F3F3F3F3F | +-------+------------+ 2 rows in set (0,01 sec)
However, if your data is stored in the
column and you use
for a connection, you will only get a corrupted result set. The data itself will be left untouched:
mysql> create table utf8(f1 varchar(100)) engine=innodb character set utf8; Query OK, 0 rows affected (0,18 sec) mysql> insert into utf8 values('Sveta'), ('?????'); Query OK, 2 rows affected (0,15 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> set names latin1; Query OK, 0 rows affected (0,00 sec) mysql> select f1, hex(f1) from utf8; +-------+----------------------+ | f1 | hex(f1) | +-------+----------------------+ | Sveta | 5376657461 | | ????? | D0A1D0B2D0B5D182D0B0 | +-------+----------------------+ 2 rows in set (0,00 sec) mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> select f1, hex(f1) from utf8; +------------+----------------------+ | f1 | hex(f1) | +------------+----------------------+ | Sveta | 5376657461 | | ????? | D0A1D0B2D0B5D182D0B0 | +------------+----------------------+ 2 rows in set (0,00 sec)
Q: Can you discuss how charsets affect mysqldump? Specifically, how do we dump a database containing tables with different default charsets?
A: Yes, you can. MySQL can successfully convert data that uses different character sets, so your only job is to specify option
. In this case, strings in any character set you use can be converted to the character set specified. For example, if you use
, you may set option
. However, you cannot set it to
because Cyrillic characters exist in the
character set, but do not exist in
The default value for
. You only need to change this default if you use values that are outside of the range supported by
(for example, the smileys in
Q: But if you use the
, you can only specify one character set in the default?
A: Yes, and this is OK: all data will be converted into this character set. And then, when you will restore the dump, it will be converted back to the character set specified in column definitions.
Q: I noticed that MySQL doesn’t support case-sensitive
character sets. What do you recommend for implementing case-sensitive
, if it’s at all possible?
A: In the link I provided earlier, Oracle promises to implement case-sensitive collations for
in version 8.0.1. Before that happens, I recommend you to implement your own case-sensitive collation.
Q: How are tools like
affected by charsets? Is it safe to use a 4-byte charset (like
) as the default charset for all comparisons? Assuming our tables are a mix of
A: With this combination, you won’t have any issues:
uses a complicated set of functions that joins columns and calculates a
checksum on them. In your case, all data will be converted to
and no conflicts will happen.
However, if you use incompatible character sets in a single table, you may get the error
"Illegal mix of collations for operation 'concat_ws' "
mysql> create table cp1251(f1 varchar(100) character set latin1, f2 varchar(100) character set cp1251) engine=innodb; Query OK, 0 rows affected (0,32 sec) mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> insert into cp1251 values('Sveta', '?????'); Query OK, 1 row affected (0,07 sec) sveta@Thinkie:~/build/mysql-8.0/mysql-test$ ~/build/percona-toolkit/bin/pt-table-checksum h=127.0.0.1,P=13000,u=root,D=test Diffs cannot be detected because no slaves were found. Please read the --recursion-method documentation for information. 03-18T03:51:58 Error executing EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/"] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 11351. 03-18T03:51:58 Error checksumming table db1.cp1251: Error executing checksum query: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*checksum table*/" with ParamValues: 0='db1', 1='cp1251', 2=1, 3=undef, 4=undef, 5=undef] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 10741. TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 03-18T03:51:58 2 0 0 1 0 0.003 db1.cp1251 03-18T03:51:58 0 0 2 1 0 0.167 db1.latin1 03-18T03:51:58 0 0 6 1 0 0.198 db1.test_yo ...
The tool continues working, and will process the rest of your tables. I reported this behavior as Bug #1674266.
Thanks for attending the Troubleshooting Issues with MySQL Character Sets webinar.
AWS continues to add yet more software and services to build out its revenues and touchpoints with businesses that already use its cloud infrastructure for storage and to host and administer services and apps. The latest product, launching today, is Amazon Connect, a cloud-based contact center solution. AWS said it is based on the same tech that Amazon itself has built and uses in-house… Read More
After reading Baron’s prediction on databases, here:
I want to provide my own view on what’s coming up next for SQL databases. I think we live in interesting times, when we can see the beginning of the next-generation of RDBMSs.
There are defining characteristics of such databases:
- Auto-scaling. The ability to add and use resources depending on the current load and database size. This is done transparently for users and DBAs.
- Auto-healing. The automatic handling of node failures.
- Multi-regional, cloud-agnostic, geo-distributed. The ability to support multiple data centers and multiple clouds, in different parts of the world.
- Transactional. All the above, with the ability to support multi-statements transactional workloads.
- Strong consistency. The full definition of strong consistency is pretty involved. For simplicity, let’s say it means that reads (in the absence of ongoing writes) will return the same data, despite what region or data center you are getting it from. A simple counter-example is the famous MySQL asynchronous replication, where (with the slave delay) reading the data on a slave can return very outdated data. I am focusing on reads, because in a distributed environment the consistent reads performance will be affected. This is where network latency (often limited by the speed of light) will define performance.
- SQL language. SQL, despite being old and widely criticized, is not going anywhere. This is a universal language for app developers to access data.
With this, I see following interesting projects:
- Google Cloud Spanner (https://cloud.google.com/spanner/). Recently announced and still in the Beta stage. Definitely an interesting projects, with the obvious limitation of running only in Google Cloud.
- FaunaDB (https://fauna.com/). Also very recently announced, so it is hard to say how it performs. The major downside I see is that it does not provide SQL access, but uses a custom language.
- Two open source projects:
- CockroachDB (https://www.cockroachlabs.com/). This is still in the Beta stage, but definitely an interesting project to follow. Initially, the project planned to support only key-value access, but later they made a very smart decision to provide SQL access via a PostgreSQL-compatible protocol.
- TiDB (https://github.com/pingcap/tidb). Right now in RC stages, and the target is to provide SQL access over a MySQL compatible protocol (and later PostgreSQL protocol).
Protocol compatibility is a wise approach, although not strictly necessary. It lowers an entry barrier for the existing applications.
Both CockroachDB and TiDB, at the moment of this writing, still have rough edges and can’t be used in serious deployments (from my experience). I expect both projects will make a big progress in 2017.
What shared characteristics can we expect from these systems?
As I mentioned above, we may see that the read performance is degraded (as latency increases), and often it will be defined more by network performance than anything else. Storage IO and CPU cycles will be secondary factors. There will be more work on how to understand and tune the network traffic.
We may need to get used to the fact that point or small range selects become much slower. Right now, we see very fast point selects for traditional RDBM (MySQL, PostgreSQL, etc.).
Heavy writes will be problematic. The problem is that all writes will need to go through the consistency protocol. Write-optimized storage engines will help (both CockroachDB and TiDB use RocksDB in the storage layer).
The long transactions (let’s say changing 100000 or more rows) also will be problematic. There is just too much network round-trips and housekeeping work on each node, making long transactions an issue for distributed systems.
Another shared property (at least between CockroachDB and TiDB) is the active use of the Raft protocol to achieve consistency. So it will be important to understand how this protocol works to use it effectively. You can find a good overview of the Raft protocol here: http://container-solutions.com/raft-explained-part-1-the-consenus-problem/.
There probably are more NewSQL technologies than I have mentioned here, but I do not think any of them captured critical market- or mind-share. So we are at the beginning of interesting times . . .
What about MySQL? Can MySQL become the database that provides all these characteristics? It is possible, but I do not think it will happen anytime soon. MySQL would need to provide automatic sharding to do this, which will be very hard to implement given the current internal design. It may happen in the future, though it will require a lot of engineering efforts to make it work properly.
A startup based in Petah Tikva, Israel, Airobotics, has scored the right to fly drones autonomously for business purposes in Israel. The Civil Aviation Authority of Israel (CAAI) was the first in the world to authorize commercial, fully unmanned drone flights in their nation’s airspace. Read More
Percona announces the availability of Percona Toolkit 3.0.2 on March 27, 2017.
Percona Toolkit is a collection of advanced command-line tools that perform a variety of MySQL and MongoDB server and system tasks too difficult or complex for DBAs to perform manually. Percona Toolkit, like all Percona software, is free and open source.
This release includes the following changes:
- PT-74: Fixed gathering of security settings when running pt-mongodb-summary on a mongod instance that is specified as the host
- PT-75: Changed the default sort order in pt-mongodb-query-digest output to descending
- PT-76: Added support of & and # symbols in passwords for pt-mysql-summary
- PT-77: Updated Makefile to support new MongoDB tools
- PT-89: Fixed pt-stalk to run top more than once to collect useful CPU usage
- PT-93: Fixed pt-mongodb-query-digest to make query ID match query key (Thanks, Kamil Dziedzic)
- PT-94: Fixed pt-online-schema-change to not make duplicate rows in _t_new when updating the primary key. Also, see 1646713.
- PT-101: Fixed pt-table-checksum to correctly use the –slave-user and –slave-password options. Also, see 1651002.
- PT-105: Fixed pt-table-checksum to continue running if a database is dropped in the process
Medical billing is a largely untapped and lucrative industry, potentially pulling in $55 billion globally by 2020. But it’s inner workings are still very murky — most of the time it’s not clear how much something will cost and sometimes you don’t even get the (possibly whopping) bill until months down the road. Founder Katelyn Gleason wants to make it easier… Read More