Oct
09
2017
--

MySQL and MariaDB Default Configuration Differences

MySQL and MariaDB Default Configuration

MySQL and MariaDB Default ConfigurationIn this blog post, I’ll discuss some of the MySQL and MariaDB default configuration differences, focusing on MySQL 5.7 and MariaDB 10.2.

MariaDB Server is a general purpose open source database, created by the founders of MySQL. MariaDB Server (referred to as MariaDB for brevity) has similar roots as Percona Server for MySQL, but is quickly diverging from MySQL compatibility and growing on its own. MariaDB has become the default installation for several operating systems (such as Red Hat Enterprise Linux/CentOS/Fedora). Changes in the default variables can make a large difference in the out-of-box performance of the database, so knowing what is different is important.

As MariaDB grows on its own and doesn’t remain 100% compatible with MySQL, the defaults configuration settings might not mean everything or behave the way they used to. It might use different variable names, or implement the same variables in new ways. You also need to take into account that MariaDB uses it’s own Aria storage engine that has many configuration options that do not exist in MySQL.

Note: In this blog, I am looking at variables common to both MySQL or MariaDB, but have different defaults, not variables that are specific to either MySQL or MariaDB (except for the different switches inside the optimizer_switch).

Binary Logs

Variable MariaDB Default MySQL Default
sync_binlog 0 1
binlog_format Mixed Row

 

MySQL has taken a more conservative stance when it comes to the binary log. In the newest versions of MySQL 5.7, they have updated two variables to help ensure all committed data remains intact and identical. Binlog_format was updated to row in MySQL in order to prevent non-deterministic statements from having different results on the slave. Row-based replication also helps when performing a lot of smaller updates. MariaDB defaults to the Mixed format. Mixed uses statement-based format unless certain criteria are met. It hat case, it uses the row format. You can see the detailed criteria for when the row format is used here: https://mariadb.com/kb/en/the-mariadb-library/binary-log-formats/.

The other difference that can cause a significant impact on performance is related to sync_binlog. Sync_binlog controls the number of commit groups to collect before synchronizing the binary log to disk. MySQL has changed this to 1, which means that every transaction is flushed to disk before it is committed. This guarantees that there can never be a committed transaction that is not recorded (even during a system failure). This can create a big impact to performance, as shown by a Roel Van de Paar in his blog: https://www.percona.com/blog/2016/06/03/binary-logs-make-mysql-5-7-slower-than-5-6/

MariaDB utilizes a value of 0 for sync_binlog, which allows the operating system to determine when the binlog needs to be flushed. This provides better performance, but adds the risk that if MariaDB crashes (or power is lost) that some data may be lost.

MyISAM

Variable MariaDB Default MySQL Default
myisam_recover_options BACKUP,QUICK OFF
key_buffer_size 134217728 8388608

 

InnoDB replaced MyISAM as the default storage engine for some time now, but it is still used for many system tables. MySQL has tuned down the MyISAM settings, since it is not heavily used.

When mysqld opens a table, it checks whether the table is marked as crashed, or was not closed properly, and runs a check on it based on the myisam_recover_options settings. MySQL disables this by default, preventing recovery. MariaDB has enabled the BACKUP and QUICK recovery options. BACKUP causes a table_name-datetime.bak file to be created whenever a data file is changed during recovery. QUICK causes mysqld to not check the rows in a table if there are no delete blocks, ensuring recovery can occur faster.

MariaDB 10.2 increased the key_buffer_size. This allows for more index blocks to be stored in memory. All threads use this buffer, so a small buffer can cause information to get moved in and out of it more quickly. MariaDB 10.2 uses a buffer 16 times the size of MySQL 5.7: 134217728 in MariaDB 10.2 vsx 8388608 in MySQL 5.7.

Innodb

Variable MariaDB Default MySQL Default
innodb_max_undo_log_size 10485760(10 MiB) 1073741824(1024 MiB)

 

InnoDB variables have remained primarily unchanged between MariaDB 10.2 and MySQL 5.7. MariaDB has reduced the innodb_max_undo_log_size starting in 10.2.6. This was reduced from MySQL’s default of 1073741824(1024 MiB) to 10485760(10 MiB). These sizes reflect the maximum size an undo tablespace can become before it is marked for truncation. The tablespace doesn’t get truncated unless innodb_undo_log_truncate is enabled, and it is disabled in MySQL 5.7 and MariaDB 10.2 by default.

Logging

Variable MariaDB Default MySQL Default
log_error /var/log/mysqld.log
log_slow_admin_statements ON OFF
log_slow_slave_statements ON OFF
lc_messages_dir /usr/share/mysql

 

Logs are extremely important for troubleshooting any issues so the different choices in logging for MySQL 5.7 and MariaDB 10.2 are very interesting.

The log_error variable allows you to control where errors get logged. MariaDB 10.2 leaves this variable blank, writing all errors to stderr. MySQL 5.7 uses an explicitly created file at: /var/log/mysqld.log.

MariaDB 10.2 has also enabled additional slow statement logging. Log_slow_admin_statements create a record for any administrative statements that are not typically written to the binlog. Log_slow_slave_statements log the replicated statements sent from the master, if they are slow to complete. MySQL 5.7 does not enable logging of these statements by default.

Lc_messages_dir is the directory that contains the error message files for various languages. The variable defaults might be a little misleading in MariaDB 10.2. Lc_messages_dir is left empty by default, although it still uses the same path as MySQL 5.7. The files are located in /usr/share/mysql by default for both databases.

Performance Schema

Variable MariaDB Default MySQL Default
performance_schema OFF ON
performance_schema_setup_actors_size 100 -1 (auto adjusted)
performance_schema_setup_objects_size 100 -1 (auto adjusted)

 

The performance schema is an instrumentation tool that is designed to help troubleshoot various performance concerns. MySQL 5.7 enables the performance schema, and many of its instruments, by default. MySQL even goes so far as to detect the appropriate value for many Performance Schema variables instead of setting a static default. The Performance Schema does come with some overhead, and there are many blogs regarding how much this can impact performance. I think Sveta Smirnova said it best in her blog  Performance Schema Benchmarks OLTP RW: “…test on your system! No generic benchmark can exactly repeat a workload on your site.

MariaDB has disabled the Performance Schema by default, as well as adjusted a couple of the dynamic variables. Note that if you wish to disable or enable the Performance Schema, it requires a restart of the server since these variables are not dynamic. Performance_schema_setup_actors_size and performance_schema_setup_objects_size have both been set to a static 100, instead of the dynamic -1 used in MySQL 5.7. These both limit the number of rows that can be stored in relative tables. This creates a hard limit to the size these tables can grow to, helping to reduce their data footprint.

SSL/TLS

Variable MariaDB Default MySQL Default
ssl_ca ca.pem
ssl_cert server-cert.pem
ssl_key server-key.pem

 

Secure Sockets Layer (SSL) and Transport Layer Security (TLS) are cryptographic protocols that allow for secure communication. SSL is actually the predecessor of TLS, although both are often referred to as SSL. MySQL 5.7 and MariaDB 10.2 support both yaSSL and OpenSSL. The default configurations for SSL/TLS differ only slightly between MySQL 5.7 and MariaDB 10.2. MySQL 5.7 sets a specific file name for ssl_ca, ssl_cert, and ssl_key. These files are created in the base directory, identified by the variable basedir. Each of these variables is left blank in MariaDB 10.2, so you need to set them before using secure connections. These variables are not dynamic, so be sure to set the values before starting your database.

Query Optimizer

MariaDB 10.2 MySQL 5.7 Optimization Meaning Switch
N/A OFF Batched Key Access Controls use of BKA join algorithm batched_key_access
N/A ON Block Nested-Loop Controls use of BNL join algorithm block_nested_loop
N/A ON Condition Filtering Controls use of condition filtering condition_fanout_filter
Deprecated ON Engine Condition Pushdown Controls engine condition pushdown engine_condition_pushdown
ON N/A Engine Condition Pushdown Controls ability to push conditions down into non-mergeable views and derived tables condition_pushdown_for_derived
ON N/A Exists Subquery Allows conversion of in statements to exists statements exists_to_in
ON N/A Exists Subquery Allows conversion of exists statements to in statements in_to_exists
N/A ON Index Extensions Controls use of index extensions use_index_extensions
OFF N/A Index Merge Allows index_merge for non-equality conditions index_merge_sort_intersection
ON N/A Join Algorithms Perform index lookups for a batch of records from the join buffer join_cache_bka
ON N/A Join Algorithms Controls use of BNLH and BKAH algorithms join_cache_hashed
ON N/A Join Algorithms Controls use of incremental algorithms join_cache_incremental
ON N/A Join Algorithms Controls use of block-based algorithms for outer joins outer_join_with_cache
ON N/A Join Algorithms Controls block-based algorithms for use with semi-join operations semijoin_with_cache
OFF N/A Join Buffer Creates the join buffer with an estimated size based on the estimated number of rows in the result optimize_join_buffer_size
ON N/A Materialized Temporary Tables Allows index creation on derived temporary tables derived_keys
ON N/A Materialized Temporary Tables Controls use of the rowid-merge strategy partial_match_rowid_merge
ON N/A Materialized Temporary Tables Controls use of the partial_match_table-scan strategy partial_match_table_scan
OFF ON Multi-Range Read Controls use of the multi-range read strategy mrr
OFF ON Multi-Range Read Controls use of cost-based MRR, if mrr=on mrr_cost_based
OFF N/A Multi-Range Read Enables key ordered scans if mrr=on mrr_sort_keys
ON N/A Order By Considers multiple equalities when ordering results ordery_uses_equalities
ON N/A Query Plan Allows the optimizer to use hidden components of InnoDB keys extended_keys
ON N/A Query Plan Controls the removal of irrelevant tables from the execution plan table_elimination
ON N/A Subquery Stores subquery results and correlation parameters for reuse subquery_cache
N/A ON Subquery Materialization Controls us of cost-based materialization ubquery_materialization_cost_based
N/A ON Subquery Materialization &

Semi-join

Controls the semi-join duplicate weedout strategy duplicateweedout

 

The query optimizer has several variances that not only affect query performance but also how you write SQL statements. The query optimizer is substantially different between MariaDB and MySQL, so even with identical configurations you are likely to see varying performance.

The sql_mode puts restrictions on how you can write queries. MySQL 5.7 has several additional restrictions compared to MariaDB 10.2. Only_full_group_by requires that all fields in any select…group by statement are either aggregated or inside the group by clause. The optimizer doesn’t assume anything regarding the grouping, so you must specify it explicitly.

No_zero_date, and no_zero_in_date both affect how the server interprets 0’s in dates. When no_zero_date is enabled, values of ‘0000-00-00’ are permitted but produce a warning. With strict mode enabled, then the value is not permitted and produces an error. No_zero_in_date is similar, except it applies to any section of the date(month, day, or year). With this disabled, dates with 0 parts, such as ‘2017-00-16’ are allowed as is. When enabled, the date is changed to ‘0000-00-00’ without warning. Strict mode prevents the date being inserted, unless ignore is provided as well. “INSERT IGNORE” and “UPDATE IGNORE” inserts the dates as ‘0000-00-00’. 5.7.4 changed this. No_zero_in_date was consolidated with strict mode, and the explicit option is deprecated.

The query_prealloc_size determines the size of the persistent buffer used for statement parsing and execution. If you regularly use complex queries, it can be useful to increase the size of this buffer, as it does not need to allocate additional memory during the query parsing. MySQL 5.7 has set this buffer to 8192, with a block size of 1024. MariaDB increased this value in 10.1.2 up to 24576.

Query_alloc_block_size dictates the size in bytes of any extra blocks allocated during query parsing. If memory fragmentation is a common problem, you might want to look at increasing this value. MySQL 5.7 uses 8192, while MariaDB 10.2 uses 16384 (twice that). Be careful when adjusting the block sizes: going too high consumes more than the needed amount of memory, and too low causes significant fragmentation.

The optimizer_switch variable contains many different switches that impact how the query optimizer plans and performs different queries. MariaDB 10.2 and MySQL 5.7 have many differences in their enabled options, and even the available options. You can see a brief breakdown of each of the options below. Any options with N/A is not supported in that server.

Miscellaneous

Variable MariaDB Default MySQL Default
default_tmp_storage_engine NULL InnoDB
group_concat_max_len 1048576(1M) 1024(1K)
Lock_wait_timeout 86400 (1 DAY) 31536000 (1 YEAR)
Max_allowed_packet (16777216) 16MB 4194304 (4MB)
Max_write_lock_count 4294967295 18446744073709551615
Old_passwords OFF 0
Open_files_limit 0 dependent on OS
pid_file /var/lib/mysql/ /var/run/mysqld/
secure_file_priv Varies by installation
sort_buffer_size 2097152 262144
table_definition_cache 400 autosized
table_open_cache_instances 8 16
thread_cache_size autosized autosized
thread_stack 292KB 192KB/256KB

 

There are many variables that do not fit well into a group. I will go over those here.

When creating temporary tables, if you do not specify a storage engine then a default is used. In MySQL 5.7 this is set to InnoDB, the same as the default_storage_engine. MariaDB 10.2 also uses InnoDB, but it is not explicitly set. MariaDB sets the default_tmp_storage_engine to NULL, which causes it to use the default_storage_engine. This is important to remember if you change your default storage engine, as it would also change the default for temporary tables. An Important note, in MariaDB this is only relevant to tables created with “CREATE TEMPORARY TABLE”. Internal in-memory temporary tables use the memory storage engine, and internal, on-disk temporary tables use the aria engine by default.

The Group_concat function can cause some very large results if left unchecked. You can restrict the maximum size of results from this function with group_concat_max_len. MySQL 5.7 limits this to 1024(1K). MariaDB increased the value in 10.2.4 up to 1048576(1M).

Lock_wait_timeout controls how long a thread waits as it attempts to acquire a metadata lock. Several statements require a metadata lock, including DDL and DML operations, Lock Tables, Flush Tables with Read Lock and Handler statements. MySQL 5.7 defaults to the maximum possible value (one year), while MariaDB 10.2 has toned this down to one day.

Max_allowed_packet sets a limit to the maximum size of a packet, or a generated/intermediate string. This value is intentionally kept small (4MB) on MySQL 5.7 in order to detect the larger, intentionally incorrect packets. MariaDB has increased this value to 16MB. If using any large BLOB fields, you need to adjust this value to the size of the largest BLOB, in multiples of 1024, or you risk running into errors transferring the results.

Max_write_lock_count controls the number of write locks that can be given before some read lock requests being processed. In extremely heavy write loads your reads can pile up while waiting for the writes to complete. Modifying the max_write_lock_count allows you to tune how many writes can occur before some reads are allowed against the table. MySQL 5.7 keeps this value at the maximum (18446744073709551615), while MariaDB 10.2 lowered this to 4294967295. One thing to note is that this is still the maximum value on MariaDB 10.2.

Old_passwords controls the hashing method used by the password function, create user and grant statements. This variable has undergone several changes in MySQL 5.7. As of 5.7.4 the valid options were MySQL 4.1 native hashing, Pre-4.1 (old) hashing, and SHA-256 hashing. Version 5.7.5 removed the “old” Pre-4.1 method, and in 5.7.6 the variable has been deprecated with the intent of removing it entirely. MariaDB 10.2 uses a simple boolean value for this variable instead of the enumerated one in MySQL 5.7, though the intent is the same. Both default the old_passwords to OFF, or 0, and allow you to enable the older method if necessary.

Open_files_limit restricts the number of file descriptors mysqld can reserve. If set to 0 (the default in MariaDB 10.2) then mysqld reserves max_connections * 5 or max_connections + table_open_cache * 2, whichever is larger. It should be noted that mysqld cannot use an amount larger than the hard limit imposed by the operating system. MySQL 5.7 is also restricted by the operating systems hard limit, but is set at runtime to the real value permitted by the system (not a calculated value).

The pid_file allows you to control where you store the process id file. This isn’t a file you typically need, but it is good to know where it is located in case some unusual errors occur. On MariaDB you can find this inside /var/lib/mysql/, while on MySQL 5.7 you will find it inside /var/run/mysqld/. You will also notice a difference in the actual name of the file. MariaDB 10.2 uses the hostname as the name of the pid, while MySQL 5.7 simply uses the process name (mysqld.pid).

Secure_file_priv is a security feature that allows you to restrict the location of files used in data import and export operations. When this variable is empty, which was the default in MySQL before 5.7.6, there is no restriction. If the value is set to NULL, import and export operations are not permitted. The only other valid value is the directory path where files can be imported from or exported to. MariaDB 10.2 defaults to empty. As of MySQL 5.7.6, the default will depend on the install_layout CMAKE option.

INSTALL_LAYOUT DEFAULT VALUE
STANDALONE,WIN NULL(>=MySQL 5.7.16_,empty(<MySQL 5.7.16)
DEB,RPM,SLES,SVR4 /var/lib/mysql-files
Other Mysql-files under the CMAKE_INSTALL_PREFIX value

 

Mysqld uses a sort buffer regardless of storage engine. Every session that must perform a sort allocates a buffer equal to the value of sort_buffer_size. This buffer should at minimum be large enough to contain 15 tuples. In MySQL 5.7, this defaults to 262144, while MariaDB 10.2 uses the larger value 2097152.

The table_definition_cache restricts the number of table definitions that can be cached. If you have a large number of tables, mysqld may have to read the .frm file to get this information. MySQL 5.7 auto detects the appropriate size to use, while MariaDB 10.2 defaults this value to 400. On my small test VM, MySQL 5.7 chose a value of 1400.

The table_open_cache_instances vary in implementation between MySQL and MariaDB. MySQL 5.7 creates multiple instances of the table_open_cache, each holding a portion of the tables. This helps reduce contention, as a session needs to lock only one instance of the cache for DML statements. In MySQL 5.7.7 the default was a single instance, but this was changed in MySQL 5.7.8 (increased to 16). MariaDB has a more dynamic approach to the table_open_cache. Initially there is only a single instance of the cache, and the table_open_cache_instances variable is the maximum number of instances that can be created. If contention is detected on the single cache, another instance is created and an error logged. MariaDB 10.2 suspects that the maximum eight instances it sets by default should support up to 100 CPU cores.

The thread_cache_size controls when a new thread is created. When a client disconnects the thread is stored in the cache, as long as the maximum number of threads do not exist. Although this is not typically noticeable, if your server sees hundreds of connections per second you should increase this value to so that new connections can use the cache. Thread_cache_size is an automatically detected variable in both MySQL 5.7 and MariaDB 10.2, but their methods to calculate the default vary significantly. MySQL uses a formula, with a maximum of 100: 8+ (max_connections / 100). MariaDB 10.2 uses the smaller value out of 256 or the max_connections size.

The thread_stack is the stack size for each thread. If the stack size is too small, it limits the complexity of SQL statements, the recursion depth of stored procedures and other memory-consuming actions. MySQL 5.7 defaults the stack size to 192KB on 32-bit platforms and 256KB on 64-bit systems. MariaDB 10.2 adjusted this value several times. MariaDB 10.2.0 used 290KB, 10.2.1 used 291KB and 10.2.5 used 292KB.

Conclusion

Hopefully, this helps you with the configurations options between MySQL and MariaDB. Use the comments for any questions.

Oct
11
2016
--

MySQL 8.0: The end of MyISAM

MySQL 8.0 General Tablespaces

MyISAMThis blog discusses the gradual end of MyISAM in MySQL.

The story that started 20 years ago is coming to its end. I’m talking about the old MyISAM storage engine that was the only storage provided by MySQL in 1995, and was available in MySQL for 20+ years. Actually, part of my job as a MySQL consultant for 10+ years was to discover MyISAM tables and advise customers how to convert those to InnoDB.

(Check your MySQL installation, you may still have MyISAM tables).

MySQL 5.7 still used MyISAM storage for the system tables in the MySQL schema.

In MySQL 8.0 (DMR version as of writing), the MyISAM storage engine is still available. But in a very limited scope:

  • After introducing the new data dictionary, the MyISAM tables are gone from the system schema (“mysql” db).
  • Working with MyISAM is harder now (and discouraged): you can’t just copy MyISAM tables into a running MySQL server, they will not be discovered (unlike InnoDB, where you can use “ALTER TABLE … IMPORT TABLESPACE”)
  • However, you can create a table engine=MyISAM, and it will work as before

InnoDB implemented all the older, missing features:

Feature MyISAM InnoDB
Full Text Indexes yes Since MySQL 5.6
Portable tables (tablespaces) yes Since MySQL 5.6
Spatial Indexes/RTREE (GIS) yes Since MySQL 5.7
Last update for table yes Since MySQL 5.7

(http://dev.mysql.com/worklog/task/?id=6658)

Suitable for temp tables yes Since MySQL 5.7

Also complex selects uses InnoDB ondisk temp tables

Faster count(*) yes *Faster in MySQL 5.7 but does not store counter

 

So the only MyISAM advantages left are:

  1. Tables will be smaller on disk compared to uncompressed InnoDB tables.
  2. The count(*) is still much faster in MyISAM:
mysql> select count(*) from a_myisam;
+----------+
| count(*) |
+----------+
|  6291456 |
+----------+
1 row in set (0.00 sec)
mysql> select count(*) from a_innodb;
+----------+
| count(*) |
+----------+
|  6291456 |
+----------+
1 row in set (2.16 sec)

I would not use MyISAM unless there is a specific case, and for well-known reasons (MyISAM are non-transactional, table level locks, with no crash recovery, etc.)

My colleague Laurynas Biveinis also suggested converting MyISAM to an optional storage engine plugin.

Apr
28
2015
--

Test your knowledge: Percona XtraDB Cluster (PXC) quiz

Test your knowledge: Percona XtraDB Cluster (PXC)I often talk with people who are very interested in the features of Percona XtraDB Cluster (PXC) such as synchronous and parallel replication, multi-node writing and high availability. However some get confused when operating a real PXC cluster because they do not fully realize the implications of these features. So here is a fun way to test your PXC knowledge: try to solve these 12 questions related to PXC! (you will find the answers at the end of the post.)

Workload

1. With Galera 3.x, support for MyISAM is experimental. When can we expect to have full MyISAM support?
a. This will never happen as Galera is designed for transactional storage engines.
b. This is planned for Galera 4.0.

2. Why aren’t all workloads a good fit for PXC?
a. Execution plans can change compared to a regular MySQL server, so performance is sometimes not as good as with a regular MySQL server.
b. Large transactions and write hotspots can create performance issues with Galera.

3. For workloads with a write hot spot, writing on all nodes to distribute the load is a good way to solve the issue.
a. True
b. False

4. Optimistic locking is used in a PXC cluster. What does it mean?
a. When a transaction starts on a node, locks are only set on this node but never on the remote nodes.
b. When a transaction starts on a node, locks are only set on the remote nodes but never on the local node.
c. Write conflict detection is built-in, so there is no need to set locks at all.

Replication

5. Galera implements virtually synchronous replication. What does it mean?
a. A transaction is first committed locally, and then it is committed on all remote nodes at the same exact point in time.
b. Transactions are replicated synchronously, but they are applied asynchronously on remote nodes.
c. Replication is actually asynchronous, but as it is faster than MySQL replication, so marketing decided to name it ‘virtually synchronous’.

6. When the receive queue of a node exceeds a threshold, the node sends flow control messages. What is the goal of these flow control messages?
a. They instruct the other nodes that they must pause processing writes for some time, to allow the slow node to catch up.
b. The other nodes trigger an election and if they have quorum they will evict the slow node.
c. The messages can be used by monitoring systems to detect a slow node, but they have no effect.

7. When you change the state of a node to Donor/Desynced, what happens?
a. The node stops receiving writes from the other nodes.
b. The node intentionally replicates writes at a slower pace, this is roughly equivalent to a delayed replica when using MySQL replication.
c. The node keeps working as usual, but it will not send flow control messages if its receive queue becomes large.

High Availability

8. You should always use an odd number of nodes, because with an even number (say 4 or 6), the failure of one node will create a split-brain situation.
a. True
b. False

9. With a 3-node cluster, what happens if you gracefully stop 2 nodes?
a. The remaining node can process queries normally.
b. The remaining node is up but it stops processing queries as it does not have quorum.

Operations

10. If a node has been stopped for less than 5 minutes, it will always perform an IST.
a. True: SST is only performed after a node crash, never after a regular shutdown.
b. False: it depends on the gcache size.

11. Even with datasets under 5GB, the preferred SST method is xtrabackup-v2 not mysqldump.
a. True
b. False

12. Migration from a master-slave setup to a PXC cluster always involves a downtime to dump and reload the database.
a. True, because MySQL replication and Galera replication are incompatible.
b. False, one node of the PXC cluster can be set up as an asynchronous replica of the old master.

Solutions

1. a      2. b      3. b
4. a      5. b      6. a
7. c      8. b      9. a
10. b    11. a    12. b

The post Test your knowledge: Percona XtraDB Cluster (PXC) quiz appeared first on MySQL Performance Blog.

Oct
23
2014
--

MySQL 5.6 Full Text Search Throwdown: Webinar Q&A

MySQL 5.6 Full Text Search Throwdown: Webinar Q&AYesterday (Oct. 22) I gave a presentation titled “MySQL 5.6 Full Text Search Throwdown.” If you missed it, you can still register to view the recording and my slides.

Thanks to everyone who attended, and especially to folks who asked the great questions. I answered as many as we had time for during the session, but here are all the questions with my complete answers:

Q: Does Solr automatically maintain its index against MySQL? Do you have to hit the Solr server with a specific query to keep the index ‘warm’?

There are several strategies for updating a Solr index. In my examples, I showed only a “full import” which is what you would do to create an index by reading all the source data.

You can also perform a “delta import” periodically, to add a subset of the source data to an existing index, for example to add data that has changed since the last time you updated the Solr index. See the documentation for Using delta-import command and also Using query attribute for both full and delta import.

The delta import would typically be something you would invoke from a cron job, perhaps every hour. But that means that a Solr search might not find data that has changed in MySQL more recently than the last delta import. Depending on the application, a delay of up to 60 minutes might be acceptable, or else maybe you have strict requirements that all data must be in sync instantly.

You could also update the Solr index one document at a time using its Java API or web service API. This would require you to write code in your application. Every time you INSERT or UPDATE or DELETE a document in MySQL that you want to be kept in sync with the Solr index, you would write more code to do a similar operation in the Solr index. That way every single text change would be searchable nearly immediately.

Q: Did you test Elasticsearch? (several people asked about this)

I did not test Elasticsearch, but according to their technology overview: “Elasticsearch uses Lucene under the covers.” So I expect that this part of Elasticsearch performs similarly to what I saw from Apache Solr, which also uses Lucene internally.

Q: One question I could not understand, how to maintain Sphinx index in sync with data? Can be it in real time?

The Sphinx Search index does not automatically refresh as your MySQL data changes. You would have to write application code to invoke the indexing process. There’s a page in the Sphinx Search documentation about Live Index Updates, that gives an overview of the two methods, and links to further reading.

This is definitely the most inconvenient aspect of Sphinx Search. Queries are very fast, but it’s expensive to do incremental updates to an index. So it’s ideal for indexing an archive of text that doesn’t change very frequently, but not as easy to use it for indexing rapidly-changing content.

Q: I have over 800,000 PDF documents to index (in several languages), any recommendations?

I said during the webinar that I recalled there exists tools to extract searchable text from a PDF file. I found one such project called Apache PDFBox includes this capability, and they have a page describing a helper class for doing PDF parsing and extraction combined with Lucene indexing. I haven’t used it myself, so I can’t comment on its performance for indexing 800,000 PDF documents, but it seems like you could write a Java program to iterate over your collection of PDF’s, and index them using this class.

Q: What is your suggestion to use Sphinx Search for single column searches?

You can use any SQL query in the sphinx.conf to define the source data to index. You can select one column, multiple columns, or even multiple columns from joined tables. The result from any SQL query you write can be used as the data source.

Q: Which modules did you use with Sphinx Search? Did you use its built-in stemmers and metaphone package, etc.?

I installed the default modules. I don’t know if there is a significant performance difference from using optional packages.

Q: What about quality of results from each solution? I remember reading an article on percona.com several months ago comparing MyISAM fulltext vs InnoDB fulltext, and there were concerns about the results from InnoDB. Did you do any testing on this?

Indeed, here’s a link to the excellent blog post by my colleague Ernie Souhrada in which he found some surprises in the results from InnoDB FTS: InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

I was just doing some comparison for performance in the current MySQL 5.7 milestone. I didn’t compare the query results this time.

Q: Is there any full text search in Percona Server with XtraDB?

Percona Server is based on the upstream MySQL Community Edition of the respective version number. So Percona Server has the builtin FULLTEXT index types for MyISAM and InnoDB, and we have not changed this part of the code. Percona Server does not bundle Sphinx Search, but it’s not too difficult to install Sphinx Search as a complementary technology, just as you would install other packages that are commonly used parts of an application infrastructure, for example Memcached or HA-proxy.

Q: Is MySQL going to improve the built-in InnoDB FTS in the near future?

They are continuing to add features that improve FTS, for example:

  • You can now write your own plugins for fulltext parsing (that is, parsing the input data to identify “words” to index; you may have your own idea about how to split text into words).
  • Both B-tree and full-text types now uses bulk-loading to make it faster and more efficient to build the index.

I’m not aware of any work to improve the performance of fulltext queries significantly.

Q: What is the performance comparison between MyISAM and InnoDB for inline index updating?

I didn’t test performance of incremental index updates this time. I only populated my tables from the StackOverflow data using LOAD XML, and then I created fulltext indexes on the populated tables. But I generally favor moving all important data to InnoDB, and not using MyISAM tables. It’s hard to imagine that the performance of index updates would be so much better that would convince me to use MyISAM. It’s more likely that the accuracy of search results would be a good reason to use MyISAM. Even then, I’d keep the original data in InnoDB and use MyISAM only as a copy of the data, to create a disposable fulltext index.

Thanks again for attending my webinar! For more great content, please join Percona and the MySQL community at our conference events. The next one is Percona Live London 2014 on November 3-4. We also look forward to the Open Stack Live 2015 in Santa Clara, California April 13-14, in the same venue with Percona Live MySQL Conference and Expo 2015, April 13-16.

Also watch more webinars from Percona in the future!

The post MySQL 5.6 Full Text Search Throwdown: Webinar Q&A appeared first on MySQL Performance Blog.

Sep
25
2014
--

More then 1000 columns – get transactional with TokuDB

Recently I encountered a specific situation in which a customer was forced to stay with the MyISAM engine due to a legacy application using tables with over 1000 columns. Unfortunately InnoDB has a limit at this point. I did not expect to hear this argument for MyISAM. It is usually about full text search or spatial indexes functionality that were missing in InnoDB, and which were introduced in MySQL 5.6 and 5.7, respectively, to let people forget about MyISAM. In this case though, InnoDB still could not be used, so I gave the TokuDB a try.

I’ve created a simple bash script to generate a SQL file with CREATE TABLE statement with the number of columns I desired and then tried to load this using different storage engines. Bit surprisingly, InnoDB failed with column count above 1017, so little more then documented maximum of 1000:

mysql> source /home/vagrant/multicol2.sql
ERROR 1117 (HY000): Too many columns

MyISAM let me to create maximum 2410 columns and I could achieve the same result for the TokuDB table! Tried with tinyint or char(10) datatype, same maximum cap applied, not quite sure why it’s exactly 2410 though.

mysql> SELECT tables.TABLE_NAME,count(*) columns,engine,row_format FROM information_schema.columns JOIN information_schema.tables USING (TABLE_NAME) where TABLE_NAME like "multicol%" group by TABLE_NAME;
+-----------------+---------+--------+-------------+
| TABLE_NAME      | columns | engine | row_format  |
+-----------------+---------+--------+-------------+
| multicol_innodb |    1017 | InnoDB | Compact     |
| multicol_myisam |    2410 | MyISAM | Fixed       |
| multicol_toku   |    2410 | TokuDB | tokudb_zlib |
+-----------------+---------+--------+-------------+
3 rows in set (0.31 sec)

So if you have that rare kind of table schema with that many columns and you wish to be able to use a transaction storage engine, you may go with TokuDB, available also with recent Percona Server 5.6 versions.

You can find more details about column number limits in MySQL in this post, “Understanding the maximum number of columns in a MySQL table.”

The post More then 1000 columns – get transactional with TokuDB appeared first on Percona Performance Blog.

Feb
24
2014
--

The MySQL ARCHIVE storage engine – Alternatives

In my previous post I pointed out that the existing ARCHIVE storage engine in MySQL may not be the one that will satisfy your needs when it comes to effectively storing large and/or old data. But are there any good alternatives? As the primary purpose of this engine is to store rarely accessed data in disk space efficient way, I will focus here on data compression abilities rather then on performance.

The InnoDB engine provides compressed row format, but is it’s efficiency even close to the one from that available in archive engine? You can also compress MyISAM tables by using myisampack tool, but that also means a table will be read only after such operation.

Moreover, I don’t trust MyISAM nor Archive when it comes to data durability. Fortunately along came a quite new (open source since April 2013) player into this field – TokuDB! It seems to provide an excellent compression ratios, but also it’s fully ACID compliant, and does not have any of the limitations present in Archive, so it’s functionality is much more like InnoDB! This may allow you also to store production data on SSD drives, which disk space cost is still higher then on traditional disks, where otherwise it could be too expensive.

To better illustrate what choice do we have, I made some very simple disk savings comparison of all the mentioned variants.
I have used an example table with some scientific data fetched from here (no indexes):

CREATE TABLE `table1` (
  `snp_id` int(11) DEFAULT NULL,
  `contig_acc` varchar(32) DEFAULT NULL,
  `contig_ver` tinyint(4) DEFAULT NULL,
  `asn_from` int(11) DEFAULT NULL,
  `asn_to` int(11) DEFAULT NULL,
  `locus_id` int(11) DEFAULT NULL,
  `locus_symbol` varchar(128) DEFAULT NULL,
  `mrna_acc` varchar(128) DEFAULT NULL,
  `mrna_ver` int(11) DEFAULT NULL,
  `protein_acc` varchar(128) DEFAULT NULL,
  `protein_ver` int(11) DEFAULT NULL,
  `fxn_class` int(11) DEFAULT NULL,
  `reading_frame` int(11) DEFAULT NULL,
  `allele` text,
  `residue` text,
  `aa_position` int(11) DEFAULT NULL,
  `build_id` varchar(4) NOT NULL,
  `ctg_id` int(11) DEFAULT NULL,
  `mrna_start` int(11) DEFAULT NULL,
  `mrna_stop` int(11) DEFAULT NULL,
  `codon` text,
  `protRes` char(3) DEFAULT NULL,
  `contig_gi` int(11) DEFAULT NULL,
  `mrna_gi` int(11) DEFAULT NULL,
  `mrna_orien` tinyint(4) DEFAULT NULL,
  `cp_mrna_ver` int(11) DEFAULT NULL,
  `cp_mrna_gi` int(11) DEFAULT NULL,
  `verComp` varchar(7) NOT NULL
)

ARCHIVE storage engine

mysql >show table status like 'table1'\G
*************************** 1. row ***************************
           Name: table1
         Engine: ARCHIVE
        Version: 10
     Row_format: Compressed
           Rows: 19829016
 Avg_row_length: 11
    Data_length: 221158267
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: NULL
    Update_time: 2013-12-22 23:58:51
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.28 sec)
-rw-rw----. 1 przemek przemek 211M Dec 22 23:58 table1.ARZ

TokuDB engine, default compression

mysql >show table status like 'table1'\G
*************************** 1. row ***************************
           Name: table1
         Engine: TokuDB
        Version: 10
     Row_format: tokudb_zlib
           Rows: 19829016
 Avg_row_length: 127
    Data_length: 2518948412
Max_data_length: 9223372036854775807
   Index_length: 0
      Data_free: 6615040
 Auto_increment: NULL
    Create_time: 2013-12-23 00:03:47
    Update_time: 2013-12-23 00:12:14
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.13 sec)
-rwxrwx--x. 1 przemek przemek 284M Dec 23 00:12 _b_tokudb_table1_main_32_1_18_B_0.tokudb

TokuDB engine, highest compression

mysql [localhost] {msandbox} (b_tokudb) > show table status like 'table1'\G
*************************** 1. row ***************************
           Name: table1
         Engine: TokuDB
        Version: 10
     Row_format: tokudb_lzma
           Rows: 19829016
 Avg_row_length: 127
    Data_length: 2518948412
Max_data_length: 9223372036854775807
   Index_length: 0
      Data_free: 6950912
 Auto_increment: NULL
    Create_time: 2013-12-23 00:43:47
    Update_time: 2013-12-23 00:49:14
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: row_format=TOKUDB_LZMA
        Comment:
1 row in set (0.01 sec)
-rwxrwx--x. 1 przemek przemek 208M Dec 23 00:49 _b_tokudb_sql_980_2_main_1b92_2_18.tokudb

(btw, did you notice how the file name changed after altering with different compression?
It’s no longer reflecting the real table name, so quite confusing :( )

InnoDB engine, uncompressed

mysql > show table status like 'table1'\G
*************************** 1. row ***************************
           Name: table1
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 19898159
 Avg_row_length: 117
    Data_length: 2343567360
Max_data_length: 0
   Index_length: 0
      Data_free: 4194304
 Auto_increment: NULL
    Create_time: 2014-01-01 16:47:03
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.42 sec)
-rw-rw----. 1 przemek przemek 2.3G Jan  1 16:37 table1.ibd

InnoDB engine, compressed with default page size (8kB)

mysql > show table status like 'table1'\G
*************************** 1. row ***************************
           Name: table1
         Engine: InnoDB
        Version: 10
     Row_format: Compressed
           Rows: 19737546
 Avg_row_length: 59
    Data_length: 1171783680
Max_data_length: 0
   Index_length: 0
      Data_free: 5767168
 Auto_increment: NULL
    Create_time: 2014-01-01 18:51:22
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: row_format=COMPRESSED
        Comment:
1 row in set (0.31 sec)
-rw-rw----. 1 przemek przemek 1.2G Jan  1 18:51 table1.ibd

InnoDB engine, compressed with 4kB page size

mysql > show table status like 'table1'\G
*************************** 1. row ***************************
           Name: table1
         Engine: InnoDB
        Version: 10
     Row_format: Compressed
           Rows: 19724692
 Avg_row_length: 30
    Data_length: 592445440
Max_data_length: 0
   Index_length: 0
      Data_free: 3932160
 Auto_increment: NULL
    Create_time: 2014-01-01 19:41:12
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: row_format=COMPRESSED KEY_BLOCK_SIZE=4
        Comment:
1 row in set (0.03 sec)
-rw-rw----. 1 przemek przemek 584M Jan  1 19:41 table1.ibd

MyISAM engine, uncompressed

mysql > show table status like 'table1'\G
*************************** 1. row ***************************
           Name: table1
         Engine: MyISAM
        Version: 10
     Row_format: Dynamic
           Rows: 19829016
 Avg_row_length: 95
    Data_length: 1898246492
Max_data_length: 281474976710655
   Index_length: 1024
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2013-12-23 11:02:28
    Update_time: 2013-12-23 11:03:45
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.01 sec)
-rw-rw----. 1 przemek przemek 1.8G Dec 23 11:03 table1.MYD

MyISAM engine, compressed (myisampack)

mysql > show table status like  'table1'\G
*************************** 1. row ***************************
           Name: table1
         Engine: MyISAM
        Version: 10
     Row_format: Compressed
           Rows: 19829016
 Avg_row_length: 42
    Data_length: 848098828
Max_data_length: 281474976710655
   Index_length: 1024
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2013-12-23 11:02:28
    Update_time: 2013-12-23 11:03:45
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: 853535317
 Create_options:
        Comment:
1 row in set (0.00 sec)
-rw-rw----. 1 przemek przemek 809M Dec 23 11:03 table1.MYD

Compression summary table

Engine Compression Table size [MB]
InnoDB  none  2272
InnoDB  KEY_BLOCK_SIZE=8  1144
InnoDB  KEY_BLOCK_SIZE=4  584
MyISAM  none  1810
MyISAM  compressed with myisampack  809
Archive  default  211
TokuDB  ZLIB  284
TokuDB  LZMA  208

So the clear winner is TokuDB, leaving InnoDB far behind. But this is just one test – the results may be very different for your specific data.

To get even better idea, let’s compare several crucial features available in mentioned storage engines

Feature Archive MyISAM (compressed) InnoDB TokuDB
DML only INSERTs no yes yes
Transactions no no yes yes
ACID no no yes yes
Indexes no yes yes yes
Online DDL no no yes * yes **

* – since version 5.6, with some limitations
** – supports add/drop indexes, add/drop/rename columns and expand int, char, varchar and varbinary data types

Summary

TokuDB seems to be an excellent alternative when it comes to disk space usage efficiency, but this is not the only reason why you should try it perhaps.
You may want to check these articles too:

The post The MySQL ARCHIVE storage engine – Alternatives appeared first on MySQL Performance Blog.

May
22
2013
--

MySQL and the SSB – Part 2 – MyISAM vs InnoDB low concurrency

This blog post is part two in what is now a continuing series on the Star Schema Benchmark.

In my previous blog post I compared MySQL 5.5.30 to MySQL 5.6.10, both with default settings using only the InnoDB storage engine.  In my testing I discovered that innodb_old_blocks_time had an effect on performance of the benchmark.  There was some discussion in the comments and I promised to follow up with more SSB tests at a later date.

I also promised more low concurrency SSB tests when Peter blogged about the importance of performance at low concurrency.

The SSB
The SSB tests a database’s ability to optimize queries for a star schema. A star schema presents some unique challenge to the database optimizer. The SSB benchmark consists of four sets of queries. Each set is known as a “flight”. I have labeled each query as Q{FLIGHT_NUMBER}.{QUERY_NUMBER}. In general, each flight examines different time periods or different regions. The flights represent the type of investigations and drill-downs that are common in OLAP analysis.

Each query in each flight (Q1.1 for example) is tested with a cold buffer pool. Then the query is tested again without restarting the database. The first test is described as the cold test, and the second as the hot test. The database software is restarted after the hot test. All OS caches are dropped at this time as well.

These set of queries were tested on the SSB at SCALE FACTOR: 20. This means there is approximately 12GB of data in the largest table.

You can find the individual SSB query definitions in my previous blog post.

Test environment
These tests were done on a relatively fast machine with a Xeon E5-2680 (8 cores, 16 threads) with fast IO (OCZ R4 1.6TB) and 128GB memory. For the hot test, the data fits in the buffer pool and has been loaded by the cold test already. The buffer pool and adaptive hash index are cold for the cold test. All tests were done with no concurrency. The hardware for this test was provided by Adotomi. I will be blogging about raw performance of the OCZ card in another post.

Also, while it is labeled on the graphs, it is important to note that in all cases, lower times are better.

SSB Flight #1
Here you will see the start of an interesting trend. MyISAM is faster when the data is not cached (the cold run) but is slower in the hot (cached) run. I did some investigation during the testing and found that InnoDB does more IO than MyISAM when the database is cold, but uses less CPU time when the database is hot. I am only speculating (and I can investigate further), but I believe the adaptive hash index is improving performance of InnoDB significantly during the hot run, as hash indexes are faster than a b-tree index. Also accessing pages from the buffer pool should be faster than getting them from the OS cache, which is another advantage of InnoDB.

 

 

 


image009
 

 

 

 

 

 


image001
 

 

 

SSB Flight #2
Flight #2 is similar to Flight #1. MyISAM is faster than InnoDB when the database is cold, but the opposite is true when the database is hot.

 

 


image012
 

 

 

 

 

 


image003
 

 

 

 

SSB Flight #3
Here in some cases MyISAM is substantially faster than InnoDB both cold and hot.

 

 


image014
 

 

 

 

 

 


image005
 

 

 

 

SSB Flight #4
There is one query in this flight, Q4.3, which is faster using MyISAM than InnoDB. Like the queries in Flight #3 that are faster using MyISAM, Q4.3 examines very little data. It seems that InnoDB performs better when a larger number of rows must be joined together (Q4.1, Q4.2) but worse when small amounts of data are examined.

 

 


image016
 

 

 

 

 

 


image007
 

 

 

Conclusion

In some cases MyISAM is faster than InnoDB, but usually only when the buffer pool is cold. Please don’t take away that you should be using MyISAM for everything!. MyISAM may be good for raw performance, but there are limitations which MyISAM imposes that are difficult to work with.  MyISAM does not maintain checksum consistency during regular operations and is not ACID compliant. MyISAM and InnoDB may perform differently under concurrency, which this benchmark does not cover. I will make a follow-up post about concurrency in another blog post in this series. Regardless, when the working set fits in memory, InnoDB almost always performs better, at least for this workload.

Notes

MySQL version used: 5.6.11, custom compiled to remove performance_schema

For the InnoDB tests, a 64GB buffer pool was used. O_DIRECT was used so, there was no caching of data at the filesystem level. The InnoDB indexes were built using ALTER TABLE fast index creation (merge sort).

For the MyISAM tests I used a 10GB key buffer. I used ALTER TABLE DISABLE KEYS and built the keys with sort via ALTER TABLE ENABLE KEYS.
my.cnf

[mysqld]
datadir=/mnt/mysql56/data
basedir=/usr/local/mysql
socket=/var/lib/mysql/mysql.sock
user=justin
innodb_buffer_pool_size=64G
innodb_log_file_size=4G
innodb_file_per_table
innodb_stats_on_metadata=off
innodb_file_format=barracuda
innodb_log_buffer_size=32M
innodb_buffer_pool_instances=16
metadata_locks_hash_instances=32
table_open_cache_instances=8
sort_buffer_size=128k
read_rnd_buffer_size=8M
join_buffer_size=8M
default_tmp_storage_engine=myisam
tmpdir=/dev/shm
innodb_undo_logs=32
innodb_old_blocks_time=0
table_open_cache=2048
table_definition_cache=16384
innodb_flush_method=O_DIRECT
key_buffer_size=10G
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
innodb_stats_persistent
innodb_stats_auto_update=off
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

The post MySQL and the SSB – Part 2 – MyISAM vs InnoDB low concurrency appeared first on MySQL Performance Blog.

Mar
04
2013
--

InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

This is part 2 in a 3 part series. In part 1, we took a quick look at some initial configuration of InnoDB full-text search and discovered a little bit of quirky behavior; here, we are going to run some queries and compare the result sets. Our hope is that the one of two things will happen; either the results returned from a MyISAM FTS query will be exactly identical to the same query when performed against InnoDB data, OR that the results returned by InnoDB FTS will somehow be “better” (as much as it’s actually possible to do this in a single blog post) than what MyISAM gives us.

Recall that we have two different sets of data, one which is the text of roughly 8000 SEO-stuffed webpage bodies (we call that one SEO) and the other, which we call DIR, that is roughly 800,000 directory records with name, address, and the like. We are using MySQL 5.5.30 and MySQL 5.6.10 with no configuration tuning other than to set innodb_ft_min_token_size to 4 (rather than the default of 3) so that it matches MyISAM’s default ft_min_word_length.

First, MyISAM, with MySQL 5.5, on the SEO data set:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('arizona business records'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_myisam ORDER BY 3 DESC LIMIT 5;
+------+-----------------------------------------------------------------------+--------------------+
| id   | title                                                                 | score              |
+------+-----------------------------------------------------------------------+--------------------+
|  458 | Free Business Forms for Starting or Maintaining a Successful Business | 1.3383517265319824 |
| 7112 | Download Idaho Tax Forms for Individual or Business Needs             | 0.9273209571838379 |
| 7113 | Illinois Tax Forms for Individuals and Business                       | 0.8827990889549255 |
| 7121 | Massachusetts Tax Forms                                               | 0.8678107261657715 |
|  454 | Business Accounting and Invoice Forms                                 | 0.8668115139007568 |
+------+-----------------------------------------------------------------------+--------------------+

The same query, run against InnoDB on 5.6.10:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('arizona business records'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_innodb ORDER BY 3 DESC LIMIT 5;
+----+------------------------------------------------------------------------+-------------------+
| id | title                                                                  | score             |
+----+------------------------------------------------------------------------+-------------------+
| 48 | California Free Public Records, Criminal Records And Background Checks | 21.23662567138672 |
| 66 | Michigan Free Public Records, Criminal Records And Background checks   | 5.363710880279541 |
| 44 | Alabama Free Public Records, Criminal Records And Background Checks    | 5.310127258300781 |
| 57 | Illinois Free Public Records, Criminal Records And Background Checks   | 4.569097518920898 |
| 70 | Montana Free Public Records, Criminal Records And Background Checks    | 4.516233444213867 |
+----+------------------------------------------------------------------------+-------------------+

Wow. I’m not sure if I should be concerned so much that the *scores* are different, but the *matches* are COMPLETELY DIFFERENT between 5.5/MyISAM and 5.6/InnoDB. Now, we know that MyISAM FTS does have the caveat with natural language searches whereby a word that’s present in 50% or more of the rows is treated as a stopword, so does that account for our problem? It might, because the word ‘arizona’ appears in over 6900 of the 7150 rows, and the word ‘records’ appears in 7082 of them. So let’s try something else that’s less likely to have that issue. The word “corporation” appears in 143 of the documents; the word “forms” appears in 439 of them, and the word “commission” appears in 130. There might be some overlap here, but even if there isn’t, 143+130+439 < 0.5 * 7150, so none of these should be treated as stopwords in MyISAM.

With 5.5:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('corporation commission forms'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_myisam ORDER BY 3 DESC LIMIT 5;
+------+--------------------+--------------------+
| id   | title              | score              |
+------+--------------------+--------------------+
| 7132 | New York Tax Forms |  7.821961879730225 |
| 7127 | Nebraska Tax Forms |  6.882259845733643 |
| 7123 | Free Forms         |  6.748578071594238 |
| 7126 | Montana Tax Forms  | 6.4749345779418945 |
| 7119 | Maine Tax Forms    |  6.400341510772705 |
+------+--------------------+--------------------+

With 5.6:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('corporation commission forms'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_innodb ORDER BY 3 DESC LIMIT 5;
+-----+--------------------------------------------------------------------------+--------------------+
| id  | title                                                                    | score              |
+-----+--------------------------------------------------------------------------+--------------------+
|  79 | Ohio Free Public Records, Criminal Records And Background Checks         |  51.76125717163086 |
|  78 | Free North Dakota Public Records, Criminal Records And Background Checks |  30.32522201538086 |
|  19 | Free Public Printable Forms For All Industries                           | 27.557558059692383 |
| 408 | Free Colorado DMV Resources and Driving Records                          | 24.933029174804688 |
|  48 | California Free Public Records, Criminal Records And Background Checks   | 24.224336624145508 |
+-----+--------------------------------------------------------------------------+--------------------+

OK, now I’m starting to get a little worried. The docs do tell us that the default stopword list is substantially different between InnoDB and MyISAM, and as it turns out, there are only 36 stopwords in the default InnoDB list, but there are 543 stopwords in the default MyISAM list. What happens if we take the MyISAM stopwords, insert them into a table, and configure that table to be our stopword list for InnoDB?

This is the table that we’re trying to emulate:

mysql: SHOW CREATE TABLE information_schema.innodb_ft_default_stopword\G
*************************** 1. row ***************************
       Table: INNODB_FT_DEFAULT_STOPWORD
Create Table: CREATE TEMPORARY TABLE `INNODB_FT_DEFAULT_STOPWORD` (
  `value` varchar(18) NOT NULL DEFAULT ''
) ENGINE=MEMORY DEFAULT CHARSET=utf8

The docs tell us that we need to create an *InnoDB* table with a single VARCHAR column named “value”. OK, sounds easy enough:

mysql: SHOW CREATE TABLE innodb_myisam_stopword\G
*************************** 1. row ***************************
       Table: innodb_myisam_stopword
Create Table: CREATE TABLE `innodb_myisam_stopword` (
  `value` varchar(18) NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=utf8

But, when we try to use this table, here’s what comes back:

mysql: SET GLOBAL innodb_ft_server_stopword_table='test/innodb_myisam_stopword';
ERROR 1231 (42000): Variable 'innodb_ft_server_stopword_table' can't be set to the value of 'test/innodb_myisam_stopword'

And here’s what appeared in the server’s error log:

InnoDB: invalid column type for stopword table test/innodb_myisam_stopword. Its first column must be of varchar type

Uh… Does this mean that my next blog post should be entitled, “When is a VARCHAR Not Really a VARCHAR?” Thinking that maybe this was a case of GEN_CLUST_INDEX causing me issues, I tried adding a second column to the table which was an integer PK, and in another attempt, I tried just making the “value” column the PK, but neither of those worked. Also, trying to set innodb_ft_user_stopword_table produced the same error. I submitted a bug report (68450), and as you can see from the bug discussion, it turns out that this table is character-set-sensitive. If you’re going to use your own stopword table for InnoDB FTS, at least for the moment, this table must use the latin1 character set.

mysql: CREATE TABLE innodb_ft_list2 ( value VARCHAR(18) NOT NULL DEFAULT '') ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.02 sec)
mysql: SET GLOBAL innodb_ft_server_stopword_table='test/innodb_ft_list2';
Query OK, 0 rows affected (0.00 sec)
mysql: INSERT INTO innodb_ft_list2 SELECT * FROM innodb_myisam_stopword;
Query OK, 543 rows affected (0.01 sec)
Records: 543  Duplicates: 0  Warnings: 0

As far as I can tell, this little gotcha doesn’t appear to be mentioned anywhere in the MySQL 5.6 documentation; every place where it talks about creating one of these stopword tables, it simply mentions the table engine and the column name/type, so I’m not sure if this is an intentional restriction that just needs to be better documented or if it’s a limitation with the InnoDB FTS feature that will be removed in a later version.

Now that we’ve sorted this out, let’s drop and rebuild our FT index on the InnoDB table and try the above queries one more time. We already know what the MyISAM results are going to be; do our InnoDB results change? No, they are exactly the same, although the scores did change slightly.

What about a Boolean mode query? The docs tell us that if we use Boolean mode, and we put a “+” in front of our search term, then that term *must* appear in the search results. But does it?

With 5.5:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson +arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 5;
+-------+--------------------------+-------+
| id    | full_name                | score |
+-------+--------------------------+-------+
| 74717 | James R Peterson         |     1 |
|     1 | Harold Wesley Abbott Iii |     0 |
|     3 | William Doyle Abbott Jr  |     0 |
|     5 | Robert Jack Abraham      |     0 |
|     7 | Mark Allen Abrell        |     0 |
+-------+--------------------------+-------+

And with 5.6:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson +arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 5;
+---------+------------------+-------------------+
| id      | full_name        | score             |
+---------+------------------+-------------------+
|   74717 | James R Peterson | 23.63458251953125 |
| 1310720 | Terry Lynn Suter |                 0 |
| 1441792 | Jorge E Morrison |                 0 |
| 1310976 | Oscar Blakemore  |                 0 |
| 1442048 | Donald Simmons   |                 0 |
+---------+------------------+-------------------+

There’s only one row in the table that actually matches all three search terms, and in this case, both MyISAM and InnoDB FTS performed identically and found it. I’m not really concerned about the fact that the next four rows are completely different; the scores are zero, which means “no match.” This looks promising, so let’s explore further. Again, from the docs, if we run a boolean mode query where some of the search terms are prefixed with “+” and others have no prefix, results that have the unprefixed term will be ranked higher than those with out it. So, for example, if we change the above query to be “+james +peterson arizona” then we might expect to get back multiple matches containing the words “James” and “Peterson”, and we should expect the record from Arizona to be towards the top of the list.

With 5.5, this is exactly what happens:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 5;
+--------+------------------------------+--------------------+
| id     | full_name                    | score              |
+--------+------------------------------+--------------------+
|  74717 | James R Peterson             | 1.3333333730697632 |
|  14159 | Christopher Michael James    |                  1 |
|  44427 | James Cyrus Peterson         |                  1 |
|  53501 | James/Rober T Giles/Peterson |                  1 |
| 126373 | Bamish James Peterson        |                  1 |
+--------+------------------------------+--------------------+

With 5.6, we’re not so fortunate.

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 5;
+--------+--------------------------+--------------------+
| id     | full_name                | score              |
+--------+--------------------------+--------------------+
| 741223 | Alphonso Lee Peterson Sr | 59.972904205322266 |
| 925811 | James B Peterson         |  59.37348937988281 |
| 269589 | Michael James Peterson   |  44.82982635498047 |
| 471273 | James Allen Peterson     |  39.58232879638672 |
| 925781 | Anthony James Petersen   |  38.03296661376953 |
+--------+--------------------------+--------------------+

These results aren’t even close to identical. As it turns out, the full record for “Alphonso Lee Peterson Sr” does also contain the name “James”, and the word “Peterson” is listed in there several times, but “Arizona” is not present at all, whereas the record for “James R Peterson” had all three search terms and no significant repetition of any of them. Using this particular query, “James R Peterson” is #15 on the list.

At this point, it’s pretty obvious that the way MyISAM is calculating the scores is much different from the way that InnoDB is doing it, and given what I said earlier about the repetition of words in the “Alphonso Lee Peterson Sr” record versus the “James R Peterson” one, we might argue that InnoDB is actually behaving more correctly than MyISAM. Imagine if we were searching through newspaper articles or something of that sort, and we were looking for queries containing the word “MySQL” – odds are that an article which has 10 instances of “MySQL” might be more desirable to us than an article which only has it mentioned once. So if I look at these results from that persepctive, I can understand the how and the why of it. My concern is that there are likely going to be people who believe that switching to InnoDB FTS is simply a matter of upgrading to 5.6 and running ALTER TABLE foo ENGINE=InnoDB. In theory, yes. In practice, not even close.

I tried one more Boolean search, this time looking for someone’s full name, which I knew to be present only once in the database, and I used double quotes to group the search terms as a single phrase:

With 5.5:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 1;
+-------+----------------+-------+
| id    | full_name      | score |
+-------+----------------+-------+
| 62633 | Thomas B Smith |     1 |
+-------+----------------+-------+

Looks good, there he is. Now what happens under 5.6?

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 1;
+---------+------------------+-------+
| id      | full_name        | score |
+---------+------------------+-------+
| 1310720 | Terry Lynn Suter |     0 |
+---------+------------------+-------+

In the immortal words of Homer J. Simpson, “D’OH!!” Why is MyISAM able to locate this record but InnoDB cannot find it at all? I suspect that the “B” is causing problems for InnoDB, because it’s only a single character and we’ve set innodb_ft_min_token_size to 4. Thus, when InnoDB is parsing the data and building the word list, it’s completely ignoring Mr. Smith’s middle initial. To test this hypothesis, I reset innodb_ft_min_token_size to 1, dropped/rebuilt the InnoDB index, and tried again.

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 1;
+-------+----------------+--------------------+
| id    | full_name      | score              |
+-------+----------------+--------------------+
| 62633 | Thomas B Smith | 31.749061584472656 |
+-------+----------------+--------------------+

Aha, there he is! Based on that result, I would caution anyone designing an application that’s going to use InnoDB FTS to be quite mindful of the types of queries that you’re expecting your users to run. In particular, if you expect or are going to allow users to enter search phrases that include initials, numbers, or any other string of length less than 3 (the default), I think you’re going to be forced to set innodb_ft_min_token_size to 1. Otherwise you’ll run into the same problem as our Mr. Smith here. [This does raise the question of why it works with MyISAM when ft_min_word_length defaults to 4, but that is a topic for another day.]

Note that there may or may not be some performance implications to cranking this value all the way down; that is something I have not yet tested but will be reporting on in part 3 of this series. I can, however, confirm that the on-disk size of my DIR dataset is exactly the same with a setting of 1 versus a setting of 4. This may or may not be the case with multi-byte character sets or with ideographic languages such as Japanese, although Japanese poses its own unique problems for FTS of any kind due to its lack of traditional word boundaries.

In any event, it appears that we’ve solved the Boolean-mode search issue, but we still have vastly different results with the natural-language-mode search. For those of you who are expecting and need to have the MyISAM-style search results, there is at least one potential escape hatch from this rabbit hole. When defining a FULLTEXT KEY, you can use the “WITH PARSER” modifier to specify the name of a UDF which references your own custom-written fulltext parser plugin. Thus I am thinking that it may be possible to take the MyISAM full-text parser code, convert it to a plugin, and use it for InnoDB FT indexes where you’re expecting MyISAM-style results. Verifying or refuting this conjecture is left as an exercise for the reader. :-)

A quick recap of what we’ve learned so far:

  • There are parts of InnoDB FTS configuration which are both letter-case and character-set sensitive. Watch out!
  • When you add your first FULLTEXT KEY to an InnoDB table, be prepared for a table rebuild.
  • Calculation of match score is completely different between the two engines; sometimes this leads to wildly different results.
  • If you were hoping to use InnoDB FTS as a simple drop-in replacement for your current MyISAM FTS, the results may surprise you.

That last point bears particular emphasis, as it also illustrates an important best practice even if FTS isn’t involved. Always test how your application behaves as a result of a major MySQL version upgrade before rolling it into production! Percona has tools (pt-upgrade and Percona Playback) that can help you with this. These tools are free and open source, please use them. You, and your users, will be happy that you did.

In the third and final installment of this series, we will take a look at performance. How does the speed of InnoDB FTS compare to its MyISAM counterpart, and what kinds of tricks can we use to make it go faster? Stay tuned to find out!

The post InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries! appeared first on MySQL Performance Blog.

Feb
26
2013
--

InnoDB Full-text Search in MySQL 5.6 (part 1)

I’ve never been a very big fan of MyISAM; I would argue that in most situations, any possible advantages to using MyISAM are far outweighed by the potential disadvantages and the strengths of InnoDB. However, up until MySQL 5.6, MyISAM was the only storage engine with support for full-text search (FTS). And I’ve encountered many customers for whom the prudent move would be a migration to InnoDB, but due to their use of MyISAM FTS, the idea of a complete or partial migration was, for one reason or another, an impractical solution. So, when FTS for InnoDB was first announced, I thought this might end up being the magic bullet that would help these sorts of customers realize all of the benefits that have been engineered into InnoDB over the past few years and still keep their FTS capability without having to make any significant code changes.

Unfortunately, I think that hope may be premature. While it is true that InnoDB FTS in MySQL 5.6 is syntactically identical to MyISAM FTS, in the sense that the SQL required to run a MATCH .. AGAINST is the same (modulo any new features introduced with InnoDB FTS), that’s largely where the similarities end.

NOTE 1: I was originally planning to cover everything I wanted to discuss with respect to InnoDB FTS in this one post, but I think there’s a lot of interesting stuff here, so I will break it into three pieces instead. The first part (this one) will be a very quick overview of FTS in InnoDB and some observations that I’ve made while getting it configured. The second part will compare query results between MyISAM FTS and InnoDB FTS over the same data sets, and then finally in the third installment, we’ll look at query performance. In the event that a new release of 5.6 appears between now and part 3, I’ll also revisit some of the “quirks” from parts 1 and 2 to see if the behavior has changed.

NOTE 2: For purposes of this discussion, I used two separate data sets. The first one is a set of about 8K very SEO-stuffed web pages, wherein the document title is the page title, and the document body is the HTML-tag-stripped body of the page. We’ll call this data set “SEO” – it’s about 20MB of actual data. The other one is a set of about 790K directory records, each one containing the name, address, and some other public-records-type information about each person. We’ll call this data set “DIR”, and it’s about 155MB of actual data.

NOTE 3: Also, keep in mind that I used the community editions of MySQL 5.5.30 and MySQL 5.6.10 with no tuning whatsoever (with one exception that I’ll explain below) – the idea behind this investigation wasn’t to find out how to make InnoDB FTS blazingly-fast, but simply to get a sense of how it works compared to traditional MyISAM FTS. We’ll get to performance in the third installment. For now, the important number here is to note that the InnoDB buffer pool for my 5.6 instance is 128MB – smaller than the size of my DIR data.

So, with all of that out of the way, let’s get to it.

Here is our basic table definition for the DIR dataset. The table for the SEO dataset looks identical, except that we replace “full_name” with a VARCHAR(255) “title” and “details” with a TEXT “body”.

CREATE TABLE `dir_test_innodb` (
  `id` int(10) unsigned NOT NULL,
  `full_name` varchar(100) DEFAULT NULL,
  `details` text,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `full_name` (`full_name`,`details`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

We also have identical tables created in 5.5.30 where, of course, the only difference is that the engine is MyISAM rather than InnoDB. Loading the data was done via a simple Perl script, inserting one row at a time with AutoCommit enabled – remember, the focus here isn’t on performance just yet.

Having loaded the data, the first thing we notice is that there are a lot of “new” InnoDB tablespace files in our database directory:

-rw-rw----. 1 mysql mysql      8632 Feb 20 15:54 dir_test_innodb.frm
-rw-rw----. 1 mysql mysql 213909504 Feb 20 15:55 dir_test_innodb.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_0000000000000153_DOC_ID.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_ADDED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_BEING_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_BEING_DELETED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_CONFIG.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_DELETED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_STOPWORDS.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_0000000000000144_DOC_ID.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_ADDED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_BEING_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:55 FTS_0000000000000114_BEING_DELETED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:55 FTS_0000000000000114_CONFIG.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:55 FTS_0000000000000114_DELETED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_STOPWORDS.ibd
-rw-rw----. 1 mysql mysql      8618 Feb 20 16:09 seo_test_innodb.frm
-rw-rw----. 1 mysql mysql  37748736 Feb 20 16:29 seo_test_innodb.ibd

By comparison, this is what we see on the MyISAM side:

-rw-rw----. 1 mysql mysql      8632 Feb 19 17:17 dir_test_myisam.frm
-rw-rw----. 1 mysql mysql 155011048 Feb 19 17:17 dir_test_myisam.MYD
-rw-rw----. 1 mysql mysql 153956352 Feb 19 17:18 dir_test_myisam.MYI
-rw-rw----. 1 mysql mysql      8618 Feb 20 16:11 seo_test_myisam.frm
-rw-rw----. 1 mysql mysql  21561096 Feb 20 16:11 seo_test_myisam.MYD
-rw-rw----. 1 mysql mysql  14766080 Feb 20 16:11 seo_test_myisam.MYI

I also observed that if I simply load the data into an InnoDB table that has never had a full-text index on it, and then I create one, the following warning is generated:

mysql> alter table dir_test_innodb ADD FULLTEXT KEY (full_name, details);
Query OK, 0 rows affected, 1 warning (39.73 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> show warnings;
+---------+------+--------------------------------------------------+
| Level   | Code | Message                                          |
+---------+------+--------------------------------------------------+
| Warning |  124 | InnoDB rebuilding table to add column FTS_DOC_ID |
+---------+------+--------------------------------------------------+

This doesn’t make a lot of sense to me. Why does InnoDB need to add a hidden column (similar to GEN_CLUST_INDEX when you don’t define a PRIMARY KEY, I assume) when I already have an INT UNSIGNED PK that should suffice as any sort of document ID ? As it turns out, if you create a column called FTS_DOC_ID which is a BIGINT UNSIGNED NOT NULL with a unique index on it, your table doesn’t need to be rebuilt. The most important item to note here – FTS_DOC_ID must be spelled and specified exactly that way – IN ALL CAPS. If you try “fts_doc_id” or any other mixture of lettercase, you’ll get an error:

mysql> CREATE TABLE dir_test_innodb4 (fts_doc_id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY);
ERROR 1166 (42000): Incorrect column name 'fts_doc_id'

Various points in the documentation mention the notion of a “document ID” that “might reflect the value of an ID column that you defined for the underlying table, or it can be a sequence value generated by InnoDB when the table does not contain a suitable column,” but there are only a handful of references to FTS_DOC_ID found when searching the MySQL 5.6 manual, and the only page which appears to suggest how using an explicitly-defined column is done is this one, which discusses improving bulk insert performance. At the very bottom, the page claims that you can speed up bulk loading into an InnoDB FT index by declaring a column called FTS_DOC_ID at table creation time of type BIGINT UNSIGNED NOT NULL with a unique index on it, loading your data, and then creating the FT index after the data is loaded.

One obvious problem wih those instructions is that if you define a column and a unique key as they suggest, your data won’t load due to unique key constraint violations unless you also do something to provide some sort of sequence value for that column, whether as an auto_increment value or via some other means, but the bit that troubles me further is the introduction of a column-level case-sensitivity requirement that only seems to actually matter at table creation time. Once I’ve got a table with an explicit FTS_DOC_ID column, however, MySQL apparently has no problem with either of the following statements:

mysql> insert into dir_test_innodb3 (fts_doc_id, full_name) values (1, 'john smith');
mysql> select * from dir_test_innodb3 where fts_doc_id=1;

Philosophically, I find that kind of behavior unsettling. I don’t like case-sensitivity in my table or column names to begin with (I may be one of the few people that likes lower_case_table_names = 1 in /etc/my.cnf), but I think it’s even worse that the case-sensitivity only matters some of the time. That strikes me as a good recipe for DBA frustration.

Now, let’s return to all of those FTS_*.ibd files. What, exactly, are they? In short, the _CONFIG.ibd file contains configuration info about the FT index (the same sort of configuration data that can be queried out of the I_S.INNODB_FT_CONFIG table, as I’ll discuss momentarily), and the others contain document IDs of new rows that are added to or removed from the table and which need to be merged back into or removed from the main index. I’m not entirely sure about the _STOPWORDS.ibd file just yet; I thought it might be somehow related to a custom stopwords table, but that doesn’t seem to be the case (or at least not in the way that I had thought), so I will need to look through the source to figure out what’s going on there.

In any case, for each new FULLTEXT KEY that you create, you’ll get a corresponding FTS_*_DOC_ID.ibd file (but none of the others), and if you drop a FT index, the its corresponding FTS_*_DOC_ID.ibd file will also be removed. HOWEVER, even if you drop all of the FT indexes for a given table, you’re still left with all of the other FTS_*.ibd files, and it appears that the only way to get rid of them is to actually rebuild the table.

-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_ADDED.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_BEING_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_BEING_DELETED.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 19:07 FTS_0000000000000025_CONFIG.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_DELETED.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_STOPWORDS.ibd

Also, while we’re on the subject of adding and dropping FT indexes, it’s entirely possible to DROP multiple FT indexes with InnoDB in the same ALTER TABLE statement, but it’s not possible to CREATE more than one at a time. If you try it, this is what happens:

mysql> alter table dir_test_innodb ADD FULLTEXT KEY (full_name, details), 
       ADD FULLTEXT KEY (details);
ERROR 1795 (HY000): InnoDB presently supports one FULLTEXT index creation at a time

That’s an odd limitation. Do it as two separate ALTER statements, and it appears to work fine.

But here’s where things start to get even weirder. According to the documentation, if we specify the name of a table that has a FT index for the global variable innodb_ft_aux_table, we should be able to get some statistics about the FT indexes on that table by querying the various I_S.INNODB_FT_* tables. In particular, the INNODB_FT_CONFIG table is supposed to “display metadata about the FULLTEXT index and associated processing for an InnoDB table.” The documentation also tells us that we can keep our FT indexes up to date by setting innodb_optimize_fulltext_only = 1 and then running OPTIMIZE TABLE, and that we might have to run OPTIMIZE TABLE multiple times if we’ve had a lot of changes to the table.

This all sounds pretty good, in theory, but at least some part of it doesn’t seem to work. First, let’s check the stats immediately after setting these variables, and then let’s push some additional data into the table, run an OPTIMIZE or two, delete some data, and see what happens:

mysql> set global innodb_ft_aux_table='test/dir_test_innodb';
mysql> set global innodb_optimize_fulltext_only=1;
mysql> select * from information_schema.innodb_ft_config;
+---------------------------+--------+                                                  
| KEY                       | VALUE  |                                                  
+---------------------------+--------+                                                  
| optimize_checkpoint_limit | 180    |                                                  
| synced_doc_id             | 787625 |                                                  
| last_optimized_word       |        |                                                  
| deleted_doc_count         | 0      |                                                  
| total_word_count          |        |                                                  
| optimize_start_time       |        |                                                  
| optimize_end_time         |        |                                                  
| stopword_table_name       |        |                                                  
| use_stopword              | 1      |
| table_state               | 0      |
+---------------------------+--------+
10 rows in set (0.00 sec)

mysql> insert into dir_test_innodb (full_name, details) 
       SELECT reverse(full_name), details FROM dir_test_innodb WHERE id < 500000; 
Query OK, 245051 rows affected (8 min 13.13 sec) 
Records: 245051  Duplicates: 0  Warnings: 0 

mysql> select * from information_schema.innodb_ft_config;
+---------------------------+---------+
| KEY                       | VALUE   |
+---------------------------+---------+
| optimize_checkpoint_limit | 180     |
| synced_doc_id             | 1028261 |
| last_optimized_word       |         |
| deleted_doc_count         | 0       |
| total_word_count          |         |
| optimize_start_time       |         |
| optimize_end_time         |         |
| stopword_table_name       |         |
| use_stopword              | 1       |
| table_state               | 0       |
+---------------------------+---------+
10 rows in set (0.02 sec)

mysql> optimize table dir_test_innodb;
+----------------------+----------+----------+----------+
| Table                | Op       | Msg_type | Msg_text |
+----------------------+----------+----------+----------+
| test.dir_test_innodb | optimize | status   | OK       |
+----------------------+----------+----------+----------+
1 row in set (4.60 sec)

mysql> select * from information_schema.innodb_ft_config;
+---------------------------+---------+
| KEY                       | VALUE   |
+---------------------------+---------+
| optimize_checkpoint_limit | 180     |
| synced_doc_id             | 1032677 |
| last_optimized_word       |         |
| deleted_doc_count         | 0       |
| total_word_count          |         |
| optimize_start_time       |         |
| optimize_end_time         |         |
| stopword_table_name       |         |
| use_stopword              | 1       |
| table_state               | 0       |
+---------------------------+---------+
10 rows in set (0.00 sec)

mysql> delete from dir_test_innodb LIMIT 200000;
Query OK, 200000 rows affected (8.65 sec)

mysql> optimize table dir_test_innodb;           
+----------------------+----------+----------+----------+
| Table                | Op       | Msg_type | Msg_text |
+----------------------+----------+----------+----------+
| test.dir_test_innodb | optimize | status   | OK       |
+----------------------+----------+----------+----------+
1 row in set (8.44 sec)

mysql> select * from information_schema.innodb_ft_config;
+---------------------------+---------+
| KEY                       | VALUE   |
+---------------------------+---------+
| optimize_checkpoint_limit | 180     |
| synced_doc_id             | 1032677 |
| last_optimized_word       |         |
| deleted_doc_count         | 0       |
| total_word_count          |         |
| optimize_start_time       |         |
| optimize_end_time         |         |
| stopword_table_name       |         |
| use_stopword              | 1       |
| table_state               | 0       |
+---------------------------+---------+

I ran OPTIMIZE TABLE several more times, and each execution took between 6 and 8 seconds, but the output of the query against I_S.innodb_ft_config never changed, so it seems like at least some of the diagnostic output isn’t working quite right. Intuitively, I would expect some changes in total_word_count, or optimize_(start|end)_time, and the like. However, if I check some of the other I_S tables, I do find that the number of rows in I_S.innodb_ft_index_table is changing, so it’s pretty clear that I do actually have a FT index available.

At the start of this post, I mentioned that I did make one configuration change to the default InnoDB settings for MySQL 5.6, and that was to change innodb_ft_min_token_size from the default of 3 to a value of 4 so that it would be identical to the MyISAM default. After all, the (naive?) hope here is that when I run an FTS query against both MyISAM and InnoDB I will get back the same results; if this equivalence doesn’t hold, then as a consultant, I might have a hard time recommending this feature to my customers, and as an end user, I might have a hard time using the feature at all, because it could completely alter the behavior of my application unless I also make a nontrivial number of code changes.

In part 2 of this series, we’ll reload and reset our SEO and DIR data sets back to their initial states, run some queries, and compare the output. Stay tuned, it gets rather curious.

The post InnoDB Full-text Search in MySQL 5.6 (part 1) appeared first on MySQL Performance Blog.

Nov
22
2010
--

Moving from MyISAM to Innodb or XtraDB. Basics

I do not know if it is because we’re hosting a free webinar on migrating MyISAM to Innodb or some other reason but recently I see a lot of questions about migration from MyISAM to Innodb.

Webinar will cover the process in a lot more details though I would like to go over basics in advance. You can also check my old post on this topic about Moving MyISAM to Innodb as well as searching the blog – We’ve blogged so much on this topic.

So what are the basics ?

Regression Benchmarks – Make sure to run regression benchmarks in particular in terms of concurrent behavior. You may have hidden dependencies of MyISAM table lock behavior in your applications, also check if your application handled deadlocks well. MyISAM will not produce deadlocks for Innodb you should always see deadlocks as a probability as long as you write to the database. They may be rate but it is rather hard to guaranty you will never run into them.

Performance Benchmarks – Innodb and MyISAM have different performance properties and you can’t really say one is faster than other it is very much workload dependent, and again concurrent tests should be important here. Innodb also may result in different plans for some queries which is rather easy to check with mk-upgrade

Feature Differences – There are some feature differences between MyISAM and Innodb though well it is typically easily spotted by converting tables to Innodb on restored backup. Full text search indexes, GIS, multi-column auto increment keys are great examples. There are also different limits for MyISAM and Innodb – it is possible to have some rows which can be stored in MyISAM but would not fit to Innodb, though this is an exception.

Space Innodb Tables tend to be larger. Again converting schema will show you some of this. Though the best is to take a look at the size after stressing system with load for a while as depending on the schema and usage Innodb tables may increase in size significantly due to fragmentation.

Usage Differences What works well for MyISAM may not work for Innodb and vice versa. You may benefit from different index structure for Innodb, including different primary key setup, as well as you may want to structure workload differently. With MyISAM it is often for people to do updates in small chunks, almost row by row to avoid holding table lock for long time in Innodb you want larger updates to reduce cost of transaction commit. You also may want to avoid excessive SELECT COUNT(*) FROM TBL (with no where clause) which is very fast for MyISAM but does table/index scan for Innodb.

Defaults You need to know two things about defaults for Innodb. First Depending on MySQL version they may be somewhere from suboptimal to absolutely disastrous, you do not want to try to run Innodb or XtraDB with them. Second Innodb is tuned to be ACID by default – if you’re moving from MyISAM often you do not need such strong guarantees and can at least change innodb_flush_log_at_trx_commit=2. It still will be much more secure than storing data in MyISAM. The 3 most important values to check are innodb_flush_log_at_trx_commit, innodb_buffer_pool_size and innodb_log_file_size. There are a lot more options for fine tuning but make sure at least these are right.


Entry posted by peter |
One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com