Oct
23
2014
--

MySQL 5.6 Full Text Search Throwdown: Webinar Q&A

MySQL 5.6 Full Text Search Throwdown: Webinar Q&AYesterday (Oct. 22) I gave a presentation titled “MySQL 5.6 Full Text Search Throwdown.” If you missed it, you can still register to view the recording and my slides.

Thanks to everyone who attended, and especially to folks who asked the great questions. I answered as many as we had time for during the session, but here are all the questions with my complete answers:

Q: Does Solr automatically maintain its index against MySQL? Do you have to hit the Solr server with a specific query to keep the index ‘warm’?

There are several strategies for updating a Solr index. In my examples, I showed only a “full import” which is what you would do to create an index by reading all the source data.

You can also perform a “delta import” periodically, to add a subset of the source data to an existing index, for example to add data that has changed since the last time you updated the Solr index. See the documentation for Using delta-import command and also Using query attribute for both full and delta import.

The delta import would typically be something you would invoke from a cron job, perhaps every hour. But that means that a Solr search might not find data that has changed in MySQL more recently than the last delta import. Depending on the application, a delay of up to 60 minutes might be acceptable, or else maybe you have strict requirements that all data must be in sync instantly.

You could also update the Solr index one document at a time using its Java API or web service API. This would require you to write code in your application. Every time you INSERT or UPDATE or DELETE a document in MySQL that you want to be kept in sync with the Solr index, you would write more code to do a similar operation in the Solr index. That way every single text change would be searchable nearly immediately.

Q: Did you test Elasticsearch? (several people asked about this)

I did not test Elasticsearch, but according to their technology overview: “Elasticsearch uses Lucene under the covers.” So I expect that this part of Elasticsearch performs similarly to what I saw from Apache Solr, which also uses Lucene internally.

Q: One question I could not understand, how to maintain Sphinx index in sync with data? Can be it in real time?

The Sphinx Search index does not automatically refresh as your MySQL data changes. You would have to write application code to invoke the indexing process. There’s a page in the Sphinx Search documentation about Live Index Updates, that gives an overview of the two methods, and links to further reading.

This is definitely the most inconvenient aspect of Sphinx Search. Queries are very fast, but it’s expensive to do incremental updates to an index. So it’s ideal for indexing an archive of text that doesn’t change very frequently, but not as easy to use it for indexing rapidly-changing content.

Q: I have over 800,000 PDF documents to index (in several languages), any recommendations?

I said during the webinar that I recalled there exists tools to extract searchable text from a PDF file. I found one such project called Apache PDFBox includes this capability, and they have a page describing a helper class for doing PDF parsing and extraction combined with Lucene indexing. I haven’t used it myself, so I can’t comment on its performance for indexing 800,000 PDF documents, but it seems like you could write a Java program to iterate over your collection of PDF’s, and index them using this class.

Q: What is your suggestion to use Sphinx Search for single column searches?

You can use any SQL query in the sphinx.conf to define the source data to index. You can select one column, multiple columns, or even multiple columns from joined tables. The result from any SQL query you write can be used as the data source.

Q: Which modules did you use with Sphinx Search? Did you use its built-in stemmers and metaphone package, etc.?

I installed the default modules. I don’t know if there is a significant performance difference from using optional packages.

Q: What about quality of results from each solution? I remember reading an article on percona.com several months ago comparing MyISAM fulltext vs InnoDB fulltext, and there were concerns about the results from InnoDB. Did you do any testing on this?

Indeed, here’s a link to the excellent blog post by my colleague Ernie Souhrada in which he found some surprises in the results from InnoDB FTS: InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

I was just doing some comparison for performance in the current MySQL 5.7 milestone. I didn’t compare the query results this time.

Q: Is there any full text search in Percona Server with XtraDB?

Percona Server is based on the upstream MySQL Community Edition of the respective version number. So Percona Server has the builtin FULLTEXT index types for MyISAM and InnoDB, and we have not changed this part of the code. Percona Server does not bundle Sphinx Search, but it’s not too difficult to install Sphinx Search as a complementary technology, just as you would install other packages that are commonly used parts of an application infrastructure, for example Memcached or HA-proxy.

Q: Is MySQL going to improve the built-in InnoDB FTS in the near future?

They are continuing to add features that improve FTS, for example:

  • You can now write your own plugins for fulltext parsing (that is, parsing the input data to identify “words” to index; you may have your own idea about how to split text into words).
  • Both B-tree and full-text types now uses bulk-loading to make it faster and more efficient to build the index.

I’m not aware of any work to improve the performance of fulltext queries significantly.

Q: What is the performance comparison between MyISAM and InnoDB for inline index updating?

I didn’t test performance of incremental index updates this time. I only populated my tables from the StackOverflow data using LOAD XML, and then I created fulltext indexes on the populated tables. But I generally favor moving all important data to InnoDB, and not using MyISAM tables. It’s hard to imagine that the performance of index updates would be so much better that would convince me to use MyISAM. It’s more likely that the accuracy of search results would be a good reason to use MyISAM. Even then, I’d keep the original data in InnoDB and use MyISAM only as a copy of the data, to create a disposable fulltext index.

Thanks again for attending my webinar! For more great content, please join Percona and the MySQL community at our conference events. The next one is Percona Live London 2014 on November 3-4. We also look forward to the Open Stack Live 2015 in Santa Clara, California April 13-14, in the same venue with Percona Live MySQL Conference and Expo 2015, April 13-16.

Also watch more webinars from Percona in the future!

The post MySQL 5.6 Full Text Search Throwdown: Webinar Q&A appeared first on MySQL Performance Blog.

Mar
04
2013
--

InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

This is part 2 in a 3 part series. In part 1, we took a quick look at some initial configuration of InnoDB full-text search and discovered a little bit of quirky behavior; here, we are going to run some queries and compare the result sets. Our hope is that the one of two things will happen; either the results returned from a MyISAM FTS query will be exactly identical to the same query when performed against InnoDB data, OR that the results returned by InnoDB FTS will somehow be “better” (as much as it’s actually possible to do this in a single blog post) than what MyISAM gives us.

Recall that we have two different sets of data, one which is the text of roughly 8000 SEO-stuffed webpage bodies (we call that one SEO) and the other, which we call DIR, that is roughly 800,000 directory records with name, address, and the like. We are using MySQL 5.5.30 and MySQL 5.6.10 with no configuration tuning other than to set innodb_ft_min_token_size to 4 (rather than the default of 3) so that it matches MyISAM’s default ft_min_word_length.

First, MyISAM, with MySQL 5.5, on the SEO data set:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('arizona business records'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_myisam ORDER BY 3 DESC LIMIT 5;
+------+-----------------------------------------------------------------------+--------------------+
| id   | title                                                                 | score              |
+------+-----------------------------------------------------------------------+--------------------+
|  458 | Free Business Forms for Starting or Maintaining a Successful Business | 1.3383517265319824 |
| 7112 | Download Idaho Tax Forms for Individual or Business Needs             | 0.9273209571838379 |
| 7113 | Illinois Tax Forms for Individuals and Business                       | 0.8827990889549255 |
| 7121 | Massachusetts Tax Forms                                               | 0.8678107261657715 |
|  454 | Business Accounting and Invoice Forms                                 | 0.8668115139007568 |
+------+-----------------------------------------------------------------------+--------------------+

The same query, run against InnoDB on 5.6.10:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('arizona business records'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_innodb ORDER BY 3 DESC LIMIT 5;
+----+------------------------------------------------------------------------+-------------------+
| id | title                                                                  | score             |
+----+------------------------------------------------------------------------+-------------------+
| 48 | California Free Public Records, Criminal Records And Background Checks | 21.23662567138672 |
| 66 | Michigan Free Public Records, Criminal Records And Background checks   | 5.363710880279541 |
| 44 | Alabama Free Public Records, Criminal Records And Background Checks    | 5.310127258300781 |
| 57 | Illinois Free Public Records, Criminal Records And Background Checks   | 4.569097518920898 |
| 70 | Montana Free Public Records, Criminal Records And Background Checks    | 4.516233444213867 |
+----+------------------------------------------------------------------------+-------------------+

Wow. I’m not sure if I should be concerned so much that the *scores* are different, but the *matches* are COMPLETELY DIFFERENT between 5.5/MyISAM and 5.6/InnoDB. Now, we know that MyISAM FTS does have the caveat with natural language searches whereby a word that’s present in 50% or more of the rows is treated as a stopword, so does that account for our problem? It might, because the word ‘arizona’ appears in over 6900 of the 7150 rows, and the word ‘records’ appears in 7082 of them. So let’s try something else that’s less likely to have that issue. The word “corporation” appears in 143 of the documents; the word “forms” appears in 439 of them, and the word “commission” appears in 130. There might be some overlap here, but even if there isn’t, 143+130+439 < 0.5 * 7150, so none of these should be treated as stopwords in MyISAM.

With 5.5:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('corporation commission forms'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_myisam ORDER BY 3 DESC LIMIT 5;
+------+--------------------+--------------------+
| id   | title              | score              |
+------+--------------------+--------------------+
| 7132 | New York Tax Forms |  7.821961879730225 |
| 7127 | Nebraska Tax Forms |  6.882259845733643 |
| 7123 | Free Forms         |  6.748578071594238 |
| 7126 | Montana Tax Forms  | 6.4749345779418945 |
| 7119 | Maine Tax Forms    |  6.400341510772705 |
+------+--------------------+--------------------+

With 5.6:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('corporation commission forms'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_innodb ORDER BY 3 DESC LIMIT 5;
+-----+--------------------------------------------------------------------------+--------------------+
| id  | title                                                                    | score              |
+-----+--------------------------------------------------------------------------+--------------------+
|  79 | Ohio Free Public Records, Criminal Records And Background Checks         |  51.76125717163086 |
|  78 | Free North Dakota Public Records, Criminal Records And Background Checks |  30.32522201538086 |
|  19 | Free Public Printable Forms For All Industries                           | 27.557558059692383 |
| 408 | Free Colorado DMV Resources and Driving Records                          | 24.933029174804688 |
|  48 | California Free Public Records, Criminal Records And Background Checks   | 24.224336624145508 |
+-----+--------------------------------------------------------------------------+--------------------+

OK, now I’m starting to get a little worried. The docs do tell us that the default stopword list is substantially different between InnoDB and MyISAM, and as it turns out, there are only 36 stopwords in the default InnoDB list, but there are 543 stopwords in the default MyISAM list. What happens if we take the MyISAM stopwords, insert them into a table, and configure that table to be our stopword list for InnoDB?

This is the table that we’re trying to emulate:

mysql: SHOW CREATE TABLE information_schema.innodb_ft_default_stopword\G
*************************** 1. row ***************************
       Table: INNODB_FT_DEFAULT_STOPWORD
Create Table: CREATE TEMPORARY TABLE `INNODB_FT_DEFAULT_STOPWORD` (
  `value` varchar(18) NOT NULL DEFAULT ''
) ENGINE=MEMORY DEFAULT CHARSET=utf8

The docs tell us that we need to create an *InnoDB* table with a single VARCHAR column named “value”. OK, sounds easy enough:

mysql: SHOW CREATE TABLE innodb_myisam_stopword\G
*************************** 1. row ***************************
       Table: innodb_myisam_stopword
Create Table: CREATE TABLE `innodb_myisam_stopword` (
  `value` varchar(18) NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=utf8

But, when we try to use this table, here’s what comes back:

mysql: SET GLOBAL innodb_ft_server_stopword_table='test/innodb_myisam_stopword';
ERROR 1231 (42000): Variable 'innodb_ft_server_stopword_table' can't be set to the value of 'test/innodb_myisam_stopword'

And here’s what appeared in the server’s error log:

InnoDB: invalid column type for stopword table test/innodb_myisam_stopword. Its first column must be of varchar type

Uh… Does this mean that my next blog post should be entitled, “When is a VARCHAR Not Really a VARCHAR?” Thinking that maybe this was a case of GEN_CLUST_INDEX causing me issues, I tried adding a second column to the table which was an integer PK, and in another attempt, I tried just making the “value” column the PK, but neither of those worked. Also, trying to set innodb_ft_user_stopword_table produced the same error. I submitted a bug report (68450), and as you can see from the bug discussion, it turns out that this table is character-set-sensitive. If you’re going to use your own stopword table for InnoDB FTS, at least for the moment, this table must use the latin1 character set.

mysql: CREATE TABLE innodb_ft_list2 ( value VARCHAR(18) NOT NULL DEFAULT '') ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.02 sec)
mysql: SET GLOBAL innodb_ft_server_stopword_table='test/innodb_ft_list2';
Query OK, 0 rows affected (0.00 sec)
mysql: INSERT INTO innodb_ft_list2 SELECT * FROM innodb_myisam_stopword;
Query OK, 543 rows affected (0.01 sec)
Records: 543  Duplicates: 0  Warnings: 0

As far as I can tell, this little gotcha doesn’t appear to be mentioned anywhere in the MySQL 5.6 documentation; every place where it talks about creating one of these stopword tables, it simply mentions the table engine and the column name/type, so I’m not sure if this is an intentional restriction that just needs to be better documented or if it’s a limitation with the InnoDB FTS feature that will be removed in a later version.

Now that we’ve sorted this out, let’s drop and rebuild our FT index on the InnoDB table and try the above queries one more time. We already know what the MyISAM results are going to be; do our InnoDB results change? No, they are exactly the same, although the scores did change slightly.

What about a Boolean mode query? The docs tell us that if we use Boolean mode, and we put a “+” in front of our search term, then that term *must* appear in the search results. But does it?

With 5.5:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson +arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 5;
+-------+--------------------------+-------+
| id    | full_name                | score |
+-------+--------------------------+-------+
| 74717 | James R Peterson         |     1 |
|     1 | Harold Wesley Abbott Iii |     0 |
|     3 | William Doyle Abbott Jr  |     0 |
|     5 | Robert Jack Abraham      |     0 |
|     7 | Mark Allen Abrell        |     0 |
+-------+--------------------------+-------+

And with 5.6:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson +arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 5;
+---------+------------------+-------------------+
| id      | full_name        | score             |
+---------+------------------+-------------------+
|   74717 | James R Peterson | 23.63458251953125 |
| 1310720 | Terry Lynn Suter |                 0 |
| 1441792 | Jorge E Morrison |                 0 |
| 1310976 | Oscar Blakemore  |                 0 |
| 1442048 | Donald Simmons   |                 0 |
+---------+------------------+-------------------+

There’s only one row in the table that actually matches all three search terms, and in this case, both MyISAM and InnoDB FTS performed identically and found it. I’m not really concerned about the fact that the next four rows are completely different; the scores are zero, which means “no match.” This looks promising, so let’s explore further. Again, from the docs, if we run a boolean mode query where some of the search terms are prefixed with “+” and others have no prefix, results that have the unprefixed term will be ranked higher than those with out it. So, for example, if we change the above query to be “+james +peterson arizona” then we might expect to get back multiple matches containing the words “James” and “Peterson”, and we should expect the record from Arizona to be towards the top of the list.

With 5.5, this is exactly what happens:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 5;
+--------+------------------------------+--------------------+
| id     | full_name                    | score              |
+--------+------------------------------+--------------------+
|  74717 | James R Peterson             | 1.3333333730697632 |
|  14159 | Christopher Michael James    |                  1 |
|  44427 | James Cyrus Peterson         |                  1 |
|  53501 | James/Rober T Giles/Peterson |                  1 |
| 126373 | Bamish James Peterson        |                  1 |
+--------+------------------------------+--------------------+

With 5.6, we’re not so fortunate.

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 5;
+--------+--------------------------+--------------------+
| id     | full_name                | score              |
+--------+--------------------------+--------------------+
| 741223 | Alphonso Lee Peterson Sr | 59.972904205322266 |
| 925811 | James B Peterson         |  59.37348937988281 |
| 269589 | Michael James Peterson   |  44.82982635498047 |
| 471273 | James Allen Peterson     |  39.58232879638672 |
| 925781 | Anthony James Petersen   |  38.03296661376953 |
+--------+--------------------------+--------------------+

These results aren’t even close to identical. As it turns out, the full record for “Alphonso Lee Peterson Sr” does also contain the name “James”, and the word “Peterson” is listed in there several times, but “Arizona” is not present at all, whereas the record for “James R Peterson” had all three search terms and no significant repetition of any of them. Using this particular query, “James R Peterson” is #15 on the list.

At this point, it’s pretty obvious that the way MyISAM is calculating the scores is much different from the way that InnoDB is doing it, and given what I said earlier about the repetition of words in the “Alphonso Lee Peterson Sr” record versus the “James R Peterson” one, we might argue that InnoDB is actually behaving more correctly than MyISAM. Imagine if we were searching through newspaper articles or something of that sort, and we were looking for queries containing the word “MySQL” – odds are that an article which has 10 instances of “MySQL” might be more desirable to us than an article which only has it mentioned once. So if I look at these results from that persepctive, I can understand the how and the why of it. My concern is that there are likely going to be people who believe that switching to InnoDB FTS is simply a matter of upgrading to 5.6 and running ALTER TABLE foo ENGINE=InnoDB. In theory, yes. In practice, not even close.

I tried one more Boolean search, this time looking for someone’s full name, which I knew to be present only once in the database, and I used double quotes to group the search terms as a single phrase:

With 5.5:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 1;
+-------+----------------+-------+
| id    | full_name      | score |
+-------+----------------+-------+
| 62633 | Thomas B Smith |     1 |
+-------+----------------+-------+

Looks good, there he is. Now what happens under 5.6?

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 1;
+---------+------------------+-------+
| id      | full_name        | score |
+---------+------------------+-------+
| 1310720 | Terry Lynn Suter |     0 |
+---------+------------------+-------+

In the immortal words of Homer J. Simpson, “D’OH!!” Why is MyISAM able to locate this record but InnoDB cannot find it at all? I suspect that the “B” is causing problems for InnoDB, because it’s only a single character and we’ve set innodb_ft_min_token_size to 4. Thus, when InnoDB is parsing the data and building the word list, it’s completely ignoring Mr. Smith’s middle initial. To test this hypothesis, I reset innodb_ft_min_token_size to 1, dropped/rebuilt the InnoDB index, and tried again.

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 1;
+-------+----------------+--------------------+
| id    | full_name      | score              |
+-------+----------------+--------------------+
| 62633 | Thomas B Smith | 31.749061584472656 |
+-------+----------------+--------------------+

Aha, there he is! Based on that result, I would caution anyone designing an application that’s going to use InnoDB FTS to be quite mindful of the types of queries that you’re expecting your users to run. In particular, if you expect or are going to allow users to enter search phrases that include initials, numbers, or any other string of length less than 3 (the default), I think you’re going to be forced to set innodb_ft_min_token_size to 1. Otherwise you’ll run into the same problem as our Mr. Smith here. [This does raise the question of why it works with MyISAM when ft_min_word_length defaults to 4, but that is a topic for another day.]

Note that there may or may not be some performance implications to cranking this value all the way down; that is something I have not yet tested but will be reporting on in part 3 of this series. I can, however, confirm that the on-disk size of my DIR dataset is exactly the same with a setting of 1 versus a setting of 4. This may or may not be the case with multi-byte character sets or with ideographic languages such as Japanese, although Japanese poses its own unique problems for FTS of any kind due to its lack of traditional word boundaries.

In any event, it appears that we’ve solved the Boolean-mode search issue, but we still have vastly different results with the natural-language-mode search. For those of you who are expecting and need to have the MyISAM-style search results, there is at least one potential escape hatch from this rabbit hole. When defining a FULLTEXT KEY, you can use the “WITH PARSER” modifier to specify the name of a UDF which references your own custom-written fulltext parser plugin. Thus I am thinking that it may be possible to take the MyISAM full-text parser code, convert it to a plugin, and use it for InnoDB FT indexes where you’re expecting MyISAM-style results. Verifying or refuting this conjecture is left as an exercise for the reader. :-)

A quick recap of what we’ve learned so far:

  • There are parts of InnoDB FTS configuration which are both letter-case and character-set sensitive. Watch out!
  • When you add your first FULLTEXT KEY to an InnoDB table, be prepared for a table rebuild.
  • Calculation of match score is completely different between the two engines; sometimes this leads to wildly different results.
  • If you were hoping to use InnoDB FTS as a simple drop-in replacement for your current MyISAM FTS, the results may surprise you.

That last point bears particular emphasis, as it also illustrates an important best practice even if FTS isn’t involved. Always test how your application behaves as a result of a major MySQL version upgrade before rolling it into production! Percona has tools (pt-upgrade and Percona Playback) that can help you with this. These tools are free and open source, please use them. You, and your users, will be happy that you did.

In the third and final installment of this series, we will take a look at performance. How does the speed of InnoDB FTS compare to its MyISAM counterpart, and what kinds of tricks can we use to make it go faster? Stay tuned to find out!

The post InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries! appeared first on MySQL Performance Blog.

Feb
26
2013
--

InnoDB Full-text Search in MySQL 5.6 (part 1)

I’ve never been a very big fan of MyISAM; I would argue that in most situations, any possible advantages to using MyISAM are far outweighed by the potential disadvantages and the strengths of InnoDB. However, up until MySQL 5.6, MyISAM was the only storage engine with support for full-text search (FTS). And I’ve encountered many customers for whom the prudent move would be a migration to InnoDB, but due to their use of MyISAM FTS, the idea of a complete or partial migration was, for one reason or another, an impractical solution. So, when FTS for InnoDB was first announced, I thought this might end up being the magic bullet that would help these sorts of customers realize all of the benefits that have been engineered into InnoDB over the past few years and still keep their FTS capability without having to make any significant code changes.

Unfortunately, I think that hope may be premature. While it is true that InnoDB FTS in MySQL 5.6 is syntactically identical to MyISAM FTS, in the sense that the SQL required to run a MATCH .. AGAINST is the same (modulo any new features introduced with InnoDB FTS), that’s largely where the similarities end.

NOTE 1: I was originally planning to cover everything I wanted to discuss with respect to InnoDB FTS in this one post, but I think there’s a lot of interesting stuff here, so I will break it into three pieces instead. The first part (this one) will be a very quick overview of FTS in InnoDB and some observations that I’ve made while getting it configured. The second part will compare query results between MyISAM FTS and InnoDB FTS over the same data sets, and then finally in the third installment, we’ll look at query performance. In the event that a new release of 5.6 appears between now and part 3, I’ll also revisit some of the “quirks” from parts 1 and 2 to see if the behavior has changed.

NOTE 2: For purposes of this discussion, I used two separate data sets. The first one is a set of about 8K very SEO-stuffed web pages, wherein the document title is the page title, and the document body is the HTML-tag-stripped body of the page. We’ll call this data set “SEO” – it’s about 20MB of actual data. The other one is a set of about 790K directory records, each one containing the name, address, and some other public-records-type information about each person. We’ll call this data set “DIR”, and it’s about 155MB of actual data.

NOTE 3: Also, keep in mind that I used the community editions of MySQL 5.5.30 and MySQL 5.6.10 with no tuning whatsoever (with one exception that I’ll explain below) – the idea behind this investigation wasn’t to find out how to make InnoDB FTS blazingly-fast, but simply to get a sense of how it works compared to traditional MyISAM FTS. We’ll get to performance in the third installment. For now, the important number here is to note that the InnoDB buffer pool for my 5.6 instance is 128MB – smaller than the size of my DIR data.

So, with all of that out of the way, let’s get to it.

Here is our basic table definition for the DIR dataset. The table for the SEO dataset looks identical, except that we replace “full_name” with a VARCHAR(255) “title” and “details” with a TEXT “body”.

CREATE TABLE `dir_test_innodb` (
  `id` int(10) unsigned NOT NULL,
  `full_name` varchar(100) DEFAULT NULL,
  `details` text,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `full_name` (`full_name`,`details`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

We also have identical tables created in 5.5.30 where, of course, the only difference is that the engine is MyISAM rather than InnoDB. Loading the data was done via a simple Perl script, inserting one row at a time with AutoCommit enabled – remember, the focus here isn’t on performance just yet.

Having loaded the data, the first thing we notice is that there are a lot of “new” InnoDB tablespace files in our database directory:

-rw-rw----. 1 mysql mysql      8632 Feb 20 15:54 dir_test_innodb.frm
-rw-rw----. 1 mysql mysql 213909504 Feb 20 15:55 dir_test_innodb.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_0000000000000153_DOC_ID.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_ADDED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_BEING_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_BEING_DELETED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_CONFIG.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_DELETED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 16:09 FTS_00000000000000ad_STOPWORDS.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_0000000000000144_DOC_ID.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_ADDED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_BEING_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:55 FTS_0000000000000114_BEING_DELETED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:55 FTS_0000000000000114_CONFIG.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:55 FTS_0000000000000114_DELETED.ibd
-rw-rw----. 1 mysql mysql     98304 Feb 20 15:54 FTS_0000000000000114_STOPWORDS.ibd
-rw-rw----. 1 mysql mysql      8618 Feb 20 16:09 seo_test_innodb.frm
-rw-rw----. 1 mysql mysql  37748736 Feb 20 16:29 seo_test_innodb.ibd

By comparison, this is what we see on the MyISAM side:

-rw-rw----. 1 mysql mysql      8632 Feb 19 17:17 dir_test_myisam.frm
-rw-rw----. 1 mysql mysql 155011048 Feb 19 17:17 dir_test_myisam.MYD
-rw-rw----. 1 mysql mysql 153956352 Feb 19 17:18 dir_test_myisam.MYI
-rw-rw----. 1 mysql mysql      8618 Feb 20 16:11 seo_test_myisam.frm
-rw-rw----. 1 mysql mysql  21561096 Feb 20 16:11 seo_test_myisam.MYD
-rw-rw----. 1 mysql mysql  14766080 Feb 20 16:11 seo_test_myisam.MYI

I also observed that if I simply load the data into an InnoDB table that has never had a full-text index on it, and then I create one, the following warning is generated:

mysql> alter table dir_test_innodb ADD FULLTEXT KEY (full_name, details);
Query OK, 0 rows affected, 1 warning (39.73 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> show warnings;
+---------+------+--------------------------------------------------+
| Level   | Code | Message                                          |
+---------+------+--------------------------------------------------+
| Warning |  124 | InnoDB rebuilding table to add column FTS_DOC_ID |
+---------+------+--------------------------------------------------+

This doesn’t make a lot of sense to me. Why does InnoDB need to add a hidden column (similar to GEN_CLUST_INDEX when you don’t define a PRIMARY KEY, I assume) when I already have an INT UNSIGNED PK that should suffice as any sort of document ID ? As it turns out, if you create a column called FTS_DOC_ID which is a BIGINT UNSIGNED NOT NULL with a unique index on it, your table doesn’t need to be rebuilt. The most important item to note here – FTS_DOC_ID must be spelled and specified exactly that way – IN ALL CAPS. If you try “fts_doc_id” or any other mixture of lettercase, you’ll get an error:

mysql> CREATE TABLE dir_test_innodb4 (fts_doc_id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY);
ERROR 1166 (42000): Incorrect column name 'fts_doc_id'

Various points in the documentation mention the notion of a “document ID” that “might reflect the value of an ID column that you defined for the underlying table, or it can be a sequence value generated by InnoDB when the table does not contain a suitable column,” but there are only a handful of references to FTS_DOC_ID found when searching the MySQL 5.6 manual, and the only page which appears to suggest how using an explicitly-defined column is done is this one, which discusses improving bulk insert performance. At the very bottom, the page claims that you can speed up bulk loading into an InnoDB FT index by declaring a column called FTS_DOC_ID at table creation time of type BIGINT UNSIGNED NOT NULL with a unique index on it, loading your data, and then creating the FT index after the data is loaded.

One obvious problem wih those instructions is that if you define a column and a unique key as they suggest, your data won’t load due to unique key constraint violations unless you also do something to provide some sort of sequence value for that column, whether as an auto_increment value or via some other means, but the bit that troubles me further is the introduction of a column-level case-sensitivity requirement that only seems to actually matter at table creation time. Once I’ve got a table with an explicit FTS_DOC_ID column, however, MySQL apparently has no problem with either of the following statements:

mysql> insert into dir_test_innodb3 (fts_doc_id, full_name) values (1, 'john smith');
mysql> select * from dir_test_innodb3 where fts_doc_id=1;

Philosophically, I find that kind of behavior unsettling. I don’t like case-sensitivity in my table or column names to begin with (I may be one of the few people that likes lower_case_table_names = 1 in /etc/my.cnf), but I think it’s even worse that the case-sensitivity only matters some of the time. That strikes me as a good recipe for DBA frustration.

Now, let’s return to all of those FTS_*.ibd files. What, exactly, are they? In short, the _CONFIG.ibd file contains configuration info about the FT index (the same sort of configuration data that can be queried out of the I_S.INNODB_FT_CONFIG table, as I’ll discuss momentarily), and the others contain document IDs of new rows that are added to or removed from the table and which need to be merged back into or removed from the main index. I’m not entirely sure about the _STOPWORDS.ibd file just yet; I thought it might be somehow related to a custom stopwords table, but that doesn’t seem to be the case (or at least not in the way that I had thought), so I will need to look through the source to figure out what’s going on there.

In any case, for each new FULLTEXT KEY that you create, you’ll get a corresponding FTS_*_DOC_ID.ibd file (but none of the others), and if you drop a FT index, the its corresponding FTS_*_DOC_ID.ibd file will also be removed. HOWEVER, even if you drop all of the FT indexes for a given table, you’re still left with all of the other FTS_*.ibd files, and it appears that the only way to get rid of them is to actually rebuild the table.

-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_ADDED.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_BEING_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_BEING_DELETED.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 19:07 FTS_0000000000000025_CONFIG.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_DELETED_CACHE.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_DELETED.ibd
-rw-rw----. 1 mysql mysql 98304 Feb 20 18:57 FTS_0000000000000025_STOPWORDS.ibd

Also, while we’re on the subject of adding and dropping FT indexes, it’s entirely possible to DROP multiple FT indexes with InnoDB in the same ALTER TABLE statement, but it’s not possible to CREATE more than one at a time. If you try it, this is what happens:

mysql> alter table dir_test_innodb ADD FULLTEXT KEY (full_name, details), 
       ADD FULLTEXT KEY (details);
ERROR 1795 (HY000): InnoDB presently supports one FULLTEXT index creation at a time

That’s an odd limitation. Do it as two separate ALTER statements, and it appears to work fine.

But here’s where things start to get even weirder. According to the documentation, if we specify the name of a table that has a FT index for the global variable innodb_ft_aux_table, we should be able to get some statistics about the FT indexes on that table by querying the various I_S.INNODB_FT_* tables. In particular, the INNODB_FT_CONFIG table is supposed to “display metadata about the FULLTEXT index and associated processing for an InnoDB table.” The documentation also tells us that we can keep our FT indexes up to date by setting innodb_optimize_fulltext_only = 1 and then running OPTIMIZE TABLE, and that we might have to run OPTIMIZE TABLE multiple times if we’ve had a lot of changes to the table.

This all sounds pretty good, in theory, but at least some part of it doesn’t seem to work. First, let’s check the stats immediately after setting these variables, and then let’s push some additional data into the table, run an OPTIMIZE or two, delete some data, and see what happens:

mysql> set global innodb_ft_aux_table='test/dir_test_innodb';
mysql> set global innodb_optimize_fulltext_only=1;
mysql> select * from information_schema.innodb_ft_config;
+---------------------------+--------+                                                  
| KEY                       | VALUE  |                                                  
+---------------------------+--------+                                                  
| optimize_checkpoint_limit | 180    |                                                  
| synced_doc_id             | 787625 |                                                  
| last_optimized_word       |        |                                                  
| deleted_doc_count         | 0      |                                                  
| total_word_count          |        |                                                  
| optimize_start_time       |        |                                                  
| optimize_end_time         |        |                                                  
| stopword_table_name       |        |                                                  
| use_stopword              | 1      |
| table_state               | 0      |
+---------------------------+--------+
10 rows in set (0.00 sec)

mysql> insert into dir_test_innodb (full_name, details) 
       SELECT reverse(full_name), details FROM dir_test_innodb WHERE id < 500000; 
Query OK, 245051 rows affected (8 min 13.13 sec) 
Records: 245051  Duplicates: 0  Warnings: 0 

mysql> select * from information_schema.innodb_ft_config;
+---------------------------+---------+
| KEY                       | VALUE   |
+---------------------------+---------+
| optimize_checkpoint_limit | 180     |
| synced_doc_id             | 1028261 |
| last_optimized_word       |         |
| deleted_doc_count         | 0       |
| total_word_count          |         |
| optimize_start_time       |         |
| optimize_end_time         |         |
| stopword_table_name       |         |
| use_stopword              | 1       |
| table_state               | 0       |
+---------------------------+---------+
10 rows in set (0.02 sec)

mysql> optimize table dir_test_innodb;
+----------------------+----------+----------+----------+
| Table                | Op       | Msg_type | Msg_text |
+----------------------+----------+----------+----------+
| test.dir_test_innodb | optimize | status   | OK       |
+----------------------+----------+----------+----------+
1 row in set (4.60 sec)

mysql> select * from information_schema.innodb_ft_config;
+---------------------------+---------+
| KEY                       | VALUE   |
+---------------------------+---------+
| optimize_checkpoint_limit | 180     |
| synced_doc_id             | 1032677 |
| last_optimized_word       |         |
| deleted_doc_count         | 0       |
| total_word_count          |         |
| optimize_start_time       |         |
| optimize_end_time         |         |
| stopword_table_name       |         |
| use_stopword              | 1       |
| table_state               | 0       |
+---------------------------+---------+
10 rows in set (0.00 sec)

mysql> delete from dir_test_innodb LIMIT 200000;
Query OK, 200000 rows affected (8.65 sec)

mysql> optimize table dir_test_innodb;           
+----------------------+----------+----------+----------+
| Table                | Op       | Msg_type | Msg_text |
+----------------------+----------+----------+----------+
| test.dir_test_innodb | optimize | status   | OK       |
+----------------------+----------+----------+----------+
1 row in set (8.44 sec)

mysql> select * from information_schema.innodb_ft_config;
+---------------------------+---------+
| KEY                       | VALUE   |
+---------------------------+---------+
| optimize_checkpoint_limit | 180     |
| synced_doc_id             | 1032677 |
| last_optimized_word       |         |
| deleted_doc_count         | 0       |
| total_word_count          |         |
| optimize_start_time       |         |
| optimize_end_time         |         |
| stopword_table_name       |         |
| use_stopword              | 1       |
| table_state               | 0       |
+---------------------------+---------+

I ran OPTIMIZE TABLE several more times, and each execution took between 6 and 8 seconds, but the output of the query against I_S.innodb_ft_config never changed, so it seems like at least some of the diagnostic output isn’t working quite right. Intuitively, I would expect some changes in total_word_count, or optimize_(start|end)_time, and the like. However, if I check some of the other I_S tables, I do find that the number of rows in I_S.innodb_ft_index_table is changing, so it’s pretty clear that I do actually have a FT index available.

At the start of this post, I mentioned that I did make one configuration change to the default InnoDB settings for MySQL 5.6, and that was to change innodb_ft_min_token_size from the default of 3 to a value of 4 so that it would be identical to the MyISAM default. After all, the (naive?) hope here is that when I run an FTS query against both MyISAM and InnoDB I will get back the same results; if this equivalence doesn’t hold, then as a consultant, I might have a hard time recommending this feature to my customers, and as an end user, I might have a hard time using the feature at all, because it could completely alter the behavior of my application unless I also make a nontrivial number of code changes.

In part 2 of this series, we’ll reload and reset our SEO and DIR data sets back to their initial states, run some queries, and compare the output. Stay tuned, it gets rather curious.

The post InnoDB Full-text Search in MySQL 5.6 (part 1) appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com