Nov
04
2022
--

PostgreSQL: Are All NULLs the Same?

PostgreSQL Are All NULLs the Same

I was recently reading the PostgreSQL 15 release notes and stopped at “Allow unique constraints and indexes to treat NULL values as not distinct”. At first glance, it seems to be a minor change, but this feature may change many designs and make many lives a lot easier.

The first thing here is to understand what is and how a NULL works. Many authors describe NULL as how missing or unknown values are represented, and some authors dedicate a good amount of time to explaining and describing NULL, like Date and Darwen’s “A Guide to the SQL Standard“. Also, checking Wikipedia we see that “In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database”. 

Okay, NULL indicates that there is an unknown value in a column, but how does it work? Can we operate with NULLs? Well, we can operate with NULLs but we’ll probably have an unexpected result. For example, arithmetic operations involving NULL will return… NULL! If we have a SQL operation that adds 10 to NULL the result will be NULL. If we do “NULL – NULL” the result will be NULL, not zero! Also, a boolean comparison between two values involving NULL may not return TRUE or FALSE. What about comparisons, are there NULLs larger than others? Are there NULLs equal to others? See how tricky this NULL thing can get?

This is why that change caught my attention. If we compare NULLs they will be in essence all different, and it’s because we don’t know what they are. If I compare an unknown value with another unknown value, even though the result is unknown, in the database world they will be different, making NULLs unique by nature. This is how most of the databases operate (if not all of them). This brings a problem to many applications that need to operate on NULLs, especially the ones that try to have UNIQUE constraints using NULL columns. Because ALL NULLs are different, there will never be a collision and the UNIQUE constraint will be bogus. Many application designers have used the ZERO or EMPTY workaround, putting a zero or an empty value ‘’ to represent the absence of value .” Remember, empty isn’t NULL!

This feature changes the behavior here. It now allows NULLs to be treated equally for the sake of distinct comparison using “UNIQUE NULLS NOT DISTINCT”. No more workarounds are needed! But be careful as it’s still an unknown value and the operations are kept the same. For a much better explanation of how NULLs work, check the book mentioned above. You may see databases from another perspective after reading it!

Jul
07
2022
--

MongoDB Index Building on ReplicaSet and Shard Cluster

MongoDB Index Building on ReplicaSet and Shard Cluster

MongoDB Index Building on ReplicaSet and Shard ClusterWe all know how important it is to have a proper index in the database in order to do its job effectively. We have been using indexing in our daily life to import daily tasks, without index all tasks would be completed but in a relatively long time.

The basic working of index

Imagine that we have tons of information and we want to look at something very particular and we don’t know where it is. We are going to spend a lot of time finding that particular piece of data.

If only we would have some kind of information about all the pieces of data, the job would finish very quickly because now we know where to look without spending too much time searching each and every record for one particular data.

Indexes are special data structures that store some information of records to traverse to that particular data. Indexes can be created in ascending or descending order to support efficient equality matches and range-based query operations.

Index building strategy and consideration

When we think of building an index many aspects have to be considered like key data set which is frequently being used, cardinality, write ratio in that collection, free memory, and storage.

If there are no indexes in the collection, MongoDB will do a full collection scan every time any type of query is performed which could contain millions of records. This will not only slow down the operation but will also increase the wait time for other operations too.

We can also create multiple indexes at the same time on the same collection, saving lots of time that is spent scanning the collection with the createIndexes command.

Limitations

It is very important to have enough memory to accommodate the working set. It is not necessary that all indexes need to fit in RAM.

Index key limit should be less than 1024 bytes till v4.0. Starting v4.2 with fcv 4.2, this limit is removed.

Same with index name, it can be up 127 bytes in db with fcv 4.0 and below. This limit is reduced with db v4.2 and fcv 4.2.

Only 64 indexes can be created in any given single collection.

Index types in MongoDB

Before seeing various index types, let’s see what the index name looks like.

The default name for an index is the concatenation of the indexed keys and each key’s direction in the index ( i.e. 1 or -1) using underscores as a separator. For example, an index created on { mobile : 1, points: -1 } has the name mobile_1_points_-1.

We can also create a custom, more human-readable name 

db.products.createIndex({ mobile: 1, points: -1 }, { name: "query for rewards points" })

Index type

MongoDB provides various types of indexes to support various data and queries.

Single field index: In a single-field index, an index is created on a single field in a document. It can traverse in both directions regardless of sort order while creating the index.

Syntax:

db.collection.createIndex({"<fieldName>" : <1 or -1>})

Here 1 represents the field specified in ascending order and -1 for descending order.

Example:

db.inventory.createIndex({productId:1});

 

Compound index: In a compound index, we can create indexes on multiple fields. The order of fields listed in a compound index has significance. For instance, if a compound index consists of { userid: 1, score: -1 }, the index sorts first by userid and then, within each userid value, sorts by score.

Syntax:

db.collection.createIndex({ <field1>: <1/–1>, <field2>: <1/–1>, … })

Example:

db.students.createIndex({ userid: 1, score: -1 })

 

Multikey index: MongoDB uses multikey indexes to index the content stored in arrays. When we create an index on a field that contains an array value, MongoDB will automatically create a separate index for every element of the array. We do not need to specify multikey type explicitly, as MongoDB automatically takes care of whether to create a multikey index if the indexed field contains an array value.

Syntax:

db.collection.createIndex({ <field1>: <1/–1>})

Example:

db.students.createIndex({ "addr.zip":1})

 

Geospatial index: MongoDB provides two special indexes: 2d indexes that use planar geometry when returning results and 2dsphere indexes that use spherical geometry to return results.

Syntax:

db.collection.createIndex({ <location field> : "2dsphere" })

*where the <location field> is a field whose value is either a GeoJSON object or a legacy coordinate pair.

Example:

db.places.createIndex({ loc : "2dsphere" })

 

Text index: With the text index type, MongoDB supports searching for string content in a collection. A collection can only have one text search index, but that index can cover multiple fields.

Syntax:

db.collection.createIndex({ <field1>: text })

Example:

db.reviews.createIndex({ comments: "text" })

 

Hash index: MongoDB creates the hash value of the indexed field in case of a hash base index. This type of index is mainly required where we want to have an even data distribution e.g in the case of a shard cluster environment. 

Syntax:

db.collection.createIndex({ _id: "hashed"  })

From Version 4.4 onwards, the compound Hashed Index is applicable

Properties

Unique indexes: When specified, MongoDB will reject duplicate values for the indexed field. It will not allow inserting another document containing the same key-value pair which is indexed.

> db.cust_details.createIndex({Cust_id:1},{unique:true})

{

"createdCollectionAutomatically" : true,

"numIndexesBefore" : 1,

"numIndexesAfter" : 2,

"ok" : 1

}

> db.cust_details.insert({"Cust_id":"39772","Batch":"342"})

WriteResult({ "nInserted" : 1 })




> db.cust_details.insert({"Cust_id":"39772","Batch":"452"})

WriteResult({

"nInserted" : 0,

"writeError" : {

"code" : 11000,

"errmsg" : "E11000 duplicate key error collection: student.cust_details index: Cust_id_1 dup key: { Cust_id: \"39772\" }"

}

})

 

Partial indexes: Partial indexes only index the documents that match the filter criteria.

db.restaurants.createIndex({ cuisine: 1, name: 1 },{ partialFilterExpression: { rating: { $gt: 5 } } })

{

"createdCollectionAutomatically" : true,

"numIndexesBefore" : 1,

"numIndexesAfter" : 2,

"ok" : 1

}

 

TTL indexes: TTL indexes are special single-field indexes that can be used to auto delete documents from the collection over a certain period of time.

db.eventlog.createIndex({ "lastModifiedDate": 1 }, { expireAfterSeconds: 3600 })

lastModifiedDate_1

 

Sparse indexes: Sparse indexes only contain entries for documents that have the indexed field, even if the index field contains a null value.

db.addresses.createIndex({ "email": 1 }, { sparse: true })

email_1

 

Hidden indexes: Hidden indexes are not visible to the query planner and cannot be used to support a query. Apart from being hidden from the planner, hidden indexes behave like unhidden indexes.

To create a new hidden index:

db.addresses.createIndex({ pincode: 1 },{ hidden: true });

To change an existing index into a hidden one (works only with db having fcv 4.4 or greater):

db.addresses.hideIndex({ pincode: 1 }); // Specify the index key specification document
or
db.addresses.hideIndex( "pincode_1" );  // Specify the index name

To unhide any hidden index:

Index name or key can be used to hide the index.

db.addresses.unhideIndex({ pincode: 1 }); // Specify the index key specification document
or
db.addresses.unhideIndex( "pincode_1" );  // Specify the index name

Rolling index builds on replica sets

Starting from MongoDB 4.4 and later, index build happens simultaneously on all data-bearing nodes. For workloads that cannot tolerate performance issues due to index build, we can follow the approach of rolling index build strategy.

**NOTE**

Unique indexes

To create unique indexes using the following procedure, you must stop all writes to the collection during this procedure.

If you cannot stop all writes to the collection during this procedure, do not use the procedure on this page. Instead, build your unique index on the collection by issuing db.collection.createIndex() on the primary for a replica set.

Oplog size

Ensure that your oplog is large enough to permit the indexing or re-indexing operation to complete without falling too far behind to catch up.

Procedure

1. Stop one secondary and restart as a standalone on a different port number.

In this process, we are going to stop any one secondary node at a time and disable the replication parameter from the configuration file, and disableLogicalSessionCacheRefresh to true in the configuration file under the setParameter section.

Example

net:

   bindIp: localhost,<hostname(s)|ip address(es)>

   port: 27217

#   port: 27017

#replication:

#   replSetName: myRepl

setParameter:

disableLogicalSessionCacheRefresh: true

We only need to make changes in the above settings, the rest will remain the same.

Once the above changes are done, save it and restart the process.

mongod --config <path/To/ConfigFile>

OR

sudo systemctl start mongod

Now, the mongod process will start on port 27217 in standalone mode.

2. Build the index

Connect to the mongod instance on port 27217. Switch to the desired database and collection to create an index.

Example:

mongo –port 27217 -u ‘username’  –authenticationDatabase admin

> use student

switched to db student

> db.studentData.createIndex( { StudentID: 1 } );

{

"createdCollectionAutomatically" : true,

"numIndexesBefore" : 1,

"numIndexesAfter" : 2,

"ok" : 1

}

 

3. Restart the process mongod as a replica set member

After the desired index build completes, we can add the node back to replicaset member. 

Undo the configuration file change made in step one above. Restart the mongod process with the original configuration file.

net:

   bindIp: localhost,<hostname(s)|ip address(es)>

   port: 27017

replication:

   replSetName: myRepl

After saving the configuration file, restart the process and let it become secondary.

mongod --config <path/To/ConfigFile>

OR

sudo systemctl start mongod

4. Repeat the above procedure for the remaining secondaries

Once the ongoing node becomes secondary and there is no lag, repeat the procedure again one node at a time.

  1. Stop one secondary and restart as a standalone.
  2. Build the index.
  3. Restart the mongod process as a replica set member.

5. Index build on primary

Once index build activity finishes up in all the secondary nodes, use the same process as above to create an index on the last remaining node.

  1. Connect to the primary node and issue rs.stepDown(); Once it successfully steps down, it becomes secondary and a new primary is elected. Follow steps from one through three to build the index.
  2. Stop secondary node and restart as a standalone.
  3. Build the iondex.
  4. Restart the mongod process as a replica set member.

Rolling index builds on sharded clusters

Starting from MongoDB 4.4 and later, index build happens simultaneously on all data-bearing nodes. For workloads that cannot tolerate performance issues due to index build, we can follow the approach of rolling index build strategy.

**NOTE**

Unique indexes

To create unique indexes using the following procedure, you must stop all writes to the collection during this procedure.

If you cannot stop all writes to the collection during this procedure, do not use the procedure on this page. Instead, build your unique index on the collection by issuing db.collection.createIndex() on the primary for a replica set.

Oplog size

Ensure that your oplog is large enough to permit the indexing or re-indexing operation to complete without falling too far behind to catch up.

Procedure

1. Stop the balancer

In order to create an index in a rolling fashion in a shard cluster, it is necessary to stop the balancer so that we do not end up with an inconsistent index.

Connect to mongos instance and run sh.stopBalancer() to disable the balancer.

If there is any active migration going on, the balancer will stop only after the completion of the ongoing migration.

We can check if the balancer is stopped or not with the below command,

sh.getBalancerState()

If the balancer is stopped, the output will be false.

2. Determine the distribution of the collection

In order to build indexes in a rolling fashion, it is necessary to know on which shards the collections are residing. 

Connect to one of the mongos and refresh the cache so that we get fresh distribution information of collections in the shard for which we want to build the index.

Example:

We want to create an index in the studentData collection in the student database.

We will run the below command to get a fresh distribution of that collection.

db.adminCommand( { flushRouterConfig: "students.studentData" } );

db.records.getShardDistribution();

We will get the output of shards containing the collection :

Shard shardA at shardA/s1-mongo1.net:27018,s1-mongo2.net:27018,s1-mongo3.net:27018
data : 1KiB docs : 50 chunks : 1
estimated data per chunk : 1KiB
estimated docs per chunk : 50
Shard shardC at shardC/s3-mongo1.net:27018,s3-mongo2.net:27018,s3-mongo3.net:27018
data : 1KiB docs : 50 chunks : 1
estimated data per chunk : 1KiB
estimated docs per chunk : 50
Totals
data : 3KiB docs : 100 chunks : 2
Shard shardA contains 50% data, 50% docs in cluster, avg obj size on shard : 40B
Shard shardC contains 50% data, 50% docs in cluster, avg obj size on shard : 40B

From the above output, we can see that the students.studentData exist on shardA and shardC and we need to build indexes on shardA and shardC, respectively.

3. Build indexes on the shards that contain collection chunks

Follow the procedure below on each shard that contains the chunk of collection.

3.1. Stop one secondary and restart as a standalone

For the identified shard, stop one of the secondary nodes and make the following changes.

  • Change the port number to a different port
  • Comment out replication parameters
  • Comment out sharding parameters
  • Under section “setParameter” add skipShardingConfigurationChecks: true and disableLogicalSessionCacheRefresh: true 

Example

net:

   bindIp: localhost,<hostname(s)|ip address(es)>

   port: 27218

#   port: 27018

#replication:

#   replSetName: shardA

#sharding:

#   clusterRole: shardsvr

setParameter:

 skipShardingConfigurationChecks: true

 disableLogicalSessionCacheRefresh: true

After saving the configuration restart the process 

mongod --config <path/To/ConfigFile>

OR

sudo systemctl start mongod

 

3.2. Build the index

Connect to the mongod instance running on standalone mode and start the index build process.

Here, we are building the index in students collection on field StudentID in ascending order

> db.students.createIndex( { StudentID: 1 } )

{

"createdCollectionAutomatically" : true,

"numIndexesBefore" : 1,

"numIndexesAfter" : 2,

"ok" : 1

}

 

3.3. Restart the MongoDB process as replicaset node

Once the index build activity is finished, shutdown the instance and restart with the original configuration, remove the parameters skipShardingConfigurationChecks: true and disableLogicalSessionCacheRefresh: true 

net:

   bindIp: localhost,<hostname(s)|ip address(es)>

   port: 27018

replication:

   replSetName: shardA

sharding:

   clusterRole: shardsvr

 

After saving the configuration restart the process 

mongod --config <path/To/ConfigFile>

OR

sudo systemctl start mongod

 

3.4. Repeat the procedure for the remaining secondaries for the shard

Once the node on which index build has been completed, added back to the replicaset set, and is in sync with other nodes, repeat the above process from 3.1 to 3.3 on the remaining nodes.

3.1. Stop one secondary and restart as a standalone

3.2. Build the index

3.3. Restart the MongoDB process as replicaset node

3.5. Index build on primary

Once index build activity finishes up in all the secondary nodes, use the same process as above to create an index on the last remaining node.

  1. Connect to the primary node and issue rs.stepDown(); Once it successfully steps down, becomes secondary and a new primary is elected. Follow steps from one through three to build the index.
  2. Stop the secondary node and restart it as a standalone
  3. Build the index
  4. Restart the process mongod as a replica set member

4. Repeat for the other affected shards

Once the index build is finished for one of the identified shard, start the process outlined in step three on the next identified shard.

5. Restart the balancer

Once we are done building the index on all identified shards we can start the balancer again.

Connect to a mongos instance in the sharded cluster, and run sh.startBalancer()

sh.startBalancer()

Conclusion

Picking the right key based on an access pattern and having a good index is better than having multiple bad indexes. So, choose your index wisely.

There are also other interesting blogs on https://www.percona.com/blog/ which might be helpful to you.

I also recommend going and using Percona Server for MongoDB, which provides MongoDB enterprise-grade features without any license (as it is free). You can learn more about it in the blog MongoDB: Why Pay for Enterprise When Open Source Has You Covered?

Percona also offers some more great products for MongoDB like Percona Backup for MongoDBPercona Operator for MongoDB, and for other technologies and tools too like MySQL Software, PostgreSQL Distribution, Percona Operators, and Monitoring & Management

Sep
01
2021
--

Goodcall picks up $4M, Yelp partnership to answer merchant inbound calls

Even without staffing shortages, local merchants have difficulty answering calls while all hands are busy, and Goodcall wants to alleviate some of that burden from America’s 30 million small businesses.

Goodcall’s free cloud-based conversational platform leverages artificial intelligence to manage incoming phone calls and boost customer service for businesses of all sizes. Former Google executive Bob Summers left Google back in January, where he was working on Area 120 — an internal incubator program for experimental projects — to start Goodcall after recognizing the call problem, noting that in fact 60% of the calls that come into merchants go unanswered.

“It’s frustrating for you and for the person calling,” Summers told TechCrunch. “Every missed call is a lost opportunity.”

Goodcall announced its launch Wednesday with $4 million in seed funding led by strategic investors Neo, Foothill Ventures, Merus Capital, Xoogler Ventures, Verissimo Ventures and VSC Ventures, as well as angel investors including Harry Hurst, founder and co-CEO of Pipe.com, and Zillow co-founder Spencer Rascoff.

Goodcall mobile agent. Image Credits: Goodcall

Restaurants, shops and merchants can set up on Goodcall in a matter of minutes and even establish a local phone number to free up an owner’s mobile number from becoming the business’ main line. The service is initially deployed in English and the company has plans to operate in Spanish, French and Hindi by 2022.

Merchants can choose from six different assistant voices and monitor the call logs and what the calls were about. Goodcall can also capture consumer sentiment, Summers said.

The company offers three options, including its freemium service for solopreneurs and business owners, which includes up to 500 minutes per month of Goodcall services for a single phone line. Up to five additional locations and five staff members costs $19 per month for the Pro level, or the Premium level provides unlimited locations and staff for $49 per month.

During the company’s beta period, Goodcall was processing several thousands of calls per month. The new funding will be used to continue to offer the free service, hire engineers and continue product development.

In addition to the funding round, Goodcall is unveiling a partnership with Yelp to tap into its database of local businesses so that those owners and managers can easily deploy Goodcall. Yelp data shows that more than 500,000 businesses opened during the pandemic. The company pulls in from Yelp a merchant’s open hours, location, if they offer Wi-Fi and even their COVID policy.

“We are partnering with Yelp, which has the best data on small businesses, and other large distribution channels to get our product to market,” Summers said. “We are bringing technology into an industry that hasn’t innovated since the 1980s and democratizing conversational AI for small businesses that are the main driver of job creation, and we want to help them grow.”

 

Apr
05
2019
--

Peter Kraus dishes on the market

During my recent conversation with Peter Kraus, which was supposed to be focused on Aperture and its launch of the Aperture New World Opportunities Fund, I couldn’t help veering off into tangents about the market in general. Below is Kraus’ take on the availability of alpha generation, the Fed, inflation versus Amazon, housing, the cross-ownership of U.S. equities by a few huge funds and high-frequency trading.

Gregg Schoenberg: Will alpha be more available over the next five years than it has been over the last five?

To think that at some point equities won’t become more volatile and decline 20% to 30%… I think it’s crazy.

Peter Kraus: Do I think it’s more available in the next five years than it was in the last five years? No. Do I think people will pay more attention to it? Yes, because when markets are up to 30 percent, if you get another five, it doesn’t matter. When markets are down 30 percent and I save you five by being 25 percent down, you care.

GS: Is the Fed’s next move up or down?

PK: I think the Fed does zero, nothing. In terms of its next interest rate move, in my judgment, there’s a higher probability that it’s down versus up.

Jul
11
2017
--

Webinar Wednesday July 12, 2017: MongoDB Index Types – How, When and Where Should They Be Used?

MongoDB Index Types

MongoDB Index TypesJoin Percona’s Senior Technical Services Engineer, Adamo Tonete as he presents MongoDB Index Types: How, When and Where Should They Be Used? on Wednesday, July 12, 2017 at 11:00 am PDT / 2:00 pm EDT (UTC-7).

MongoDB has 12 index types. Do you know how each works, or when you should use each of them? This talk will arm you with this knowledge, and help you understand how indexes impact performance, storage and even the sharding of your data. We will also discuss some solid index operational practices, as well as some settings for things like TTL you might not know exist. The contents of this webinar will make you a Rock Star!

Register for the webinar here.

MongoDB Index TypesAdamo Tonete, Senior Technical Services Engineer

Adamo joined Percona in 2015, after working as a MongoDB/MySQL database administrator for three years. As the main database admin of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24/7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, working mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three years he has moved to NoSQL technologies (without giving up relational databases).

Apr
11
2017
--

Correct Index Choices for Equality + LIKE Query Optimization

Query Optimization

Query OptimizationAs part of our support services, we do a lot of query optimization. This is where most performance gains come from. Here’s an example of the work we do.

Some days ago a customer arrived with the following table:

CREATE TABLE `infamous_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `member_id` int(11) NOT NULL DEFAULT '0',
  `email` varchar(200) NOT NULL DEFAULT '',
  `msg_type` varchar(255) NOT NULL DEFAULT '',
  `t2send` int(11) NOT NULL DEFAULT '0',
  `flag` char(1) NOT NULL DEFAULT '',
  `sent` varchar(100) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `f` (`flag`),
  KEY `email` (`email`),
  KEY `msg_type` (`msg_type`(5)),
  KEY `t_msg` (`t2send`,`msg_type`(5))
) ENGINE=InnoDB DEFAULT CHARSET=latin1

And a query that looked like this:

SELECT COUNT(*)
  FROM `infamous_table`
 WHERE `t2send` > 1234
   AND `msg_type` LIKE 'prefix%';

The table had an index t_msg that wasn’t helping at all: the EXPLAIN for our 1000000 rows test table looked like this:

id: 1
select_type: SIMPLE
table: infamous_table
type: range
possible_keys: t_msg
key: t_msg
key_len: 4
ref: NULL
rows: 107478
Extra: Using where

You can see the index is the on that was expected: “t_msg”. But the key_len is 4. This indicates that the INT part was used, but that the msg_type(5) part was ignored. This resulted examining 100k+ rows. If you have MySQL 5.6, you can see it more clearly with EXPLAIN FORMAT=JSON under used_key_parts:

EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "table": {
      "table_name": "infamous_table",
      "access_type": "range",
      "possible_keys": [
        "t_msg"
      ],
      "key": "t_msg",
      "used_key_parts": [
        "t2send"
      ],
      "key_length": "4",
      "rows": 107478,
      "filtered": 100,
      "index_condition": "(`test`.`infamous_table`.`t2send` > 1234)",
      "attached_condition": "(`test`.`infamous_table`.`msg_type` like 'prefix%')"
    }
  }
}

The customer had multi-valued strings like “PREFIX:INT:OTHER-STRING” stored in the columnmsg_type, and that made it impossible to convert it to an enum or similar field type that allowed changing the LIKE for an equity.

So the solution was rather simple: just like for point and range queries over numeric values, you must define the index with the ranged field as the rightmost part. This means the correct index would have looked like msg_type(5),t2send. The EXPLAIN for the new index provided the customer with some happiness:

id: 1
select_type: SIMPLE
table: infamous_table
type: range
possible_keys: t_msg,better_multicolumn_index
key: better_multicolumn_index
key_len: 11
ref: NULL
rows: 4716
Extra: Using where

You can see the key_len is now what we would have expected: four bytes for the INT and another seven bytes for the VARCHAR (five for our chosen prefix + two for prefix length). More importantly, you can notice the rows count decreased by approximately 22 times.

We used pt-online-schema on the customer’s environment to apply ALTER to avoid downtime. This made it an easy and painless solution, and the query effectively executed in under 1/20 of the time! So, all fine and dandy? Well, almost. We did a further test, and the query looked like this:

SELECT COUNT(*)
  FROM `infamous_table`
 WHERE `t2send` > 1234
   AND `msg_type` LIKE 'abc%';

So where’s the difference? The length of the string used for the LIKE condition is shorter than the prefix length we choose for the VARCHAR part of the index (the customer intended to look-up strings with only three chars, so we needed to check this). This query also scanned 100k rows, and EXPLAIN showed the key_len was 4, meaning the VARCHAR part was being ignored once again.

This means the index prefix needed to be shorter. We ALTERed the table and made the prefix four characters long, counting on the fact that the multi-valued strings were using “:” to separate the values, so we suggested the customer include the colon in the look-up string for the shortest strings. In this case,  'abc%' would be 'abc:%' (which is also four characters long).

As a final optimization, we suggested dropping old indexes that were covered by the new better_multicolumn_index, and that were most likely created by the customer while testing optimization.

Conclusion

Just like in point-and-range queries, the right order for multi-column indexes is putting the ranged part last. Equally important is that the length of the string prefix needs to match the length of the shortest string you intend to look-up. Just remember, you can’t make this prefix too short or you’ll lose specificity and the query will end up scanning rows unnecessarily.

Mar
02
2017
--

Ozlo releases a suite of APIs to power your next conversational AI

Illustration of laptop connected to bookshelf Building on its promise to give the entrenched a run for their money, conversational AI startup Ozlo is making its meticulously crafted knowledge layer available for purchase today. Ozlo’s new suite of APIs that includes tools for both expressing knowledge and understanding language will help to democratize the creation of conversational AI assistants. In the spirit of the expert systems… Read More

Oct
27
2016
--

Thoughts on MySQL 8.0 Invisible Indexes

Invisible Indexes

Invisible IndexesMySQL 8.0 has a new feature called “invisible indexes,” which allow you to quickly enable/disable indexes from being used by the MySQL Optimizer.

I wanted to share some of my first experiences and thoughts about this new feature.

Why is it good for us?

There are a couple of use cases. One of them is if you want to drop an index, but want to know the effect beforehand. You can make it invisible to the optimizer. It is a quick metadata change to make an index invisible. Once you are sure there is no performance degradation, you can then drop the index.

The main point is that the invisible index is unavailable for use by the optimizer, but it is still present and kept up-to-date by write operations. The optimizer won’t use it, even if we try to “FORCE INDEX”. I think we should be able to force it somehow, though. There might be scenarios where this could be useful:

  • We can create a new invisible index, but if we want to test it we have to make it visible. That means all the queries are going to be able to use it, which could have an immediate effect on the application. I don’t think this is the best approach if we just want to test it. You should always test on staging, but not everybody has the same data size or real life data on their staging servers. Forcing invisible indexes could be useful.
  • You have many indexes, but are not sure which one is not in use. You can change one index to invisible to see if there is any performance degradation. If yes, you can change it back immediately.
  • You could have a special case where only one query can use that index. In that case, an invisible index could be a great solution.

How can you create invisible indexes?

There are two options. We can create a table with an invisible index like this:

CREATE TABLE `t1` (
`i` int(11) DEFAULT NULL,
`j` int(11) DEFAULT NULL,
`k` int(11) DEFAULT NULL,
KEY `i_idx` (`i`),
KEY `idx_1` (`i`,`j`,`k`) INVISIBLE
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Or we can use alter table and change the index to be invisible:

ALTER TABLE t1 ALTER INDEX idx_1 INVISIBLE;

Using invisible indexes

If we want to remove an index now, we can change it to invisible. But what about queries that use “FORCE/USE INDEX”? Are they are going to throw an error? If you force an index that does not exist, you get an error. You don’t get an error with invisible indexes. The optimizer doesn’t use it, but it knows that it exists.

mysql> show create table t1 G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`i` int(11) DEFAULT NULL,
`j` int(11) DEFAULT NULL,
`k` int(11) DEFAULT NULL,
KEY `i_idx` (`i`),
KEY `idx_1` (`i`,`j`,`k`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
mysql> explain select * from t1 force index(idx_1) where i=1 and j=4;
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref         | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | ref  | idx_1         | idx_1 | 10      | const,const |    2 |   100.00 | Using index |
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> alter table t1 alter index idx_1 invisible;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain select * from t1 force index(idx_1) where i=1 and j=4;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   16 |     6.25 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
mysql> explain select * from t1 where i=1 and j=4;
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | ref  | i_idx         | i_idx | 5       | const |    2 |    10.00 | Using where |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

As you can see, if we use “FORCE INDEX” with an invisible index MySQL performs a full table scan because (from mysql.com):

The FORCE INDEX hint acts like USE INDEX (index_list), with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the named indexes to find rows in the table.

MySQL won’t throw any errors because the index exists, but it is not visible. Even if there is another usable index, it is going to perform a full table scan. On a large table, that could cause serious performance issues. Even if MySQL doesn’t throw any errors during the query execution, it should log a warning in the error log.

Conclusion

I think invisible indexes are a great new feature that could be useful for many customers. We should to be able to use an invisible index if necessary, and be able to log queries that are trying to use invisible indexes.

You can read more about invisible indexes in this blog post, and in the MySQL Documentation.

Mar
10
2015
--

Advanced JSON for MySQL

What is JSON

JSON is an text based, human readable format for transmitting data between systems, for serializing objects and for storing document store data for documents that have different attributes/schema for each document. Popular document store databases use JSON (and the related BSON) for storing and transmitting data.

Problems with JSON in MySQL

It is difficult to inter-operate between MySQL and MongoDB (or other document databases) because JSON has traditionally been very difficult to work with. Up until recently, JSON is just a TEXT document. I said up until recently, so what has changed? The biggest thing is that there are new JSON UDF by Sveta Smirnova, which are part of the MySQL 5.7 Labs releases. Currently the JSON UDF are up to version 0.0.4. While these new UDF are a welcome edition to the MySQL database, they don’t solve the really tough JSON problems we face.

Searching

The JSON UDF provide a number of functions that make working with JSON easier, including the ability to extract portions of a document, or search a document for a particular key. That being said, you can’t use JSON_EXTRACT() or JSON_SEARCH in the WHERE clause, because it will initiate a dreaded full-table-scan (what MongoDB would call a full collection scan). This is a big problem and common wisdom is that JSON can’t be indexed for efficient WHERE clauses, especially sub-documents like arrays or objects within the JSON.

Actually, however, I’ve come up with a technique to effectively index JSON data in MySQL (to any depth). The key lies in transforming the JSON from a format that is not easily indexed into one that is easily indexed. Now, when you think index you think B-TREE or HASH indexes (or bitmap indexes) but MySQL also supports FULLTEXT indexes.

A fulltext index is an inverted index where words (tokens) point to documents. While text indexes are great, they aren’t normally usable for JSON. The reason is, MySQL splits words on whitespace and non-alphanumeric characters. A JSON document doesn’t end up being usable when the name of the field (the key) can’t be associated with the value. But what if we transform the JSON? You can “flatten” the JSON down into key/value pairs and use a text index to associate the key/value pairs with the document. I created a UDF called RAPID_FLATTEN_JSON using the C++ Rapid JSON library. The UDF flattens JSON documents down into key/value pairs for the specific purpose of indexing.

Here is an example JSON document:

{
	"id": "0001",
	"type": "donut",
	"name": "Cake",
	"ppu": 0.55,
	"batters":
		{
			"batter":
				[
					{ "id": "1001", "type": "Regular" },
					{ "id": "1002", "type": "Chocolate" },
					{ "id": "1003", "type": "Blueberry" },
					{ "id": "1004", "type": "Devil's Food" }
				]
		},
	"topping":
		[
			{ "id": "5001", "type": "None" },
			{ "id": "5002", "type": "Glazed" },
			{ "id": "5005", "type": "Sugar" },
			{ "id": "5007", "type": "Powdered Sugar" },
			{ "id": "5006", "type": "Chocolate with Sprinkles" },
			{ "id": "5003", "type": "Chocolate" },
			{ "id": "5004", "type": "Maple" }
		]
}

Flattened:

mysql> select RAPID_FLATTEN_JSON(load_file('/tmp/doc.json'))G
*************************** 1. row ***************************
RAPID_FLATTEN_JSON(load_file('/tmp/doc.json')): id=0001
type=donut
name=Cake
ppu=0.55
id=1001
type=Regular
id=1002
type=Chocolate
id=1003
type=Blueberry
id=1004
type=Devil's Food
type=Devil's
type=Food
id=5001
type=None
id=5002
type=Glazed
id=5005
type=Sugar
id=5007
type=Powdered Sugar
type=Powdered
type=Sugar
id=5006
type=Chocolate with Sprinkles
type=Chocolate
type=with
type=Sprinkles
id=5003
type=Chocolate
id=5004
type=Maple
1 row in set (0.00 sec)

Obviously this is useful, because our keys are now attached to our values in an easily searchable way. All you need to do is store the flattened version of the JSON in another field (or another table), and index it with a FULLTEXT index to make it searchable. But wait, there is one more big problem: MySQL will split words on the equal sign. We don’t want this as it removes the locality of the keyword and the value. To fix this problem you’ll have to undertake the (actually quite easy) step of adding a new collation to MySQL (I called mine ft_kvpair_ci). I added equal (=) to the list of lower case characters as described in the manual. You just have to change two text files, no need to recompile the server or anything, and as I said, it is pretty easy. Let me know if you get stuck on this step and I can show you the 5.6.22 files I modified.

By the way, I used a UDF, because MySQL FULLTEXT indexes don’t support pluggable parsers for InnoDB until 5.7. This will be much cleaner in 5.7 with a parser plugin and there will be no need to maintain an extra column.

Using the solution:
Given a table full of complex json:

create table json2(id int auto_increment primary key, doc mediumtext);

Add a column for the index data and FULLTEXT index it:

alter table json2 add flat mediumtext character set latin1 collate ft_kvpair_ci, FULLTEXT(flat);

Then populate the index. Note that you can create a trigger to keep the second column in sync, I let that up to an exercise of the reader, or you can use Flexviews to maintain a copy in a second table automatically.

mysql> update json2 set flat=RAPID_FLATTEN_JSON(doc);
Query OK, 18801 rows affected (26.34 sec)
Rows matched: 18801  Changed: 18801  Warnings: 0

Using the index:

mysql> select count(*) from json2 where match(flat) against ('last_name=Vembu');
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)

The documents I searched for that example are very complex and highly nested. Check out the full matching documents for the query here here

If you want to only index a subportion of the document, use the MySQL UDF JSON_EXTRACT to extract the portion you want to index, and only flatten that.

Aggregating

JSON documents may contain sub-documents as mentioned a moment ago. JSON_EXTRACT can extract a portion of a document, but it is still a text document. There is no function that can extract ALL of a particular key (like invoice_price) and aggregate the results. So, if you have a document called orders which contains a varying number of items and their prices, it is very difficult (if not impossible) to use the JSON UDF to aggregate a “total sales” figure from all the order documents.

To solve this problem, I created another UDF called RAPID_EXTRACT_ALL(json, ‘key’). This UDF will extract all the values for the given key. For example, if there are 10 line items with invoice_id: 30, it will extract the value (30 in this case) for each item. This UDF returns each item separated by newline. I created a few stored routines called jsum, jmin, jmax, jcount, and javg. They can process the output of rapid_extract_all and aggregate it. If you want to only RAPID_EXTRACT_ALL from a portion of a document, extract that portion with the MySQL UDF JSON_EXTRACT first, then process that with RAPID_EXTRACT_ALL.

For example:

mysql> select json_extract_all(doc,'id') ids, jsum(json_extract_all(doc,'id')) from json2 limit 1G
*************************** 1. row ***************************
ids: 888
889
2312
5869
8702
jsum(json_extract_all(doc,'id')): 18660.00000
1 row in set (0.01 sec)

Aggregating all of the id values in the entire collection:

mysql> select sum( jsum(json_extract_all(doc,'id')) ) from json2 ;
+-----------------------------------------+
| sum( jsum(json_extract_all(doc,'id')) ) |
+-----------------------------------------+
|                         296615411.00000 |
+-----------------------------------------+
1 row in set (2.90 sec)

Of course you could extract other fields and sort and group on them.

Where to get the tools:
You can find the UDF in the swanhart-tools github repo. I think you will find these tools very useful in working with JSON documents in MySQL.

(This post was originally posted on my personal blog: swanhart.livejournal.com, but is reposed here for wider distribution)

The post Advanced JSON for MySQL appeared first on MySQL Performance Blog.

Feb
17
2015
--

‘Indexing’ JSON documents for efficient MySQL queries over JSON data

MySQL meets NoSQL with JSON UDF

I recently got back from FOSDEM, in Brussels, Belgium. While I was there I got to see a great talk by Sveta Smirnova, about her MySQL 5.7 Labs release JSON UDF functions. It is important to note that while the UDF come in a 5.7 release it is absolutely possible to compile and use the UDF with earlier versions of MySQL because the UDF interface has not changed for a long time. However, the UDF should still be considered alpha/preview level of quality and should not be used in production yet! For this example I am using Percona Server 5.6 with the UDF.

That being said, the proof-of-concept that I’m about to present here uses only one JSON function (JSON_EXTRACT) and it has worked well enough in my testing to present my idea here. The JSON functions will probably be GA sometime soon anyway, and this is a useful test of the JSON_EXTRACT function.

The UDF let you parse, search and manipulate JSON data inside of MySQL, bringing MySQL closer to the capabilities of a document store.

Since I am using Percona Server 5.6, I needed to compile and install the UDF. Here are the steps I took to compile the plugin:

  1. $ cd mysql-json-udfs-0.3.3-labs-json-udfs-src
  2. $ cmake -DMYSQL_DIR=/usr/local/mysql .
  3. $ sudo make install
  4. $ sudo cp *.so /usr/local/mysql/lib/plugin

JSON UDF are great, but what’s the problem

The JSON functions work very well for manipulating individual JSON objects, but like all other functions, using JSON_EXTRACT in the WHERE clause will result in a full table scan. This means the functions are virtually useless for searching through large volumes of JSON data.  If you want to use MySQL as a document store, this is going to limit the usefulness in the extreme as the ability to extract key/value pairs from JSON documents is powerful, but without indexing it can’t scale well.

What can be done to index JSON in MySQL for efficient access?

The JSON UDF provides a JSON_EXTRACT function which can pull data out of a JSON document. There are two ways we can use this function to “index” the JSON data.

  1. Add extra columns to the table (or use a separate table, or tables) containing the JSON and populate the columns using JSON_EXTRACT in a trigger. The downside is that this slows down inserts and modifications of the documents significantly.
  2. Use Flexviews materialized views to maintain an index table separately and asynchronously. The upside is that insertion/modification speed is not affected, but there is slight delay before index is populated. This is similar to eventual consistency in a document store.

Writing triggers is an exercise I’ll leave up to the user. The rest of this post will discuss using Flexviews materialized views to create a JSON index.

What is Flexviews?

Flexviews can create ‘incrementally refreshable’ materialized views. This means that the views are able to be refreshed efficiently using changes captured by FlexCDC, the change data capture tool that ships with Flexviews. Since the view can be refreshed fast, it is possible to refresh it frequently and have a low latency index, but not one perfectly in sync with the base table at all times.

The materialized view is a real table that is indexed to provide fast access. Flexviews includes a SQL_API, or a set of stored procedures for defining and maintaining materialized views.

See this set of slides for an overview of Flexviews: http://www.slideshare.net/MySQLGeek/flexviews-materialized-views-for-my-sql

Demo/POC using materialized view as an index

The first step to creating an incrementally refreshable materialized view with Flexviews, is to create a materialized view change log on all of the tables used in the view. The CREATE_MVLOG($schema, $table) function creates the log and FlexCDC will immediately being to collect changes into it.

mysql> call flexviews.create_mvlog('ssb_json','json');
Query OK, 1 row affected (0.01 sec)

Next, the materialized view name, and refresh type must be registered with the CREATE($schema, $mvname, $refreshtype) function:

mysql> call flexviews.create('ssb_json','json_idx','INCREMENTAL');
Query OK, 0 rows affected (0.00 sec)
-- save the just generated identifier for the view.  You can use GET_ID($schema,$mvname) later.
mysql> set @mvid := last_insert_id();
Query OK, 0 rows affected (0.00 sec)

Now one or more tables have to be added to the view using the ADD_TABLE($mvid, $schema, $table, $alias,$joinclause) function. This example will use only one table, but Flexviews supports joins too.

mysql> call flexviews.add_table(@mvid, 'ssb_json','json','json',null);
Query OK, 1 row affected (0.00 sec)

Expressions must be added to the view next. Since aggregation is not used in this example, the expressions should be ‘COLUMN’ type expressions. The function ADD_EXPR($mvid, $expression_type, $expression, $alias) is used to add expressions. Note that JSON_EXTRACT returns a TEXT column, so I’ve CAST the function to integer so that it can be indexed. Flexviews does not currently have a way to define prefix indexes.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', "cast(json_extract(doc,'D_DateKey') as date)", 'D_DateKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'C_CustomerKey') as unsigned)", 'C_CustomerKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'S_SuppKey') as unsigned)", 'S_SuppKey');
Query OK, 1 row affected (0.01 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'P_PartKey') as unsigned)", 'P_PartKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_OrderKey') as unsigned)", 'LO_OrderKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_LineNumber') as unsigned)", 'LO_LineNumber');
Query OK, 1 row affected (0.00 sec)

I’ve also projected out the ‘id’ column from the table, which is the primary key. This ties the index entries to the original row, so that the original document can be retrieved.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', 'id', 'id');
Query OK, 1 row affected (0.00 sec)

Since we want to use the materialized view as an index, we need to index the columns we’ve added to it.

mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_LineNumber", 'LO_LineNumber_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_OrderKey", 'LO_OrderKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"P_PartKey", 'P_PartKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"S_SuppKey", 'S_SuppKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"D_DateKey", 'D_DateKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"C_CustomerKey", 'C_CustomerKey_Idx');
Query OK, 1 row affected (0.00 sec)

Finally, the view has to be created. There are 6 million rows in my table, the JSON functions are UDF so they are not as fast as built in functions, and I indexed a lot of things (six different indexes are being populated at once) so it takes some time to build the index:

mysql> call flexviews.enable(@mvid);
Query OK, 2 rows affected (35 min 53.17 sec)

After the materialized view is built, you can see it in the schema. Note there is also a delta table, which I will explain a bit later.

mysql> show tables;
+--------------------+
| Tables_in_ssb_json |
+--------------------+
| json               |
| json_idx           |
| json_idx_delta     |
+--------------------+
3 rows in set (0.00 sec)

Here is the table definition of json_idx, our materialized view. You can see it is indexed:

CREATE TABLE `json_idx` (
  `mview$pk` bigint(20) NOT NULL AUTO_INCREMENT,
  `D_DateKey` date DEFAULT NULL,
  `C_CustomerKey` bigint(21) unsigned DEFAULT NULL,
  `S_SuppKey` bigint(21) unsigned DEFAULT NULL,
  `P_PartKey` bigint(21) unsigned DEFAULT NULL,
  `LO_OrderKey` bigint(21) unsigned DEFAULT NULL,
  `LO_LineNumber` bigint(21) unsigned DEFAULT NULL,
  `id` bigint(20) NOT NULL DEFAULT '0',
  `mview$hash` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`mview$pk`),
  KEY `LO_LineNumber_Idx` (`LO_LineNumber`),
  KEY `LO_OrderKey_Idx` (`LO_OrderKey`),
  KEY `P_PartKey_Idx` (`P_PartKey`),
  KEY `S_SuppKey_Idx` (`S_SuppKey`),
  KEY `D_DateKey_Idx` (`D_DateKey`),
  KEY `C_CustomerKey_Idx` (`C_CustomerKey`),
  KEY `mview$hash_key` (`mview$hash`)
) ENGINE=InnoDB AUTO_INCREMENT=6029221 DEFAULT CHARSET=latin1;

Here are some sample contents. You can see the integer values extracted out of the JSON:

mysql> select * from json_idx limit 10;
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
|        1 | 1996-08-08 |          6748 |         1 |    178778 |       35620 |             2 |  1 | 3151656687 |
|        2 | 1994-05-20 |          5272 |         1 |     52362 |      102790 |             4 |  2 | 2181615425 |
|        3 | 1995-05-04 |         22870 |         1 |    136407 |      146757 |             3 |  3 |  544130577 |
|        4 | 1996-06-16 |         12217 |         1 |    129103 |      151200 |             1 |  4 | 2969697839 |
|        5 | 1992-07-20 |         21757 |         1 |     35243 |      151745 |             1 |  5 | 1438921571 |
|        6 | 1997-08-16 |         18760 |         1 |    150766 |      159232 |             6 |  6 | 3941775529 |
|        7 | 1994-03-04 |           757 |         1 |     15750 |      188902 |             3 |  7 | 2142628088 |
|        8 | 1993-11-04 |         17830 |         1 |    192023 |      253828 |             5 |  8 | 3480147565 |
|        9 | 1993-07-12 |         16933 |         1 |     59997 |      269062 |             5 |  9 | 3572286145 |
|       10 | 1998-06-16 |         26660 |         1 |     30023 |      298272 |             3 | 10 | 1971966244 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
10 rows in set (0.00 sec)

Now, there needs to be an easy way to use this index in a select statement. Since a JOIN is needed between the materialized view and the base table, a regular VIEW makes sense to access the data. We’ll call this the index view:

mysql> create view json_idx_v as select * from json natural join json_idx;
Query OK, 0 rows affected (0.00 sec)

And just for completeness, here is the contents of a row from our new index view:

mysql> select * from json_idx_v limit 1G
*************************** 1. row ***************************
           id: 1
          doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU     5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU     0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
     mview$pk: 1
    D_DateKey: 1996-08-08
C_CustomerKey: 6748
    S_SuppKey: 1
    P_PartKey: 178778
  LO_OrderKey: 35620
LO_LineNumber: 2
   mview$hash: 3151656687
1 row in set (0.00 sec)

Using the UDF to find a document

The UDF does a full table scan, parsing all six million documents (TWICE!) as it goes along. Unsurprisingly, this is slow:

mysql> select * from json where json_extract(doc,'LO_OrderKey') = 35620 and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (54.49 sec)
mysql> explain select * from json
                where json_extract(doc,'LO_OrderKey') = 35620
                  and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5233236
        Extra: Using where
1 row in set (0.00 sec)

Using the index view to find a document

mysql> select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (0.00 sec)
mysql> explain select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json_idx
         type: ref
possible_keys: LO_LineNumber_Idx,LO_OrderKey_Idx
          key: LO_OrderKey_Idx
      key_len: 9
          ref: const
         rows: 4
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: ssb_json.json_idx.id
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Keeping the index in sync

Flexviews materialized views need to be refreshed when the underlying table changes. Flexviews includes a REFRESH($mvid, $mode, $transaction_id) function.

I am going to remove one document from the table:

mysql> delete from json where id = 10000;
Query OK, 1 row affected (0.01 sec)

Note there is now one row in the materialized view change log. dml_type is -1 because it is a delete:

mysql> select * from flexviews.mvlog_f1673fac9814a93508a1c917566ecd4dG
*************************** 1. row ***************************
    dml_type: -1
      uow_id: 113
fv$server_id: 33
      fv$gsn: 1083
          id: 10000
         doc: {"LO_OrderKey":"3359521","LO_LineNumber":"2","LO_CustKey":"10306","LO_PartKey":"77997","LO_SuppKey":"4","LO_OrderDateKey":"19951010","LO_OrderPriority":"2-HIGH","LO_ShipPriority":"0","LO_Quantity":"43","LO_ExtendedPrice":"8492457","LO_OrdTotalPrice":"27032802","LO_Discount":"2","LO_Revenue":"8322607","LO_SupplyCost":"118499","LO_Tax":"4","LO_CommitDateKey":"19951228","LO_ShipMode":"FOB","C_CustomerKey":"10306","C_Name":"Customer#000010306","C_Address":"4UR9tz8","C_City":"ROMANIA  5","C_Nation":"ROMANIA","C_Region":"EUROPE","C_Phone":"29-361-986-3513","C_MktSegment":"BUILDING","S_SuppKey":"4","S_Name":"Supplier#000000004","S_Address":"qGTQJXogS83a7MB","S_City":"MOROCCO  4","S_Nation":"MOROCCO","S_Region":"AFRICA","S_Phone":"25-128-190-5944","D_DateKey":"19951010","D_Date":"Octorber 10, 1995","D_DayOfWeek":"Wednesday","D_Month":"Octorber","D_Year":"1995","D_YearMonthNum":"199510","D_YearMonth":"Oct1995","D_DayNumInWeek":"4","D_DayNumInMonth":"10","D_DayNumInYear":"283","D_MonthNumInYear":"10","D_WeekNumInYear":"41","D_SellingSeason":"Fall","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"77997","P_Name":"burnished olive","P_MFGR":"MFGR#2","P_Category":"MFGR#24","P_Brand":"MFGR#2426","P_Colour":"orchid","P_Type":"MEDIUM PLATED TIN","P_Size":"16","P_Container":"WRAP PKG"}
1 row in set (0.01 sec)

Now we can verify the materialized view is out of date:

mysql> select * from json_idx where id = 10000;
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id    | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
|    10000 | 1995-10-10 |         10306 |         4 |     77997 |     3359521 |             2 | 10000 | 2937185172 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
1 row in set (2.60 sec)

To bring the index up to date we must refresh it. Usually you will use the ‘BOTH’ mode to ‘COMPUTE’ and ‘APPLY’ the changes at the same time, but I am going to use COMPUTE mode to show you what ends up in the delta table:

mysql> select * from json_idx_deltaG
*************************** 1. row ***************************
     dml_type: -1
       uow_id: 113
       fv$gsn: 1083
    D_DateKey: 1995-10-10
C_CustomerKey: 10306
    S_SuppKey: 4
    P_PartKey: 77997
  LO_OrderKey: 3359521
LO_LineNumber: 2
           id: 10000
   mview$hash: 2937185172
1 row in set (0.00 sec)

Delta tables are similar to materialized view change log tables, except they contain insertions and deletions to the view contents. In this case, you can see dml_type is -1 and id = 10000, so the row from the view corresponding to the row we deleted, will be deleted when the change is applied.

Finally the change can be applied:

mysql> call flexviews.refresh(flexviews.get_id('ssb_json','json_idx'), 'APPLY',NULL);
Query OK, 2 rows affected (0.47 sec)
mysql> select * from json_idx where id = 10000; -- note, we didn't index id in the MV
Empty set (2.61 sec)

Finally, it makes sense to try to keep the index in sync as quickly as possible using a MySQL event:

DELIMITER ;;
CREATE EVENT IF NOT EXISTS flexviews.refresh_json_idx
ON SCHEDULE EVERY 1 SECOND
DO
BEGIN
  DECLARE v_got_lock tinyint default 0;
  SELECT GET_LOCK('JSON_IDX_LOCK', 0) INTO v_got_lock;
  IF v_got_lock = 1 THEN
    CALL flexviews.refresh(flexviews.get_id('ssb_json','json_idx'),'BOTH',NULL);
    SELECT RELEASE_LOCK('JSON_IDX_LOCK') INTO @discard;
  END IF;
END;;
DELIMITER ;

So there you have it. A way to index and quickly search through JSON documents and keep the index in sync automatically.

The post ‘Indexing’ JSON documents for efficient MySQL queries over JSON data appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com