Oct
06
2021
--

Reminder: TokuDB Storage Engine Will Be Disabled by Default in Percona Server for MySQL 8.0.26

TokuDB Disabled in Percona Server for MySQL

TokuDB Disabled in Percona Server for MySQLAs we’ve communicated in our blog post in May, the TokuDB Storage Engine has been marked as “deprecated” in Percona Server for MySQL 8.0. It will be removed in a future version (Percona Server for MySQL 8.0.28, expected to ship in Q1 2022).

With the release of Percona Server for MySQL 8.0.26, the storage engine will still be included in the binary builds and packages but will be disabled by default. If you are upgrading from a previous version, the TokuDB Storage Engine plugin will fail with an error message at server startup if it is installed.

You will still be able to re-enable it manually so that you can perform the necessary migration steps.

Re-enabling the TokuDB Engine in Percona Server for MySQL 8.0.26 is fairly straightforward. You need to add the options

tokudb_enabled=TRUE

and

tokudb_backup_enabled=TRUE

options to your my.cnf file and restart your server instance.

Once the server is up and running again, you can migrate your data to an alternative storage engine like RocksDB or InnoDB before disabling and removing TokuDB. See the chapter Removing the TokuDB storage engine in our documentation for details.

Let us know if you have any questions or comments about this process!

If you need any assistance with migrating your data or have any questions or concerns about this, don’t hesitate to reach out to us – our experts are here to help!

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Jun
22
2021
--

WiredTiger File Forensics Part 3: Viewing all the MongoDB Data

WiredTiger MongoDB

WiredTiger MongoDBThis article continues on from Part 1: Building “wt” and  “Part 2: wt dump” to show how to extract any of your MongoDB documents directly from WiredTiger’s raw data files. It’ll also show how to take a peek into the index files. Lastly, it’ll show how to also look in the WT transaction log to see updates made since the latest checkpoint.

List up the Collections and Indexes

WiredTiger doesn’t name any of its data files according to the MongoDB object names, so as a first step you’ll have to extract a table of WT idents (=identifiers) vs. collection and index names.

The _mdb_catalog.wt file is not the top table in the WiredTiger storage engine’s own hierarchy of data sources. To the MongoDB layer of code though it is the complete definition of collection and index objects in the database. This includes both MongoDB system and user-made collections and indexes.

Dump WT ident vs Collections

As explained in the previous Part 2: wt dump blog post, you should reuse this command to save a tab-delimited file of WiredTiger ident(ifier)s vs collection names. I’ll call this file wt_ident_vs_collection_ns.tsv in this article.

$ #cd to a copy of a MongoDB data directory
$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -r 'select(. | has("md")) | [.ident, .ns] | @tsv' | sort > wt_ident_vs_collection_ns.tsv
$ 
$ head -n 5 wt_ident_vs_collection_ns.tsv
collection-0--4131298130356306083	config.cache.chunks.test.bar
collection-0-5834121039240263510	local.replset.initialSyncId
collection-0-5841128870485063129	local.startup_log
collection-0--6422702119521843596	config.system.sessions
collection-10--4131298130356306083	config.cache.chunks.test.foo

Eg. if an ident is “collection-4-5841128870485063129” then there will be a file collection-4-5841128870485063129.wt that has the data of a MongoDB collection.

In case you have a vague memory of seeing strings like these somewhere whilst using the mongo shell, you probably have. These idents are the same as the ones shown in the “wiredTiger.uri” field of the db.collection.stats() command output.

Optional: Dump WT ident vs Indexes

The index WT idents can be dumped as well. The example below will save them to wt_vs_index_ns.tsv in three columns [WT ident, collection name, index name].

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -r 'select(. | has("idxIdent")) | .ns as $nsT | .idxIdent | to_entries[] | [.value, $nsT, .key] | @tsv' | sort > wt_ident_vs_index_ns.tsv
$
$ head -n 5 wt_vs_index_ns.tsv
index-11--4131298130356306083	config.cache.chunks.test.foo	_id_
index-12--4131298130356306083	config.cache.chunks.test.foo	lastmod_1
index-12-5841128870485063129	local.system.replset	_id_
index-1--4131298130356306083	config.cache.chunks.test.bar	_id_
index-14-5841128870485063129	admin.system.version	_id_

Looking at the Application (= MongoDB) Table Data

collection-*.wt and index-*.wt files contain the data of the MongoDB collections and indexes that you observe as a client connected to a MongoDB instance. Once you’ve identified the WT ident value for the collection or index you want to inspect you can use that in the URI argument to the wt dump command.

wt dump collections to *.bson files

The wt dump output of collection-* tables has a basic numeric type key and BSON data as the values.

Use wt dump -x in your terminal on the WT file for a collection and you will see, after the header section, records of the WT table in alternating lines of key and value.

Eg. in the example below the first key is numeric/binary value 0x81 and the first value is the one starting with binary bytes 0xce000000025f6964001e. The second key is 0x82, and its value starts with 0xce000000025f69640021. The values are BSON which were much longer in reality, I only trimmed it for readability at the moment.

$ wt dump -x collection-X-XXXXXXXXXX
WiredTiger Dump (WiredTiger Version 10.0.0)
Format=hex
Header
file:WiredTiger.wt
access_pattern_hint=none,allocation_size=4KB,app.....
Data
81
68000000025f6964001e000000424d50524f442d425a41342d31...
82
6b000000025f69640021000000424d50524f442d574f524b4552...
...

FYI Using the alternative URI argument syntax “wt dump -x table:collection-X-XXXXXXXXXX” or “wt dump -x file:collection-X-XXXXXXXXXX.wt” will produce the same.

The key value in the collection-* tables isn’t needed to see the document content so print only every second line for the values using the following command:

$ wt dump -x collection-X-XXXXXXXXXX | tail -n +7 | awk 'NR%2 == 0 { print }'
68000000025f6964001e000000424d50524f442d425a41342d31...
6b000000025f69640021000000424d50524f442d574f524b4552....
66000000025f6964001c000000424d50524f442d574f524b4552...
6b000000025f69640021000000424d50524f442d574f524b4552...
64000000025f6964001a000000424d50524f442d44415441312d...

The command above prints binary BSON in a hex string format with a newline separating each record. We can translate that hexadecimal back to the original binary using the xxd command utility with the “-r” and “-p” flags. (Note: Don’t combine as one “-rp” flag. It doesn’t work like most unix command’s short options.)

$ #Look for the WT ident of my test.customerOrder collection
$ grep customerOrder wt_ident_vs_collection_ns.tsv
collection-14--3398103177079662761	test.customerOrder
$ 
$ ls -lh collection-14--3398103177079662761.wt
-rw------- 1 akira akira 40K Jun 10 15:12 collection-14--3398103177079662761.wt
$
$ #dump and convert its values to a plain BSON file:
$ wt dump -x collection-14--3398103177079662761 | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p > test.customerOrder.bson
$ 
$ #Confirm the content using bsondump
$ bsondump --quiet test.customerOrder.bson
{"_id":"123456","_class":"model.customer_order.CustomerOrder","orderReference":{"businessInteractionId":"str ... "orderAttributeValue":"string"}]}]}]}]}}
{"_id":"ORN-billingissue01","_class":"com.ctl.bm.servi ... startDate":"2018-02-02T00:00:00.000Z"}],"existingTN":[]}
2021-06-10T15:12:16.627+0900	2 objects found

“wt read” a single record?

Sorry, you can’t read your MongoDB collection data with the “wt read” shell command. The blocker is trivial – as of WT v10.0.0 the “wt read” command only accepts plain text or its own “r” recordid numeric value as the lookup key value. The keys for the mongodb collections and indexes however are ‘q’ and ‘u’ types respectively. (Documentation: WT Schema Format types.)

In MongoDB, you might know you can use a showRecordId cursor option so it’s tempting to think that this is the same “r” type that wt read can currently accept, but unfortunately it is not. See the key_format=X value in the header of wt dump output samples to confirm.

If wt read (code = utilities/util_read.c) was modified to accept an -x argument so we could pass the hex strings we already see in wt dump output this issue would be solved. But as it isn’t, for now, you have to dump all records even if you just want one.

This is only a limitation in the shell. If you use the API from within a programming language, including the Python SWIG binding available, you should be able to read just a single record.

Looking at MongoDB’s Index Files

wt dump index-*.wt files

index-* WiredTiger tables have two different formats that are easy to see when using wt dump to look inside them – one with just keys, and one with both keys and values. The WT keys are binary values generated from MongoDB’s KeyString structure.

Below is an example of an index on {state: 1, process: 1} on a collection called config.locks. This is a case where there are no values in the WT table for the index. 

$ grep index-15--749209957533832251 wt_ident_vs_index_ns.tsv 
index-15--7492099575338322516	config.locks	state_1_process_1
$ 
$ wt dump -x index-15--7492099575338322516 | tail -n +7
293c436f6e66696753657276657200040008    <keys on odd-numbered line>
   <these even-numbered lines are where the WT value would be>
293c436f6e66696753657276657200040040

293c7072642d6d6f6e2d7065722d73686172642d6130333a32373031383a313538363938343138383a3139393331363932353337323937373139383300040028

....

293c70726f642d6d6f6e676f2d73686172642d6d6f6e676f732d302e6d6f6e676f2e706572636f6e612e636f6d3a32373031373a313535343633323838333a38363534383239393034393534383037333200040078

2b043c436f6e66696753657276657200040060

$

Below is an example of an {_id: 1} index on the collection test.rlru. This is a case when there are both WT keys and values.

$ grep index-1-2871579003788456567 wt_ident_vs_index_ns.tsv 
index-1-2871579003788456567	test.rlru	_id_
$
$ wt dump -x index-1-287157900378845656 | tail -n +7
2904
0008
2b0204
0010
2b0404
0018
...
...
2c0faa04
203eb1
2c0fac04
203eb9
...

Given the point of an index record lookup is to have a value that points to a record in the collection-X-XXXX WT table, you should be asking “How can that first index above be useful without values?”

The answer is the recordid is packed on the end of the key. You’ll notice in the first example they all have 0x04 as the third-last byte. This is how MongoDB packs a recordId when the value is between 0 and 2^10 – 1 I believe. See KeyString appendRecordId() if you want to get into it further.

By the way, an index’s field names are constant and thus irrelevant to sort order, so they’re not part of the keystrings.

Writing about the KeyString format even in just the shortest summary would take a whole blog post. So I’ll just punch out some translations of the binary above as a teaser and stop there.

Keystring 293c436f6e66696753657276657200040008 =>

  • 0x29 = type marker for numeric value 0.
  • 0x3c type marker for (UTF-8?) string
  • 36f6e66696753657276657200 hex of string “ConfigServer” plus tailing null
  • 0x04 type marker for a recordId
  • 0x0008 => binary 000 + binary 0000000001 + binary 000 = (recordId) value 1

Keystring 2b043c436f6e66696753657276657200040060 =>

  • 0x2b = type marker, positive integer in small range (< 2^7?)
  • 04 = binary 0000010 + bit 0 = value 2
  • 0x3c type marker for (UTF-8?) string
  • 36f6e66696753657276657200 hex of string “ConfigServer” plus tailing null
  • 0x04 type marker for a recordId
  • 0x0060 => binary 000 + binary 00.0000.1100 + binary 000 = (recordId) value 12

Keystring 2b0404 =>

  • 0x2b = type marker, positive integer in small range (< 2^7?)
  • 04 = binary 000010 + 00 = value 2, not sure what the tailing two bytes are for.
  • 0x04 type marker to indicate the following value is recordId?

Value 001001 =>

  • 0018 = binary 000 + binary value 00 0000 0011 = 3, plus 3 bits 000 on the end. T.b.h. I don’t always know how to translate this one, but commonly 3 bits at the beginning and end of the recordId format are used as a byte size indicator.

If looking for matching records in the collection-X-XXX table look for key (0x80 + the recordid) value from the index file. Eg. 2 -> 0x82; 12 -> 0x9c. For some reason 0x0 – 0x80 seems to be reserved and so key values in the collection-X-XXX.wt files are all incremented by 0x80 higher than the recordId value in the index-X-XXX.wt records.

In the end though, as we don’t have an ability to use wt read, all this poking around in the indexes from a shell command can only be for satisfying curiosity. Not for making single-record access fast.

Looking at the WT Transaction Log

A WiredTiger checkpoint saves a copy of documents in all collections and indexes as they were at one exact point in time (the time the checkpoint is started). MongoDB will call for a checkpoint to be made once per minute by default.

Without something else being saved to disk, a sudden crash would mean that restores/restarts could only revert to the last checkpoint. The classic database concept of write-ahead log is the solution to this of course. In WiredTiger this is provided by the transaction log, often just called “log” in its own documentation. Or as the documentation also says it adds “commit-level” durability to checkpoint durability.

At restart, whether it is after a perfectly normal shutdown or a crash, WiredTiger will read and replay the writes it finds in its transaction log onto the tables, in memory. In time, when the MongoDB layer requests a new checkpoint be created, the in-memory restored data will be saved to disk.

When you use wt dump it’s not easy to say if you’re looking at the collection (or index) restored as of the last checkpoint or with the transaction log “recovered” (read and applied) as well. If the “-R” global option of the wt command is used then yes log is recovered; if the opposite “-r” option is used then no. But which is in effect if you specify neither is unclear. Also, I’ve seen comments or diagnostic messages that suggest the -r option isn’t perfect.

Two Layers of Recovery

WiredTiger would, by default, keep no transaction log for the tables it manages for the application embedding it. It’s only the request of the application that engages it.

  • If you are using a standalone mongod MongoDB code will enable WT log for every collection and index created
    • Unless the “recoverFromOplogAsStandalone” server parameter is used. This is a trick that is part of point-in-time recovery of hot backups.
  • When replication is enabled (the typical case) only the WT tables for “local” db collections and indexes get “log=(enabled=true)” in their WT config string made by WiredTigerRecordStore::generateCreateString() in wiredtiger_record_store.cpp, and something similar in wiredtiger_index.cpp.

When a mongod node restarts with an existing data directory the WiredTiger library will run recovery. It can’t be stopped (at least not by the MongoDB user). This is not the end of the story though. When replication was used it means only the “local” db collections, in particular the oplog, is restored. The replication system code has a ReplicationRecovery class that is used next, and this will apply updates made since the last durable checkpoint time from the oplog to the user and system collections they’re supposed to be in. Only after that occurs will recovery be complete and the db server will make the data available to clients.

ReplicationRecovery allows MongoDB to do rollbacks (at least to the checkpoint time) and also to trim off ops in the last replication batch application. It’s an ugly point but MongoDB applies replication ops in small batches in parallel to improve performance; this only happens on secondaries usually of course, but also after a restart. If all the writes in the batch finish that is fine, but if they don’t some writes may be missing which is a clear consistency violation. So the unfinished batch’s writes should all be cleared, back to the last saved oplogTruncateAfterPoint value. repl.oplogTruncateAfterPoint is a single-document collection in the local db. It exists only for this purpose as far as I know.

Diagram for Typical Restart Recovery

“Typical” = restarting a node that is a replica. Doesn’t matter if it was a primary or secondary, or even the only node in the replica set.

WiredTiger’s recovery

?

WT table for local.oplog.rs updated

?

ReplicationRecovery

?

Trim oplog docs where “ts” > oplogTruncateAfterPoint

?

oplog application

Pseudo code:
db.oplog.rs.find({ts > durable checkpoint time}).forEach(function(d) {
     //re-apply write to intended collection in “test”, “mydb”, “admin”, etc. db
     applyOps(opdoc);
}

?

Ready!

wt printlog

You can use the wt printlog command to see what is in the log currently in the WiredTigerLog.<nnnnnn> files in the journal/ subdirectory. If you decode the write ops in there you can understand what document versions will be restored by MongoDB when it restarts.

Preparation to Using wt printlog – Map WT idents to fileid

Until now you’ve learned that every MongoDB collection and index is in its own WT table file, which means needing to learn what the WT ident(ifier) is to find the right raw WT file to look into.

There’s another small integer id value, a.k.a. fileid, in WT metadata / WT table config strings for each file too. I suppose it saves space; at any rate, the WT transaction log references files only by this number. This means you’ll have to build a handy mapping or list of which WT ident is which fileid. Use the following command to create a file I call wt_ident_vs_fileid.tsv.

$ wt dump file:WiredTiger.wt | grep -B 1 ',id=' | sed 's/^file:\(.*\)\.wt\\00/\1\t/; s/.*,id=\([0-9][0-9]*\),.*/\1/' | paste -d" " - - > wt_ident_vs_fileid.tsv

$ cat wt_ident_vs_fileid.tsv
WiredTigerHS	 268
_mdb_catalog	 3
collection-0--7000593926523175565	 269
collection-0--7468664482515188034	 196
collection-0--9200920937255444113	 79
...
collection-8-2787176720526139723	 12
collection-83--6694766911154494072	 194
index-1--7000593926523175565	 270
index-1--7468664482515188034	 197
index-1-2787176720526139723	 5
...
index-9--7468664482515188034	 205
index-9-2787176720526139723	 13
sizeStorer	 2

WT idents don’t suggest to the human reader which MongoDB collection or index is involved though, so after that use this shell command to join wt_ident_vs_fileid.tsv to the wt_ident_vs_collection_ns.tsv and wt_ident_vs_index_ns.tsv files made earlier.

$ join <(sort wt_ident_vs_fileid.tsv) <(cat wt_ident_vs_collection_ns.tsv wt_ident_vs_index_ns.tsv | sort) | sort -n -k 2 | sed 's/ /\t/g' > wt_ident_vs_sorted_fileid_and_mdb_name.tsv
$ cat wt_ident_vs_sorted_fileid_and_mdb_name.tsv
collection-0-2787176720526139723	4	local.startup_log
index-1-2787176720526139723	5	local.startup_log	_id_
collection-2-2787176720526139723	6	local.replset.oplogTruncateAfterPoint
index-3-2787176720526139723	7	local.replset.oplogTruncateAfterPoint	_id_
collection-4-2787176720526139723	8	local.replset.minvalid
index-5-2787176720526139723	9	local.replset.minvalid	_id_
...
collection-16-2787176720526139723	20	local.oplog.rs
collection-17-2787176720526139723	21	admin.system.version
index-18-2787176720526139723	22	admin.system.version	_id_
...

Not shown: WiredTiger’s own metadata table WiredTiger.wt is fileid 0. I presume this is hardcoded.

Transaction Log Files

In a brutally trimmed summary, the content of the transaction log is something like this:

{ checkpoint marker }
...
...
{ lsn : [42, 7424], ops: [ {fileid: i, key: x1, value: y1 } ] }
{ lsn : [42, 8448], ops: [ {fileid: i, key: x2, value: y2 }, {...}, {...} ] }
{ lsn : [42, 8960], ops: [ {fileid: j, key: x6, value: y6 } ] }
{ lsn : [43,    0], ops: [ {fileid: i, key: x7, value: y7 }, {...} ] }
{ lsn : [42,  256], ops: [ {fileid: i, key: x9, value: y9 } ] }
...

Going into details:

  • The files are WiredTiger table files.
    • The inserts to the WT table for the MongoDB local.oplog.rs collection are responsible for most of the log by size.
    • When replication is used only “local” db collection and index writes are saved in the WT log. This is saying: WT transaction log will only restore the local db data to what it should be. The replication system in MongoDB, above the storage engine, has a recovery process that reflects those post-last-checkpoint oplog writes to their respective collections outside of the “local” db.
  • The fileid number is an internal one of WiredTiger. It is a simple integer that starts from 1, so you’ll probably only see this as a one or two- or three-digit value. Which WT ident(ifier) (eg. “collection-99-8595782607099944640”) it is for, and then which MongoDB collection (or index) that is for, can be determined by looking in the WiredTiger metadata. (See “Output a tab-delimited table of WT ident(ifier) values vs. collection namespace” in WiredTiger File Forensics Part 2: wt dump. Once you know the ident, use wt dump on that table and look for the “id=…” value in its config string in the header section.)
  • There are no timestamps here that the WiredTiger library uses. It will follow the order exactly whilst replaying these writes though. And that order is as the application (= MongoDB) committed them before.
  • MongoDB writes the oplog “ts” timestamp into the wiredtiger key value shown above when it writes oplog docs. The “ts” value used is transformed from the BSON timestamp value binary format by record_id_helpers::extractKeyOptime(…). (oploghack::extractKey() in <= v4.2).
    This is an exception to what MongoDB puts into the key values of WT tables for other collections (and indexes).
  • The value shown above is just a binary blob to WiredTiger
    • For the tables that hold MongoDB collections, those blobs hold the raw BSON object you can see as a MongoDB user.
    • The blob for mongodb indexes will be a MongoDB KeyString.
  • “lsn” field is transaction log file number (eg. 43 is “WiredTigerLog.0000000043”) plus byte offset. So it is an ever-increasing tuple value, a simple definition of order.

With that explained, let’s look at a real sample now:

$ #To use wt printlog the WiredTiger config string definitely needs to be set.
$ #  See the part 1 of this blog series for details. In this example I'll do it by 
$ #  the -C option, but there are two other ways as well
$
$ wt -C "log=(compressor=snappy,path=journal)" printlog -xu | head -n 100
[
  { "lsn" : [55,128],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "system",
    "ops": [
      { "optype": "prev_lsn",
        "prev_lsn": [54, 98566272]
      }
    ]
  },
  { "lsn" : [55,256],
    "hdr_flags" : "",
    "rec_len" : 384,
    "mem_len" : 384,
    "type" : "commit",
    "txnid" : 812443,
    "ops": [
      { "optype": "row_put",
        "fileid": 20 0x14,
        "key": "\u00e8`\u0088\u000b\u00c2\u00ff\u00ff\u00e0\u00e7",
        "key-hex": "e860880bc2ffffe0e7",
        "value": "\u00f3\u0000\u0000\u0000\u0002op\u0000\u0002\u0000\u0000\u0000d\u0000\u0002ns\u0000 \u0000\u0000\u0000config.transaction_coordinators\u0000\u0005ui\u0000\u0010\u0000\u0000\u0000\u0004\u00f7z\u001082\u00c6A\r\u0089\nx\u00cb(\u00edD\u0080\u0003o\u0000p\u0000\u0000\u0000\u0003_id\u0000f\u0000\u0000\u0000\u0003lsid\u0000H\u0000\u0000\u0000\u0005id\u0000\u0010\u0000\u0000\u0000\u0004\u00d4\u00dc\u00ac\u00ef\u00e7\u0087@\u00a7\u009d\u00df\u00c6.\u00d6\u00f4\u00f0\u00ec\u0005uid\u0000 \u0000\u0000\u0000\u0000\u00e3\u00b0\u00c4B\u0098\u00fc\u001c\u0014\u009a\u00fb\u00f4\u00c8\u0099o\u00b9$'\u00aeA\u00e4d\u009b\u0093L\u00a4\u0095\u0099\u001bxR\u00b8U\u0000\u0012txnNumber\u0000s\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0011ts\u0000'\u0001\u0000\u0000\u00c3\u000b\u0088`\u0012t\u0000i\u0000\u0000\u0000\u0000\u0000\u0000\u0000\twall\u0000(\u00f5m\u0013y\u0001\u0000\u0000\u0012v\u0000\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
        "value-hex": "f3000000026f7000020000006400026e730020000000636f6e6669672e7472616e73616374696f6e5f636f6f7264696e61746f727300057569001000000004f77a103832c6410d890a78cb28ed4480036f0070000000035f69640066000000036c7369640048000000056964001000000004d4dcacefe78740a79ddfc62ed6f4f0ec05756964002000000000e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855001274786e4e756d62657200730400000000000000001174730027010000c30b886012740069000000000000000977616c6c0028f56d1379010000127600020000000000000000"
      }
    ]
  },
  { ....

 

The plain “key” and “value” fields are a mixture of printable chars and hex encoding for non-printable chars, unfortunately, encoded as UTF-16 rather than ascii or UTF-8. This is too hard to work with, so I add the -x option to also print the hex strings the same as wt dump -x does.

In this sample fileid 20 is for local.oplog.rs and this typical content for the WT log with MongoDB. Using xxd -r -p we can see what is going into the value:

$ echo  "f3000000026f7000020000006400026e730020000000636f6e6669672e7472616e73616374696f6e5f636f6f7264696e61746f727300057569001000000004f77a103832c6410d890a78cb28ed4480036f0070000000035f69640066000000036c7369640048000000056964001000000004d4dcacefe78740a79ddfc62ed6f4f0ec05756964002000000000e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855001274786e4e756d62657200730400000000000000001174730027010000c30b886012740069000000000000000977616c6c0028f56d1379010000127600020000000000000000" | xxd -r -p | bsondump --quiet | jq  .
{
  "op": "d",
  "ns": "config.transaction_coordinators",
  "ui": {
    "$binary": {  "base64": "93oQODLGQQ2JCnjLKO1EgA==", "subType": "04" }
  },
  "o": {
    "_id": {
      "lsid": {
        "id": {  "$binary": { "base64": "1Nys7+eHQKed38Yu1vTw7A==", "subType": "04" } },
        "uid": {
          "$binary": { "base64": "47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=", "subType": "00" }
        }
      },
      "txnNumber": {  "$numberLong": "1139" }
    }
  },
  "ts": {  "$timestamp": {"t": 1619528643, "i": 295 } },
  "t": { "$numberLong": "105" },
  "wall": { "$date": { "$numberLong": "1619528643880" } },
  "v": { "$numberLong": "2" }
}

An insert to the oplog that is for a d(elete) of the document in “config.transaction_coordinators” with “id”: {  “$binary”: { “base64”: “1Nys7+eHQKed38Yu1vTw7A==”, “subType”: “04” } }.

Summary

To look at the MongoDB collections, indexes, and the transaction log (a.k.a. the mongodb journal) you can use the wt dump and wt printlog commands. Before doing any of that you’ll have to make a mapping of WT table ident(ifier)s vs. MongoDB collection and index names, and also a list of fileids for the log. This information can be extracted using wt dump on WiredTiger.wt and _mdb_catalog.wt tables. Well, that and a medium amount of grep, sed, awk, etc.

The collection-*.wt files have key-value pairs of [recordId, BSON binary]. The index-*.wt files can also be read, where you can see the binary of the MongoDB-defined “keystrings”. Due to limitations with wt read described above in the article, there is no practical way currently with the “wt” util command to do index lookups.

Looking at the transaction log with wt printlog command will show the operations that will be applied during restart to progress the data from the state it was at the last durable checkpoint to the moment the last journal commit happened before shutdown.

Jun
07
2021
--

Experimental Feature: $backupCursorExtend in Percona Server for MongoDB

$backupCursorExtend MongoDB

$backupCursorExtend MongoDBPercona Server for MongoDB (PSMDB) has provided a ‘hot’ backup of its underlying data db directory files using WiredTiger library methods since v3.4. When you use the { createBackup: … } command it will copy them to whichever other filesystem directory or object store bucket location you specify.

createBackup partially freezes the WiredTiger *.wt Btree files with an internal backup cursor while the command runs. That’s “partially” because more data may be added to the files whilst the data is copied to the backup storage. The key point is all the file blocks of the last checkpoint will be frozen until the backup cursor is closed.

As well as the Btree-storing *.wt files, some metadata files, config-ish files, and the journal/WiredTigerLog.* files will also be copied. We’ll get into those a little later though.

In PSMDB 4.4.6-8 we’ve added two new aggregation pipeline stages that make the file-by-file mechanics of this fastest backup procedure more visible. They also support cluster-consistent backup and restore.

These aggregation pipeline stages are not real aggregation functions such as group, sort, etc. They’re functions for DBAs to use, like $collStats, $currentOp, $getDiagnosticData, etc.

New Aggregation Stage #1: $backupCursor

When you run db.aggregate([{$backupCursor: {}}] it simply returns a list of files to copy.

testrs:SECONDARY> bkcsr = db.getSiblingDB("admin").aggregate([{$backupCursor: {}}])
{ "metadata" : { 
    "backupId" : UUID("f53a93d5-8a0c-4f09-b654-909ff023a7ed"), 
    "dbpath" : "/data/plain_rs/n1/data", 
    "oplogStart" : { "ts" : Timestamp(1615341658, 18745), "t" : NumberLong(105) }, 
    "oplogEnd" : { "ts" : Timestamp(1622185503, 1), "t" : NumberLong(115) }, 
    "checkpointTimestamp" : Timestamp(1622185473, 1) 
  } 
}
{ "filename" : "/data/plain_rs/n1/data/collection-105-8595782607099944640.wt", "fileSize" : NumberLong(20480) }
{ "filename" : "/data/plain_rs/n1/data/index-1--1329013882440624564.wt", "fileSize" : NumberLong(20480) }
...
...
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000042", "fileSize" : NumberLong(104857600) }
testrs:SECONDARY>

See the “Full instructions for using” section below for a thorough example.

It just lists the files and size without copying them anywhere. This is less convenient than the existing createBackup which does. So why did we add it? Compatibility with the other enterprise version of MongoDB is the main point. But there’s another one – adding this aggregation stage exposes the WiredTiger library’s backup cursor API in an almost 1-to-1 structural match. Which is educational for advanced users, and that’s worth something.

Cursors of … File Names?

You may have heard that in Unix “Everything is file”. Well, in the WiredTiger library API everything is a cursor. Tables and indexes are accessed by a cursor of course, and so is the transaction log. Colgroups are too. Also statistics and metadata. Even files are accessed through a cursor. How did this happen? I guess when you’re a db engine developer …. hammers, nails – you know the saying.

New Aggregation Stage #2: $backupCursorExtend

This extra function returns just a few more file paths to copy. The output is small, as you see. The files it returns are just the list of WiredTiger transaction log files that have been updated or newly added since the $backupCursor was first run.

The WT log files are small, but they have an important benefit: in combination with some restart-time tricks, we can use the WT transaction ops in these files to restore to any arbitrary time between the $backupCursor and $backupCursorExtend execution times.

You must run $backupCursor first. As well as keeping that first cursor open, which is a requirement, it’ll be helpful to save a copy of the “metadata” document that is its first record. The reason is that the “backupId” value in the metadata is a required parameter when creating the $backupCursorExtend cursor – it has to know from which main backup cursor it is meant to extend.

testrs:SECONDARY> var bkcsr = db.getSiblingDB("admin").aggregate([{$backupCursor: {}}])
testrs:SECONDARY> var bkCsrMetadata = bkCsr.next().metadata //first document in $backupCursor result has just one child property, the "metadata" object.
testrs:SECONDARY> // do other things with the main backup cursor
testrs:SECONDARY> # ...
testrs:SECONDARY> var bkExtCsr = db.aggregate([{$backupCursorExtend: {backupId: bkCsrMetadata.backupId, timestamp: new Timestamp(162218619, 1)}}])
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000042" }
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000043" }
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000044" }
testrs:SECONDARY>

The second parameter, the timestamp one, will block the command from returning its result until the node has reached that cluster time. This guarantees that you don’t have, for example, a lagging secondary’s snapshot that ends at a cluster time that is before the starting cluster time of other shard’s backups.

Full Instructions for Using $backupCursor and $backupCursorExtend

N.b. this feature is experimental in PMSDB as of now (v4.4.6), so it may change in the future.

Backup

From one node in each replica set (including the config server replica set) run the following procedure.

It’s OK to use either a primary or a secondary, but using a secondary that has significant lag will oblige you to wait for that much time in between steps 1 and 2.

1. Get a list of all files to copy with $backupCursor

Getting the result response, with the metadata doc and the list of files, will take less than a second, but (important!) this cursor must be kept alive until the end of all the backup steps shown below. So don’t close it.

When the command is completed on all replica sets save somewhere in your script etc, the maximum (latest) “oplogEnd” timestamp amongst the replica set nodes – this will be the target, common time of the backup to restore. In this blog, I’ll call this value the commonBackupTimestamp.

(Pre-2: wait until all complete step 1.)

Step 1 will only take ~1 second to complete, but make sure you finish it for all servers involved before doing the next step.

2. If backing up a cluster also run $backupCursorExtend

Do it as soon as you like, you only need to capture a span of time that is long enough to overlap the same process on the other shards and the config server replica set.

Pass the metadata “backupId” from the $backupCursor on this node as the first parameter, and the commonBackupTimestamp (from all replica sets) as the second parameter. This could be run within seconds if your script/program is fast and is good at parallel execution across servers.

You don’t need to explicitly close the cursor of the $backupCursorExtend, it happens automatically. It doesn’t need to do the pinning of any files because $backupCursor is already doing that.

3. Start a < 10 min loop that runs getmore on backup cursor

This will prevent the automatic close of the backup cursor. This is crucial – if the backup cursor is closed WiredTiger library will start allowing the backup snapshot file blocks to be cleaned up and overwritten.

The $backupCursor MongoDB cursor is a tailable one that will return a successful but empty result, as many times as you have to.

4. Now start copying all the files listed to your backup storage

This is the part that takes hours … or whatever <your-db-file-size>/<transfer-rate> is for you.

5. Terminate the getmore loop

6. Close the cursor of the $backupCursor command

Or just let it time out (10mins).

7. Save the commonBackupTimestamp for restore.

This is a MongoDB timestamp, eg. Timestamp(162218619, 1). If you don’t save it in a handy place you’ll have to peek at all the restored nodes to find some common cluster timestamp they all support.

Restore

Having made backups with the current $backupCursor and $backupCursorExtend some work is required to restore a consistent cluster state. The reason is that $backupCursorExtend, in its current experimental version at least, returns WT log files that probably contain operations beyond the target cluster time.

To fix this we can use a special function of the replication subsystem’s startup process to trim the amount of oplog being restored beyond the last checkpoint to the desired, common cluster time. The “oplogTruncateAfterPoint” value in the local db’s replset.oplogTruncateAfterPoint collection is how you set that limit. (It’s undocumented in the MongoDB Manual, but a critical part of replication being updated all the time in normal replication.)

The big hassle is that more restarts are required – the oplogTruncateAfterPoint value can only be set whilst in standalone mode because it’s a field in a system collection of the local db; enabling replication again is required to get the value to be enacted upon. Saving the final result in a new checkpoint takes another standalone mode restart, with special extra options. And, unrelated to timestamp matters, you’ll probably be fixing the replica set and shard configuration to accept the new hostnames too.

For the general process of how to restore a filesystem snapshot of a cluster please see: Restore a Sharded Cluster. Although it’s not an absolutely required part of this restore process we also recommend setting an ephemeral port in the config file for all restarts, except the final one when rs.initiate() is executed. This prevents accidental connection to the original replica set nodes, wherever that environment is, and prevents interference between config and shard nodes before they’re perfectly restored.

The difference to the documented method above is, for both configsvr and shardsvr type nodes, instead of following the two steps to start the node in standalone mode and run db.getSiblingDB(“local”).drop() do these three steps instead:

  • Start mongod as standalone (i.e. replication and sharding options deleted or commented out)
    • Don’t drop the entire local db. We need to keep the oplog. Drop just the “replset.minvalid”, “replset.oplogTruncateAfterPoint”, “replset.election” and “system.replset” collections.
    • Insert a dummy replset.minvalid timestamp.
      db.getSiblingDB(“local”).replset.minvalid.insert({“_id”: ObjectId(), “t”: NumberLong(-1), “ts”: Timestamp(0,1) });
    • Manually create a temporary replset configuration with this minimal config:

      var tmp_rs_conf = {
      &nbsp;&nbsp;"_id" : THE-REPLSET-NAME,
      &nbsp;&nbsp;"version" : NumberInt(1), "protocolVersion" : NumberInt(1),
      &nbsp;&nbsp;"members" : [
      &nbsp;&nbsp;&nbsp;&nbsp;{ "_id" : NumberInt(0), "host" : "localhost:<port>" }
      &nbsp;&nbsp;], "settings" : {}
      };
      
      //Note: If config server add "configsvr": true as well
      // tmp_rs_conf["configsvr"] = true;
      
      db.getSiblingDB("local").system.replset.insert(tmp_rs_conf);
    • Put the commonBackupTimestamp in the replset.oplogTruncateAfterPoint collection.
      db.getSiblingDB("local").replset.oplogTruncateAfterPoint.insert( {
          "_id": "oplogTruncateAfterPoint",
          "oplogTruncateAfterPoint": <commonBackupTimestamp> })
    • Shutdown instance
  • Start as a single-node replica set to recover the oplog.
    • CSRS can be started with sharding enabled (clusterRole=configsvr) but shard’s replica sets should be started without clusterRole configured
    • At this start, oplog history is truncated to the point set in the oplogTruncateAfterPoint collection
    • Shutdown instance
  • Start as standalone with these parameters so the WT log and then mongodb-layer oplog updates are A) applied and B) saved into a checkpoint.
    • –setParameter recoverFromOplogAsStandalone=true
    • –setParameter takeUnstableCheckpointOnShutdown=true
    • Shutdown instance

I won’t say “… and that’s all there is to it!”, because five restarts is a big job. (Per the usual process in Restore a Sharded Cluster there are another two to fix up replica set and sharding config required when moving hosts.) But this is how we can restore all nodes to exactly the same MongoDB cluster time after using the current experimental status $backupCursorExtend being released in Percona Server for MongoDB v4.4 this week.

Summary

The WiredTiger API exposes ways to list all files that should be copied for backup, and it will freeze those files so you can copy them without them being deleted or necessary parts within them being overwritten. These methods of the WT storage engine are exposed at the MongoDB user level with the $backupCursor and $backupCursorExtend aggregation pipeline stages.

$backupCursorExtend in particular gives us a way to keep on collecting WT transaction log files so that we can guarantee coverage of some common time between all shards and the config server replica set of a sharded cluster.

This makes it possible to achieve cluster-consistent filesystem copy restores of a MongoDB sharded cluster. Distributed transactions, if you are using them, will be restored all-or-nothing style up to the cluster timestamp you specify during restore.

Looking ahead: $backupCursorExtend in PSMDB 4.4.6 (and as far as we can tell, in MongoDB Enterprise edition too) doesn’t limit transaction log file copies to an exact MongoDB cluster time. It only guarantees greater-than-or-equal-to inclusion of log writes of that time. So during restore, we must use extra techniques that the MongoDB replication system provides, in particular the undocumented oplogTruncateAfterPoint trick, to do a clean cut of the oplog up to the same cluster timestamp.

After we’ve heard feedback from the community about these new backup cursor functions we’d like to improve them so let us know how you use them, and what a better solution would be for you.

May
27
2021
--

How InnoDB Handles TEXT/BLOB Columns

InnoDB Handles TEXT BLOB Columns

InnoDB Handles TEXT BLOB ColumnsRecently we had a debate in the consulting team about how InnoDB handles TEXT/BLOB columns. More specifically, the argument was around the Barracuda file format with dynamic rows.

In the InnoDB official documentation, you can find this extract:

When a table is created with ROW_FORMAT=DYNAMIC, InnoDB can store long variable-length column values (for VARCHAR, VARBINARY, and BLOB and TEXT types) fully off-page, with the clustered index record containing only a 20-byte pointer to the overflow page.

Whether columns are stored off-page depends on the page size and the total size of the row. When a row is too long, the longest columns are chosen for off-page storage until the clustered index record fits on the B-tree page. TEXT and BLOB columns that are less than or equal to 40 bytes are stored in line.

The first paragraph indicates InnoDB can store long columns fully off-page, in an overflow page. A little further down in the documentation, the behavior of short values, the ones with lengths less than 40 bytes, is described. These short values are stored in the row.

Our argument was about what happened for lengths above 40 bytes. Are these values stored in the rows or in overflow pages. I don’t really need a reason to start digging on a topic, imagine when I have one. With simple common tools, let’s experiment and find out.

Experimentation with TEXT/BLOB

For our experiment, we need a MySQL database server, a table with a TEXT column, and the hexdump utility. I spun up a simple LXC instance in my lab with Percona-server 8.0 and created the following table:

CREATE TABLE `testText` (
`id` int(11) unsigned NOT NULL,
`before` char(6) NOT NULL DEFAULT 'before',
`data` text NOT NULL,
`after` char(5) NOT NULL DEFAULT 'after',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The before and after columns are there to help locate the TEXT column in the datafile page. We can insert a few rows in the table, a TEXT value with size being a multiplier of 16 is convenient with hexdump, successive identical lines are replaced by ‘*’.

mysql> insert into testText (id,data) values (1,repeat('0123456789abcdef',1));
Query OK, 1 row affected (0.01 sec)
mysql> insert into testText (id,data) values (2,repeat('0123456789abcdef',2));
Query OK, 1 row affected (0.00 sec)
mysql> insert into testText (id,data) values (3,repeat('0123456789abcdef',3));
Query OK, 1 row affected (0.01 sec)

Then we use the hexdump utility at the shell level, the first leaf page starts at offset 0xc000:

# hexdump -C /var/lib/mysql/test/testText.ibd
...
0000c060  02 00 1b 69 6e 66 69 6d  75 6d 00 04 00 0b 00 00  |...infimum......|
0000c070  73 75 70 72 65 6d 75 6d  10 00 00 10 00 32 00 00  |supremum.....2..|
0000c080  00 01 00 00 00 05 74 3f  d9 00 00 01 5d 01 10 62  |......t?....]..b|
0000c090  65 66 6f 72 65 30 31 32  33 34 35 36 37 38 39 61  |efore0123456789a|
0000c0a0  62 63 64 65 66 61 66 74  65 72 20 00 00 18 00 42  |bcdefafter ....B|
0000c0b0  00 00 00 02 00 00 00 05  74 40 da 00 00 01 5e 01  |........t@....^.|
0000c0c0  10 62 65 66 6f 72 65 30  31 32 33 34 35 36 37 38  |.before012345678|
0000c0d0  39 61 62 63 64 65 66 30  31 32 33 34 35 36 37 38  |9abcdef012345678|
0000c0e0  39 61 62 63 64 65 66 61  66 74 65 72 30 00 00 20  |9abcdefafter0.. |
0000c0f0  ff 7e 00 00 00 03 00 00  00 05 74 45 dd 00 00 01  |.~........tE....|
0000c100  62 01 10 62 65 66 6f 72  65 30 31 32 33 34 35 36  |b..before0123456|
0000c110  37 38 39 61 62 63 64 65  66 30 31 32 33 34 35 36  |789abcdef0123456|
*
0000c130  37 38 39 61 62 63 64 65  66 61 66 74 65 72 00 00  |789abcdefafter..|

Clearly, the values are stored inside the row, even the third one with a length of 48.

Cutoff to an Overflow Page

If we continue to increase the length, the behavior stays the same up to a length of 8080 bytes (505 repetitions). If we add 16 bytes to the length, the row becomes larger than half of the available space on the page. At this point, the TEXT value is moved to an overflow page and replaced by a 20 bytes pointer in the row itself. The pointer in the rows looks like this:

0000c060 02 00 1c 69 6e 66 69 6d 75 6d 00 02 00 0b 00 00 |...infimum......|
0000c070 73 75 70 72 65 6d 75 6d 14 c0 00 00 10 ff f1 00 |supremum........|
0000c080 00 00 01 00 00 00 05 74 80 a4 00 00 01 fe 01 10 |.......t........|
0000c090 62 65 66 6f 72 65 00 00 01 02 00 00 00 04 00 00 |before..........|
0000c0a0 00 26 00 00 00 00 00 00 1f a0 61 66 74 65 72 00 |.&........after.|
0000c0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

The 20 bytes pointer is: 00 00 01 02 00 00 00 04 00 00 00 26 00 00 00 00 00 00 1f a0
Space ID: 00 00 01 02
First overflow page: 00 00 00 04 (4 x 0x4000 = 0x10000)
Offset in overflow page: 00 00 00 26
Version number: 00 00 00 00
Total length of the TEXT value: 00 00 1f a0 (0x1fa0 = 8096 = 16*506)

The overflow page:

00010000 c1 06 3d 24 00 00 00 04 00 00 00 00 00 00 00 00 |..=$............|
00010010 00 00 00 00 74 dd 8e 3f 00 0a 00 00 00 00 00 00 |....t..?........|
00010020 00 00 00 00 01 02 00 00 1f a0 ff ff ff ff 30 31 |..............01|
00010030 32 33 34 35 36 37 38 39 61 62 63 64 65 66 30 31 |23456789abcdef01|
*
00011fc0 32 33 34 35 36 37 38 39 61 62 63 64 65 66 00 00 |23456789abcdef..|
00011fd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

The value header starting @ 0x10026: 00 00 1f a0 ff ff ff ff
Length in that overflow page: 00 00 1f a0 (0x1fa0 = 8096 = 16*506)
Next overflow page: ff ff ff ff (this means it is the last)

From the above, one can make a few interesting observations:

  • The 20 bytes pointer includes a space_id value so in theory, the overflow page could be in another tablespace (ibd file).
  • The total length is using 4 bytes even though this is a TEXT and not a LONGTEXT column. 2 bytes would have been sufficient.
  • The length of an overflow page chunk is also using 4 bytes for length even though the largest possible InnoDB page size is 64KB.

Performance Impacts

TEXT/BLOB columns are the source of a number of performance-related impacts. Let’s review the most obvious ones.

Storage

The first performance impact related to storage inefficiency. TEXT/BLOB values are stored in chunks of 16KB if the default InnoDB page size is used. That means, on average, about 8KB per value is lost when overflow pages are used. This leads to larger data files and less efficient caching.

Reads

The presence of TEXT/BLOB columns can significantly increase the number of read IOPs required for queries. For example, let’s consider the following simple query:

SELECT * from myTable limit 10;

If we ignore caching, without TEXT/BLOB columns, the above query would require only one read IOP per level in the primary key btree of myTable. For a small table, this could be one or two read IOPs. Now, if each row has a 1MB TEXT/BLOB column, the same simple query would require in excess of 640 read IOPs since each TEXT/BLOB value is a chain of 64 pages of 16KB.

Writes

For this section, let’s assume a worst-case scenario with row-based replication enabled and a full row image. Now, if we insert a row with a 1MB value as a longText column in a Percona server 8.0 instance, we have:

+----------------------------------+---------------------------+
| FILE_NAME                        | SUM_NUMBER_OF_BYTES_WRITE |
+----------------------------------+---------------------------+
| /var/lib/mysql/#ib_16384_0.dblwr |                   1572864 |
| /var/lib/mysql/ib_logfile0       |                   1174528 |
| /var/lib/mysql/test/testText.ibd |                   1130496 |
| /var/lib/mysql/binlog.000003     |                   1048879 |
| /tmp/MLfd=41                     |                   1048783 |
| /var/lib/mysql/undo_001          |                    278528 |
| /var/lib/mysql/undo_002          |                    131072 |
| /var/lib/mysql/ibdata1           |                     16384 |
| /var/lib/mysql/mysql.ibd         |                     16384 |
+----------------------------------+---------------------------+

For a total of around 6.4 MB. This is not surprising, there are two log files and the data is also written twice because of the doublewrite buffer. The temporary file is used for the disk binlog cache and unless the transaction is very long, it won’t actually be written to storage.

Anyway, this is just to set the stage for what happens after an update. If I just change the last letter of the longText value, the amount of data written raises to approximately 10 MB.

+----------------------------------+---------------------------+
| FILE_NAME                        | SUM_NUMBER_OF_BYTES_WRITE |
+----------------------------------+---------------------------+
| /var/lib/mysql/#ib_16384_0.dblwr | 2260992                   |
| /var/lib/mysql/test/testText.ibd | 2211840                   |
| /var/lib/mysql/binlog.000003     | 2097475                   |
| /tmp/MLfd=41                     | 2097379                   |
| /var/lib/mysql/ib_logfile0       | 1129472                   |
| /var/lib/mysql/undo_001          | 32768                     |
| /var/lib/mysql/ibdata1           | 16384                     |
+----------------------------------+---------------------------+

The longText value is not modified in place, it is copied to a new set of overflow pages. The new and old overflow pages then need to be flushed to storage. Also, since we use the worst-case scenario of the full row image, the binary log entry has the old and new value stored but the InnoDB log files only have the new version.

I hope this illustrates why storing mutable data in a TEXT/BLOB column is a bad idea.

JSON

Although columns stored using the MySQL JSON data type are stored as a TEXT/BLOB column, MySQL 8.0 added some logic to allow in-place updates. The impact of an update to a large JSON column in 8.0 is not as severe as in 5.7.

How to Best Deal with TEXT/BLOB Columns?

Data Compression

Data compression is a very compelling option for TEXT/BLOB values By definition, those values are normally large and thus, usually compress well. This is nothing new and has been reported in previous previously. PostgreSQL, for one, compresses by default its TEXT/BLOB columns (called TOAST).

The best option for compression is always the application when it is possible of course. As we have just seen, this reduces the write load and spares the database from the CPU burden of compression.

Another option with MySQL is the InnoDB Barracuda table compression. When used, TEXT/BLOB values are compressed before being written to the overflow pages. This is much more efficient than compressing the pages one at a time.

Finally, if you are using Percona Server or MariaDB, you have access to transparent column compression. This is the second-best option, performance-wise, behind compression by the application is not possible.

Avoid Returning TEXT/BLOB Columns

When there are large TEXT/BLOB columns in a table, the cost of accessing those columns is high. Because of this, it is important to select only the columns that are needed and avoid the default use of “select * from”. Unfortunately, many ORM frameworks instantiate objects by grabbing all the columns. When using such frameworks, you should consider storing the TEXT/BLOB in a separate table with a loose 1:1 relationship with the original table. That way, the ORM is able to instantiate the object without necessarily forcing a read of the TEXT/BLOB columns.

Conclusion

I hope this article improved your understanding of TEXT/BLOB values in InnoDB. When used correctly, TEXT/BLOB columns can be useful and have a limited impact on the performance of the database.

May
21
2021
--

Heads-Up: TokuDB Support Changes and Future Removal from Percona Server for MySQL 8.0

TokuDB Support Changes

TokuDB Support ChangesBack in December 2018, when we announced the general availability of Percona Server for MySQL 8.0, we also announced that the TokuDB Storage Engine has been marked as “deprecated” in this release, recommending to use the MyRocks Storage Engine as an alternative. We believe that MyRocks provides similar benefits for the majority of workloads and is better optimized for modern hardware.

Since then, we have continued maintaining the storage engine in the 8.0 release, e.g. by incorporating bug fixes. However, the ongoing amount of changes that are still occurring in the upstream MySQL 8.0 codebase have been a constant challenge and a cause for concern.

Maintaining TokuDB as part of the 8.0 codebase has become increasingly difficult and time-consuming. Based on the number of support requests and conversations on our forums, we have seen very little to no adoption of TokuDB in Percona Server for MySQL 8.0.

We have therefore decided that the TokuDB storage engine will be disabled in future versions of Percona Server for MySQL 8.0. Beginning with Percona Server version 8.0.25, we’ll add a respective warning notice to the release notes, to inform users about this upcoming change.

Timeline

Starting with Percona Server version 8.0.26 (expected in Q3 2021), the storage engine will still be included in the binary builds and packages, but disabled by default. Users upgrading from previous versions will still be able to re-enable it manually, so they can perform the necessary migration steps.

Starting with Percona Server for MySQL version 8.0.28 (expected to ship in Q1 2022), the TokuDB storage will no longer be supported and will be removed from the installation packages. It will still be part of the 8.0 source code, but not enabled in our binary builds.

We intend to fully remove it in the next major version of Percona Server for MySQL (9.0 or whatever version that will be). Note that this change only applies to Percona Server for MySQL version 8.0 – TokuDB remains enabled and supported in Percona Server 5.7 until this release has reached the end of its support period.

In case you’re still using TokuDB in Percona Server 8.0, we recommend switching to InnoDB or MyRocks instead as soon as possible. Please consult our documentation on how to migrate and remove the TokuDB storage engine.

If you need any assistance with migrating your data or if you have any questions or concerns about this, don’t hesitate to reach out to us – our experts are here to help!

May
18
2021
--

WiredTiger File Forensics Part 2: wt dump

wiredtiger file forensics wt dump

wiredtiger file forensics wt dumpThis article contains one normal section – how to print information directly from the raw WiredTiger files using wt dump – followed by really important, otherwise undocumented information about how to get to the MongoDB binary data inside WT tables.

See “WiredTiger File Forensics (Part 1: Building “wt”)” for:

  • How to build the “wt” utility command
  • A short description of its commandline syntax
  • How to set the wiredtiger_open() configuration string it needs

The examples in this page assume the following:

  • WiredTiger.config or WIREDTIGER_CONFIG or the -C option is being used to give “wt” the right config string. Having said that – an empty “wt” config might work OK for you, at least until wt printlog.
  • xxd, bsondump, jq are installed. (See “Shell script formatting tricks” section for why.)

wt dump

The wt dump command can be used to show the full content of a WiredTiger table file. It will output the table object’s header then every key and value.

wt dump has one compulsory argument: uri. A uri is usually “table:” + WT ident string, or “file:” + (relative) file path. Eg. “table:collection-4–9876544321″ or “file:collection-4–9876544321.wt“.

“table:” is automatically prepended to the uri by wt dump if you don’t specify any type yourself.

Documentation at source.wiredtiger.com

The format from wt dump can be:

  • “print”: Keys and values will be strings of printable ascii chars and \-escaped hex values of non-printable bytes. (A.k.a. pretty-print.)
    Eg. \82\00\00\00\03_id\00\0a\00\00\00\ff_id\00\00\03max\00…..
  • “hex”: Keys and values will be show as continuous [0-9af] hexadecimal characters. No escaping or quotes.
    Eg. 82000000035f6964000a000000ff5f696400…
  • “json” Keys will be shown as natural type (only a plain number as of MongoDB 4.4 / WiredTiger 10.0.0, I think). Values will be shown as UTF-16, which doesn’t make sense – it is inserting null bytes every second byte. (I suspect this is a current-version bug.)
    Eg. “\u0082\u0000\u0000\u0000\u0003\u005f\u0069\u0064\u0000\u000a\u0000\u0000\u0000\u00ff\u005f\u0069\u0064\u0000….”

For “print” or “hex” the output has a fixed line format:

Line # Field Sample
1 WT library version string WiredTiger Dump (WiredTiger Version 10.0.0)
2 “wt dump” format type Format=hex
3 Constant string “Header” Header
4 URI table:collection-10-5841128870485063129
5 Config string from the table’s creation (WT_SESSION::create) access_pattern_hint=none,allocation_size=4KB,app_metadata=, … 
6 Constant string “Data” Data
7
8
First key; its value e85d689139ffffdfc1

6400000011747300010000003a91685d12680000000000000000…

9
10
Second key; its value e85d68913bffffdfc1

e700000011747300010000003c91685d12740001000000000000….

(repeat for all kv pairs)

The -j JSON output mode of wt dump is much easier to read by eye, but values are currently encoded in UTF-16 strings which is practically useless. I can’t ingest that using my preferred JSON tools at least. Eg. what is the binary value 82 00 00 00 03 5f 69 64 (eight bytes) will be output as “\u0082\u0000\u0000\u0000\u0003\u005f\u0069\u0064” (sixteen bytes). The binary length has been doubled by null bytes between every intended value byte.

Shell script formatting tricks used in this article

  • xxd -r -p converts plain, [0-9a-fA-Z]* hex strings to raw bytes. (Note “xxd -rp” fails, and fails silently. Keep the -r and -p flags separated in the shell command.)
  • bsondump is used to convert BSON from binary to JSON. Use –quiet.
  • tail -n +7 to skip the header output of wt dump.
  • awk is used to filter out just the keys (line number NR%2 == 1) or values (line number NR%2 == 0) in the pretty-print or hex mode of wt dump.
  • jq is used to filter and/or reformat bsondump and wt printlog output.
    (Ideally jq would be used to ingest the output of wt dump -j directly, but the falsely null-padded, bytes-as-UTF-16 string encoding it currently outputs defeats that.)

Record output is not influenced by the WT binary format

The output of wt dump is unaffected by the WiredTiger binary format used. It will come out the same regardless of table format type, and modifiers such as max page size, key prefixing, etc.

This means wt dump won’t teach you about the ‘physical’ format of the data. The only thing you can see related to this is the config string that was used in WT_SESSION::create when a table was first made. It is shown in the header section every time wt dump is run.

$ wt dump file:collection-0--4131298130356306083.wt 
WiredTiger Dump (WiredTiger Version 10.0.0)
Format=print
Header
file:collection-0--4131298130356306083.wt
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=snappy,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(enabled=false,file_metadata=,repair=false),internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=false),lsm=(auto_throttle=true,bloom=true,bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldest=false,chunk_count_limit=0,chunk_max=5GB,chunk_size=10MB,merge_custom=(prefix=,start_generation=0,suffix=),merge_max=15,merge_min=0),memory_page_image_max=0,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,readonly=false,source=,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,tiered=(chunk_size=1GB,tiers=),tiered_storage=(auth_token=,bucket=,local_retention=300,name=,object_target_size=10M),type=file,value_format=u,verbose=[],write_timestamp_usage=none

FYI there’s a lot of empty and/or unused options in the config strings, so don’t take the presence of an option to imply it is being used. Eg. above we see lsm options, even though all the tables mongodb makes are btrees (which you can confirm by the “format=btree” option).

The wt verify subcommand with a -d option does show something about the physical layout. Arguably wt stat does too. But these will no be covered here.

Metadata tables to explore in MongoDB data directory

WiredTiger

Yawn. Just a version string. View it as plain text.

WiredTiger.turtle

Fairly boring. Version information, a URI pointing to WiredTiger.wt, a config string to open it. View it as plain text.

The turtle file will be recreated auto-magically if it is absent.

WiredTiger.config

This is not a WiredTiger data file. It is kind of like a resource file for the “wt” utility command. It won’t be created by any program. It will only be made by you, or some nice colleague, to save you from specifying the -C option each time you run the “wt” command. See the previous blog article to learn what to put in it.

WiredTiger.lock

Just a classic lock file. Contains nothing except the plain text “WiredTiger lock file”. Created and overwritten automatically.

WiredTiger.wt

This is the main metadata table of the WiredTiger storage engine. It is also, itself, a WiredTiger table, one the WiredTiger.turtle file points to.

  • The keys in this table are URI strings of:
    • table: Wiredtiger tables
    • file: Files
    • system: A couple of special system values, and 
    • colgroup: Colgroups.
      (Colgroups are 1-to-1 to tables the way MongoDB uses WiredTiger; I tend to forget about their presence as a result.)
  • The values are the table/file/colgroup WiredTiger-flavoured JSON config strings that WiredTiger uses when it opens those objects in other API functions.

The entire content of this file is ascii-printable, keys and values alike. WT identifiers and config values are all print-safe c strings. You won’t have it as easy with the other table files though! They all contain binary numeric type and BSON keys and/or values.

WiredTiger.wt is not the MongoDB collection and index etc. catalog. That is in _mdb_catalog.wt (see later).

There is a special command that is hard-coded to dump only this collection – wt list. By default, it filters the output to just the keys, and furthermore, it excludes the “system:” ones plus the table/file listing for itself. Even if you use the “-v” option it will still exclude itself.

Bonus exercise: Dump the content of the WiredTiger.wt metadata file like it was just any other wt table

Amongst other records, the WiredTiger.wt file includes “table:xxxx” keys that point to every table’s URI. Every table except itself, that is.

So you can run commands such as wt dump collection-0–4131298130356306083, or wt dump index-8-5834121039240263510 or wt dump _mdb_catalog or wt dump sizeStorer, but you can’t run wt dump WiredTiger.

The trick is using the URI with “file:” at the front. You also need the full filename with the “.wt” suffix on the end, because it is a file, not a table URI. (wt dump assumes “table:” as the type prefix if not supplied.)

$ wt dump file:WiredTiger.wt

WiredTiger Dump (WiredTiger Version 10.0.0)
Format=print
Header
file:WiredTiger.wt
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,cache_resident=false,checksum=uncompressed,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,.....
Data
colgroup:_mdb_catalog\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:_mdb_catalog.wt",type=file\00
colgroup:collection-0--4131298130356306083\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-0--4131298130356306083.wt",type=file\00
colgroup:collection-0--6422702119521843596\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-0--6422702119521843596.wt",type=file\00
colgroup:collection-0-5834121039240263510\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-0-5834121039240263510.wt",type=file\00
colgroup:collection-0-5841128870485063129\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-0-5841128870485063129.wt",type=file\00
colgroup:collection-10--4131298130356306083\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-10--4131298130356306083.wt",type=file\00
colgroup:collection-10-5841128870485063129\00
app_metadata=(formatVersion=1,oplogKeyExtractionVersion=1),collator=,columns=,source="file:collection-10-5841128870485063129.wt",type=file\00
colgroup:collection-11-5841128870485063129\0
...
colgroup:sizeStorer\00
app_metadata=,collator=,columns=,source="file:sizeStorer.wt",type=file\00
file:WiredTigerHS.wt\00
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=snappy,cache_resident=false,checksum=uncompressed,...
file:_mdb_catalog.wt\00
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,....
file:collection-0--4131298130356306083.wt\00
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,
...
file:sizeStorer.wt\00
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,...
system:checkpoint\00
checkpoint_timestamp="6088966b00000001"\00
system:checkpoint_base_write_gen\00
base_write_gen=104333\00
system:checkpoint_snapshot\00
snapshot_min=1,snapshot_max=1,snapshot_count=0\00
system:oldest\00
oldest_timestamp="6088966600000001"\00
table:_mdb_catalog\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-0--4131298130356306083\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-0--6422702119521843596\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-0-5834121039240263510\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-0-5841128870485063129\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-10--4131298130356306083\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-10-5841128870485063129\00
app_metadata=(formatVersion=1,oplogKeyExtractionVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-11-5841128870485063129\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
...
table:index-9-5841128870485063129\00
app_metadata=(formatVersion=8),colgroups=,collator=,columns=,key_format=u,value_format=u\00
table:sizeStorer\00
app_metadata=,colgroups=,collator=,columns=,key_format=u,value_format=u\00

_mdb_catalog.wt

This is the catalog/schema that the MongoDB user sees.

$ #N.b. URI can be "file:_mdb_catalog.wt", "table:_mdb_catalog", or just "_mdb_catalog"
$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet

{"md":{"ns":"local.startup_log","options":{"uuid":{"$binary":{"base64":"9TLH+oBmQXW7eou0SZFXPQ==","subType":"04"}},"capped":true,"size":{"$numberInt":"10485760"}},"indexes":[{"spec":{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_","ns":"local.startup_log"},"ready":true,"multikey":false,"multikeyPaths":{"_id":{"$binary":{"base64":"AA==","subType":"00"}}},"head":{"$numberLong":"0"},"prefix":{"$numberLong":"-1"},"backgroundSecondary":false,"runTwoPhaseBuild":false,"versionOfBuild":{"$numberLong":"1"}}],"prefix":{"$numberLong":"-1"}},"idxIdent":{"_id_":"index-1-5841128870485063129"},"ns":"local.startup_log","ident":"collection-0-5841128870485063129"}
{"isFeatureDoc":true,"ns":null,"nonRepairable":{"$numberLong":"0"},"repairable":{"$numberLong":"1"}}
{"md":{"ns":"local.replset.oplogTruncateAfterPoint","options":{"uuid":{"$binary":{"base64":"8kuKi7n6TliHYwjFxEEKWw==","subType":"04"}}},"indexes":[{"spec":{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_","ns":"local.replset.oplogTruncateAfterPoint"},"ready":true,"multikey":false,"multikeyPaths":{"_id":{"$binary":{"base64":"AA==","subType":"00"}}},"head":{"$numberLong":"0"},"prefix":{"$numberLong":"-1"},"backgroundSecondary":false,"runTwoPhaseBuild":false,"versionOfBuild":{"$numberLong":"1"}}],"prefix":{"$numberLong":"-1"}},"idxIdent":{"_id_":"index-3-5841128870485063129"},"ns":"local.replset.oplogTruncateAfterPoint","ident":"collection-2-5841128870485063129"}
...
...

The content of each value is BSON, as you can in the deserialized output above. There is just one document that isn’t for a collection, the “isFeatureDoc” (see above). Apart from that single, trivial exception, the fields of each object are:

  • “ns” Eg. “test.foo”, “config.chunks”, “local.oplog.rs”
  • “ident” The unique identifier string for the WT table of this mongodb collection. Eg. “collection-7-5834121039240263510”
  • “idxIdent”: object listing the ident(ifier)s of each index on the collection (Annoyingly they are separated from the index definitions inside the “md.indexes” object; see below.)
  • “md” Metadata of the collection. It has these fields:
    • “ns” (again)
    • “options” Options you would pass to collection create command. Most collections will only have the (automatically-assigned) collection UUID value in this sub-object, but if a collection has a non-default option such as capped: true it is defined here.
    • “indexes” Array of index specs, the ones you’d see if you were running db.collection.getIndexes().

Confirming things match the MongoDB picture

If you want to prove to yourself the information in _mdb_catalog WiredTiger table matches what the result of listDatabases (‘show dbs’), listCollections (‘show collections’), and listIndexes (db.collection.getIndexes()) in MongoDB, you can search the file with these shell commands:

List of collections and their index names, vs. the WT ident(ifier) strings

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq 'select(. | has("md")) | {ns: .ns, ident: .ident, idxIdent: .idxIdent}'
{
  "ns": "local.startup_log",
  "ident": "collection-0-5841128870485063129",
  "idxIdent": {
    "_id_": "index-1-5841128870485063129"
  }
}
{
  "ns": "local.replset.oplogTruncateAfterPoint",
  "ident": "collection-2-5841128870485063129",
  "idxIdent": {
    "_id_": "index-3-5841128870485063129"
  }
}
...
...

List of collections and the options strings they were created with:

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq 'select(. | has("md")) | .md | {ns: .ns, options: .options}'
{
  "ns": "local.startup_log",
  "options": {
    "uuid": { "$binary": {  "base64": "9TLH+oBmQXW7eou0SZFXPQ==", "subType": "04" } },
    "capped": true,
    "size": { "$numberInt": "10485760" }
  }
}
....

Index definitions (a.k.a. index specs):

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq 'select(. | has("md")) | .md | {ns: .ns, indexes: .indexes}'
{
  "ns": "local.startup_log",
  "indexes": [
    {
      "spec": { "v": { "$numberInt": "2" },
        "key": {
          "_id": { "$numberInt": "1" }
        },
        "name": "_id_",
        "ns": "local.startup_log"
      },
      "ready": true,
      "multikey": false,
      "multikeyPaths": {
        "_id": { "$binary": { "base64": "AA==", "subType": "00" }
        }
      },
      "head": { "$numberLong": "0" },
....

Output a tab-delimited table of WT ident(ifier) values vs. collection namespace

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -r 'select(. | has("md")) | [.ident, .ns] | @tsv' | sort
collection-0--4131298130356306083	config.cache.chunks.test.bar
collection-0-5834121039240263510	local.replset.initialSyncId
collection-0-5841128870485063129	local.startup_log
collection-0--6422702119521843596	config.system.sessions
collection-10--4131298130356306083	config.cache.chunks.test.foo
...
...

Output a tab-delimited table of WT ident(ifier) values vs. indexes

The following outputs three columns: WT ident value, collection namespace, and index name.

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -r 'select(. | has("idxIdent")) | .ns as $nsT | .idxIdent | to_entries[] | [.value, $nsT, .key] | @tsv' | sort
index-11--4131298130356306083	config.cache.chunks.test.foo	_id_
index-12--4131298130356306083	config.cache.chunks.test.foo	lastmod_1
index-12-5841128870485063129	local.system.replset	_id_
index-1--4131298130356306083	config.cache.chunks.test.bar	_id_
index-14-5841128870485063129	admin.system.version	_id_
...
...

sizeStorer.wt

Just a table of BSON values. MongoDB stores the sum BSON document data size and record count for each mongodb collection here. The keys are not the mongodb collection ns value though – they are the WT URI/ident values. You’ll have to map to the output of wt dump _mdb_catalog to work out which counts and sizes are for which collections.

$ #print the keys
$ wt dump table:sizeStorer | tail -n +7 | awk 'NR%2 == 1 { print }' | head 
table:_mdb_catalog
table:collection-0--2823106041799591351
table:collection-0--4131298130356306083
table:collection-0--6422702119521843596
...
...

$ #print the bson data values
$ wt dump -x table:sizeStorer | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -c .
{"numRecords":{"$numberLong":"37"},"dataSize":{"$numberLong":"16930"}}
{"numRecords":{"$numberLong":"1"},"dataSize":{"$numberLong":"22"}}
{"numRecords":{"$numberLong":"1"},"dataSize":{"$numberLong":"130"}}
{"numRecords":{"$numberLong":"0"},"dataSize":{"$numberLong":"0"}}
...

storage.bson

Not a part of WiredTiger. A few storage-engine-related options are persisted into a file in the dbPath data directory so MongoDB layer above can restart knowing if the directoryPerDB etc. options were used last time the data was used.

View it using bsondump [–quiet] storage.bson .

Application (= MongoDB) tables

The collection-*.wt and index-*.wt files contain the data of the MongoDB collections and indexes that you observe as a client connected to a MongoDB instance.

These are not WiredTiger metadata so we’ll cover them in the next blog post (WiredTiger File Forensics Part 3: Viewing all the MongoDB data). For starters though you can use the wt dump -x collection-*** | tail -n +7 | awk ‘{NR%2 = 0}’ | bsondump –quiet method shown for _mdb_catalog earlier to see the bsondump of collection-*.wt.

FYI these files do not contain any of their own MongoDB catalog information. You can see the data in them, but you can’t see what indexes are for which collections, or even what a collection’s mongodb ns (namespace) is. You will always need to dump the BSON data from the _mdb_catalog wiredtiger table information to get that.

WiredTiger’s transactional log files

The journal/WiredTigerLog.NNNNNNNN files are WiredTiger’s transactional log files, its write-ahead log. This is the only place on disk that preserves writes that haven’t yet been flushed to collection-*.wt and index-*.wt files during a checkpoint.

You usually don’t need to pay attention to the exact log file numbers as you work with them – they’re treated as a sequence and are automatically iterated from the start.

You view them with wt printlog instead of wt dump. The output is always JSON, but additional hex versions of the keys and values in ops can be included as well by using the -x option.

$ wt printlog | head -n 20
[
  { "lsn" : [72,128],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "system",
    "ops": [
      { "optype": "prev_lsn",
        "prev_lsn": [1, 0]
      }
    ]
  },
  { "lsn" : [72,256],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "system",
    "ops": [
      { "optype": "checkpoint_start",
...

This is just a starter for using looking at the wiredtiger transaction log files (a.k.a. mongodb journal files). The meaning of lsn numbers; how to identify collections from ops[].fileid; how to find oplog writes; why there are no writes for many collections, is material for another day.

You can ignore the WiredTigerPreplog.* files, they are empty. They’re pre-allocated in the filesystem to reduce latency that can occur doing the open() syscall; they are simply renamed as the next WiredTigerLog.* file when it is needed.

Summary

wt dump is the main command you will use to look at WiredTiger tables. wt list is a shortcut for dumping a subset of records from the top-most of those tables (WiredTiger.wt).

wt printlog is the command needed to view the (completely different) transaction log files. These are the write-ahead log of WiredTiger, known as journal in MongoDB.

Knowing how to follow the hierarchy of WiredTiger metadata, followed by the MongoDB catalog, is crucial to begin introspection of a MongoDB data directory.

Another crucial understanding is that WiredTiger stores keys and values as raw binary values with any type indicator, so you have to know ahead of time what those are, and/or be a skilled binary detective. The long and the short of it though is: WiredTiger’s metadata is simple, ascii-printable strings; all other tables are usually BSON or plain integer types. (Ok, one extra: A special ‘dehydrated keystring’ packing format, defined by the MongoDB layer, is in the index-*.wt files.)

Wait for the next blog post (WiredTiger File Forensics Part 3: Viewing all the MongoDB data) to see how to put it all together and extract any information, as you think of it as a MongoDB user.

May
18
2021
--

WiredTiger File Forensics Part 1: Building “wt”

wiredtiger file forensics

wiredtiger file forensicsMost of the files in a data directory of a MongoDB server are made by the WiredTiger storage engine. If you want to look at the content inside them you can use the tool “wt” from the WiredTiger library:

https://github.com/wiredtiger/wiredtiger/

http://source.wiredtiger.com/10.0.0/command_line.html

Inspection of the WiredTiger files is not an essential MongoDB DBA skill – it’s just for the curious. Or maybe post-crash data recovery, if you weren’t using a replica set. But for a certain type of person this is fun, and if you voluntarily clicked a page link to “file forensics” then it seems that’s you. ?

Build and Install the “wt” Executable

Documentation: http://source.wiredtiger.com/develop/install.html

  1. git clone https://github.com/wiredtiger/wiredtiger.git
  2. cd wiredtiger
  3. git branch --all | grep mongodb

      #find a branch matching the major version of mongodb you are using

  4. git checkout mongodb-4.4  #Using the WT code matching mongodb 4.4 as an example
  5. sh autogen.sh
  6. ./configure --disable-shared --with-builtins=lz4,snappy,zlib,zstd
  7. make -j $(nproc) # -j $(nproc) to parallelize build to number of cores
  8. sudo make install

Points about the make configuration above:

Use

--disabled-shared

  to force the build of a statically linked “wt” binary.

If you do not disable shared library mode, the final build product will be somewhat different. The “wt” target executable at the top of the source directory (and installed to /usr/local/bin/) will be a script, not a compiled binary. That script will execute a compiled binary built at .libs/wt, but before it can it has to fix the LD_LIBRARY_PRELOAD environment variable so the binary can find the libwiredtiger.so, also in .libs/. Much hacking is required if you then try to move these files elsewhere.

Also use

--with-builtins=lz4,snappy,zlib,zstd

to match what normal mongod binaries expect. But feel free to drop “lz4”, “zlib” and/or “zstd” if there is a hassle about installing those libraries’ dev packages. Only snappy is used in 99%+ of MongoDB deployments.

--with-builtins lz4,snappy,zlib,zstd

, ie. without a “=” between option name and values, will fail with the message “checking build system type… Invalid configuration snappy': machine snappy’ not recognized”. Put the “=” in.

You can use configuration option –enable-XXX (eg. –enable-snappy) to build extensions as standalone libraries instead, but that’s the hard way. You’ll have to continually specify the full path to the standalone extension library as an extra extension in config every time you run a wt program.

Versions

You’ll need to build a different branch of wiredtiger depending on which mongodb version of data you’re looking at. “Log” in the table below means the transaction log, a.k.a. the journal in mongodb.

/**
* Table of MongoDB<->WiredTiger<->Log version numbers:
*
* |                MongoDB | WiredTiger | Log |
* |------------------------+------------+-----|
* |                 3.0.15 |      2.5.3 |   1 |
* |                 3.2.20 |      2.9.2 |   1 |
* |                 3.4.15 |      2.9.2 |   1 |
* |                  3.6.4 |      3.0.1 |   2 |
* |                 4.0.16 |      3.1.1 |   3 |
* |                  4.2.1 |      3.2.2 |   3 |
* |                  4.2.6 |      3.3.0 |   3 |
* | 4.2.6 (blessed by 4.4) |      3.3.0 |   4 |
* |                  4.4.0 |     10.0.0 |   5 |
*/

(This useful table is currently found as a comment in mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp.)

To build WT matching 4.4 do “git checkout mongodb-4.4”, for the latest 4.2 minor versions “git checkout mongodb-4.2”, etc. There are also tags (rather than branches) for a lot of minor versions. Use “git branch -a” and “git tag –list” to see.

A “wt” Build Shortcut

If you already have a compilable source code directory of MongoDB Community or Percona Server for MongoDB, a slightly quicker way to build and install wt it is to run scons with “wt” as the target (instead of “mongod”, or “install-core”, etc.). The WiredTiger library code in the subdirectory src/third_party/wiredtiger/ is already cherry-picked to match whichever version of MongoDB you have checked out at the moment.

The built binary will be created at build/opt/third_party/wiredtiger/wt if building with legacy (<=v4.2 style) install mode. Otherwise, it is installed to scons –prefix option plus “/bin/” I believe.

Executing the wt Command

$ wt
WiredTiger Data Engine (version 10.0)
MongoDB wiredtiger_open configuration: ...
global_options:
    -B  maintain release 3.3 log file compatibility
    -C config
        wiredtiger_open configuration
    -E key
        secret encryption key
    -h home
        database directory
    -L  turn logging off for debug-mode
    -m  run verify on metadata
    -R  run recovery (if recovery configured)
    -r  access the database via a readonly connection
    -S  run salvage recovery (if recovery configured)
    -V  display library version and exit
    -v  verbose
commands:
    alter
        alter an object
    backup
        database backup
    compact
        compact an object
    copyright
        display copyright information
    create
        create an object
    downgrade
        downgrade a database
    drop
        drop an object
    dump
        dump an object
    list

The options above are the global ones, and each subcommand has other options not shown above. Eg. the list subcommand has two options -c and -v (see help output below). The subcommand-specific options (-c or -v in this case) must appear after the subcommand (“list”, “dump”, etc.) on the command line, and the global options (-LmRrSVv, -C, -E) must be before it.

$ wt list --help
wt: illegal option -- -
usage: wt [-LmRrSVv] [-C config] [-E secretkey] [-h home] list [-cv] [uri]
options:
    -c  display checkpoints in human-readable format (by default checkpoints are not displayed)
    -v  display the complete schema table (by default only a subset is displayed)

Set the wiredtiger_open Config String

MongoDB configures WiredTiger so that (by default) the snappy compressor is used, and the WT transaction log files are placed under the “journal/” subdirectory.

Some wt subcommands don’t need to read the transaction logs, so you might survive without setting this. But if you do then “wt: __wt_log_scan, 2114: no log files found: Operation not supported” is the error message you will see, and you should follow the advice in this section.

Neither the “journal” transaction log subdirectory or snappy compression are defaults of the WT library. The current v10.0.0 wt’s help output shows “MongoDB wiredtiger_open configuration: log=(enabled=true,path=journal,compressor=snappy)” as the second or third line of output, but this is misleading. It is just a hint message, not the config string actually in effect ? 

So set this value in a config string so the wiredtiger connection API knows where and how to open the wiredtiger transaction log files. (Option setting method is explained in the “How to put the configuration into effect” section just a little bit further down the page.)

log=(compressor=snappy,path=journal)

As of May 2021: I do not include the log=(enabled=true,…) value because it doesn’t seem to be necessary, and it leads to a “WT_LOGSCAN_FIRST not set” error when using the wt printlog subcommand at least.

Config String Syntax

See http://source.wiredtiger.com/10.0.0/config_strings.html for notes about the supposedly “compatible with JavaScript Object Notation (JSON)” config string format that WT uses.

In the config strings you’ll see in WT or MongoDB code, and the ones used in this article, there are no “{}” brackets or the “:” key-value delimiter. Instead it will be “object=(key1=val1,key2=val2,..),array_val=[val1,val2,val3]” style. It’s weird JSON, but presumably also perfectly conforming to some standard of JSON written somewhere.

These config options are for the WT C API’s wiredtiger_open function which initializes a WT_CONNECTION object. (grep for “wiredtiger_open_configuration” in util_main.cpp if you want to confirm the code.)

How to Put the Configuration into Effect

Lowest precedence: WiredTiger.config file

This is probably the most comfortable option. Put the config string in a file WiredTiger.config in the data directory (i.e. alongside WiredTiger.wt, WiredTiger.turtle, all the collection and index *.wt files, etc.)

$ cat WiredTiger.config&nbsp;
log=(compressor=snappy,path=journal)

http://source.wiredtiger.com/1.4.2/database_config.html#config_file

Middle precedence: WIREDTIGER_CONFIG environment variable

Settings here will override those coming from a WiredTiger.config file.

Eg. this setting will enable verbose log messages from places in the code where __wt_verbose(session, WT_VERB_METADATA, …) is called.

export WIREDTIGER_CONFIG="verbose=[metadata]"

http://source.wiredtiger.com/1.4.2/database_config.html#config_file

Highest precedence: Use the -C option

This will also include values merged from the previous two sources, but otherwise is the final say in what config string is passed into wiredtiger_open(.., .., config_string, &conn).

N.b. The -C value has to be before the subcommand. Eg.

wt -C "log=(compressor=snappy,path=journal)" list

Some other options of the “wt” util will add to this config string as well. Eg. -m adds “verify_metadata=true”, -R adds “log=(recover=on)”, etc.

Config merging rules

… aren’t clear to me. In practical testing, it seems to override sometimes but merge at other times. See the code of wiredtiger_open() if you want to figure it out.

Q. What wiredtiger_open() config does MongoDB use?

As it happens you can easily see the config string mongod uses as it initializes an internal connection to the WiredTiger library. Just search for “Opening WiredTiger” in mongod’s diagnostic log file. In MongoDB 4.4+, with the structured json-style log file, you can alternatively grep for a unique log message identifier number 22315.

$ grep -m 1 ‘Opening WiredTiger’ /data/node_s1n1/mongod.log

{“t”:{“$date”:”2021-04-27T21:58:09.580+09:00″},”s”:”I”,  “c”:”STORAGE”,  “id”:22315,   “ctx”:”initandlisten”,”msg”:”Opening WiredTiger”,”attr”:{“config”:”create,cache_size=7375M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],“}}

You could use this entire config string when starting the wt utility command, but you don’t need to. Some are for asynchronous processes, eg. eviction and file manager timers, which the wt util won’t run. And some I wouldn’t use: Eg. log=(archive=true) will discard old log files after recovery and checkpoint if you somehow manage to make that happen with just the wt command.

Summary

This article was a relatively brief guide on how to build the “wt” utility of the WiredTiger library; plus information regarding the runtime config values it needs when inspecting a data directory created by MongoDB.

Please see the next article “WiredTiger File Forensics (Part 2: wt dump)” to see how to use this tool with a mongod node’s data directory files.

May
04
2021
--

InnoDB File Growth Weirdness

InnoDB File Growth Weirdness

InnoDB File Growth WeirdnessThere is a common pattern in life, you often discover or understand things by accident. Many scientific discoveries fit such a description. In our database world, I was looking to see how BLOB/TEXT columns are allocated using overlay pages and I stumbled upon something interesting and unexpected. Let me present to you my findings, along with my attempt at explaining what is happening.

InnoDB Tablespaces

The first oddity I found is a bunch of free pages in each tablespace it is skipping. Here’s an example from a simple table with only an integer primary key and a char(32) column:

root@LabPS8_1:~/1btr# innodb_space -f /var/lib/mysql/test/t1.ibd space-extents
start_page  page_used_bitmap
0       	#####################################........................... <--- free pages
64      	################################################################
128     	################################################################
...

The innodb_space tool comes from the InnoDB ruby project of Jeremy Cole. If you want to explore InnoDB file formats, you need these tools. As you can see, the first extent has 27 free pages. These free pages are reserved for node pages (non-leaf) and will eventually be all used. At this point, I thought a table with 34 index pages, just starting to use the second extent for the leaf pages, would have 90 free pages (27 + 63) and use 2MB of disk space. While the previous statement proved to be true, I was up for quite a surprise.

InnoDB File Growth

To better illustrate the amount of free space available in an InnoDB data file, I decided to follow the evolution of the tablespace file size as I added index pages. The following figure shows my result.

Show the evolution of the size of an Innodb tablespace as data is added.

As I added rows, more leaf pages were allocated until the file segment of the leaf pages reached 32 pages. At this point, the table has 33 index pages, one root, and 32 leaves. The allocation of another page forces InnoDB to fully allocate the first extent and add the second one for the leaves. At this point, the size on the disk is 2MB. If we keep inserting rows, the following page allocation triggers InnoDB to allocate 7 reserved extents of 1MB each. At this point, the tablespace size on the disk reaches 9MB.

 

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

 

InnoDB uses the reserved extents for btree maintenance operations. They are not accounted for in the free space of the tablespace. Now, reserving 7 extents of 1MB each in a table containing only 560KB of data is pretty insane. At this point, the InnoDB tablespace has a size of 9MB on disk. This is extremely inefficient, about 8.4MB if just free space filled with “0”. Of course, as the table grows, the size impact of these reserved extents is diluted. The amount of reserved space will grow by about 1MB (1 extent) for every 100MB allocated.

This allocation of reserved extents is far from optimal, especially in a multi-tenants era where it is fairly common to see MySQL servers handling more than 100k tables. 100k tables, each with only 1MB of data in them will use 900GB of disk space. This phenomenon is not new, a bug report was created back in 2013 and is still open. The bug is considered a low priority and non-critical.

A lot of effort has been devoted to improving the capacity of MySQL 8.0 to handle a large number of tables. Until the allocation of reserved extents is fixed, be aware of this issue when planning your storage allocation. Of course, if you are using ZFS, the impacts are more limited…

root@LabPS8_1:~# du -hs /var/lib/mysql/test/t1.ibd
225K /var/lib/mysql/test/t1.ibd
root@LabPS8_1:~# du -hs --apparent-size /var/lib/mysql/test/t1.ibd
9,0M /var/lib/mysql/test/t1.ibd

My lab setup uses ZFS for LXC and KVM instances. LZ4 compression does magic on extents full of “0”, the actual consumed space is reduced to 225KB. If you want to explore the use of ZFS with MySQL, I invite you to read a ZFS post I wrote a few years ago.

Ok, time now to close this parenthesis and go back to the storage of BLOB/TEXT columns. That will be for a future post though!

Apr
05
2021
--

Which Version of MySQL Should I Use for MyRocks?

Version of MySQL Should I Use for MyRocks

Version of MySQL Should I Use for MyRocksAs database footprints continue to explode, many companies are looking for ways to deal with such rapid growth.  One approach is to refactor traditional relational databases to fit into a NoSQL engine, where horizontal scalability is easier.  However, in many cases, this is in no way a trivial undertaking.

Another approach that has been gaining interest is the use of MyRocks as an alternative storage engine to the traditional InnoDB.  While not for everyone, in certain use cases it could be a potential solution.  As with so many things open source, the next standard questions are: which version should I use?  Any differences with the engine if I use MyRocks with MySQL 5.7 vs 8.0?

In this post, I wanted to touch on this and give some high-level thoughts on MyRocks when it comes to the version of MySQL.

Engine Overview

At a high level, RocksDB is an embedded key-value database with data stored in a log-structured merge tree (LSM).  MyRocks is an abstraction layer that allows RocksDB to serve as a MySQL storage engine.  With RocksDB as the underlying storage layer, there are numerous advantages including faster replication, better compression, and faster data loading.

In contrast, InnoDB is B-Tree based structure.  MySQL was designed using a plugin architecture that separates the storage engine logic from the main server functionality.  This allows users to choose which storage engine they want based on their use case.  Historically, this was the MyISAM storage engine.  In recent years, InnoDB replaced MyISAM as the defacto standard and later the default engine.

Which Engine is Better?

Choosing which engine to use is very use case-specific.  Even at Facebook, where MyRocks was first developed, it was not meant to be a universal replacement for InnoDB.  Features like gap locks remain absent from MyRocks.  If a workload is dependent on gap locking, it will be a non-starter with the MyRocks engine.

Also, data access patterns should dictate the engine choice.  If the workload is write-intensive with limited range scans, it may be a good fit for MyRocks.  If you have a traditional OLTP workload with several reporting access patterns, InnoDB would remain a better option.  As mentioned in MyRocks Engine: Things to Know Before You Start:

“MyRocks is not an enhanced InnoDB, nor a one-size-fits-all replacement for InnoDB. It has its own pros/cons just like InnoDB. You need to decide which engine to use based on your application’s data access patterns.”

Which Version of MySQL Should I Use?

Assuming you have done your research and found MyRocks would be a good fit, the next choice is which version to use.  Facebook runs MyRocks on a heavily patched, internal version of MySQL 5.6Percona Server for MySQL includes MyRocks in both 5.7 and 8.0.  So what are the differences?

At a high level, the MyRocks/RocksDB code is essentially the same in all three versions.  Percona Server for MySQL uses the upstream version of RocksDB and only changes the linked version when Facebook MySQL updates the version.  This is due to the fact that it inherits the integration testing between RocksDB and MyRocks from Facebook.

As such, the biggest differences are based solely on the server version and not the server version and MyRocks combination.  If your infrastructure is already running with MySQL 5.7 in production and not ready to finalize a move to 8.0, there would be no discernible difference running the MyRocks engine with your current version.  MyRocks running on Percona Server for MySQL 5.7 is stable and shouldn’t be discarded as an option.  You should still plan to upgrade to 8.0, but there shouldn’t be any unique challenges in the process compared to using InnoDB.  It would just require the standard upgrade process and significant testing.

Moving forward (as 5.7 approaches EOL in 2023), you can expect to see fewer enhancements with MyRocks on Percona Server for MySQL 5.7.  Most new active development will be done against 8.0 while 5.7 will remain primarily in extended support, which includes critical bug fixes and security fixes only.

Conclusion

With all the version combinations and options, it can be overwhelming to pick the best option for your organization.  For a new deployment, I would recommend starting with MyRocks on MySQL 8.0.  With an existing MySQL 5.7 deployment (while you should be actively working towards an upgrade to 8.0), MyRocks is a viable and stable option as well.  The biggest driver for MyRocks should be space concerns combined with the workload.  As nice as it would be to say MyRocks is the silver bullet that works for everything, that just isn’t the case.  You can always reach out to the Professional Services team at Percona and let us help you determine if MyRocks would be a good fit for your team!

Apr
01
2021
--

Working to Validate MyRocks in the Enterprise with Dropbox

MyRocks in the Enterprise with Dropbox

MyRocks in the Enterprise with DropboxPercona Technical Account Managers get the privilege of working with some of our largest enterprise clients day in and day out.  As such, we get to really focus on how to best leverage our technology to generate measurable benefits for our users.  While it is fun to “nerd out” and always strive to use the latest and greatest, we need to stay focused on demonstrating business value and a genuine need.  Over the past few months, I’ve been working with one of my larger clients, Dropbox, along with our professional services team to validate the use of Percona Server for MySQL with the MyRocks storage engine over a large portion of their MySQL infrastructure.

Please note – this is not meant to be a deep dive into the technical details around MyRocks or the implementation.  Rather, it is meant to show how we determined the need, potential solution, and the planning that has started us down this path.  Look for a more detailed case study in the coming months as we hope to push this solution forward!

The Problem

In a single word, space.  When a client reaches a certain scale, space becomes a real concern.  With 10-20 servers, having an extra index or choosing the wrong data type can be noticeable.  However, with 10,000 servers, there is much less margin for error.  The challenge with this client is that the schema has already been sharded and optimized. Even with that, the aggregate dataset is still on the petabyte scale.  Think about what that means:

  • Large storage footprint per server (2+TB)
  • Replication multiplies that footprint for each “cluster”
  • Multiple backups per cluster result in a huge (PB+) backup footprint

There are some additional challenges created at this scale, too.  To mitigate the risk of filling a disk, there is a soft cap of 75 percent full on each server.  When a cluster hits this level, the shard is split to give each half some additional runway.  While this is great operationally, splitting shards results in doubling the cost for the cluster.  Again, on a scale of 1000s of servers, a “split” means an increase in cost in the hundreds of thousands of dollars.  This is not trivial and puts pressure on the team to delay splits as long as possible while maintaining high availability, durability, and performance.

MyRocks Use Case

After much discussion and understanding that space (and, in turn, cost) is a major business driver, we decided to investigate a change to MyRocks as the storage engine.  While standard InnoDB compression provided a small bandaid, the thinking was that MyRocks would give substantial savings.  At a very high level, MyRocks is an LSM tree-based storage engine built on top of RocksDB.  This type of storage has numerous advantages over traditional B+Tree storage engines (like InnoDB), including a smaller disk footprint and reduced write amplification.  This can translate to business value in several ways, including:

  • Decreased cloud/HDFS storage cost for backups
  • Reduced Write Amplification results in a longer disk lifetime, reducing fleet turnover
  • Decreased instance space requires fewer splits and increases runway

After several rounds of discussion around the benefits and any blockers, we decided to begin testing the feasibility of converting from InnoDB to MyRocks.

Initial Testing

While the use case seemed like a solid fit, extensive testing is always needed.  So far, the initial testing looks promising.  We’ve seen a large reduction in space combined with an increase in performance.  One example cluster saw nearly a 75 percent reduction in space, to the point that we were able to test running two replica instances on a single server in parallel.

What makes this interesting is that the test cluster in question has replicas that periodically show lag and generally are close to the replication limit.  In contrast, the two MyRocks replicas showed no lag during the same test period despite running two instances on one physical server.

While this isn’t something that would be done in production, it was impressive to see double the workload operating so efficiently on the same hardware.  I/O and CPU utilization were both noticeably lower than the single replica running InnoDB.  This shows the potential of increased server longevity and less frequent splits that we were hoping to see.  If these numbers and this performance were to hold into production, we could see savings on the order of millions of dollars across the entire fleet.

Note – this early testing has been limited to replicas only so we can’t yet validate this performance at the production level of concurrency of a primary server.  These results are only with four replica threads, so contention has been minor.

Looking Forward

So far, our limited testing has shown some real promise.  There are still some blockers that we need to overcome, but early results are encouraging. Currently, the biggest technical challenges in this project include:

  • MyRocks currently lacks pt-table-checksum support (for replica consistency)
  • Modifying and validating existing backup/restore/clone automation

Once we are able to fully validate this solution and solve the existing challenges (both fixes have been identified and are on the project roadmap) in the coming months, look for a more detailed case study.  While not every organization needs to make such a drastic change, this use case is one that is becoming more common at the enterprise level as data sets continue to explode.

Want to learn more? Check out the Percona MyRocks Introduction page on our website!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com