Jan
07
2022
--

Configure wiredTiger cacheSize Inside Percona Distribution for MongoDB Kubernetes Operator

wiredTiger cacheSize Inside Percona Distribution for MongoDB Kubernetes Operator

wiredTiger cacheSize Inside Percona Distribution for MongoDB Kubernetes OperatorNowadays we are seeing a lot of customers starting to use our Percona Distribution for MongoDB Kubernetes Operator. The Percona Kubernetes Operators are based on best practices for the configuration of a Percona Server for MongoDB replica set or the sharded cluster. The main component in MongoDB is the wiredTiger cache which helps to define the cache used by this engine and we can set it based on our load.

In this blog post, we will see how to define the resources’ memory and set the wiredTiger cache for the shard replicaset to improve the performance of the sharded cluster.

The Necessity of WT cache

The parameter storage.wiredTiger.engineConfig.cacheSizeGB limits the size of the WiredTiger internal cache. The operating system will use the available free memory for filesystem cache, which allows the compressed MongoDB data files to stay in memory. In addition, the operating system will use any free RAM to buffer file system blocks and file system cache. To accommodate the additional consumers of RAM, you may have to set WiredTiger’s internal cache size properly.

Starting from MongoDB 3.4, the default WiredTiger internal cache size is the larger of either:

50% of (RAM - 1 GB), or 256 MB.

For example, on a system with a total of 4GB of RAM the WiredTiger cache will use 1.5GB of RAM (0.5 * (4 GB – 1 GB) = 1.5 GB). Conversely, a system with a total of 1.25 GB of RAM will allocate 256 MB to the WiredTiger cache because that is more than half of the total RAM minus one gigabyte (0.5 * (1.25 GB – 1 GB) = 128 MB < 256 MB).

WT cacheSize in Kubernetes Operator

The mongodb wiredTiger cacheSize can be tune with the parameter storage.wiredTiger.engineConfig.cacheSizeRatio and its default value is 0.5. As explained above, if the system allocated memory limit is too low, then the WT cache is set to 256M or calculated as per the formula.

Prior to PSMDB operator 1.9.0, the cacheSizeRatio can be tuned under the sharding section of the cr.yaml file. This is deprecated from v1.9.0+ and unavailable from v1.12.0+. So you have to use the cacheSizeRatio parameter available under replsets configuration instead. The main thing that you will need to check here before changing the cacheSize is to make sure that the resources’ memory limit allocated is also available as per your cacheSize’s requirement. i.e the below section limiting the memory:

     resources:
       limits:
         cpu: "300m"
         memory: "0.5G"
       requests:
         cpu: "300m"
         memory: "0.5G"

 

https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/psmdb/container.go#L307

From the source code that calculates the mongod.storage.wiredTiger.engineConfig.cacheSizeRatio:

// In normal situations WiredTiger does this default-sizing correctly but under Docker
// containers WiredTiger fails to detect the memory limit of the Docker container. We
// explicitly set the WiredTiger cache size to fix this.
//
// https://docs.mongodb.com/manual/reference/configuration-options/#storage.wiredTiger.engineConfig.cacheSizeGB//

func getWiredTigerCacheSizeGB(resourceList corev1.ResourceList, cacheRatio float64, subtract1GB bool) float64 {
 maxMemory := resourceList[corev1.ResourceMemory]
 var size float64
 if subtract1GB {
  size = math.Floor(cacheRatio * float64(maxMemory.Value()-gigaByte))
 } else {
  size = math.Floor(cacheRatio * float64(maxMemory.Value()))
 }
 sizeGB := size / float64(gigaByte)
 if sizeGB < minWiredTigerCacheSizeGB {
  sizeGB = minWiredTigerCacheSizeGB
 }
 return sizeGB
}

 

Changing the cacheSizeRatio

Here for the test, we deployed the PSMDB operator on GCP. You can refer here for the steps – https://www.percona.com/doc/kubernetes-operator-for-psmongodb/gke.html. With the latest operator v1.11.0, the sharded cluster has been started with a shard and a config server replicaSets along with mongos pods.

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-cluster-name-cfg-0 2/2 Running 0 4m9s
my-cluster-name-cfg-1 2/2 Running 0 2m55s
my-cluster-name-cfg-2 2/2 Running 1 111s
my-cluster-name-mongos-758f9fb44-d4hnh 1/1 Running 0 99s
my-cluster-name-mongos-758f9fb44-d5wfm 1/1 Running 0 99s
my-cluster-name-mongos-758f9fb44-wmvkx 1/1 Running 0 99s
my-cluster-name-rs0-0 2/2 Running 0 4m7s
my-cluster-name-rs0-1 2/2 Running 0 2m55s
my-cluster-name-rs0-2 2/2 Running 0 117s
percona-server-mongodb-operator-58c459565b-fc6k8 1/1 Running 0 5m45s

Now login into the shard and check the default memory allocated to the container and to the mongod instance. In below, the memory size available is 15G, but the memory limit to use in this container is 476MB only:

rs0:PRIMARY> db.hostInfo()
{
"system" : {
"currentTime" : ISODate("2021-12-30T07:16:59.441Z"),
"hostname" : "my-cluster-name-rs0-0",
"cpuAddrSize" : 64,
"memSizeMB" : NumberLong(15006),
"memLimitMB" : NumberLong(476),
"numCores" : 4,
"cpuArch" : "x86_64",
"numaEnabled" : false
},
"os" : {
"type" : "Linux",
"name" : "Red Hat Enterprise Linux release 8.4 (Ootpa)",
"version" : "Kernel 5.4.144+"
},
"extra" : {
"versionString" : "Linux version 5.4.144+ (builder@7d732a1aec13) (Chromium OS 12.0_pre408248_p20201125-r7 clang version 12.0.0 (/var/tmp/portage/sys-devel/llvm-12.0_pre408248_p20201125-r7/work/llvm-12.0_pre408248_p20201125/clang f402e682d0ef5598eeffc9a21a691b03e602ff58)) #1 SMP Sat Sep 25 09:56:01 PDT 2021",
"libcVersion" : "2.28",
"kernelVersion" : "5.4.144+",
"cpuFrequencyMHz" : "2000.164",
"cpuFeatures" : "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities",
"pageSize" : NumberLong(4096),
"numPages" : 3841723,
"maxOpenFiles" : 1048576,
"physicalCores" : 2,
"mountInfo" : [
..
..

 

The cachesize in MB of wiredTiger engine allocated in Shard is as follows:

rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]/1024/1024
256

The cache size of 256MB is too low for the real environment. So let’s see how to tune the memory limit and also the cacheSize of WT engine. You can use the parameter called cacheSizeRatio to mention the WT cache ratio (out of 1) and memlimit to mention the memory allocated to the container. To do this, edit the cr.yaml file under deploy directory in the operator to change the settings. From the PSMDB operator v1.9.0, editing cacheSizeRatio parameter under mongod section is deprecated. So for the WT cache limit, use the cacheSizeRatio parameter under the section “replsets” and to set memory, use the memlimit parameter. Setting 3G for the container and 80% of the memory calculations.

deploy/cr.yaml:58

46 configuration: |
47 # operationProfiling:
48 # mode: slowOp
49 # systemLog:
50 # verbosity: 1
51 storage:
52 engine: wiredTiger
53 # inMemory:
54 # engineConfig:
55 # inMemorySizeRatio: 0.9
56 wiredTiger:
57 engineConfig:
58 cacheSizeRatio: 0.8

 

deploy/cr.yaml:229-232:

226 resources:
227 limits:
228 cpu: "300m"
229 memory: "3G"
230 requests:
231 cpu: "300m"
232 memory: "3G"

 

Apply the new cr.yaml

# kubectl appli -f deploy/cr.yaml
perconaservermongodb.psmdb.percona.com/my-cluster-name configured

The shard pods are re-allocated and you can check the progress as follows:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-cluster-name-cfg-0 2/2 Running 0 36m
my-cluster-name-cfg-1 2/2 Running 0 35m
my-cluster-name-cfg-2 2/2 Running 1 34m
my-cluster-name-mongos-758f9fb44-d4hnh 1/1 Running 0 34m
my-cluster-name-mongos-758f9fb44-d5wfm 1/1 Running 0 34m
my-cluster-name-mongos-758f9fb44-wmvkx 1/1 Running 0 34m
my-cluster-name-rs0-0 0/2 Init:0/1 0 13s
my-cluster-name-rs0-1 2/2 Running 0 60s
my-cluster-name-rs0-2 2/2 Running 0 8m33s
percona-server-mongodb-operator-58c459565b-fc6k8 1/1 Running 0 38m

Now check the new settings of WT cache as follows:

rs0:PRIMARY> db.hostInfo().system
{
"currentTime" : ISODate("2021-12-30T08:37:38.790Z"),
"hostname" : "my-cluster-name-rs0-1",
"cpuAddrSize" : 64,
"memSizeMB" : NumberLong(15006),
"memLimitMB" : NumberLong(2861),
"numCores" : 4,
"cpuArch" : "x86_64",
"numaEnabled" : false
}
rs0:PRIMARY> 
rs0:PRIMARY> 
rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]/1024/1024
1474

Here, the memory calculation for WT is done roughly as follows (Memory limit should be more than 1G, else 256MB is allocated by default:
(Memory limit – 1G) * cacheSizeRatio

(2861 - 1) *0.8 = 1467

 

NOTE:

Till PSMDB operator v1.10.0, the operator takes the change of cacheSizeRatio only if the resources.limit.cpu is also set. This is a bug and it got fixed in v1.11.0 – refer https://jira.percona.com/browse/K8SPSMDB-603 . So if you’re in an older version, don’t be surprised and you have to make sure the resources.limit.cpu is set as well.

https://github.com/percona/percona-server-mongodb-operator/blob/v1.10.0/pkg/psmdb/container.go#L194

if limit, ok := resources.Limits[corev1.ResourceCPU]; ok && !limit.IsZero() {
args = append(args, fmt.Sprintf(
"--wiredTigerCacheSizeGB=%.2f",
getWiredTigerCacheSizeGB(resources.Limits, replset.Storage.WiredTiger.EngineConfig.CacheSizeRatio, true),
))
}

From v1.11.0:
https://github.com/percona/percona-server-mongodb-operator/blob/v1.11.0/pkg/psmdb/container.go#L194

if limit, ok := resources.Limits[corev1.ResourceMemory]; ok && !limit.IsZero() {
    args = append(args, fmt.Sprintf(
       "--wiredTigerCacheSizeGB=%.2f",
       getWiredTigerCacheSizeGB(resources.Limits, replset.Storage.WiredTiger.EngineConfig.CacheSizeRatio, true),
))
}

 

Conclusion

So based on the application load, you will need to set the cacheSize of WT for better performance. You can use the above methods to tune the cache size for the shard replicaset in the PSMDB operator.

Reference Links :

https://www.percona.com/doc/kubernetes-operator-for-psmongodb/operator.html

https://www.percona.com/doc/kubernetes-operator-for-psmongodb/gke.html

https://www.percona.com/doc/kubernetes-operator-for-psmongodb/operator.html#mongod-storage-wiredtiger-engineconfig-cachesizeratio

MongoDB 101: How to Tune Your MongoDB Configuration After Upgrading to More Memory

Dec
30
2021
--

MongoDB Config Server Upgrade From SCCC to CSRS ReplicaSet

MongoDB Config Server Upgrade

MongoDB Config Server UpgradeRecently, I got some of our customers doing an upgrade to v4.x from v3.0 (Yeah, still could see older MongoDB versions and we suggest everyone do the upgrade to stay with the current GA!). There were some pain points in the upgrade process, and especially before migrating to v3.4, you will need to change from SCCC config setup to CSRS replicaSet setup. Interestingly, I did some tests in that process and would like to share them here. Even though this topic is pretty outdated, this would be beneficial to the people out there who still seek to migrate from the older version and are stuck at this config server upgrade.

In this blog post, we will see how to migrate a mirrored config server configuration (SCCC) to the replicaSet configuration (CSRS) along with the storage engine from MMAP to wiredTiger change. From MongoDB 3.2.4, the replicaSet configuration is supported for config servers too and the support of SCCC (Sync Cluster Connection Configuration) configuration is removed from MongoDB 3.4. So it is important to make sure the config servers are configured with the replicaSet before upgrading to MongoDB 3.4. Also, the config servers must run on wiredTiger when configured as replicaSet. There are two ways to do this – one is that when you want to replace existing config servers and another one is to migrate to the new set of servers/ports. Let’s see them in the following topics.

Migrate From SCCC to CSRS on the Different Hosts/Ports:

This section gives a brief about migrating the existing mirrored hosts/ports to different hosts or ports. It is always recommended to backup your config server before your steps and get downtime to avoid any writes on this crucial step. The test shared cluster setup runs on Percona Server for MongoDB (PSMDB) v3.2.22 with SCCC set up on the following ports:

47080 – mongos
47087, 47088, 47089 – mongoc – config servers with mmapv1 engine (sccc)
47081, 47082, 47083 – shard01 replicaset
47084, 47085, 47086 – shard02 replicaset

New Config servers to migrate (for testing purposes, using localhost but different ports to migrate):

47090, 47091, 47092 – new mongoc (csrs)

Now stop the balancer:

mongos> sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
Waiting again for active hosts after balancer is off...
WriteResult({ "nMatched" : 0, "nUpserted" : 1, "nModified" : 0, "_id" : "balancer" })
mongos> 
mongos> sh.getBalancerState()
false

 

Initiate the replicaSet in one of the config servers (here using config db running on 47087). Use this member as a base for migrating from SCCC to CSRS:

$ mongo32 localhost:47087/admin
MongoDB shell version: 3.2.21-2-g105acca0d4
connecting to: localhost:47087/admin
configsvr> 
configsvr> rs.initiate({ _id: "configRepl", configsvr: true, version:1, members: [ {_id:0, host: "localhost:47087"} ] } )
{ "ok" : 1 }

 

Start the new config DB instances using the same replicaset name used in the above initiate command:

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongod --dbpath /home/balaguru/mongodb/testshard32/data/configrepl/rs1/db/ --logpath /home/balaguru/mongodb/testshard32/data/configrepl/rs1/mongod.log --port 47090 --fork --configsvr --storageEngine wiredTiger --wiredTigerCacheSizeGB 1 --replSet configRepl
about to fork child process, waiting until server is ready for connections.
forked process: 476923
child process started successfully, parent exiting

 

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongod --dbpath /home/balaguru/mongodb/testshard32/data/configrepl/rs2/db/ --logpath /home/balaguru/mongodb/testshard32/data/configrepl/rs2/mongod.log --port 47091 --fork --configsvr --storageEngine wiredTiger --wiredTigerCacheSizeGB 1 --replSet configRepl
about to fork child process, waiting until server is ready for connections.
forked process: 478193
child process started successfully, parent exiting

 

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongod --dbpath /home/balaguru/mongodb/testshard32/data/configrepl/rs3/db/ --logpath /home/balaguru/mongodb/testshard32/data/configrepl/rs3/mongod.log --port 47092 --fork --configsvr --storageEngine wiredTiger --wiredTigerCacheSizeGB 1 --replSet configRepl
about to fork child process, waiting until server is ready for connections.
forked process: 478254
child process started successfully, parent exiting

 

Now restart 47087 with the option –replSet and –configsvrMode which allow this mirrored config DB to be in replicaset and have MMAP engine.

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongod --dbpath /home/balaguru/mongodb/testshard32/data/config/db --logpath /home/balaguru/mongodb/testshard32/data/config/mongod.log --port 47087 --fork --configsvr --storageEngine mmapv1 --configsvrMode sccc --replSet configRepl
about to fork child process, waiting until server is ready for connections.
forked process: 477682
child process started successfully, parent exiting

 

Add the new config DB instances to this replicaSet (set priority/votes 0 to avoid election):

configRepl:PRIMARY> rs.add({host: "localhost:47090", priority:0, votes:0})
{ "ok" : 1 }

configRepl:PRIMARY> rs.add({host: "localhost:47091", priority:0, votes:0})
{ "ok" : 1 }

configRepl:PRIMARY> rs.add({host: "localhost:47092", priority:0, votes:0})
{ "ok" : 1 }

configRepl:PRIMARY> rs.status()
{
        "set" : "configRepl",
        "date" : ISODate("2021-11-23T08:11:13.065Z"),
        "myState" : 1,
        "term" : NumberLong(1),
        "configsvr" : true,
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "localhost:47087",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 154,
                        "optime" : {
                                "ts" : Timestamp(1637655072, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2021-11-23T08:11:12Z"),
                        "electionTime" : Timestamp(1637654919, 1),
                        "electionDate" : ISODate("2021-11-23T08:08:39Z"),
                        "configVersion" : 4,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "localhost:47090",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 69,
                        "optime" : {
                                "ts" : Timestamp(1637655069, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2021-11-23T08:11:09Z"),
                        "lastHeartbeat" : ISODate("2021-11-23T08:11:11.608Z"),
                       "lastHeartbeatRecv" : ISODate("2021-11-23T08:11:11.611Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "localhost:47087",
                        "configVersion" : 4
                },
                {
                        "_id" : 2,
                        "name" : "localhost:47091",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 6,
                        "optime" : {
                                "ts" : Timestamp(1637655069, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2021-11-23T08:11:09Z"),
                        "lastHeartbeat" : ISODate("2021-11-23T08:11:11.608Z"),
                        "lastHeartbeatRecv" : ISODate("2021-11-23T08:11:09.609Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "localhost:47090",
                        "configVersion" : 4
                },
                {
                        "_id" : 3,
                        "name" : "localhost:47092",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 3,
                        "optime" : {
                                "ts" : Timestamp(1637655069, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2021-11-23T08:11:09Z"),
                        "lastHeartbeat" : ISODate("2021-11-23T08:11:11.608Z"),
                        "lastHeartbeatRecv" : ISODate("2021-11-23T08:11:11.620Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "localhost:47091",
                        "configVersion" : 4
                }
        ],
        "ok" : 1
}

 

Shut down one of the other mirrored config servers (either 47088 or 47089). If one of the mirrored DB is down in between the migration, then the sharded cluster goes into read-only mode, i.e there will be a failure when issuing DDL (mentioned in the below example). So it is advised to do the activity when there is downtime to avoid any unexpected error from the application side.

// try to create new db and collection as follows to test when 47088 is down
mongos> use vinodh  
switched to db vinodh
mongos> db.testCreateColl.insert({id:1})
WriteResult({
        "writeError" : {
                "code" : 13663,
                "errmsg" : "unable to target write op for collection vinodh.testCreateColl :: caused by :: Location13663: exception creating distributed lock vinodh/balaguru-UbuntuPC:47080:1637654366:1487823871 :: caused by :: SyncClusterConnection::update prepare failed:  localhost:47088 (127.0.0.1) failed:9001 socket exception [CONNECT_ERROR] server [couldn't connect to server localhost:47088, connection attempt failed] "
        }
})

 

Once the replicaSet is set, change the priority and votes of other nodes to participate in the election:

configRepl:PRIMARY> cfg.members[1].host
localhost:47090
configRepl:PRIMARY> cfg.members[1].priority
0
configRepl:PRIMARY> cfg.members[1].priority = 1
1
configRepl:PRIMARY> cfg.members[1].votes 
0
configRepl:PRIMARY> cfg.members[1].votes = 1
1
configRepl:PRIMARY> cfg.members[2].priority = 1
1
configRepl:PRIMARY> cfg.members[3].priority = 1
1
configRepl:PRIMARY> cfg.members[2].votes = 1
1
configRepl:PRIMARY> cfg.members[3].votes = 1
1
configRepl:PRIMARY> rs.reconfig(cfg)
{ "ok" : 1 }

 

Now change the PRIMARY role to the new member and make 47087 as SECONDARY:

configRepl:PRIMARY> rs.stepDown()
2021-11-23T13:58:37.885+0530 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host 'localhost:47087'  :
DB.prototype.runCommand@src/mongo/shell/db.js:135:1
DB.prototype.adminCommand@src/mongo/shell/db.js:152:1
rs.stepDown@src/mongo/shell/utils.js:1202:12
@(shell):1:1

2021-11-23T13:58:37.887+0530 I NETWORK  [thread1] trying reconnect to localhost:47087 (127.0.0.1) failed
2021-11-23T13:58:37.887+0530 I NETWORK  [thread1] reconnect localhost:47087 (127.0.0.1) ok
configRepl:SECONDARY>

 

Restart the old config member 47087 without –configsvrMode option. Here 47087 will go into REMOVED which is totally fine as it still has MMAPV2 engine + running without the configsvrMode option. Then you can restart mongos with new –-configdb option mentioning only the new config servers (–configdb “configRepl/localhost:47090,localhost:47091,localhost:47092” ):

$ /home/balaguru/mongodb/mongodb-linux-x86_64-3.2.21-2-g105acca0d4/bin/mongos --logpath /home/balaguru/mongodb/testshard32/data/mongos.log --port 47080 --configdb "configRepl/localhost:47090,localhost:47091,localhost:47092" --fork

 

Now 47087 is ready to be removed from the replicaSet finally:

configRepl:PRIMARY> rs.remove("localhost:47087")
{ "ok" : 1 }

You can then check the shard status and the data through mongos. Once everything looks good, stop the other mirrored config DBs on 47088 and 47089 ports followed by enabling the balancer.

 

Migrate From SCCC to CSRS on the Same Hosts:

This section gives a brief about replacing the existing mirrored hosts/port. This has steps until enabling the replicaSet in one of the mirrored config servers. For this testing case, the shared cluster setup runs with PSMDB v3.2.22 with SCCC set up on the following ports:

27100 – mongos
27107, 27108, 27109 – mongoc – config servers with mmapv1 engine
27101, 27102, 27103 – shard01 replicaset
27104, 27105, 27106 – shard02 replicaset

Here using the instance running on port 27107 to start the migration. As said in the above method,

  • Initiate the replicaSet
  • Restart the instance with –-replSet and –-configsvrMode options

The replicaSet status on 27107 instance.

csReplSet:PRIMARY> rs.status()
{
 "set" : "csReplSet",
 "date" : ISODate("2021-10-07T09:04:20.266Z"),
 "myState" : 1,
 "term" : NumberLong(1),
 "configsvr" : true,
 "heartbeatIntervalMillis" : NumberLong(2000),
 "members" : [
  {
   "_id" : 0,
   "name" : "localhost:27107",
   "health" : 1,
   "state" : 1,
   "stateStr" : "PRIMARY",
   "uptime" : 9,
   "optime" : {
    "ts" : Timestamp(1633597452, 1),
    "t" : NumberLong(1)
   },
   "optimeDate" : ISODate("2021-10-07T09:04:12Z"),
   "infoMessage" : "could not find member to sync from",
   "electionTime" : Timestamp(1633597451, 1),
   "electionDate" : ISODate("2021-10-07T09:04:11Z"),
   "configVersion" : 1,
   "self" : true
  }
 ],
 "ok" : 1
}

 

Then stop the second config 27108 instance and clear the dbpath files to remove MMAPv2 files. Start it with the replicaset + WT engine options like below:

$ rm -rf data/config2/db/*

$ /home/vinodh.krishnaswamy/mongo/mongodb-linux-x86_64-3.2.22/bin/mongod --dbpath /home/vinodh.krishnaswamy/mongo/testshard32/data/config2/db --logpath /home/vinodh.krishnaswamy/mongo/testshard32/data/config2/mongod.log --port 27108 --fork --configsvr  --wiredTigerCacheSizeGB 1  --replSet csReplSet
about to fork child process, waiting until server is ready for connections.
forked process: 42341
child process started successfully, parent exiting

At this point, as said earlier let me remind you of the point of the sharded cluster going into a read-only mode when an existing mirrored instance is down or not reachable. So don’t be surprised if you get any warning about DDL errors while in this situation.

Now add this 27108 to the replicaset from PRIMARY (27107) as follows with priority:0 and votes:0 to avoid this server taking part in the election:

$ mongo32 localhost:27107
MongoDB shell version: 3.2.22
connecting to: localhost:27107/test
csReplSet:PRIMARY> rs.add({host: "localhost:27108", priority: 0, votes: 0 })
{ "ok" : 1 }

 

Then repeat the same for the third config server – 27109. Remove DB files + restart it with –replSet and WT options for 27109 followed by adding it to the replicaSet from PRIMARY:

csReplSet:PRIMARY> rs.add({host: "localhost:27109", priority:0 , votes: 0})
{ "ok" : 1 }

 

Once they are in sync and the replicaSet looks healthy, you can adjust the priority and votes for the added members(27108, 27109) as follows to allow them to participate in the election.

csReplSet:PRIMARY> cfg = rs.conf()
{
 "_id" : "csReplSet",
 "version" : 3,
 "configsvr" : true,
 "protocolVersion" : NumberLong(1),
 "members" : [
  {
   "_id" : 0,
   "host" : "localhost:27107",
   "arbiterOnly" : false,
   "buildIndexes" : true,
   "hidden" : false,
   "priority" : 1,
   "tags" : {
   },
   "slaveDelay" : NumberLong(0),
   "votes" : 1
  },
  {
   "_id" : 1,
   "host" : "localhost:27108",
   "arbiterOnly" : false,
   "buildIndexes" : true,
   "hidden" : false,
   "priority" : 0,
   "tags" : {
   },
   "slaveDelay" : NumberLong(0),
  "votes" : 0
  },
  {
   "_id" : 2,
   "host" : "localhost:27109",
   "arbiterOnly" : false,
   "buildIndexes" : true,
   "hidden" : false,
   "priority" : 0,
   "tags" : {
   },
   "slaveDelay" : NumberLong(0),
   "votes" : 0
  }
 ],

 "settings" : {
  "chainingAllowed" : true,
  "heartbeatIntervalMillis" : 2000,
  "heartbeatTimeoutSecs" : 10,
  "electionTimeoutMillis" : 10000,
  "getLastErrorModes" : {
  },
  "getLastErrorDefaults" : {
   "w" : 1,
   "wtimeout" : 0
  },
  "replicaSetId" : ObjectId("615eb78fdb1ed4092c166786")
 }
}

csReplSet:PRIMARY> cfg.members[1].priority = 1
1
csReplSet:PRIMARY> cfg.members[2].priority = 1
1
csReplSet:PRIMARY> cfg.members[1].votes = 1
1
csReplSet:PRIMARY> cfg.members[2].votes = 1
1
csReplSet:PRIMARY> cfg.members[2].votes
1
csReplSet:PRIMARY> rs.reconfig(cfg)
{ "ok" : 1 }

 

Then go to the first config server 27107 and step down it for another member to become PRIMARY. So that you can make it down and change the storage engine to WT in it like above as well + without configsvrMode option. When it joins back, it does the initial sync from the syscSource of the replicaSet.

csReplSet:PRIMARY> rs.stepDown()
2021-10-07T05:14:26.617-0400 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host 'localhost:27107'  :
DB.prototype.runCommand@src/mongo/shell/db.js:135:1
DB.prototype.adminCommand@src/mongo/shell/db.js:153:16
rs.stepDown@src/mongo/shell/utils.js:1202:12
@(shell):1:1

2021-10-07T05:14:26.619-0400 I NETWORK  [thread1] trying reconnect to localhost:27107 (127.0.0.1) failed
2021-10-07T05:14:26.619-0400 I NETWORK  [thread1] reconnect localhost:27107 (127.0.0.1) ok
csReplSet:SECONDARY>

 

To remind you that if you forget to change the storage engine to WT and without —configsvrMode option, it will go into REMOVED state, so please make note of it. (Here it is started purposefully started to show as an example.)

$ mongo32 localhost:27107
MongoDB shell version: 3.2.22
connecting to: localhost:27107/test
csReplSet:REMOVED>

 

Now, bringing down the DB instance 27017 and clearing DB files + starting back with WT and replSet options (remove configsvrMode option this time)

$ rm -rf data/config/db/*

$ /home/vinodh.krishnaswamy/mongo/mongodb-linux-x86_64-3.2.22/bin/mongod --dbpath /home/vinodh.krishnaswamy/mongo/testshard32/data/config/db --logpath /home/vinodh.krishnaswamy/mongo/testshard32/data/config/mongod.log --port 27107 --fork --configsvr --wiredTigerCacheSizeGB 1 --replSet csReplSet
about to fork child process, waiting until server is ready for connections.
forked process: 35202
child process started successfully, parent exiting

 

Then restart the mongos instances with the below –configDB option for the new config to take effect:

mongos <options> –configDB csReplSet/localhost:27107,localhost:27108,localhost:27109

From the mongos logfile, you will see the below message if the config server replicaset was connected successfully:

2021-10-07T05:48:25.679-0400 I NETWORK  [conn6] Successfully connected to csReplSet/localhost:27107,localhost:27108,localhost:27109 (1 connections now open to csReplSet/localhost:27107,localhost:27108,localhost:27109 with a 0 second timeout)

Conclusion

We at Percona don’t recommend anyone to have EOL versions running as they don’t get any bugs fixed or you cannot use the recently added features. Hope this blog helps some who still looking to migrate from older versions.

Nov
22
2021
--

Temporary Tables in MySQL – Never Ending Story?

Temporary Tables MySQL

Temporary Tables MySQLIf you ever had to deal with performance and/or disk space issues related to temporary tables, I bet you eventually found yourself puzzled. There are many possible scenarios depending on the type of temporary table, settings, and MySQL version used. We have observed a pretty long evolution in that matter due to a couple of reasons. One of them was the need to completely eliminate the need to use the deprecated MyISAM engine, and at the same time introduce more performant and reliable alternatives. Another set of improvements was required related to InnoDB, where it was necessary to lower the overhead of temporary tables using that engine.

For that reason, I decided to gather them in a sort of summary which may help to troubleshoot their usage. Due to vast changes between major MySQL releases, I divided the article by them.

MySQL 5.6

(If you are still using that version, you are encouraged to consider upgrading it soon as it has reached EOL.)

User-Created Temporary Tables

When a table is created using CREATE TEMPORARY TABLE clause, it will use the engine defined by default_tmp_storage_engine (defaults to InnoDB) if not explicitly defined otherwise and will be stored inside the directory defined by the tmpdir variable.

An example one may look like this:

mysql > create temporary table tmp1 (id int, a varchar(10));
Query OK, 0 rows affected (0.02 sec)

mysql > show create table tmp1\G
*************************** 1. row ***************************
Table: tmp1
Create Table: CREATE TEMPORARY TABLE `tmp1` (
`id` int(11) DEFAULT NULL,
`a` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

But how do you find the file created on disk that stores this table data? While this query may help:

mysql > select table_id,space,name,path from information_schema.INNODB_SYS_DATAFILES join information_schema.INNODB_SYS_TABLES using (space) where name like '%tmp%'\G
*************************** 1. row ***************************
table_id: 21
space: 7
name: tmp/#sql11765a_2_1
path: /data/sandboxes/msb_5_6_51/tmp/#sql11765a_2_1.ibd
1 row in set (0.00 sec)

We don’t see the original table name here. Even by looking at the buffer pool, we still don’t have the real name:

mysql > select TABLE_NAME from information_schema.INNODB_BUFFER_PAGE where table_name like '%tmp%';
+-------------------------+
| TABLE_NAME              |
+-------------------------+
| `tmp`.`#sql11765a_2_1`  |
+-------------------------+
1 row in set (0.07 sec)

Here comes the extension available in the Percona Server for MySQL 5.6 variant – additional information_schema table: GLOBAL_TEMPORARY_TABLES. With that one, we can craft a query that provides a bit more information:

mysql > select SPACE,TABLE_SCHEMA,TABLE_NAME,ENGINE,g.NAME,PATH from information_schema.GLOBAL_TEMPORARY_TABLES g LEFT JOIN information_schema.INNODB_SYS_TABLES s ON s.NAME LIKE CONCAT('%', g.name, '%') LEFT JOIN information_schema.INNODB_SYS_DATAFILES USING(SPACE)\G
*************************** 1. row ***************************
SPACE: 16
TABLE_SCHEMA: test
TABLE_NAME: tmp1
ENGINE: InnoDB
NAME: #sql12c75d_2_0
PATH: /data/sandboxes/msb_ps5_6_47/tmp/#sql12c75d_2_0.ibd
*************************** 2. row ***************************
SPACE: NULL
TABLE_SCHEMA: test
TABLE_NAME: tmp3
ENGINE: MEMORY
NAME: #sql12c75d_2_2
PATH: NULL
*************************** 3. row ***************************
SPACE: NULL
TABLE_SCHEMA: test
TABLE_NAME: tmp2
ENGINE: MyISAM
NAME: #sql12c75d_2_1
PATH: NULL
3 rows in set (0.00 sec)

So at least for the InnoDB temp table, we can correlate the exact table name with the file path.

Internal Temporary Tables

These are ones created by MySQL in the process of executing a query. We don’t have any access to such tables, but let’s see how we can investigate their usage.

This type is created in memory (using MEMORY engine) as long as its size doesn’t exceed either tmp_table_size or max_heap_table_size variables, and if no TEXT/BLOB columns are in use. If such a table has to be stored on disk though, in MySQL 5.6 it will use MyISAM storage and also tmpdir used as a location. Quick example, on 10M rows sysbench table, query producing big internal temporary table:

mysql > SELECT pad, COUNT(*) FROM sbtest1 GROUP BY pad;

And we can see the related files growing:

$ ls -lh /data/sandboxes/msb_5_6_51/tmp/
total 808M
-rw-rw---- 1 przemek przemek 329M Sep 29 23:24 '#sql_11765a_0.MYD'
-rw-rw---- 1 przemek przemek 479M Sep 29 23:24 '#sql_11765a_0.MYI'

It may be difficult to correlate a particular temp table and its client connection though. The only information I found is:

mysql > select FILE_NAME,EVENT_NAME from performance_schema.file_summary_by_instance where file_name like '%tmp%' \G
*************************** 1. row ***************************
FILE_NAME: /data/sandboxes/msb_5_6_51/tmp/Innodb Merge Temp File
EVENT_NAME: wait/io/file/innodb/innodb_temp_file
*************************** 2. row ***************************
FILE_NAME: /data/sandboxes/msb_5_6_51/tmp/#sql_11765a_0.MYI
EVENT_NAME: wait/io/file/myisam/kfile
*************************** 3. row ***************************
FILE_NAME: /data/sandboxes/msb_5_6_51/tmp/#sql_11765a_0.MYD
EVENT_NAME: wait/io/file/myisam/dfile
3 rows in set (0.00 sec)

MySQL 5.7

User-Created Temporary Tables

As earlier, the default_tmp_storage_engine variable decides on the engine used. But two changes happened here. InnoDB temporary tables now use a common dedicated shared tablespace – ibtmp1, unless it is compressed. Moreover, we have an additional information_schema view: INNODB_TEMP_TABLE_INFO. Given that, we can get information like below:

mysql > select name, FILE_NAME, FILE_TYPE, TABLESPACE_NAME, SPACE, PER_TABLE_TABLESPACE, IS_COMPRESSED from INFORMATION_SCHEMA.INNODB_TEMP_TABLE_INFO join INFORMATION_SCHEMA.FILES on FILE_ID=SPACE\G
*************************** 1. row ***************************
name: #sql12cf58_2_5
FILE_NAME: ./ibtmp1
FILE_TYPE: TEMPORARY
TABLESPACE_NAME: innodb_temporary
SPACE: 109
PER_TABLE_TABLESPACE: FALSE
IS_COMPRESSED: FALSE
*************************** 2. row ***************************
name: #sql12cf58_2_4
FILE_NAME: /data/sandboxes/msb_ps5_7_33/tmp/#sql12cf58_2_4.ibd
FILE_TYPE: TEMPORARY
TABLESPACE_NAME: innodb_file_per_table_110
SPACE: 110
PER_TABLE_TABLESPACE: TRUE
IS_COMPRESSED: TRUE
2 rows in set (0.01 sec)

But again to correlate with a table name, the Percona Server for MySQL extension needs to be used:

mysql > select g.TABLE_SCHEMA, g.TABLE_NAME, name, FILE_NAME, FILE_TYPE, TABLESPACE_NAME, SPACE, PER_TABLE_TABLESPACE, IS_COMPRESSED from INFORMATION_SCHEMA.INNODB_TEMP_TABLE_INFO join INFORMATION_SCHEMA.FILES on FILE_ID=SPACE join information_schema.GLOBAL_TEMPORARY_TABLES g using (name)\G
*************************** 1. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tmp1
name: #sql12cf58_2_5
FILE_NAME: ./ibtmp1
FILE_TYPE: TEMPORARY
TABLESPACE_NAME: innodb_temporary
SPACE: 109
PER_TABLE_TABLESPACE: FALSE
IS_COMPRESSED: FALSE
*************************** 2. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tmp3
name: #sql12cf58_2_4
FILE_NAME: /data/sandboxes/msb_ps5_7_33/tmp/#sql12cf58_2_4.ibd
FILE_TYPE: TEMPORARY
TABLESPACE_NAME: innodb_file_per_table_110
SPACE: 110
PER_TABLE_TABLESPACE: TRUE
IS_COMPRESSED: TRUE
2 rows in set (0.01 sec)

Alternatively, to see also MyISAM and related .frm files, we can use:

mysql > SELECT g.TABLE_SCHEMA, g.TABLE_NAME, NAME, f.FILE_NAME, g.ENGINE, TABLESPACE_NAME, PER_TABLE_TABLESPACE, SPACE FROM information_schema.GLOBAL_TEMPORARY_TABLES g join performance_schema.file_instances f ON FILE_NAME LIKE CONCAT('%', g.name, '%') left join INFORMATION_SCHEMA.INNODB_TEMP_TABLE_INFO using (name) left join INFORMATION_SCHEMA.FILES fl on space=FILE_ID order by table_name\G
*************************** 1. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tmp1
NAME: #sql12cf58_2_5
FILE_NAME: /data/sandboxes/msb_ps5_7_33/tmp/#sql12cf58_2_5.frm
ENGINE: InnoDB
TABLESPACE_NAME: innodb_temporary
PER_TABLE_TABLESPACE: FALSE
SPACE: 109
*************************** 2. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tmp2
NAME: #sql12cf58_2_6
FILE_NAME: /data/sandboxes/msb_ps5_7_33/tmp/#sql12cf58_2_6.MYD
ENGINE: MyISAM
TABLESPACE_NAME: NULL
PER_TABLE_TABLESPACE: NULL
SPACE: NULL
*************************** 3. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tmp2
NAME: #sql12cf58_2_6
FILE_NAME: /data/sandboxes/msb_ps5_7_33/tmp/#sql12cf58_2_6.MYI
ENGINE: MyISAM
TABLESPACE_NAME: NULL
PER_TABLE_TABLESPACE: NULL
SPACE: NULL
*************************** 4. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tmp2
NAME: #sql12cf58_2_6
FILE_NAME: /data/sandboxes/msb_ps5_7_33/tmp/#sql12cf58_2_6.frm
ENGINE: MyISAM
TABLESPACE_NAME: NULL
PER_TABLE_TABLESPACE: NULL
SPACE: NULL
*************************** 5. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tmp3
NAME: #sql12cf58_2_4
FILE_NAME: /data/sandboxes/msb_ps5_7_33/tmp/#sql12cf58_2_4.frm
ENGINE: InnoDB
TABLESPACE_NAME: innodb_file_per_table_110
PER_TABLE_TABLESPACE: TRUE
SPACE: 110
*************************** 6. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tmp3
NAME: #sql12cf58_2_4
FILE_NAME: /data/sandboxes/msb_ps5_7_33/tmp/#sql12cf58_2_4.ibd
ENGINE: InnoDB
TABLESPACE_NAME: innodb_file_per_table_110
PER_TABLE_TABLESPACE: TRUE
SPACE: 110
6 rows in set (0.01 sec)

Internal Temporary Tables

For internal temporary tables in 5.7, it is similar in terms of in-memory ones. But the default engine for on-disk temp tables is defined via a new variable: internal_tmp_disk_storage_engine, which now defaults also to InnoDB, and also the ibtmp1 tablespace is used to store its contents.

Insight into this shared temporary tablespace is pretty limited. We can check its size and how much free space is currently available. An example view was taken during heavy query ongoing:

mysql > select FILE_NAME, FILE_TYPE, TABLESPACE_NAME, ENGINE, TOTAL_EXTENTS, FREE_EXTENTS, EXTENT_SIZE/1024/1024 as 'extent in MB', MAXIMUM_SIZE from INFORMATION_SCHEMA.FILES where file_name like '%ibtmp%'\G
*************************** 1. row ***************************
FILE_NAME: ./ibtmp1
FILE_TYPE: TEMPORARY
TABLESPACE_NAME: innodb_temporary
ENGINE: InnoDB
TOTAL_EXTENTS: 588
FREE_EXTENTS: 1
extent in MB: 1.00000000
MAXIMUM_SIZE: NULL
1 row in set (0.00 sec)

And after the query is finished we can see most of the space is freed (FREE_EXTENTS):

mysql > select FILE_NAME, FILE_TYPE, TABLESPACE_NAME, ENGINE, TOTAL_EXTENTS, FREE_EXTENTS, EXTENT_SIZE/1024/1024 as 'extent in MB', MAXIMUM_SIZE from INFORMATION_SCHEMA.FILES where file_name like '%ibtmp%'\G
*************************** 1. row ***************************
FILE_NAME: ./ibtmp1
FILE_TYPE: TEMPORARY
TABLESPACE_NAME: innodb_temporary
ENGINE: InnoDB
TOTAL_EXTENTS: 780
FREE_EXTENTS: 764
extent in MB: 1.00000000
MAXIMUM_SIZE: NULL
1 row in set (0.00 sec)

However, the tablespace won’t be truncated unless MySQL is restarted:

$ ls -lh msb_5_7_35/data/ibtmp*
-rw-r----- 1 przemek przemek 780M Sep 30 19:50 msb_5_7_35/data/ibtmp1

To see the writing activity (which may turn out to be much higher for a single query than total size growth made by it):

mysql > select FILE_NAME, SUM_NUMBER_OF_BYTES_WRITE/1024/1024/1024 as GB_written from performance_schema.file_summary_by_instance where file_name like '%ibtmp%' \G
*************************** 1. row ***************************
FILE_NAME: /data/sandboxes/msb_5_7_35/data/ibtmp1
GB_written: 46.925933837891
1 row in set (0.00 sec)

MySQL 8.0

For simplicity, let’s skip how things worked before 8.0.16 and discuss only how it works since then, as the changes in that matter are quite significant:

  • internal_tmp_disk_storage_engine variable was removed and it is no longer possible to use the MyISAM engine for internal temporary tables
  • shared ibtmp1 table space is no longer used for either temporary table type
  • a pool of new Session Temporary Tablespaces was introduced to handle both user and internal temporary tables on disk and is located by default in the main data directory
  • the new TempTable engine for in-memory tables uses both the space in memory as well as mmapped files on disk

User-Created Temporary Tables

For an example temporary table:

mysql > create temporary table tmp1 (id int, a varchar(10));
Query OK, 0 rows affected (0.00 sec)

mysql > select * from information_schema.INNODB_TEMP_TABLE_INFO;
+----------+----------------+--------+------------+
| TABLE_ID | NAME           | N_COLS | SPACE      |
+----------+----------------+--------+------------+
|     1089 | #sqlbbeb3_a_12 |      5 | 4243767289 |
+----------+----------------+--------+------------+
1 row in set (0.00 sec)

We can correlate which file was used from that pool by looking at the space number:

mysql > select * from INFORMATION_SCHEMA.INNODB_SESSION_TEMP_TABLESPACES ;
+----+------------+----------------------------+-------+----------+-----------+
| ID | SPACE      | PATH                       | SIZE  | STATE    | PURPOSE   |
+----+------------+----------------------------+-------+----------+-----------+
| 10 | 4243767290 | ./#innodb_temp/temp_10.ibt | 81920 | ACTIVE   | INTRINSIC |
| 10 | 4243767289 | ./#innodb_temp/temp_9.ibt  | 98304 | ACTIVE   | USER      |
|  0 | 4243767281 | ./#innodb_temp/temp_1.ibt  | 81920 | INACTIVE | NONE      |
|  0 | 4243767282 | ./#innodb_temp/temp_2.ibt  | 81920 | INACTIVE | NONE      |
|  0 | 4243767283 | ./#innodb_temp/temp_3.ibt  | 81920 | INACTIVE | NONE      |
|  0 | 4243767284 | ./#innodb_temp/temp_4.ibt  | 81920 | INACTIVE | NONE      |
|  0 | 4243767285 | ./#innodb_temp/temp_5.ibt  | 81920 | INACTIVE | NONE      |
|  0 | 4243767286 | ./#innodb_temp/temp_6.ibt  | 81920 | INACTIVE | NONE      |
|  0 | 4243767287 | ./#innodb_temp/temp_7.ibt  | 81920 | INACTIVE | NONE      |
|  0 | 4243767288 | ./#innodb_temp/temp_8.ibt  | 81920 | INACTIVE | NONE      |
+----+------------+----------------------------+-------+----------+-----------+
10 rows in set (0.00 sec)

But again, no way to look for the table name. Fortunately, Percona Server for MySQL still has the GLOBAL_TEMPORARY_TABLES table, so given the three available system views, we can get better information on user-created temporary tables using various engines, like below:

mysql > SELECT SESSION_ID, SPACE, PATH, TABLE_SCHEMA, TABLE_NAME, SIZE, DATA_LENGTH, INDEX_LENGTH, ENGINE, PURPOSE FROM information_schema.GLOBAL_TEMPORARY_TABLES LEFT JOIN information_schema.INNODB_TEMP_TABLE_INFO USING(NAME) LEFT JOIN INFORMATION_SCHEMA.INNODB_SESSION_TEMP_TABLESPACES USING(SPACE)\G
*************************** 1. row ***************************
  SESSION_ID: 10
      SPACE: 4243767290
        PATH: ./#innodb_temp/temp_10.ibt
TABLE_SCHEMA: test
  TABLE_NAME: tmp3
        SIZE: 98304
DATA_LENGTH: 16384
INDEX_LENGTH: 0
      ENGINE: InnoDB
    PURPOSE: USER
*************************** 2. row ***************************
  SESSION_ID: 13
      SPACE: NULL
        PATH: NULL
TABLE_SCHEMA: test
  TABLE_NAME: tmp41
        SIZE: NULL
DATA_LENGTH: 24
INDEX_LENGTH: 1024
      ENGINE: MyISAM
    PURPOSE: NULL
*************************** 3. row ***************************
  SESSION_ID: 13
      SPACE: NULL
        PATH: NULL
TABLE_SCHEMA: test
  TABLE_NAME: tmp40
        SIZE: NULL
DATA_LENGTH: 128256
INDEX_LENGTH: 0
      ENGINE: MEMORY
    PURPOSE: NULL
*************************** 4. row ***************************
  SESSION_ID: 13
      SPACE: 4243767287
        PATH: ./#innodb_temp/temp_7.ibt
TABLE_SCHEMA: test
  TABLE_NAME: tmp33
        SIZE: 98304
DATA_LENGTH: 16384
INDEX_LENGTH: 0
      ENGINE: InnoDB
    PURPOSE: USER
4 rows in set (0.01 sec)

Similar to ibtmp1, these tablespaces are not truncated apart from MySQL restart.

From the above, we can see that user connection 10 has one open InnoDB temporary table, and connection 13 has three temporary tables using three different engines.

Internal Temporary Tables

While a heavy query is running in connection 10, we can get the following views:

mysql > show processlist\G
...
*************************** 2. row ***************************
    Id: 10
  User: msandbox
  Host: localhost
    db: test
Command: Query
  Time: 108
  State: converting HEAP to ondisk
  Info: SELECT pad, COUNT(*) FROM sbtest1 GROUP BY pad

mysql > select * from performance_schema.memory_summary_global_by_event_name where EVENT_NAME like '%temptable%'\G
*************************** 1. row ***************************
                  EVENT_NAME: memory/temptable/physical_disk
                COUNT_ALLOC: 2
                  COUNT_FREE: 0
  SUM_NUMBER_OF_BYTES_ALLOC: 1073741824
    SUM_NUMBER_OF_BYTES_FREE: 0
              LOW_COUNT_USED: 0
          CURRENT_COUNT_USED: 2
            HIGH_COUNT_USED: 2
    LOW_NUMBER_OF_BYTES_USED: 0
CURRENT_NUMBER_OF_BYTES_USED: 1073741824
  HIGH_NUMBER_OF_BYTES_USED: 1073741824
*************************** 2. row ***************************
                  EVENT_NAME: memory/temptable/physical_ram
                COUNT_ALLOC: 12
                  COUNT_FREE: 1
  SUM_NUMBER_OF_BYTES_ALLOC: 1074790400
    SUM_NUMBER_OF_BYTES_FREE: 1048576
              LOW_COUNT_USED: 0
          CURRENT_COUNT_USED: 11
            HIGH_COUNT_USED: 11
    LOW_NUMBER_OF_BYTES_USED: 0
CURRENT_NUMBER_OF_BYTES_USED: 1073741824
  HIGH_NUMBER_OF_BYTES_USED: 1073741824
2 rows in set (0.00 sec)

mysql > select * from INFORMATION_SCHEMA.INNODB_SESSION_TEMP_TABLESPACES where id=10\G
*************************** 1. row ***************************
     ID: 10
  SPACE: 4243767290
   PATH: ./#innodb_temp/temp_10.ibt
   SIZE: 2399141888
  STATE: ACTIVE
PURPOSE: INTRINSIC
*************************** 2. row ***************************
     ID: 10
  SPACE: 4243767289
   PATH: ./#innodb_temp/temp_9.ibt
   SIZE: 98304
  STATE: ACTIVE
PURPOSE: USER
2 rows in set (0.00 sec)

From the above, we can see the query created a huge temporary table, that first exceeded temptable_max_ram variable and continued to grow in a mmapped file (still TempTable engine), but as also temptable_max_mmap was reached, the table had to be converted to on-disk InnoDB intrinsic table. The same pool of temporary InnoDB tables is used in this case, but we can see the purpose information, depending if the table is external (user-created) or internal.

The mmapped file is not visible in the file system as it has deleted state, but can be watched with lsof:

mysqld  862655 przemek   52u      REG              253,3  133644288  52764900 /data/sandboxes/msb_ps8_0_23/tmp/mysql_temptable.8YIGV8 (deleted)

It is important to know here, that as long as mmapped space has not exceeded, the Created_tmp_disk_tables counter is not incremented even though a file is created on disk here.

Also, in Percona Server for MySQL, the extended slow log, the size of temporary tables when the TempTable engine is used, is not accounted for: https://jira.percona.com/browse/PS-5168 – it shows “Tmp_table_sizes: 0”.

In some use cases, there are problems reported with the TempTable. It is possible to switch back to the old Memory engine via the internal_tmp_mem_storage_engine variable if needed.

References

 

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Oct
06
2021
--

Reminder: TokuDB Storage Engine Will Be Disabled by Default in Percona Server for MySQL 8.0.26

TokuDB Disabled in Percona Server for MySQL

TokuDB Disabled in Percona Server for MySQLAs we’ve communicated in our blog post in May, the TokuDB Storage Engine has been marked as “deprecated” in Percona Server for MySQL 8.0. It will be removed in a future version (Percona Server for MySQL 8.0.28, expected to ship in Q1 2022).

With the release of Percona Server for MySQL 8.0.26, the storage engine will still be included in the binary builds and packages but will be disabled by default. If you are upgrading from a previous version, the TokuDB Storage Engine plugin will fail with an error message at server startup if it is installed.

You will still be able to re-enable it manually so that you can perform the necessary migration steps.

Re-enabling the TokuDB Engine in Percona Server for MySQL 8.0.26 is fairly straightforward. You need to add the options

tokudb_enabled=TRUE

and

tokudb_backup_enabled=TRUE

options to your my.cnf file and restart your server instance.

Once the server is up and running again, you can migrate your data to an alternative storage engine like RocksDB or InnoDB before disabling and removing TokuDB. See the chapter Removing the TokuDB storage engine in our documentation for details.

Let us know if you have any questions or comments about this process!

If you need any assistance with migrating your data or have any questions or concerns about this, don’t hesitate to reach out to us – our experts are here to help!

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Jun
22
2021
--

WiredTiger File Forensics Part 3: Viewing all the MongoDB Data

WiredTiger MongoDB

WiredTiger MongoDBThis article continues on from Part 1: Building “wt” and  “Part 2: wt dump” to show how to extract any of your MongoDB documents directly from WiredTiger’s raw data files. It’ll also show how to take a peek into the index files. Lastly, it’ll show how to also look in the WT transaction log to see updates made since the latest checkpoint.

List up the Collections and Indexes

WiredTiger doesn’t name any of its data files according to the MongoDB object names, so as a first step you’ll have to extract a table of WT idents (=identifiers) vs. collection and index names.

The _mdb_catalog.wt file is not the top table in the WiredTiger storage engine’s own hierarchy of data sources. To the MongoDB layer of code though it is the complete definition of collection and index objects in the database. This includes both MongoDB system and user-made collections and indexes.

Dump WT ident vs Collections

As explained in the previous Part 2: wt dump blog post, you should reuse this command to save a tab-delimited file of WiredTiger ident(ifier)s vs collection names. I’ll call this file wt_ident_vs_collection_ns.tsv in this article.

$ #cd to a copy of a MongoDB data directory
$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -r 'select(. | has("md")) | [.ident, .ns] | @tsv' | sort > wt_ident_vs_collection_ns.tsv
$ 
$ head -n 5 wt_ident_vs_collection_ns.tsv
collection-0--4131298130356306083	config.cache.chunks.test.bar
collection-0-5834121039240263510	local.replset.initialSyncId
collection-0-5841128870485063129	local.startup_log
collection-0--6422702119521843596	config.system.sessions
collection-10--4131298130356306083	config.cache.chunks.test.foo

Eg. if an ident is “collection-4-5841128870485063129” then there will be a file collection-4-5841128870485063129.wt that has the data of a MongoDB collection.

In case you have a vague memory of seeing strings like these somewhere whilst using the mongo shell, you probably have. These idents are the same as the ones shown in the “wiredTiger.uri” field of the db.collection.stats() command output.

Optional: Dump WT ident vs Indexes

The index WT idents can be dumped as well. The example below will save them to wt_vs_index_ns.tsv in three columns [WT ident, collection name, index name].

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -r 'select(. | has("idxIdent")) | .ns as $nsT | .idxIdent | to_entries[] | [.value, $nsT, .key] | @tsv' | sort > wt_ident_vs_index_ns.tsv
$
$ head -n 5 wt_vs_index_ns.tsv
index-11--4131298130356306083	config.cache.chunks.test.foo	_id_
index-12--4131298130356306083	config.cache.chunks.test.foo	lastmod_1
index-12-5841128870485063129	local.system.replset	_id_
index-1--4131298130356306083	config.cache.chunks.test.bar	_id_
index-14-5841128870485063129	admin.system.version	_id_

Looking at the Application (= MongoDB) Table Data

collection-*.wt and index-*.wt files contain the data of the MongoDB collections and indexes that you observe as a client connected to a MongoDB instance. Once you’ve identified the WT ident value for the collection or index you want to inspect you can use that in the URI argument to the wt dump command.

wt dump collections to *.bson files

The wt dump output of collection-* tables has a basic numeric type key and BSON data as the values.

Use wt dump -x in your terminal on the WT file for a collection and you will see, after the header section, records of the WT table in alternating lines of key and value.

Eg. in the example below the first key is numeric/binary value 0x81 and the first value is the one starting with binary bytes 0xce000000025f6964001e. The second key is 0x82, and its value starts with 0xce000000025f69640021. The values are BSON which were much longer in reality, I only trimmed it for readability at the moment.

$ wt dump -x collection-X-XXXXXXXXXX
WiredTiger Dump (WiredTiger Version 10.0.0)
Format=hex
Header
file:WiredTiger.wt
access_pattern_hint=none,allocation_size=4KB,app.....
Data
81
68000000025f6964001e000000424d50524f442d425a41342d31...
82
6b000000025f69640021000000424d50524f442d574f524b4552...
...

FYI Using the alternative URI argument syntax “wt dump -x table:collection-X-XXXXXXXXXX” or “wt dump -x file:collection-X-XXXXXXXXXX.wt” will produce the same.

The key value in the collection-* tables isn’t needed to see the document content so print only every second line for the values using the following command:

$ wt dump -x collection-X-XXXXXXXXXX | tail -n +7 | awk 'NR%2 == 0 { print }'
68000000025f6964001e000000424d50524f442d425a41342d31...
6b000000025f69640021000000424d50524f442d574f524b4552....
66000000025f6964001c000000424d50524f442d574f524b4552...
6b000000025f69640021000000424d50524f442d574f524b4552...
64000000025f6964001a000000424d50524f442d44415441312d...

The command above prints binary BSON in a hex string format with a newline separating each record. We can translate that hexadecimal back to the original binary using the xxd command utility with the “-r” and “-p” flags. (Note: Don’t combine as one “-rp” flag. It doesn’t work like most unix command’s short options.)

$ #Look for the WT ident of my test.customerOrder collection
$ grep customerOrder wt_ident_vs_collection_ns.tsv
collection-14--3398103177079662761	test.customerOrder
$ 
$ ls -lh collection-14--3398103177079662761.wt
-rw------- 1 akira akira 40K Jun 10 15:12 collection-14--3398103177079662761.wt
$
$ #dump and convert its values to a plain BSON file:
$ wt dump -x collection-14--3398103177079662761 | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p > test.customerOrder.bson
$ 
$ #Confirm the content using bsondump
$ bsondump --quiet test.customerOrder.bson
{"_id":"123456","_class":"model.customer_order.CustomerOrder","orderReference":{"businessInteractionId":"str ... "orderAttributeValue":"string"}]}]}]}]}}
{"_id":"ORN-billingissue01","_class":"com.ctl.bm.servi ... startDate":"2018-02-02T00:00:00.000Z"}],"existingTN":[]}
2021-06-10T15:12:16.627+0900	2 objects found

“wt read” a single record?

Sorry, you can’t read your MongoDB collection data with the “wt read” shell command. The blocker is trivial – as of WT v10.0.0 the “wt read” command only accepts plain text or its own “r” recordid numeric value as the lookup key value. The keys for the mongodb collections and indexes however are ‘q’ and ‘u’ types respectively. (Documentation: WT Schema Format types.)

In MongoDB, you might know you can use a showRecordId cursor option so it’s tempting to think that this is the same “r” type that wt read can currently accept, but unfortunately it is not. See the key_format=X value in the header of wt dump output samples to confirm.

If wt read (code = utilities/util_read.c) was modified to accept an -x argument so we could pass the hex strings we already see in wt dump output this issue would be solved. But as it isn’t, for now, you have to dump all records even if you just want one.

This is only a limitation in the shell. If you use the API from within a programming language, including the Python SWIG binding available, you should be able to read just a single record.

Looking at MongoDB’s Index Files

wt dump index-*.wt files

index-* WiredTiger tables have two different formats that are easy to see when using wt dump to look inside them – one with just keys, and one with both keys and values. The WT keys are binary values generated from MongoDB’s KeyString structure.

Below is an example of an index on {state: 1, process: 1} on a collection called config.locks. This is a case where there are no values in the WT table for the index. 

$ grep index-15--749209957533832251 wt_ident_vs_index_ns.tsv 
index-15--7492099575338322516	config.locks	state_1_process_1
$ 
$ wt dump -x index-15--7492099575338322516 | tail -n +7
293c436f6e66696753657276657200040008    <keys on odd-numbered line>
   <these even-numbered lines are where the WT value would be>
293c436f6e66696753657276657200040040

293c7072642d6d6f6e2d7065722d73686172642d6130333a32373031383a313538363938343138383a3139393331363932353337323937373139383300040028

....

293c70726f642d6d6f6e676f2d73686172642d6d6f6e676f732d302e6d6f6e676f2e706572636f6e612e636f6d3a32373031373a313535343633323838333a38363534383239393034393534383037333200040078

2b043c436f6e66696753657276657200040060

$

Below is an example of an {_id: 1} index on the collection test.rlru. This is a case when there are both WT keys and values.

$ grep index-1-2871579003788456567 wt_ident_vs_index_ns.tsv 
index-1-2871579003788456567	test.rlru	_id_
$
$ wt dump -x index-1-287157900378845656 | tail -n +7
2904
0008
2b0204
0010
2b0404
0018
...
...
2c0faa04
203eb1
2c0fac04
203eb9
...

Given the point of an index record lookup is to have a value that points to a record in the collection-X-XXXX WT table, you should be asking “How can that first index above be useful without values?”

The answer is the recordid is packed on the end of the key. You’ll notice in the first example they all have 0x04 as the third-last byte. This is how MongoDB packs a recordId when the value is between 0 and 2^10 – 1 I believe. See KeyString appendRecordId() if you want to get into it further.

By the way, an index’s field names are constant and thus irrelevant to sort order, so they’re not part of the keystrings.

Writing about the KeyString format even in just the shortest summary would take a whole blog post. So I’ll just punch out some translations of the binary above as a teaser and stop there.

Keystring 293c436f6e66696753657276657200040008 =>

  • 0x29 = type marker for numeric value 0.
  • 0x3c type marker for (UTF-8?) string
  • 36f6e66696753657276657200 hex of string “ConfigServer” plus tailing null
  • 0x04 type marker for a recordId
  • 0x0008 => binary 000 + binary 0000000001 + binary 000 = (recordId) value 1

Keystring 2b043c436f6e66696753657276657200040060 =>

  • 0x2b = type marker, positive integer in small range (< 2^7?)
  • 04 = binary 0000010 + bit 0 = value 2
  • 0x3c type marker for (UTF-8?) string
  • 36f6e66696753657276657200 hex of string “ConfigServer” plus tailing null
  • 0x04 type marker for a recordId
  • 0x0060 => binary 000 + binary 00.0000.1100 + binary 000 = (recordId) value 12

Keystring 2b0404 =>

  • 0x2b = type marker, positive integer in small range (< 2^7?)
  • 04 = binary 000010 + 00 = value 2, not sure what the tailing two bytes are for.
  • 0x04 type marker to indicate the following value is recordId?

Value 001001 =>

  • 0018 = binary 000 + binary value 00 0000 0011 = 3, plus 3 bits 000 on the end. T.b.h. I don’t always know how to translate this one, but commonly 3 bits at the beginning and end of the recordId format are used as a byte size indicator.

If looking for matching records in the collection-X-XXX table look for key (0x80 + the recordid) value from the index file. Eg. 2 -> 0x82; 12 -> 0x9c. For some reason 0x0 – 0x80 seems to be reserved and so key values in the collection-X-XXX.wt files are all incremented by 0x80 higher than the recordId value in the index-X-XXX.wt records.

In the end though, as we don’t have an ability to use wt read, all this poking around in the indexes from a shell command can only be for satisfying curiosity. Not for making single-record access fast.

Looking at the WT Transaction Log

A WiredTiger checkpoint saves a copy of documents in all collections and indexes as they were at one exact point in time (the time the checkpoint is started). MongoDB will call for a checkpoint to be made once per minute by default.

Without something else being saved to disk, a sudden crash would mean that restores/restarts could only revert to the last checkpoint. The classic database concept of write-ahead log is the solution to this of course. In WiredTiger this is provided by the transaction log, often just called “log” in its own documentation. Or as the documentation also says it adds “commit-level” durability to checkpoint durability.

At restart, whether it is after a perfectly normal shutdown or a crash, WiredTiger will read and replay the writes it finds in its transaction log onto the tables, in memory. In time, when the MongoDB layer requests a new checkpoint be created, the in-memory restored data will be saved to disk.

When you use wt dump it’s not easy to say if you’re looking at the collection (or index) restored as of the last checkpoint or with the transaction log “recovered” (read and applied) as well. If the “-R” global option of the wt command is used then yes log is recovered; if the opposite “-r” option is used then no. But which is in effect if you specify neither is unclear. Also, I’ve seen comments or diagnostic messages that suggest the -r option isn’t perfect.

Two Layers of Recovery

WiredTiger would, by default, keep no transaction log for the tables it manages for the application embedding it. It’s only the request of the application that engages it.

  • If you are using a standalone mongod MongoDB code will enable WT log for every collection and index created
    • Unless the “recoverFromOplogAsStandalone” server parameter is used. This is a trick that is part of point-in-time recovery of hot backups.
  • When replication is enabled (the typical case) only the WT tables for “local” db collections and indexes get “log=(enabled=true)” in their WT config string made by WiredTigerRecordStore::generateCreateString() in wiredtiger_record_store.cpp, and something similar in wiredtiger_index.cpp.

When a mongod node restarts with an existing data directory the WiredTiger library will run recovery. It can’t be stopped (at least not by the MongoDB user). This is not the end of the story though. When replication was used it means only the “local” db collections, in particular the oplog, is restored. The replication system code has a ReplicationRecovery class that is used next, and this will apply updates made since the last durable checkpoint time from the oplog to the user and system collections they’re supposed to be in. Only after that occurs will recovery be complete and the db server will make the data available to clients.

ReplicationRecovery allows MongoDB to do rollbacks (at least to the checkpoint time) and also to trim off ops in the last replication batch application. It’s an ugly point but MongoDB applies replication ops in small batches in parallel to improve performance; this only happens on secondaries usually of course, but also after a restart. If all the writes in the batch finish that is fine, but if they don’t some writes may be missing which is a clear consistency violation. So the unfinished batch’s writes should all be cleared, back to the last saved oplogTruncateAfterPoint value. repl.oplogTruncateAfterPoint is a single-document collection in the local db. It exists only for this purpose as far as I know.

Diagram for Typical Restart Recovery

“Typical” = restarting a node that is a replica. Doesn’t matter if it was a primary or secondary, or even the only node in the replica set.

WiredTiger’s recovery

?

WT table for local.oplog.rs updated

?

ReplicationRecovery

?

Trim oplog docs where “ts” > oplogTruncateAfterPoint

?

oplog application

Pseudo code:
db.oplog.rs.find({ts > durable checkpoint time}).forEach(function(d) {
     //re-apply write to intended collection in “test”, “mydb”, “admin”, etc. db
     applyOps(opdoc);
}

?

Ready!

wt printlog

You can use the wt printlog command to see what is in the log currently in the WiredTigerLog.<nnnnnn> files in the journal/ subdirectory. If you decode the write ops in there you can understand what document versions will be restored by MongoDB when it restarts.

Preparation to Using wt printlog – Map WT idents to fileid

Until now you’ve learned that every MongoDB collection and index is in its own WT table file, which means needing to learn what the WT ident(ifier) is to find the right raw WT file to look into.

There’s another small integer id value, a.k.a. fileid, in WT metadata / WT table config strings for each file too. I suppose it saves space; at any rate, the WT transaction log references files only by this number. This means you’ll have to build a handy mapping or list of which WT ident is which fileid. Use the following command to create a file I call wt_ident_vs_fileid.tsv.

$ wt dump file:WiredTiger.wt | grep -B 1 ',id=' | sed 's/^file:\(.*\)\.wt\\00/\1\t/; s/.*,id=\([0-9][0-9]*\),.*/\1/' | paste -d" " - - > wt_ident_vs_fileid.tsv

$ cat wt_ident_vs_fileid.tsv
WiredTigerHS	 268
_mdb_catalog	 3
collection-0--7000593926523175565	 269
collection-0--7468664482515188034	 196
collection-0--9200920937255444113	 79
...
collection-8-2787176720526139723	 12
collection-83--6694766911154494072	 194
index-1--7000593926523175565	 270
index-1--7468664482515188034	 197
index-1-2787176720526139723	 5
...
index-9--7468664482515188034	 205
index-9-2787176720526139723	 13
sizeStorer	 2

WT idents don’t suggest to the human reader which MongoDB collection or index is involved though, so after that use this shell command to join wt_ident_vs_fileid.tsv to the wt_ident_vs_collection_ns.tsv and wt_ident_vs_index_ns.tsv files made earlier.

$ join <(sort wt_ident_vs_fileid.tsv) <(cat wt_ident_vs_collection_ns.tsv wt_ident_vs_index_ns.tsv | sort) | sort -n -k 2 | sed 's/ /\t/g' > wt_ident_vs_sorted_fileid_and_mdb_name.tsv
$ cat wt_ident_vs_sorted_fileid_and_mdb_name.tsv
collection-0-2787176720526139723	4	local.startup_log
index-1-2787176720526139723	5	local.startup_log	_id_
collection-2-2787176720526139723	6	local.replset.oplogTruncateAfterPoint
index-3-2787176720526139723	7	local.replset.oplogTruncateAfterPoint	_id_
collection-4-2787176720526139723	8	local.replset.minvalid
index-5-2787176720526139723	9	local.replset.minvalid	_id_
...
collection-16-2787176720526139723	20	local.oplog.rs
collection-17-2787176720526139723	21	admin.system.version
index-18-2787176720526139723	22	admin.system.version	_id_
...

Not shown: WiredTiger’s own metadata table WiredTiger.wt is fileid 0. I presume this is hardcoded.

Transaction Log Files

In a brutally trimmed summary, the content of the transaction log is something like this:

{ checkpoint marker }
...
...
{ lsn : [42, 7424], ops: [ {fileid: i, key: x1, value: y1 } ] }
{ lsn : [42, 8448], ops: [ {fileid: i, key: x2, value: y2 }, {...}, {...} ] }
{ lsn : [42, 8960], ops: [ {fileid: j, key: x6, value: y6 } ] }
{ lsn : [43,    0], ops: [ {fileid: i, key: x7, value: y7 }, {...} ] }
{ lsn : [42,  256], ops: [ {fileid: i, key: x9, value: y9 } ] }
...

Going into details:

  • The files are WiredTiger table files.
    • The inserts to the WT table for the MongoDB local.oplog.rs collection are responsible for most of the log by size.
    • When replication is used only “local” db collection and index writes are saved in the WT log. This is saying: WT transaction log will only restore the local db data to what it should be. The replication system in MongoDB, above the storage engine, has a recovery process that reflects those post-last-checkpoint oplog writes to their respective collections outside of the “local” db.
  • The fileid number is an internal one of WiredTiger. It is a simple integer that starts from 1, so you’ll probably only see this as a one or two- or three-digit value. Which WT ident(ifier) (eg. “collection-99-8595782607099944640”) it is for, and then which MongoDB collection (or index) that is for, can be determined by looking in the WiredTiger metadata. (See “Output a tab-delimited table of WT ident(ifier) values vs. collection namespace” in WiredTiger File Forensics Part 2: wt dump. Once you know the ident, use wt dump on that table and look for the “id=…” value in its config string in the header section.)
  • There are no timestamps here that the WiredTiger library uses. It will follow the order exactly whilst replaying these writes though. And that order is as the application (= MongoDB) committed them before.
  • MongoDB writes the oplog “ts” timestamp into the wiredtiger key value shown above when it writes oplog docs. The “ts” value used is transformed from the BSON timestamp value binary format by record_id_helpers::extractKeyOptime(…). (oploghack::extractKey() in <= v4.2).
    This is an exception to what MongoDB puts into the key values of WT tables for other collections (and indexes).
  • The value shown above is just a binary blob to WiredTiger
    • For the tables that hold MongoDB collections, those blobs hold the raw BSON object you can see as a MongoDB user.
    • The blob for mongodb indexes will be a MongoDB KeyString.
  • “lsn” field is transaction log file number (eg. 43 is “WiredTigerLog.0000000043”) plus byte offset. So it is an ever-increasing tuple value, a simple definition of order.

With that explained, let’s look at a real sample now:

$ #To use wt printlog the WiredTiger config string definitely needs to be set.
$ #  See the part 1 of this blog series for details. In this example I'll do it by 
$ #  the -C option, but there are two other ways as well
$
$ wt -C "log=(compressor=snappy,path=journal)" printlog -xu | head -n 100
[
  { "lsn" : [55,128],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "system",
    "ops": [
      { "optype": "prev_lsn",
        "prev_lsn": [54, 98566272]
      }
    ]
  },
  { "lsn" : [55,256],
    "hdr_flags" : "",
    "rec_len" : 384,
    "mem_len" : 384,
    "type" : "commit",
    "txnid" : 812443,
    "ops": [
      { "optype": "row_put",
        "fileid": 20 0x14,
        "key": "\u00e8`\u0088\u000b\u00c2\u00ff\u00ff\u00e0\u00e7",
        "key-hex": "e860880bc2ffffe0e7",
        "value": "\u00f3\u0000\u0000\u0000\u0002op\u0000\u0002\u0000\u0000\u0000d\u0000\u0002ns\u0000 \u0000\u0000\u0000config.transaction_coordinators\u0000\u0005ui\u0000\u0010\u0000\u0000\u0000\u0004\u00f7z\u001082\u00c6A\r\u0089\nx\u00cb(\u00edD\u0080\u0003o\u0000p\u0000\u0000\u0000\u0003_id\u0000f\u0000\u0000\u0000\u0003lsid\u0000H\u0000\u0000\u0000\u0005id\u0000\u0010\u0000\u0000\u0000\u0004\u00d4\u00dc\u00ac\u00ef\u00e7\u0087@\u00a7\u009d\u00df\u00c6.\u00d6\u00f4\u00f0\u00ec\u0005uid\u0000 \u0000\u0000\u0000\u0000\u00e3\u00b0\u00c4B\u0098\u00fc\u001c\u0014\u009a\u00fb\u00f4\u00c8\u0099o\u00b9$'\u00aeA\u00e4d\u009b\u0093L\u00a4\u0095\u0099\u001bxR\u00b8U\u0000\u0012txnNumber\u0000s\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0011ts\u0000'\u0001\u0000\u0000\u00c3\u000b\u0088`\u0012t\u0000i\u0000\u0000\u0000\u0000\u0000\u0000\u0000\twall\u0000(\u00f5m\u0013y\u0001\u0000\u0000\u0012v\u0000\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
        "value-hex": "f3000000026f7000020000006400026e730020000000636f6e6669672e7472616e73616374696f6e5f636f6f7264696e61746f727300057569001000000004f77a103832c6410d890a78cb28ed4480036f0070000000035f69640066000000036c7369640048000000056964001000000004d4dcacefe78740a79ddfc62ed6f4f0ec05756964002000000000e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855001274786e4e756d62657200730400000000000000001174730027010000c30b886012740069000000000000000977616c6c0028f56d1379010000127600020000000000000000"
      }
    ]
  },
  { ....

 

The plain “key” and “value” fields are a mixture of printable chars and hex encoding for non-printable chars, unfortunately, encoded as UTF-16 rather than ascii or UTF-8. This is too hard to work with, so I add the -x option to also print the hex strings the same as wt dump -x does.

In this sample fileid 20 is for local.oplog.rs and this typical content for the WT log with MongoDB. Using xxd -r -p we can see what is going into the value:

$ echo  "f3000000026f7000020000006400026e730020000000636f6e6669672e7472616e73616374696f6e5f636f6f7264696e61746f727300057569001000000004f77a103832c6410d890a78cb28ed4480036f0070000000035f69640066000000036c7369640048000000056964001000000004d4dcacefe78740a79ddfc62ed6f4f0ec05756964002000000000e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855001274786e4e756d62657200730400000000000000001174730027010000c30b886012740069000000000000000977616c6c0028f56d1379010000127600020000000000000000" | xxd -r -p | bsondump --quiet | jq  .
{
  "op": "d",
  "ns": "config.transaction_coordinators",
  "ui": {
    "$binary": {  "base64": "93oQODLGQQ2JCnjLKO1EgA==", "subType": "04" }
  },
  "o": {
    "_id": {
      "lsid": {
        "id": {  "$binary": { "base64": "1Nys7+eHQKed38Yu1vTw7A==", "subType": "04" } },
        "uid": {
          "$binary": { "base64": "47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=", "subType": "00" }
        }
      },
      "txnNumber": {  "$numberLong": "1139" }
    }
  },
  "ts": {  "$timestamp": {"t": 1619528643, "i": 295 } },
  "t": { "$numberLong": "105" },
  "wall": { "$date": { "$numberLong": "1619528643880" } },
  "v": { "$numberLong": "2" }
}

An insert to the oplog that is for a d(elete) of the document in “config.transaction_coordinators” with “id”: {  “$binary”: { “base64”: “1Nys7+eHQKed38Yu1vTw7A==”, “subType”: “04” } }.

Summary

To look at the MongoDB collections, indexes, and the transaction log (a.k.a. the mongodb journal) you can use the wt dump and wt printlog commands. Before doing any of that you’ll have to make a mapping of WT table ident(ifier)s vs. MongoDB collection and index names, and also a list of fileids for the log. This information can be extracted using wt dump on WiredTiger.wt and _mdb_catalog.wt tables. Well, that and a medium amount of grep, sed, awk, etc.

The collection-*.wt files have key-value pairs of [recordId, BSON binary]. The index-*.wt files can also be read, where you can see the binary of the MongoDB-defined “keystrings”. Due to limitations with wt read described above in the article, there is no practical way currently with the “wt” util command to do index lookups.

Looking at the transaction log with wt printlog command will show the operations that will be applied during restart to progress the data from the state it was at the last durable checkpoint to the moment the last journal commit happened before shutdown.

Jun
07
2021
--

Experimental Feature: $backupCursorExtend in Percona Server for MongoDB

$backupCursorExtend MongoDB

$backupCursorExtend MongoDBPercona Server for MongoDB (PSMDB) has provided a ‘hot’ backup of its underlying data db directory files using WiredTiger library methods since v3.4. When you use the { createBackup: … } command it will copy them to whichever other filesystem directory or object store bucket location you specify.

createBackup partially freezes the WiredTiger *.wt Btree files with an internal backup cursor while the command runs. That’s “partially” because more data may be added to the files whilst the data is copied to the backup storage. The key point is all the file blocks of the last checkpoint will be frozen until the backup cursor is closed.

As well as the Btree-storing *.wt files, some metadata files, config-ish files, and the journal/WiredTigerLog.* files will also be copied. We’ll get into those a little later though.

In PSMDB 4.4.6-8 we’ve added two new aggregation pipeline stages that make the file-by-file mechanics of this fastest backup procedure more visible. They also support cluster-consistent backup and restore.

These aggregation pipeline stages are not real aggregation functions such as group, sort, etc. They’re functions for DBAs to use, like $collStats, $currentOp, $getDiagnosticData, etc.

New Aggregation Stage #1: $backupCursor

When you run db.aggregate([{$backupCursor: {}}] it simply returns a list of files to copy.

testrs:SECONDARY> bkcsr = db.getSiblingDB("admin").aggregate([{$backupCursor: {}}])
{ "metadata" : { 
    "backupId" : UUID("f53a93d5-8a0c-4f09-b654-909ff023a7ed"), 
    "dbpath" : "/data/plain_rs/n1/data", 
    "oplogStart" : { "ts" : Timestamp(1615341658, 18745), "t" : NumberLong(105) }, 
    "oplogEnd" : { "ts" : Timestamp(1622185503, 1), "t" : NumberLong(115) }, 
    "checkpointTimestamp" : Timestamp(1622185473, 1) 
  } 
}
{ "filename" : "/data/plain_rs/n1/data/collection-105-8595782607099944640.wt", "fileSize" : NumberLong(20480) }
{ "filename" : "/data/plain_rs/n1/data/index-1--1329013882440624564.wt", "fileSize" : NumberLong(20480) }
...
...
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000042", "fileSize" : NumberLong(104857600) }
testrs:SECONDARY>

See the “Full instructions for using” section below for a thorough example.

It just lists the files and size without copying them anywhere. This is less convenient than the existing createBackup which does. So why did we add it? Compatibility with the other enterprise version of MongoDB is the main point. But there’s another one – adding this aggregation stage exposes the WiredTiger library’s backup cursor API in an almost 1-to-1 structural match. Which is educational for advanced users, and that’s worth something.

Cursors of … File Names?

You may have heard that in Unix “Everything is file”. Well, in the WiredTiger library API everything is a cursor. Tables and indexes are accessed by a cursor of course, and so is the transaction log. Colgroups are too. Also statistics and metadata. Even files are accessed through a cursor. How did this happen? I guess when you’re a db engine developer …. hammers, nails – you know the saying.

New Aggregation Stage #2: $backupCursorExtend

This extra function returns just a few more file paths to copy. The output is small, as you see. The files it returns are just the list of WiredTiger transaction log files that have been updated or newly added since the $backupCursor was first run.

The WT log files are small, but they have an important benefit: in combination with some restart-time tricks, we can use the WT transaction ops in these files to restore to any arbitrary time between the $backupCursor and $backupCursorExtend execution times.

You must run $backupCursor first. As well as keeping that first cursor open, which is a requirement, it’ll be helpful to save a copy of the “metadata” document that is its first record. The reason is that the “backupId” value in the metadata is a required parameter when creating the $backupCursorExtend cursor – it has to know from which main backup cursor it is meant to extend.

testrs:SECONDARY> var bkcsr = db.getSiblingDB("admin").aggregate([{$backupCursor: {}}])
testrs:SECONDARY> var bkCsrMetadata = bkCsr.next().metadata //first document in $backupCursor result has just one child property, the "metadata" object.
testrs:SECONDARY> // do other things with the main backup cursor
testrs:SECONDARY> # ...
testrs:SECONDARY> var bkExtCsr = db.aggregate([{$backupCursorExtend: {backupId: bkCsrMetadata.backupId, timestamp: new Timestamp(162218619, 1)}}])
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000042" }
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000043" }
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000044" }
testrs:SECONDARY>

The second parameter, the timestamp one, will block the command from returning its result until the node has reached that cluster time. This guarantees that you don’t have, for example, a lagging secondary’s snapshot that ends at a cluster time that is before the starting cluster time of other shard’s backups.

Full Instructions for Using $backupCursor and $backupCursorExtend

N.b. this feature is experimental in PMSDB as of now (v4.4.6), so it may change in the future.

Backup

From one node in each replica set (including the config server replica set) run the following procedure.

It’s OK to use either a primary or a secondary, but using a secondary that has significant lag will oblige you to wait for that much time in between steps 1 and 2.

1. Get a list of all files to copy with $backupCursor

Getting the result response, with the metadata doc and the list of files, will take less than a second, but (important!) this cursor must be kept alive until the end of all the backup steps shown below. So don’t close it.

When the command is completed on all replica sets save somewhere in your script etc, the maximum (latest) “oplogEnd” timestamp amongst the replica set nodes – this will be the target, common time of the backup to restore. In this blog, I’ll call this value the commonBackupTimestamp.

(Pre-2: wait until all complete step 1.)

Step 1 will only take ~1 second to complete, but make sure you finish it for all servers involved before doing the next step.

2. If backing up a cluster also run $backupCursorExtend

Do it as soon as you like, you only need to capture a span of time that is long enough to overlap the same process on the other shards and the config server replica set.

Pass the metadata “backupId” from the $backupCursor on this node as the first parameter, and the commonBackupTimestamp (from all replica sets) as the second parameter. This could be run within seconds if your script/program is fast and is good at parallel execution across servers.

You don’t need to explicitly close the cursor of the $backupCursorExtend, it happens automatically. It doesn’t need to do the pinning of any files because $backupCursor is already doing that.

3. Start a < 10 min loop that runs getmore on backup cursor

This will prevent the automatic close of the backup cursor. This is crucial – if the backup cursor is closed WiredTiger library will start allowing the backup snapshot file blocks to be cleaned up and overwritten.

The $backupCursor MongoDB cursor is a tailable one that will return a successful but empty result, as many times as you have to.

4. Now start copying all the files listed to your backup storage

This is the part that takes hours … or whatever <your-db-file-size>/<transfer-rate> is for you.

5. Terminate the getmore loop

6. Close the cursor of the $backupCursor command

Or just let it time out (10mins).

7. Save the commonBackupTimestamp for restore.

This is a MongoDB timestamp, eg. Timestamp(162218619, 1). If you don’t save it in a handy place you’ll have to peek at all the restored nodes to find some common cluster timestamp they all support.

Restore

Having made backups with the current $backupCursor and $backupCursorExtend some work is required to restore a consistent cluster state. The reason is that $backupCursorExtend, in its current experimental version at least, returns WT log files that probably contain operations beyond the target cluster time.

To fix this we can use a special function of the replication subsystem’s startup process to trim the amount of oplog being restored beyond the last checkpoint to the desired, common cluster time. The “oplogTruncateAfterPoint” value in the local db’s replset.oplogTruncateAfterPoint collection is how you set that limit. (It’s undocumented in the MongoDB Manual, but a critical part of replication being updated all the time in normal replication.)

The big hassle is that more restarts are required – the oplogTruncateAfterPoint value can only be set whilst in standalone mode because it’s a field in a system collection of the local db; enabling replication again is required to get the value to be enacted upon. Saving the final result in a new checkpoint takes another standalone mode restart, with special extra options. And, unrelated to timestamp matters, you’ll probably be fixing the replica set and shard configuration to accept the new hostnames too.

For the general process of how to restore a filesystem snapshot of a cluster please see: Restore a Sharded Cluster. Although it’s not an absolutely required part of this restore process we also recommend setting an ephemeral port in the config file for all restarts, except the final one when rs.initiate() is executed. This prevents accidental connection to the original replica set nodes, wherever that environment is, and prevents interference between config and shard nodes before they’re perfectly restored.

The difference to the documented method above is, for both configsvr and shardsvr type nodes, instead of following the two steps to start the node in standalone mode and run db.getSiblingDB(“local”).drop() do these three steps instead:

  • Start mongod as standalone (i.e. replication and sharding options deleted or commented out)
    • Don’t drop the entire local db. We need to keep the oplog. Drop just the “replset.minvalid”, “replset.oplogTruncateAfterPoint”, “replset.election” and “system.replset” collections.
    • Insert a dummy replset.minvalid timestamp.
      db.getSiblingDB(“local”).replset.minvalid.insert({“_id”: ObjectId(), “t”: NumberLong(-1), “ts”: Timestamp(0,1) });
    • Manually create a temporary replset configuration with this minimal config:

      var tmp_rs_conf = {
      &nbsp;&nbsp;"_id" : THE-REPLSET-NAME,
      &nbsp;&nbsp;"version" : NumberInt(1), "protocolVersion" : NumberInt(1),
      &nbsp;&nbsp;"members" : [
      &nbsp;&nbsp;&nbsp;&nbsp;{ "_id" : NumberInt(0), "host" : "localhost:<port>" }
      &nbsp;&nbsp;], "settings" : {}
      };
      
      //Note: If config server add "configsvr": true as well
      // tmp_rs_conf["configsvr"] = true;
      
      db.getSiblingDB("local").system.replset.insert(tmp_rs_conf);
    • Put the commonBackupTimestamp in the replset.oplogTruncateAfterPoint collection.
      db.getSiblingDB("local").replset.oplogTruncateAfterPoint.insert( {
          "_id": "oplogTruncateAfterPoint",
          "oplogTruncateAfterPoint": <commonBackupTimestamp> })
    • Shutdown instance
  • Start as a single-node replica set to recover the oplog.
    • CSRS can be started with sharding enabled (clusterRole=configsvr) but shard’s replica sets should be started without clusterRole configured
    • At this start, oplog history is truncated to the point set in the oplogTruncateAfterPoint collection
    • Shutdown instance
  • Start as standalone with these parameters so the WT log and then mongodb-layer oplog updates are A) applied and B) saved into a checkpoint.
    • –setParameter recoverFromOplogAsStandalone=true
    • –setParameter takeUnstableCheckpointOnShutdown=true
    • Shutdown instance

I won’t say “… and that’s all there is to it!”, because five restarts is a big job. (Per the usual process in Restore a Sharded Cluster there are another two to fix up replica set and sharding config required when moving hosts.) But this is how we can restore all nodes to exactly the same MongoDB cluster time after using the current experimental status $backupCursorExtend being released in Percona Server for MongoDB v4.4 this week.

Summary

The WiredTiger API exposes ways to list all files that should be copied for backup, and it will freeze those files so you can copy them without them being deleted or necessary parts within them being overwritten. These methods of the WT storage engine are exposed at the MongoDB user level with the $backupCursor and $backupCursorExtend aggregation pipeline stages.

$backupCursorExtend in particular gives us a way to keep on collecting WT transaction log files so that we can guarantee coverage of some common time between all shards and the config server replica set of a sharded cluster.

This makes it possible to achieve cluster-consistent filesystem copy restores of a MongoDB sharded cluster. Distributed transactions, if you are using them, will be restored all-or-nothing style up to the cluster timestamp you specify during restore.

Looking ahead: $backupCursorExtend in PSMDB 4.4.6 (and as far as we can tell, in MongoDB Enterprise edition too) doesn’t limit transaction log file copies to an exact MongoDB cluster time. It only guarantees greater-than-or-equal-to inclusion of log writes of that time. So during restore, we must use extra techniques that the MongoDB replication system provides, in particular the undocumented oplogTruncateAfterPoint trick, to do a clean cut of the oplog up to the same cluster timestamp.

After we’ve heard feedback from the community about these new backup cursor functions we’d like to improve them so let us know how you use them, and what a better solution would be for you.

May
27
2021
--

How InnoDB Handles TEXT/BLOB Columns

InnoDB Handles TEXT BLOB Columns

InnoDB Handles TEXT BLOB ColumnsRecently we had a debate in the consulting team about how InnoDB handles TEXT/BLOB columns. More specifically, the argument was around the Barracuda file format with dynamic rows.

In the InnoDB official documentation, you can find this extract:

When a table is created with ROW_FORMAT=DYNAMIC, InnoDB can store long variable-length column values (for VARCHAR, VARBINARY, and BLOB and TEXT types) fully off-page, with the clustered index record containing only a 20-byte pointer to the overflow page.

Whether columns are stored off-page depends on the page size and the total size of the row. When a row is too long, the longest columns are chosen for off-page storage until the clustered index record fits on the B-tree page. TEXT and BLOB columns that are less than or equal to 40 bytes are stored in line.

The first paragraph indicates InnoDB can store long columns fully off-page, in an overflow page. A little further down in the documentation, the behavior of short values, the ones with lengths less than 40 bytes, is described. These short values are stored in the row.

Our argument was about what happened for lengths above 40 bytes. Are these values stored in the rows or in overflow pages. I don’t really need a reason to start digging on a topic, imagine when I have one. With simple common tools, let’s experiment and find out.

Experimentation with TEXT/BLOB

For our experiment, we need a MySQL database server, a table with a TEXT column, and the hexdump utility. I spun up a simple LXC instance in my lab with Percona-server 8.0 and created the following table:

CREATE TABLE `testText` (
`id` int(11) unsigned NOT NULL,
`before` char(6) NOT NULL DEFAULT 'before',
`data` text NOT NULL,
`after` char(5) NOT NULL DEFAULT 'after',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The before and after columns are there to help locate the TEXT column in the datafile page. We can insert a few rows in the table, a TEXT value with size being a multiplier of 16 is convenient with hexdump, successive identical lines are replaced by ‘*’.

mysql> insert into testText (id,data) values (1,repeat('0123456789abcdef',1));
Query OK, 1 row affected (0.01 sec)
mysql> insert into testText (id,data) values (2,repeat('0123456789abcdef',2));
Query OK, 1 row affected (0.00 sec)
mysql> insert into testText (id,data) values (3,repeat('0123456789abcdef',3));
Query OK, 1 row affected (0.01 sec)

Then we use the hexdump utility at the shell level, the first leaf page starts at offset 0xc000:

# hexdump -C /var/lib/mysql/test/testText.ibd
...
0000c060  02 00 1b 69 6e 66 69 6d  75 6d 00 04 00 0b 00 00  |...infimum......|
0000c070  73 75 70 72 65 6d 75 6d  10 00 00 10 00 32 00 00  |supremum.....2..|
0000c080  00 01 00 00 00 05 74 3f  d9 00 00 01 5d 01 10 62  |......t?....]..b|
0000c090  65 66 6f 72 65 30 31 32  33 34 35 36 37 38 39 61  |efore0123456789a|
0000c0a0  62 63 64 65 66 61 66 74  65 72 20 00 00 18 00 42  |bcdefafter ....B|
0000c0b0  00 00 00 02 00 00 00 05  74 40 da 00 00 01 5e 01  |........t@....^.|
0000c0c0  10 62 65 66 6f 72 65 30  31 32 33 34 35 36 37 38  |.before012345678|
0000c0d0  39 61 62 63 64 65 66 30  31 32 33 34 35 36 37 38  |9abcdef012345678|
0000c0e0  39 61 62 63 64 65 66 61  66 74 65 72 30 00 00 20  |9abcdefafter0.. |
0000c0f0  ff 7e 00 00 00 03 00 00  00 05 74 45 dd 00 00 01  |.~........tE....|
0000c100  62 01 10 62 65 66 6f 72  65 30 31 32 33 34 35 36  |b..before0123456|
0000c110  37 38 39 61 62 63 64 65  66 30 31 32 33 34 35 36  |789abcdef0123456|
*
0000c130  37 38 39 61 62 63 64 65  66 61 66 74 65 72 00 00  |789abcdefafter..|

Clearly, the values are stored inside the row, even the third one with a length of 48.

Cutoff to an Overflow Page

If we continue to increase the length, the behavior stays the same up to a length of 8080 bytes (505 repetitions). If we add 16 bytes to the length, the row becomes larger than half of the available space on the page. At this point, the TEXT value is moved to an overflow page and replaced by a 20 bytes pointer in the row itself. The pointer in the rows looks like this:

0000c060 02 00 1c 69 6e 66 69 6d 75 6d 00 02 00 0b 00 00 |...infimum......|
0000c070 73 75 70 72 65 6d 75 6d 14 c0 00 00 10 ff f1 00 |supremum........|
0000c080 00 00 01 00 00 00 05 74 80 a4 00 00 01 fe 01 10 |.......t........|
0000c090 62 65 66 6f 72 65 00 00 01 02 00 00 00 04 00 00 |before..........|
0000c0a0 00 26 00 00 00 00 00 00 1f a0 61 66 74 65 72 00 |.&........after.|
0000c0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

The 20 bytes pointer is: 00 00 01 02 00 00 00 04 00 00 00 26 00 00 00 00 00 00 1f a0
Space ID: 00 00 01 02
First overflow page: 00 00 00 04 (4 x 0x4000 = 0x10000)
Offset in overflow page: 00 00 00 26
Version number: 00 00 00 00
Total length of the TEXT value: 00 00 1f a0 (0x1fa0 = 8096 = 16*506)

The overflow page:

00010000 c1 06 3d 24 00 00 00 04 00 00 00 00 00 00 00 00 |..=$............|
00010010 00 00 00 00 74 dd 8e 3f 00 0a 00 00 00 00 00 00 |....t..?........|
00010020 00 00 00 00 01 02 00 00 1f a0 ff ff ff ff 30 31 |..............01|
00010030 32 33 34 35 36 37 38 39 61 62 63 64 65 66 30 31 |23456789abcdef01|
*
00011fc0 32 33 34 35 36 37 38 39 61 62 63 64 65 66 00 00 |23456789abcdef..|
00011fd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

The value header starting @ 0x10026: 00 00 1f a0 ff ff ff ff
Length in that overflow page: 00 00 1f a0 (0x1fa0 = 8096 = 16*506)
Next overflow page: ff ff ff ff (this means it is the last)

From the above, one can make a few interesting observations:

  • The 20 bytes pointer includes a space_id value so in theory, the overflow page could be in another tablespace (ibd file).
  • The total length is using 4 bytes even though this is a TEXT and not a LONGTEXT column. 2 bytes would have been sufficient.
  • The length of an overflow page chunk is also using 4 bytes for length even though the largest possible InnoDB page size is 64KB.

Performance Impacts

TEXT/BLOB columns are the source of a number of performance-related impacts. Let’s review the most obvious ones.

Storage

The first performance impact related to storage inefficiency. TEXT/BLOB values are stored in chunks of 16KB if the default InnoDB page size is used. That means, on average, about 8KB per value is lost when overflow pages are used. This leads to larger data files and less efficient caching.

Reads

The presence of TEXT/BLOB columns can significantly increase the number of read IOPs required for queries. For example, let’s consider the following simple query:

SELECT * from myTable limit 10;

If we ignore caching, without TEXT/BLOB columns, the above query would require only one read IOP per level in the primary key btree of myTable. For a small table, this could be one or two read IOPs. Now, if each row has a 1MB TEXT/BLOB column, the same simple query would require in excess of 640 read IOPs since each TEXT/BLOB value is a chain of 64 pages of 16KB.

Writes

For this section, let’s assume a worst-case scenario with row-based replication enabled and a full row image. Now, if we insert a row with a 1MB value as a longText column in a Percona server 8.0 instance, we have:

+----------------------------------+---------------------------+
| FILE_NAME                        | SUM_NUMBER_OF_BYTES_WRITE |
+----------------------------------+---------------------------+
| /var/lib/mysql/#ib_16384_0.dblwr |                   1572864 |
| /var/lib/mysql/ib_logfile0       |                   1174528 |
| /var/lib/mysql/test/testText.ibd |                   1130496 |
| /var/lib/mysql/binlog.000003     |                   1048879 |
| /tmp/MLfd=41                     |                   1048783 |
| /var/lib/mysql/undo_001          |                    278528 |
| /var/lib/mysql/undo_002          |                    131072 |
| /var/lib/mysql/ibdata1           |                     16384 |
| /var/lib/mysql/mysql.ibd         |                     16384 |
+----------------------------------+---------------------------+

For a total of around 6.4 MB. This is not surprising, there are two log files and the data is also written twice because of the doublewrite buffer. The temporary file is used for the disk binlog cache and unless the transaction is very long, it won’t actually be written to storage.

Anyway, this is just to set the stage for what happens after an update. If I just change the last letter of the longText value, the amount of data written raises to approximately 10 MB.

+----------------------------------+---------------------------+
| FILE_NAME                        | SUM_NUMBER_OF_BYTES_WRITE |
+----------------------------------+---------------------------+
| /var/lib/mysql/#ib_16384_0.dblwr | 2260992                   |
| /var/lib/mysql/test/testText.ibd | 2211840                   |
| /var/lib/mysql/binlog.000003     | 2097475                   |
| /tmp/MLfd=41                     | 2097379                   |
| /var/lib/mysql/ib_logfile0       | 1129472                   |
| /var/lib/mysql/undo_001          | 32768                     |
| /var/lib/mysql/ibdata1           | 16384                     |
+----------------------------------+---------------------------+

The longText value is not modified in place, it is copied to a new set of overflow pages. The new and old overflow pages then need to be flushed to storage. Also, since we use the worst-case scenario of the full row image, the binary log entry has the old and new value stored but the InnoDB log files only have the new version.

I hope this illustrates why storing mutable data in a TEXT/BLOB column is a bad idea.

JSON

Although columns stored using the MySQL JSON data type are stored as a TEXT/BLOB column, MySQL 8.0 added some logic to allow in-place updates. The impact of an update to a large JSON column in 8.0 is not as severe as in 5.7.

How to Best Deal with TEXT/BLOB Columns?

Data Compression

Data compression is a very compelling option for TEXT/BLOB values By definition, those values are normally large and thus, usually compress well. This is nothing new and has been reported in previous previously. PostgreSQL, for one, compresses by default its TEXT/BLOB columns (called TOAST).

The best option for compression is always the application when it is possible of course. As we have just seen, this reduces the write load and spares the database from the CPU burden of compression.

Another option with MySQL is the InnoDB Barracuda table compression. When used, TEXT/BLOB values are compressed before being written to the overflow pages. This is much more efficient than compressing the pages one at a time.

Finally, if you are using Percona Server or MariaDB, you have access to transparent column compression. This is the second-best option, performance-wise, behind compression by the application is not possible.

Avoid Returning TEXT/BLOB Columns

When there are large TEXT/BLOB columns in a table, the cost of accessing those columns is high. Because of this, it is important to select only the columns that are needed and avoid the default use of “select * from”. Unfortunately, many ORM frameworks instantiate objects by grabbing all the columns. When using such frameworks, you should consider storing the TEXT/BLOB in a separate table with a loose 1:1 relationship with the original table. That way, the ORM is able to instantiate the object without necessarily forcing a read of the TEXT/BLOB columns.

Conclusion

I hope this article improved your understanding of TEXT/BLOB values in InnoDB. When used correctly, TEXT/BLOB columns can be useful and have a limited impact on the performance of the database.

May
21
2021
--

Heads-Up: TokuDB Support Changes and Future Removal from Percona Server for MySQL 8.0

TokuDB Support Changes

TokuDB Support ChangesBack in December 2018, when we announced the general availability of Percona Server for MySQL 8.0, we also announced that the TokuDB Storage Engine has been marked as “deprecated” in this release, recommending to use the MyRocks Storage Engine as an alternative. We believe that MyRocks provides similar benefits for the majority of workloads and is better optimized for modern hardware.

Since then, we have continued maintaining the storage engine in the 8.0 release, e.g. by incorporating bug fixes. However, the ongoing amount of changes that are still occurring in the upstream MySQL 8.0 codebase have been a constant challenge and a cause for concern.

Maintaining TokuDB as part of the 8.0 codebase has become increasingly difficult and time-consuming. Based on the number of support requests and conversations on our forums, we have seen very little to no adoption of TokuDB in Percona Server for MySQL 8.0.

We have therefore decided that the TokuDB storage engine will be disabled in future versions of Percona Server for MySQL 8.0. Beginning with Percona Server version 8.0.25, we’ll add a respective warning notice to the release notes, to inform users about this upcoming change.

Timeline

Starting with Percona Server version 8.0.26 (expected in Q3 2021), the storage engine will still be included in the binary builds and packages, but disabled by default. Users upgrading from previous versions will still be able to re-enable it manually, so they can perform the necessary migration steps.

Starting with Percona Server for MySQL version 8.0.28 (expected to ship in Q1 2022), the TokuDB storage will no longer be supported and will be removed from the installation packages. It will still be part of the 8.0 source code, but not enabled in our binary builds.

We intend to fully remove it in the next major version of Percona Server for MySQL (9.0 or whatever version that will be). Note that this change only applies to Percona Server for MySQL version 8.0 – TokuDB remains enabled and supported in Percona Server 5.7 until this release has reached the end of its support period.

In case you’re still using TokuDB in Percona Server 8.0, we recommend switching to InnoDB or MyRocks instead as soon as possible. Please consult our documentation on how to migrate and remove the TokuDB storage engine.

If you need any assistance with migrating your data or if you have any questions or concerns about this, don’t hesitate to reach out to us – our experts are here to help!

May
18
2021
--

WiredTiger File Forensics Part 2: wt dump

wiredtiger file forensics wt dump

wiredtiger file forensics wt dumpThis article contains one normal section – how to print information directly from the raw WiredTiger files using wt dump – followed by really important, otherwise undocumented information about how to get to the MongoDB binary data inside WT tables.

See “WiredTiger File Forensics (Part 1: Building “wt”)” for:

  • How to build the “wt” utility command
  • A short description of its commandline syntax
  • How to set the wiredtiger_open() configuration string it needs

The examples in this page assume the following:

  • WiredTiger.config or WIREDTIGER_CONFIG or the -C option is being used to give “wt” the right config string. Having said that – an empty “wt” config might work OK for you, at least until wt printlog.
  • xxd, bsondump, jq are installed. (See “Shell script formatting tricks” section for why.)

wt dump

The wt dump command can be used to show the full content of a WiredTiger table file. It will output the table object’s header then every key and value.

wt dump has one compulsory argument: uri. A uri is usually “table:” + WT ident string, or “file:” + (relative) file path. Eg. “table:collection-4–9876544321″ or “file:collection-4–9876544321.wt“.

“table:” is automatically prepended to the uri by wt dump if you don’t specify any type yourself.

Documentation at source.wiredtiger.com

The format from wt dump can be:

  • “print”: Keys and values will be strings of printable ascii chars and \-escaped hex values of non-printable bytes. (A.k.a. pretty-print.)
    Eg. \82\00\00\00\03_id\00\0a\00\00\00\ff_id\00\00\03max\00…..
  • “hex”: Keys and values will be show as continuous [0-9af] hexadecimal characters. No escaping or quotes.
    Eg. 82000000035f6964000a000000ff5f696400…
  • “json” Keys will be shown as natural type (only a plain number as of MongoDB 4.4 / WiredTiger 10.0.0, I think). Values will be shown as UTF-16, which doesn’t make sense – it is inserting null bytes every second byte. (I suspect this is a current-version bug.)
    Eg. “\u0082\u0000\u0000\u0000\u0003\u005f\u0069\u0064\u0000\u000a\u0000\u0000\u0000\u00ff\u005f\u0069\u0064\u0000….”

For “print” or “hex” the output has a fixed line format:

Line # Field Sample
1 WT library version string WiredTiger Dump (WiredTiger Version 10.0.0)
2 “wt dump” format type Format=hex
3 Constant string “Header” Header
4 URI table:collection-10-5841128870485063129
5 Config string from the table’s creation (WT_SESSION::create) access_pattern_hint=none,allocation_size=4KB,app_metadata=, … 
6 Constant string “Data” Data
7
8
First key; its value e85d689139ffffdfc1

6400000011747300010000003a91685d12680000000000000000…

9
10
Second key; its value e85d68913bffffdfc1

e700000011747300010000003c91685d12740001000000000000….

(repeat for all kv pairs)

The -j JSON output mode of wt dump is much easier to read by eye, but values are currently encoded in UTF-16 strings which is practically useless. I can’t ingest that using my preferred JSON tools at least. Eg. what is the binary value 82 00 00 00 03 5f 69 64 (eight bytes) will be output as “\u0082\u0000\u0000\u0000\u0003\u005f\u0069\u0064” (sixteen bytes). The binary length has been doubled by null bytes between every intended value byte.

Shell script formatting tricks used in this article

  • xxd -r -p converts plain, [0-9a-fA-Z]* hex strings to raw bytes. (Note “xxd -rp” fails, and fails silently. Keep the -r and -p flags separated in the shell command.)
  • bsondump is used to convert BSON from binary to JSON. Use –quiet.
  • tail -n +7 to skip the header output of wt dump.
  • awk is used to filter out just the keys (line number NR%2 == 1) or values (line number NR%2 == 0) in the pretty-print or hex mode of wt dump.
  • jq is used to filter and/or reformat bsondump and wt printlog output.
    (Ideally jq would be used to ingest the output of wt dump -j directly, but the falsely null-padded, bytes-as-UTF-16 string encoding it currently outputs defeats that.)

Record output is not influenced by the WT binary format

The output of wt dump is unaffected by the WiredTiger binary format used. It will come out the same regardless of table format type, and modifiers such as max page size, key prefixing, etc.

This means wt dump won’t teach you about the ‘physical’ format of the data. The only thing you can see related to this is the config string that was used in WT_SESSION::create when a table was first made. It is shown in the header section every time wt dump is run.

$ wt dump file:collection-0--4131298130356306083.wt 
WiredTiger Dump (WiredTiger Version 10.0.0)
Format=print
Header
file:collection-0--4131298130356306083.wt
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=snappy,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(enabled=false,file_metadata=,repair=false),internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=false),lsm=(auto_throttle=true,bloom=true,bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldest=false,chunk_count_limit=0,chunk_max=5GB,chunk_size=10MB,merge_custom=(prefix=,start_generation=0,suffix=),merge_max=15,merge_min=0),memory_page_image_max=0,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,readonly=false,source=,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,tiered=(chunk_size=1GB,tiers=),tiered_storage=(auth_token=,bucket=,local_retention=300,name=,object_target_size=10M),type=file,value_format=u,verbose=[],write_timestamp_usage=none

FYI there’s a lot of empty and/or unused options in the config strings, so don’t take the presence of an option to imply it is being used. Eg. above we see lsm options, even though all the tables mongodb makes are btrees (which you can confirm by the “format=btree” option).

The wt verify subcommand with a -d option does show something about the physical layout. Arguably wt stat does too. But these will no be covered here.

Metadata tables to explore in MongoDB data directory

WiredTiger

Yawn. Just a version string. View it as plain text.

WiredTiger.turtle

Fairly boring. Version information, a URI pointing to WiredTiger.wt, a config string to open it. View it as plain text.

The turtle file will be recreated auto-magically if it is absent.

WiredTiger.config

This is not a WiredTiger data file. It is kind of like a resource file for the “wt” utility command. It won’t be created by any program. It will only be made by you, or some nice colleague, to save you from specifying the -C option each time you run the “wt” command. See the previous blog article to learn what to put in it.

WiredTiger.lock

Just a classic lock file. Contains nothing except the plain text “WiredTiger lock file”. Created and overwritten automatically.

WiredTiger.wt

This is the main metadata table of the WiredTiger storage engine. It is also, itself, a WiredTiger table, one the WiredTiger.turtle file points to.

  • The keys in this table are URI strings of:
    • table: Wiredtiger tables
    • file: Files
    • system: A couple of special system values, and 
    • colgroup: Colgroups.
      (Colgroups are 1-to-1 to tables the way MongoDB uses WiredTiger; I tend to forget about their presence as a result.)
  • The values are the table/file/colgroup WiredTiger-flavoured JSON config strings that WiredTiger uses when it opens those objects in other API functions.

The entire content of this file is ascii-printable, keys and values alike. WT identifiers and config values are all print-safe c strings. You won’t have it as easy with the other table files though! They all contain binary numeric type and BSON keys and/or values.

WiredTiger.wt is not the MongoDB collection and index etc. catalog. That is in _mdb_catalog.wt (see later).

There is a special command that is hard-coded to dump only this collection – wt list. By default, it filters the output to just the keys, and furthermore, it excludes the “system:” ones plus the table/file listing for itself. Even if you use the “-v” option it will still exclude itself.

Bonus exercise: Dump the content of the WiredTiger.wt metadata file like it was just any other wt table

Amongst other records, the WiredTiger.wt file includes “table:xxxx” keys that point to every table’s URI. Every table except itself, that is.

So you can run commands such as wt dump collection-0–4131298130356306083, or wt dump index-8-5834121039240263510 or wt dump _mdb_catalog or wt dump sizeStorer, but you can’t run wt dump WiredTiger.

The trick is using the URI with “file:” at the front. You also need the full filename with the “.wt” suffix on the end, because it is a file, not a table URI. (wt dump assumes “table:” as the type prefix if not supplied.)

$ wt dump file:WiredTiger.wt

WiredTiger Dump (WiredTiger Version 10.0.0)
Format=print
Header
file:WiredTiger.wt
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,cache_resident=false,checksum=uncompressed,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,.....
Data
colgroup:_mdb_catalog\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:_mdb_catalog.wt",type=file\00
colgroup:collection-0--4131298130356306083\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-0--4131298130356306083.wt",type=file\00
colgroup:collection-0--6422702119521843596\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-0--6422702119521843596.wt",type=file\00
colgroup:collection-0-5834121039240263510\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-0-5834121039240263510.wt",type=file\00
colgroup:collection-0-5841128870485063129\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-0-5841128870485063129.wt",type=file\00
colgroup:collection-10--4131298130356306083\00
app_metadata=(formatVersion=1),collator=,columns=,source="file:collection-10--4131298130356306083.wt",type=file\00
colgroup:collection-10-5841128870485063129\00
app_metadata=(formatVersion=1,oplogKeyExtractionVersion=1),collator=,columns=,source="file:collection-10-5841128870485063129.wt",type=file\00
colgroup:collection-11-5841128870485063129\0
...
colgroup:sizeStorer\00
app_metadata=,collator=,columns=,source="file:sizeStorer.wt",type=file\00
file:WiredTigerHS.wt\00
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=snappy,cache_resident=false,checksum=uncompressed,...
file:_mdb_catalog.wt\00
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,....
file:collection-0--4131298130356306083.wt\00
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,
...
file:sizeStorer.wt\00
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,...
system:checkpoint\00
checkpoint_timestamp="6088966b00000001"\00
system:checkpoint_base_write_gen\00
base_write_gen=104333\00
system:checkpoint_snapshot\00
snapshot_min=1,snapshot_max=1,snapshot_count=0\00
system:oldest\00
oldest_timestamp="6088966600000001"\00
table:_mdb_catalog\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-0--4131298130356306083\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-0--6422702119521843596\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-0-5834121039240263510\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-0-5841128870485063129\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-10--4131298130356306083\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-10-5841128870485063129\00
app_metadata=(formatVersion=1,oplogKeyExtractionVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
table:collection-11-5841128870485063129\00
app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\00
...
table:index-9-5841128870485063129\00
app_metadata=(formatVersion=8),colgroups=,collator=,columns=,key_format=u,value_format=u\00
table:sizeStorer\00
app_metadata=,colgroups=,collator=,columns=,key_format=u,value_format=u\00

_mdb_catalog.wt

This is the catalog/schema that the MongoDB user sees.

$ #N.b. URI can be "file:_mdb_catalog.wt", "table:_mdb_catalog", or just "_mdb_catalog"
$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet

{"md":{"ns":"local.startup_log","options":{"uuid":{"$binary":{"base64":"9TLH+oBmQXW7eou0SZFXPQ==","subType":"04"}},"capped":true,"size":{"$numberInt":"10485760"}},"indexes":[{"spec":{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_","ns":"local.startup_log"},"ready":true,"multikey":false,"multikeyPaths":{"_id":{"$binary":{"base64":"AA==","subType":"00"}}},"head":{"$numberLong":"0"},"prefix":{"$numberLong":"-1"},"backgroundSecondary":false,"runTwoPhaseBuild":false,"versionOfBuild":{"$numberLong":"1"}}],"prefix":{"$numberLong":"-1"}},"idxIdent":{"_id_":"index-1-5841128870485063129"},"ns":"local.startup_log","ident":"collection-0-5841128870485063129"}
{"isFeatureDoc":true,"ns":null,"nonRepairable":{"$numberLong":"0"},"repairable":{"$numberLong":"1"}}
{"md":{"ns":"local.replset.oplogTruncateAfterPoint","options":{"uuid":{"$binary":{"base64":"8kuKi7n6TliHYwjFxEEKWw==","subType":"04"}}},"indexes":[{"spec":{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_","ns":"local.replset.oplogTruncateAfterPoint"},"ready":true,"multikey":false,"multikeyPaths":{"_id":{"$binary":{"base64":"AA==","subType":"00"}}},"head":{"$numberLong":"0"},"prefix":{"$numberLong":"-1"},"backgroundSecondary":false,"runTwoPhaseBuild":false,"versionOfBuild":{"$numberLong":"1"}}],"prefix":{"$numberLong":"-1"}},"idxIdent":{"_id_":"index-3-5841128870485063129"},"ns":"local.replset.oplogTruncateAfterPoint","ident":"collection-2-5841128870485063129"}
...
...

The content of each value is BSON, as you can in the deserialized output above. There is just one document that isn’t for a collection, the “isFeatureDoc” (see above). Apart from that single, trivial exception, the fields of each object are:

  • “ns” Eg. “test.foo”, “config.chunks”, “local.oplog.rs”
  • “ident” The unique identifier string for the WT table of this mongodb collection. Eg. “collection-7-5834121039240263510”
  • “idxIdent”: object listing the ident(ifier)s of each index on the collection (Annoyingly they are separated from the index definitions inside the “md.indexes” object; see below.)
  • “md” Metadata of the collection. It has these fields:
    • “ns” (again)
    • “options” Options you would pass to collection create command. Most collections will only have the (automatically-assigned) collection UUID value in this sub-object, but if a collection has a non-default option such as capped: true it is defined here.
    • “indexes” Array of index specs, the ones you’d see if you were running db.collection.getIndexes().

Confirming things match the MongoDB picture

If you want to prove to yourself the information in _mdb_catalog WiredTiger table matches what the result of listDatabases (‘show dbs’), listCollections (‘show collections’), and listIndexes (db.collection.getIndexes()) in MongoDB, you can search the file with these shell commands:

List of collections and their index names, vs. the WT ident(ifier) strings

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq 'select(. | has("md")) | {ns: .ns, ident: .ident, idxIdent: .idxIdent}'
{
  "ns": "local.startup_log",
  "ident": "collection-0-5841128870485063129",
  "idxIdent": {
    "_id_": "index-1-5841128870485063129"
  }
}
{
  "ns": "local.replset.oplogTruncateAfterPoint",
  "ident": "collection-2-5841128870485063129",
  "idxIdent": {
    "_id_": "index-3-5841128870485063129"
  }
}
...
...

List of collections and the options strings they were created with:

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq 'select(. | has("md")) | .md | {ns: .ns, options: .options}'
{
  "ns": "local.startup_log",
  "options": {
    "uuid": { "$binary": {  "base64": "9TLH+oBmQXW7eou0SZFXPQ==", "subType": "04" } },
    "capped": true,
    "size": { "$numberInt": "10485760" }
  }
}
....

Index definitions (a.k.a. index specs):

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq 'select(. | has("md")) | .md | {ns: .ns, indexes: .indexes}'
{
  "ns": "local.startup_log",
  "indexes": [
    {
      "spec": { "v": { "$numberInt": "2" },
        "key": {
          "_id": { "$numberInt": "1" }
        },
        "name": "_id_",
        "ns": "local.startup_log"
      },
      "ready": true,
      "multikey": false,
      "multikeyPaths": {
        "_id": { "$binary": { "base64": "AA==", "subType": "00" }
        }
      },
      "head": { "$numberLong": "0" },
....

Output a tab-delimited table of WT ident(ifier) values vs. collection namespace

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -r 'select(. | has("md")) | [.ident, .ns] | @tsv' | sort
collection-0--4131298130356306083	config.cache.chunks.test.bar
collection-0-5834121039240263510	local.replset.initialSyncId
collection-0-5841128870485063129	local.startup_log
collection-0--6422702119521843596	config.system.sessions
collection-10--4131298130356306083	config.cache.chunks.test.foo
...
...

Output a tab-delimited table of WT ident(ifier) values vs. indexes

The following outputs three columns: WT ident value, collection namespace, and index name.

$ wt dump -x table:_mdb_catalog | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -r 'select(. | has("idxIdent")) | .ns as $nsT | .idxIdent | to_entries[] | [.value, $nsT, .key] | @tsv' | sort
index-11--4131298130356306083	config.cache.chunks.test.foo	_id_
index-12--4131298130356306083	config.cache.chunks.test.foo	lastmod_1
index-12-5841128870485063129	local.system.replset	_id_
index-1--4131298130356306083	config.cache.chunks.test.bar	_id_
index-14-5841128870485063129	admin.system.version	_id_
...
...

sizeStorer.wt

Just a table of BSON values. MongoDB stores the sum BSON document data size and record count for each mongodb collection here. The keys are not the mongodb collection ns value though – they are the WT URI/ident values. You’ll have to map to the output of wt dump _mdb_catalog to work out which counts and sizes are for which collections.

$ #print the keys
$ wt dump table:sizeStorer | tail -n +7 | awk 'NR%2 == 1 { print }' | head 
table:_mdb_catalog
table:collection-0--2823106041799591351
table:collection-0--4131298130356306083
table:collection-0--6422702119521843596
...
...

$ #print the bson data values
$ wt dump -x table:sizeStorer | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet | jq -c .
{"numRecords":{"$numberLong":"37"},"dataSize":{"$numberLong":"16930"}}
{"numRecords":{"$numberLong":"1"},"dataSize":{"$numberLong":"22"}}
{"numRecords":{"$numberLong":"1"},"dataSize":{"$numberLong":"130"}}
{"numRecords":{"$numberLong":"0"},"dataSize":{"$numberLong":"0"}}
...

storage.bson

Not a part of WiredTiger. A few storage-engine-related options are persisted into a file in the dbPath data directory so MongoDB layer above can restart knowing if the directoryPerDB etc. options were used last time the data was used.

View it using bsondump [–quiet] storage.bson .

Application (= MongoDB) tables

The collection-*.wt and index-*.wt files contain the data of the MongoDB collections and indexes that you observe as a client connected to a MongoDB instance.

These are not WiredTiger metadata so we’ll cover them in the next blog post (WiredTiger File Forensics Part 3: Viewing all the MongoDB data). For starters though you can use the wt dump -x collection-*** | tail -n +7 | awk ‘{NR%2 = 0}’ | bsondump –quiet method shown for _mdb_catalog earlier to see the bsondump of collection-*.wt.

FYI these files do not contain any of their own MongoDB catalog information. You can see the data in them, but you can’t see what indexes are for which collections, or even what a collection’s mongodb ns (namespace) is. You will always need to dump the BSON data from the _mdb_catalog wiredtiger table information to get that.

WiredTiger’s transactional log files

The journal/WiredTigerLog.NNNNNNNN files are WiredTiger’s transactional log files, its write-ahead log. This is the only place on disk that preserves writes that haven’t yet been flushed to collection-*.wt and index-*.wt files during a checkpoint.

You usually don’t need to pay attention to the exact log file numbers as you work with them – they’re treated as a sequence and are automatically iterated from the start.

You view them with wt printlog instead of wt dump. The output is always JSON, but additional hex versions of the keys and values in ops can be included as well by using the -x option.

$ wt printlog | head -n 20
[
  { "lsn" : [72,128],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "system",
    "ops": [
      { "optype": "prev_lsn",
        "prev_lsn": [1, 0]
      }
    ]
  },
  { "lsn" : [72,256],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "system",
    "ops": [
      { "optype": "checkpoint_start",
...

This is just a starter for using looking at the wiredtiger transaction log files (a.k.a. mongodb journal files). The meaning of lsn numbers; how to identify collections from ops[].fileid; how to find oplog writes; why there are no writes for many collections, is material for another day.

You can ignore the WiredTigerPreplog.* files, they are empty. They’re pre-allocated in the filesystem to reduce latency that can occur doing the open() syscall; they are simply renamed as the next WiredTigerLog.* file when it is needed.

Summary

wt dump is the main command you will use to look at WiredTiger tables. wt list is a shortcut for dumping a subset of records from the top-most of those tables (WiredTiger.wt).

wt printlog is the command needed to view the (completely different) transaction log files. These are the write-ahead log of WiredTiger, known as journal in MongoDB.

Knowing how to follow the hierarchy of WiredTiger metadata, followed by the MongoDB catalog, is crucial to begin introspection of a MongoDB data directory.

Another crucial understanding is that WiredTiger stores keys and values as raw binary values with any type indicator, so you have to know ahead of time what those are, and/or be a skilled binary detective. The long and the short of it though is: WiredTiger’s metadata is simple, ascii-printable strings; all other tables are usually BSON or plain integer types. (Ok, one extra: A special ‘dehydrated keystring’ packing format, defined by the MongoDB layer, is in the index-*.wt files.)

Wait for the next blog post (WiredTiger File Forensics Part 3: Viewing all the MongoDB data) to see how to put it all together and extract any information, as you think of it as a MongoDB user.

May
18
2021
--

WiredTiger File Forensics Part 1: Building “wt”

wiredtiger file forensics

wiredtiger file forensicsMost of the files in a data directory of a MongoDB server are made by the WiredTiger storage engine. If you want to look at the content inside them you can use the tool “wt” from the WiredTiger library:

https://github.com/wiredtiger/wiredtiger/

http://source.wiredtiger.com/10.0.0/command_line.html

Inspection of the WiredTiger files is not an essential MongoDB DBA skill – it’s just for the curious. Or maybe post-crash data recovery, if you weren’t using a replica set. But for a certain type of person this is fun, and if you voluntarily clicked a page link to “file forensics” then it seems that’s you. ?

Build and Install the “wt” Executable

Documentation: http://source.wiredtiger.com/develop/install.html

  1. git clone https://github.com/wiredtiger/wiredtiger.git
  2. cd wiredtiger
  3. git branch --all | grep mongodb

      #find a branch matching the major version of mongodb you are using

  4. git checkout mongodb-4.4  #Using the WT code matching mongodb 4.4 as an example
  5. sh autogen.sh
  6. ./configure --disable-shared --with-builtins=lz4,snappy,zlib,zstd
  7. make -j $(nproc) # -j $(nproc) to parallelize build to number of cores
  8. sudo make install

Points about the make configuration above:

Use

--disabled-shared

  to force the build of a statically linked “wt” binary.

If you do not disable shared library mode, the final build product will be somewhat different. The “wt” target executable at the top of the source directory (and installed to /usr/local/bin/) will be a script, not a compiled binary. That script will execute a compiled binary built at .libs/wt, but before it can it has to fix the LD_LIBRARY_PRELOAD environment variable so the binary can find the libwiredtiger.so, also in .libs/. Much hacking is required if you then try to move these files elsewhere.

Also use

--with-builtins=lz4,snappy,zlib,zstd

to match what normal mongod binaries expect. But feel free to drop “lz4”, “zlib” and/or “zstd” if there is a hassle about installing those libraries’ dev packages. Only snappy is used in 99%+ of MongoDB deployments.

--with-builtins lz4,snappy,zlib,zstd

, ie. without a “=” between option name and values, will fail with the message “checking build system type… Invalid configuration snappy': machine snappy’ not recognized”. Put the “=” in.

You can use configuration option –enable-XXX (eg. –enable-snappy) to build extensions as standalone libraries instead, but that’s the hard way. You’ll have to continually specify the full path to the standalone extension library as an extra extension in config every time you run a wt program.

Versions

You’ll need to build a different branch of wiredtiger depending on which mongodb version of data you’re looking at. “Log” in the table below means the transaction log, a.k.a. the journal in mongodb.

/**
* Table of MongoDB<->WiredTiger<->Log version numbers:
*
* |                MongoDB | WiredTiger | Log |
* |------------------------+------------+-----|
* |                 3.0.15 |      2.5.3 |   1 |
* |                 3.2.20 |      2.9.2 |   1 |
* |                 3.4.15 |      2.9.2 |   1 |
* |                  3.6.4 |      3.0.1 |   2 |
* |                 4.0.16 |      3.1.1 |   3 |
* |                  4.2.1 |      3.2.2 |   3 |
* |                  4.2.6 |      3.3.0 |   3 |
* | 4.2.6 (blessed by 4.4) |      3.3.0 |   4 |
* |                  4.4.0 |     10.0.0 |   5 |
*/

(This useful table is currently found as a comment in mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp.)

To build WT matching 4.4 do “git checkout mongodb-4.4”, for the latest 4.2 minor versions “git checkout mongodb-4.2”, etc. There are also tags (rather than branches) for a lot of minor versions. Use “git branch -a” and “git tag –list” to see.

A “wt” Build Shortcut

If you already have a compilable source code directory of MongoDB Community or Percona Server for MongoDB, a slightly quicker way to build and install wt it is to run scons with “wt” as the target (instead of “mongod”, or “install-core”, etc.). The WiredTiger library code in the subdirectory src/third_party/wiredtiger/ is already cherry-picked to match whichever version of MongoDB you have checked out at the moment.

The built binary will be created at build/opt/third_party/wiredtiger/wt if building with legacy (<=v4.2 style) install mode. Otherwise, it is installed to scons –prefix option plus “/bin/” I believe.

Executing the wt Command

$ wt
WiredTiger Data Engine (version 10.0)
MongoDB wiredtiger_open configuration: ...
global_options:
    -B  maintain release 3.3 log file compatibility
    -C config
        wiredtiger_open configuration
    -E key
        secret encryption key
    -h home
        database directory
    -L  turn logging off for debug-mode
    -m  run verify on metadata
    -R  run recovery (if recovery configured)
    -r  access the database via a readonly connection
    -S  run salvage recovery (if recovery configured)
    -V  display library version and exit
    -v  verbose
commands:
    alter
        alter an object
    backup
        database backup
    compact
        compact an object
    copyright
        display copyright information
    create
        create an object
    downgrade
        downgrade a database
    drop
        drop an object
    dump
        dump an object
    list

The options above are the global ones, and each subcommand has other options not shown above. Eg. the list subcommand has two options -c and -v (see help output below). The subcommand-specific options (-c or -v in this case) must appear after the subcommand (“list”, “dump”, etc.) on the command line, and the global options (-LmRrSVv, -C, -E) must be before it.

$ wt list --help
wt: illegal option -- -
usage: wt [-LmRrSVv] [-C config] [-E secretkey] [-h home] list [-cv] [uri]
options:
    -c  display checkpoints in human-readable format (by default checkpoints are not displayed)
    -v  display the complete schema table (by default only a subset is displayed)

Set the wiredtiger_open Config String

MongoDB configures WiredTiger so that (by default) the snappy compressor is used, and the WT transaction log files are placed under the “journal/” subdirectory.

Some wt subcommands don’t need to read the transaction logs, so you might survive without setting this. But if you do then “wt: __wt_log_scan, 2114: no log files found: Operation not supported” is the error message you will see, and you should follow the advice in this section.

Neither the “journal” transaction log subdirectory or snappy compression are defaults of the WT library. The current v10.0.0 wt’s help output shows “MongoDB wiredtiger_open configuration: log=(enabled=true,path=journal,compressor=snappy)” as the second or third line of output, but this is misleading. It is just a hint message, not the config string actually in effect ? 

So set this value in a config string so the wiredtiger connection API knows where and how to open the wiredtiger transaction log files. (Option setting method is explained in the “How to put the configuration into effect” section just a little bit further down the page.)

log=(compressor=snappy,path=journal)

As of May 2021: I do not include the log=(enabled=true,…) value because it doesn’t seem to be necessary, and it leads to a “WT_LOGSCAN_FIRST not set” error when using the wt printlog subcommand at least.

Config String Syntax

See http://source.wiredtiger.com/10.0.0/config_strings.html for notes about the supposedly “compatible with JavaScript Object Notation (JSON)” config string format that WT uses.

In the config strings you’ll see in WT or MongoDB code, and the ones used in this article, there are no “{}” brackets or the “:” key-value delimiter. Instead it will be “object=(key1=val1,key2=val2,..),array_val=[val1,val2,val3]” style. It’s weird JSON, but presumably also perfectly conforming to some standard of JSON written somewhere.

These config options are for the WT C API’s wiredtiger_open function which initializes a WT_CONNECTION object. (grep for “wiredtiger_open_configuration” in util_main.cpp if you want to confirm the code.)

How to Put the Configuration into Effect

Lowest precedence: WiredTiger.config file

This is probably the most comfortable option. Put the config string in a file WiredTiger.config in the data directory (i.e. alongside WiredTiger.wt, WiredTiger.turtle, all the collection and index *.wt files, etc.)

$ cat WiredTiger.config&nbsp;
log=(compressor=snappy,path=journal)

http://source.wiredtiger.com/1.4.2/database_config.html#config_file

Middle precedence: WIREDTIGER_CONFIG environment variable

Settings here will override those coming from a WiredTiger.config file.

Eg. this setting will enable verbose log messages from places in the code where __wt_verbose(session, WT_VERB_METADATA, …) is called.

export WIREDTIGER_CONFIG="verbose=[metadata]"

http://source.wiredtiger.com/1.4.2/database_config.html#config_file

Highest precedence: Use the -C option

This will also include values merged from the previous two sources, but otherwise is the final say in what config string is passed into wiredtiger_open(.., .., config_string, &conn).

N.b. The -C value has to be before the subcommand. Eg.

wt -C "log=(compressor=snappy,path=journal)" list

Some other options of the “wt” util will add to this config string as well. Eg. -m adds “verify_metadata=true”, -R adds “log=(recover=on)”, etc.

Config merging rules

… aren’t clear to me. In practical testing, it seems to override sometimes but merge at other times. See the code of wiredtiger_open() if you want to figure it out.

Q. What wiredtiger_open() config does MongoDB use?

As it happens you can easily see the config string mongod uses as it initializes an internal connection to the WiredTiger library. Just search for “Opening WiredTiger” in mongod’s diagnostic log file. In MongoDB 4.4+, with the structured json-style log file, you can alternatively grep for a unique log message identifier number 22315.

$ grep -m 1 ‘Opening WiredTiger’ /data/node_s1n1/mongod.log

{“t”:{“$date”:”2021-04-27T21:58:09.580+09:00″},”s”:”I”,  “c”:”STORAGE”,  “id”:22315,   “ctx”:”initandlisten”,”msg”:”Opening WiredTiger”,”attr”:{“config”:”create,cache_size=7375M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],“}}

You could use this entire config string when starting the wt utility command, but you don’t need to. Some are for asynchronous processes, eg. eviction and file manager timers, which the wt util won’t run. And some I wouldn’t use: Eg. log=(archive=true) will discard old log files after recovery and checkpoint if you somehow manage to make that happen with just the wt command.

Summary

This article was a relatively brief guide on how to build the “wt” utility of the WiredTiger library; plus information regarding the runtime config values it needs when inspecting a data directory created by MongoDB.

Please see the next article “WiredTiger File Forensics (Part 2: wt dump)” to see how to use this tool with a mongod node’s data directory files.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com