Jan
15
2021
--

MySQL 8.0.22: SHOW PROCESSLIST Version 2 – Now Available From PERFORMANCE_SCHEMA

SHOW PROCESSLIST Version 2

SHOW PROCESSLIST Version 2The “SHOW PROCESSLIST” command is very famous and very useful for MySQL DBAs. It will help you to understand the ongoing thread activities and their current states. By default, the “show processlist” output details will be collected from the thread manager, and it needs the global mutex. From MySQL 8.0.22, we have an alternative way to get the process details from the PERFORMANCE_SCHEMA. It doesn’t need the global mutex. ?

Note: We also have the non-blocking SYS schema views “processlist” and “x$processlist”, which provide more complete information than the SHOW PROCESSLIST statement and the INFORMATION_SCHEMA.PROCESSLIST and PERFORMANCE_SCHEMA.PROCESSLIST. But, we can’t integrate this with the “SHOW PROCESSLIST” command.

In this blog, I am going to explain the complete details about the new processlist implementation using PERFORMANCE_SCHEMA.

“SHOW PROCESSLIST” Using Thread Manager (default)

  • This is the default method.
  • The default “show processlist” implementation iterates across active threads from within the thread manager while holding a global mutex.
  • Negatively impacts performance.
  • Particularly impacts the busy systems quite badly.
  • The INFORMATION_SCHEMA.PROCESSLIST is one of the sources of process information. This will also use the thread manager to collect the metrics.
  • By default, “mysqladmin processlist” also uses the thread manager to get the details.

The following statements are equivalent:

SHOW FULL PROCESSLIST;
SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST;
Mysqladmin processlist --verbose

“SHOW PROCESSLIST” Using Performance Schema

  • Available from MySQL 8.0.22.
  • It collects the thread details from the PERFORMANCE_SCHEMA>PROCESSLIST table.
  • Global mutex is not needed.
  • Helps to avoid the performance impact during querying the “show processlist”, particularly in busy systems.
  • The implementation also applies to “mysqladmin processlist”

The following statements are equivalent:

SHOW FULL PROCESSLIST;
SELECT * FROM PERFORMANCE_SCHEMA.PROCESSLIST;
Mysqladmin processlist --verbose

“PERFORMANCE_SCHEMA.PROCESSLIST” table has similar columns as “INFORMATION_SCHEMA.PROCESSLIST”

mysql> desc performance_schema.processlist;
+---------+-----------------+------+-----+---------+-------+
| Field   | Type            | Null | Key | Default | Extra |
+---------+-----------------+------+-----+---------+-------+
| ID      | bigint unsigned | NO   | PRI | NULL    |       |
| USER    | varchar(32)     | YES  |     | NULL    |       |
| HOST    | varchar(255)    | YES  |     | NULL    |       |
| DB      | varchar(64)     | YES  |     | NULL    |       |
| COMMAND | varchar(16)     | YES  |     | NULL    |       |
| TIME    | bigint          | YES  |     | NULL    |       |
| STATE   | varchar(64)     | YES  |     | NULL    |       |
| INFO    | longtext        | YES  |     | NULL    |       |
+---------+-----------------+------+-----+---------+-------+
8 rows in set (0.00 sec)

mysql> desc information_schema.processlist;
+---------+-----------------+------+-----+---------+-------+
| Field   | Type            | Null | Key | Default | Extra |
+---------+-----------------+------+-----+---------+-------+
| ID      | bigint unsigned | NO   |     |         |       |
| USER    | varchar(32)     | NO   |     |         |       |
| HOST    | varchar(261)    | NO   |     |         |       |
| DB      | varchar(64)     | YES  |     |         |       |
| COMMAND | varchar(16)     | NO   |     |         |       |
| TIME    | int             | NO   |     |         |       |
| STATE   | varchar(64)     | YES  |     |         |       |
| INFO    | varchar(65535)  | YES  |     |         |       |
+---------+-----------------+------+-----+---------+-------+
8 rows in set (0.00 sec)

Implementation

  • Make sure the PERFORMANCE_SCHEMA is enabled at the server startup.
  • Make sure MySQL was configured and built with the thread instrumentations enabled.

MySQL provides a variable “performance_schema_show_processlist” to enable this feature. Once we enable the variable, the “SHOW PROCESSLIST” command will start to show the details from the “PERFORMANCE_SCHEMA.PROCESSLIST” table instead of the thread manager.

The variable has a global scope, no need to restart the MySQL server.

mysql> show global variables like 'performance_schema_show_processlist';
+-------------------------------------+-------+
| Variable_name                       | Value |
+-------------------------------------+-------+
| performance_schema_show_processlist | OFF   |
+-------------------------------------+-------+
1 row in set (0.08 sec)

mysql> set global performance_schema_show_processlist='ON';
Query OK, 0 rows affected (0.00 sec)

mysql> \r
Connection id:    23
Current database: *** NONE ***

mysql> show global variables like 'performance_schema_show_processlist';
+-------------------------------------+-------+
| Variable_name                       | Value |
+-------------------------------------+-------+
| performance_schema_show_processlist | ON    |
+-------------------------------------+-------+
1 row in set (0.00 sec)

Yes, the PERFORMANCE_SCHEMA.PROCESSLIST table is activated for the “SHOW PROCESSLIST”.

“SHOW PROCESSLIST” output from “PERFORMANCE_SCHEMA”:

mysql> show processlist\G
*************************** 1. row ***************************
     Id: 5
   User: event_scheduler
   Host: localhost
     db: NULL
Command: Daemon
   Time: 2461
  State: Waiting on empty queue
   Info: NULL
*************************** 2. row ***************************
     Id: 23
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 0
  State: executing
   Info: show processlist
2 rows in set (0.00 sec)

You can also query the “performance_schema.processlist” table to get the thread information.

mysql> select * from performance_schema.processlist\G
*************************** 1. row ***************************
     ID: 5
   USER: event_scheduler
   HOST: localhost
     DB: NULL
COMMAND: Daemon
   TIME: 2448
  STATE: Waiting on empty queue
   INFO: NULL
*************************** 2. row ***************************
     ID: 23
   USER: root
   HOST: localhost
     DB: NULL
COMMAND: Query
   TIME: 0
  STATE: executing
   INFO: select * from performance_schema.processlist
2 rows in set (0.00 sec)

“mysqladmin processlist” output from “performance_schema”:

[root@mysql8 vagrant]# mysqladmin processlist
+----+-----------------+-----------+----+---------+------+------------------------+------------------+
| Id | User            | Host      | db | Command | Time | State                  | Info             |
+----+-----------------+-----------+----+---------+------+------------------------+------------------+
| 5  | event_scheduler | localhost |    | Daemon  | 2631 | Waiting on empty queue |                  |
| 24 | root            | localhost |    | Query   | 0    | executing              | show processlist |
+----+-----------------+-----------+----+---------+------+------------------------+------------------+

Recommendations

  • To avoid having some threads ignored, leave the “performance_schema_max_thread_instances” and “performance_schema_max_thread_classes” system variables set to their default value (default = -1, meaning the parameter will be autosized during the server startup).
  • To avoid having some STATE column values be empty, leave the “performance_schema_max_stage_classes” system variable set to its default (default = -1, meaning the parameter will be autosized during the server startup).
Jan
14
2021
--

How to Store MySQL Audit Logs in MongoDB in a Maintenance-Free Setup

Store MySQL Audit Logs in MongoDB

Store MySQL Audit Logs in MongoDBI was once helping one of our customers on how to load MySQL audit logs into a MySQL database and analyze them. But immediately I thought: “Hey, this is not the most efficient solution! MySQL or typical RDBMS, in general, were not really meant to store logs after all.”

So, I decided to explore an alternative – which seemed more sensible to me – and use MongoDB as the storage for logs, for three main reasons:

  • schema-less nature fits well to the audit log nature, where different types of events may use different fields
  • speaks JSON natively and the audit plugin can use JSON format
  • has capped collections feature, which allows avoiding additional maintenance overhead

Just to mention, audit logging is available in MySQL Enterprise Edition but a similar, yet free, solution, is available in Percona Server for MySQL. In both cases, it works by installing the audit log plugin.

Ad Hoc Import

The simplest scenario is to just set the audit log format to JSON:

audit_log_format = JSON

And as soon as it collects some data, import the log file into MongoDB collection via the mongoimport command, like this:

# mongoimport --username percona --password P3rc0n4 --host 10.95.83.225 --port 27017 --db auditlogs --collection audit1 --file /var/lib/mysql/audit.log
2020-12-31T16:24:43.782+0000 connected to: 10.95.83.225:27017
2020-12-31T16:24:44.316+0000 imported 25462 documents

mongo > db.audit1.countDocuments({})
25462

Of course, this works, but I prefer an automated solution, so I looked at available options for live-streaming the logs.

Syslog

The first thing that looked useful is the ability to send the audit log directly to syslog instead of a file. Knowing that both rsyslog, as well as syslog-ng, have MongoDB output modules, it felt like a very easy approach. So I installed the rsyslog-mongodb module package on my test Ubuntu VM with running Percona Server for MySQL, configured audit log with:

[mysqld]
audit_log_handler = syslog
audit_log_format = JSON

Rsyslog (version 8.2) example configuration with:

# cat /etc/rsyslog.d/49-ship-syslog.conf
action(type="ommongodb"
uristr="mongodb://percona:P3rc0n4@10.95.83.225:27017/?authSource=auditlogs"
db="auditlogs" collection="mysql_node1_log")

This worked, however, inserted documents looked like this:

mongo > db.mysql_node1_log.findOne().pretty()
{
"_id" : ObjectId("5fece941f17f487c7d1d158b"),
"msg" : " {\"audit_record\":{\"name\":\"Connect\",\"record\":\"7_1970-01-01T00:00:00\",\"timestamp\":\"2020-12-30T20:55:29Z\",\"connection_id\":\"9\",\"status\":0,\"user\":\"root\",\"priv_user\":\"root\",\"os_login\":\"root\",\"proxy_user\":\"\",\"host\":\"localhost\",\"ip\":\"\",\"db\":\"\"}}"
}

Basically, because of syslog escaping the double quote symbols, the whole audit record appears as a single string inside MongoDB collection, instead of a JSON object. No matter what I tried, like custom templates and property values in rsyslog, I could not disable escaping. Therefore, although feeding MongoDB with audit logs works this way, it becomes pretty useless when it comes to analyzing the logs later. The same issue applies to syslog-ng and the syslog-ng-mod-mongodb module. And since MongoDB does not offer before-insert triggers, I could not easily “fix” the inserted data on the fly.

Fluentd For The Rescue!

This forced me to look for alternative solutions. One of them would be using FIFO file and tail the audit log continuously to feed it, and then read from it to insert logs to mongodb. I wanted a more robust way, though, and decided to try Fluentd instead. It was created as a versatile log collector machine, highly flexible, prepared to work with many different applications out of the box, but most importantly, it is an open source project and speaks JSON natively. Making it to do the job I wanted turned out to be easier than I expected.

Here is what I did:

  • Installed the Fluentd package (I chose td-agent variant here for an even easier user experience)
  • Installed MongoDB plugin for Fluentd with (don’t use the usual ‘gem install’ here):
td-agent-gem install fluent-plugin-mongo

  • Configured audit log as a source and output directive for MongoDB:
# cat /etc/td-agent/td-agent.conf
####
...
<source>
 @type tail
 path /var/lib/mysql/audit.log
 pos_file /var/log/td-agent/audit.access_log.pos
 <parse>
  @type json
 </parse>
 tag mongo.audit.log
</source>
<match mongo.audit.log>
 @type mongo
 database auditlogs #(required)
 collection audit_log #(optional; default="untagged")
 capped
 capped_size 100m
 host 10.95.83.225 #(optional; default="localhost")
 port 27017 #(optional; default=27017)
 user percona
 password P3rc0n4
 <buffer>
  flush_interval 1s
 </buffer>
</match>

  • Added the user used by Fluentd to mysql group to allow it to read from the audit log:
# id td-agent
uid=114(td-agent) gid=121(td-agent) groups=121(td-agent)
# usermod -a -G mysql td-agent
# id td-agent
uid=114(td-agent) gid=121(td-agent) groups=121(td-agent),120(mysql)

[mysqld]
audit_log_handler = file
audit_log_format = JSON
audit_log_file = audit.log
audit_log_rotate_on_size = 10M
audit_log_rotations = 3

  • Restarted both services to apply changes:
# systemctl restart mysql
# systemctl restart td-agent

  • Checked the Fluentd log to see if it reads the audit log as expected, also for when Percona Server for MySQL rotates it:
# tail -f /var/log/td-agent/td-agent.log
2020-12-31 02:41:39 +0000 [info]: adding match pattern="mongo.audit.log" type="mongo"
...
2020-12-31 02:41:40 +0000 [info]: #0 following tail of /var/lib/mysql/audit.log
...
2020-12-31 02:52:14 +0000 [info]: #0 detected rotation of /var/lib/mysql/audit.log; waiting 5 seconds
2020-12-31 02:52:14 +0000 [info]: #0 following tail of /var/lib/mysql/audit.log

  • Ran sysbench against MySQL instance and verified the new collection in MongoDB gets updated:
mongo > db.audit_log.countDocuments({})
281245

mongo > db.audit_log.stats()
{
 "ns" : "auditlogs.audit_log",
 "size" : 104857293,
 "count" : 281245,
 "avgObjSize" : 372,
 "storageSize" : 26357760,
 "capped" : true,
 "max" : -1,
 "maxSize" : 104857600,
(...)

Yay, it works like a charm! Not only are the audit logs rotated automatically on Percona Server for MySQL, but also on MongoDB the destination collection size cap works as well, so I am safe when it comes to disk space on both hosts!

Here, there is a little caveat – if for some reason you drop the destination collection manually on MongoDB, incoming inserts will make it re-created without the capped setting! Therefore, either let the collection be created by Fluentd on its service startup or create it manually with a capped setting, and don’t drop it later.

Now, we can try some example aggregations to get some useful audit stats:

mongo > db.audit_log.aggregate([ { $group: { _id: {name: "$audit_record.name", command: "$audit_record.command_class"}, count: {$sum:1}}}, { $sort: {count:-1}} ])
{ "_id" : { "name" : "Execute", "command" : "error" }, "count" : 267086 }
{ "_id" : { "name" : "Query", "command" : "begin" }, "count" : 14054 }
{ "_id" : { "name" : "Close stmt", "command" : "error" }, "count" : 76 }
{ "_id" : { "name" : "Query", "command" : "show_variables" }, "count" : 7 }
{ "_id" : { "name" : "Query", "command" : "select" }, "count" : 6 }
{ "_id" : { "name" : "Quit" }, "count" : 5 }
{ "_id" : { "name" : "Query", "command" : "show_tables" }, "count" : 4 }
{ "_id" : { "name" : "Init DB", "command" : "error" }, "count" : 2 }
{ "_id" : { "name" : "Field List", "command" : "show_fields" }, "count" : 2 }
{ "_id" : { "name" : "Query", "command" : "show_databases" }, "count" : 2 }
{ "_id" : { "name" : "Connect" }, "count" : 1 }

mongo > db.audit_log.aggregate([ { $match: { "audit_record.status": {$gt: 0} } }, { $group: { _id: {command_class: "$audit_record.command_class", status: "$audit_record.status"}, count: {$sum:1}}}, { $sort: {count:-1}} ])
{ "_id" : { "command_class" : "error", "status" : 1049 }, "count" : 2 }
{ "_id" : { "command_class" : "show_tables", "status" : 1046 }, "count" : 2 }
{ "_id" : { "command_class" : "create_table", "status" : 1050 }, "count" : 2 }
{ "_id" : { "command_class" : "drop_table", "status" : 1051 }, "count" : 2 }
{ "_id" : { "command_class" : "drop_table", "status" : 1046 }, "count" : 2 }
{ "_id" : { "command_class" : "create_table", "status" : 1046 }, "count" : 1 }
{ "_id" : { "command_class" : "create_table", "status" : 1113 }, "count" : 1 }

References

https://www.percona.com/doc/percona-server/LATEST/management/audit_log_plugin.html
https://dev.mysql.com/doc/refman/8.0/en/audit-log.html
https://www.rsyslog.com/doc/v8-stable/configuration/modules/ommongodb.html
https://docs.fluentd.org/output/mongo

Jan
14
2021
--

Thinking About Deploying MongoDB? Read This First.

Deploying MongoDB

Deploying MongoDBAre you thinking about deploying MongoDB? Is it the right choice for you?

Choosing a database is an important step when designing an application. A wrong choice can have a negative impact on your organization in terms of development and maintenance. Also, the wrong choice can lead to poor performance.

Generally speaking, any kind of database can manage any kind of workload, but any database has specific workloads that fit better than others.

You don’t have to consider MongoDB just because it’s cool and there’s already a lot of companies using it. You need to understand if it fits properly with your workload and expectations. So, choose the right tool for the job.

In this article, we are going to discuss a few things you need to know before choosing and deploying MongoDB.

MongoDB Manages JSON-style Documents and Developers Appreciate That

The basic component of a MongoDB database is a JSON-style document. Technically it is BSON, which contains some extra datatypes (eg. datetime) that aren’t legit JSON.

We can consider the document the same as a record for a relational database. The documents are put into a collection, the same concept as a relational table.

JSON-style documents are widely used by a lot of programmers worldwide to implement web services, applications, and exchange data. Having a database that is able to manage that data natively is really effective.

MongoDB is often appreciated by developers because they can start using it without having specific knowledge about database administration and design and without studying a complex query language. Indeed, the MongoDB query language is also represented by JSON documents.

The developers can create, save, retrieve, and update their JSON-style documents at ease. Great! This leads usually to a significant reduction in development time.

MongoDB is Schemaless

Are you familiar with relational databases? For sure you are, as relational databases are used and studied for such a long time at school and at university. Relational databases are the most widely used in the market nowadays.

You know that a relational schema needs a predefined and fixed structure for the tables. Any time you add or change a column you need to run a DDL query and additional time is necessary to also change your application code to manage the new structure. In the case of a massive change that requires multiple column changes and/or the creation of new tables, the application changes could be impressive. MongoDB’s lack of schema enforcement means none of that is required. You just insert a document in a collection and that’s all. Let suppose that you have a collection with user data. If at some point you need to add for example the new “date_of_birth” field, you simply start to insert the new JSON documents with the additional field. That’s all. No need to change anything on the schema.

You can insert into the same collection even completely different JSON documents, representing different entities. Well, this is technically feasible, but not recommended, anyway.

MongoDB greatly shortens the cycle of application development for a non-technology reason – it removes the need to coordinate a schema change migration project with the DBA team. There is no need to wait until the DBA team does a QA dress-rehearsal and then the production release (with rollback plans) that, often as not, requires some production downtime.

MongoDB Has No Foreign Keys, Stored Procedures, or Triggers. Joins Are Supported, but Untypical.

The database design requires SQL queries to be able to join multiple tables on specific fields. Also, the database design may require foreign keys for assuring the consistency of the data and for running automatic changes on semantically connected fields.

What about stored procedures? They can be useful to embed into the database some application logic to simplify some tasks or to improve the security.

And what about triggers? They are useful to automatically “trigger” changes on the data based on specific events, like adding/changing/deleting a row. They help to manage the consistency of the data and, in some cases, to simplify the application code.

Well, none of them is available on MongoDB. So, be aware of that.

Note: to be honest, there’s an aggregation stage that can implement the same of a LEFT JOIN, but this is the only case.

How to survive without JOIN?

Managing JOINs must be done on your application code. If you need to join two collections, you need to read the first one, selects the join field and use it for querying the second collection, and so on. Well, this seems to be expensive in terms of application development, and also this could lead to more queries executed. Indeed it is, but the good news is that in many cases you don’t have to manage the joins at all.

Remember that MongoDB is a schemaless database; it doesn’t require normalization. If you are able to properly design your collections, you can embed and duplicate data into a single collection without the need of creating an additional collection. This way you won’t need to run any join because all the data you need is already into one collection only.

Foreign keys are not available, but as long as you can embed multiple documents into the same collection, you don’t really need them.

Stored procedures can be implemented easily as external scripts you can write in your preferred language. Triggers can be implemented externally the same way, but with the help of the Change Stream API feature connected to a collection.

If you have a lot of collections with referenced fields, you have to implement in your code a lot of joins or you have to do a lot of checks to assure consistency. This is possible but at a higher cost in terms of development. MongoDB could be the wrong choice in such a case.

MongoDB Replication and Sharding Are Easy to Deploy

MongoDB was natively designed not as a standalone application. It was designed instead to be a piece of a larger puzzle. A mongod server is able to work together with other mongod instances in order to implement replication and sharding efficiently and without the need for any additional third-party tool.

A Replica Set is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability by design. With caveats regarding potentially stale data, you also get read scalability for free. It should be the basis for all production deployments.

The Sharding Cluster is deployed as a group of several Replica Sets with the capability to split and distribute the data evenly on them. The Sharding Cluster provides write scalability in addition to redundancy, high availability, and read scalability. The sharding topology is suitable for the deployment of very large data sets. The number of shards you can add is, in theory, unlimited.

Both the topologies can be upgraded at any time by adding more servers and shards. More importantly, no changes are required for the application since each topology is completely transparent from the application perspective.

Finally, the deployment of such topologies is straightforward. Well, you need to spend some time in the beginning to understand a few basic concepts, but then, in a matter of a few hours, you can deploy even a very large sharded cluster. In the case of several servers, instead of doing everything manually, you can automatize a lot of things using Ansible playbooks or other similar tools.

Further readings:

Deploy a MongoDB Replica Set with Transport Encryption (Part 1)

MongoDB Sharding 101 Webinar

MongoDB Has Indexes and They Are Really Important

MongoDB allows you to create indexes on the JSON document’s fields. Indexes are used the same way as a relational database. They are useful to solve queries faster and to decrease the usage of machine resources: memory, CPU time, and disk IOPS.

You should create all the indexes that will help any of the regularly executed queries, updates, or deletes from your application.

MongoDB has a really advanced indexing capability. It provides TLL indexes, GEO Spatial indexes, indexes on array elements, partial and sparse indexes. If you need more details about the available index types, you can take a look at the following articles:

MongoDB Index Types and MongoDB explain() (part 1)

Using Partial and Sparse Indexes in MongoDB

Create all the indexes you need for your collections. They will help you a lot to improve the overall performance of your database.

MongoDB is Memory Intensive

MongoDB is memory intensive; it needs a lot. This is the same for many other databases. Memory is the most important resource, most of the time.

MongoDB uses the RAM for caching the most frequently and recently accessed data and indexes. The larger this cache, the better the overall performance will be, because MongoDB will be able to retrieve a lot of data faster. Also, MongoDB writes are only committed to memory before client confirmation is returned, at least by default. Writes to disk are done asynchronously – first to the journal file (typically within 50ms), and later into the normal data files (once per min).

The widely used storage engine used by MongoDB is WiredTiger. In the past there was MMAPv1, but it is no longer available on more recent versions. The WiredTiger storage engine uses an important memory cache (the WiredTiger Cache) for caching data and indexes.

Other than using the WTCache, MongoDB relies on the OS file system caches for accessing the disk pages. This is another important optimization, and significant memory may be required also for that.

In addition, MongoDB needs memory for managing other stuff like client connections, in-memory sortings, saving temporary data when executing aggregation pipelines, and other minor things.

In the end, be prepared to provide enough memory to MongoDB.

But how much memory should I need? The rule of thumb is evaluating the “working set” size.

The “working set” is the amount of data that is most frequently requested by your application. Usually, an application needs a limited amount of data, it doesn’t need to read the entire data set during normal operations. For example, in the case of time-series data, most probably you need to read only the last few hours or the last few day’s entries. Only on a few occasions will you need to read legacy data. In such a case, your working set is the one that can store just a few days of data.

Let’s suppose your data set is 100GB and you evaluated your working set is around 20%, then you need to provide at least 20GB for the WTCache.

Since MongoDB uses by default 50% of the RAM for the WTCache (we usually suggest not to increase it significantly), you should provide around 40GB of memory in total for your server.

Every case is different and sometimes it could be difficult to evaluate correctly the working set size. Anyway, the main recommendation is that you should spend a significant part of your budget to provide the larger memory you can. For sure, this is will be beneficial for MongoDB.

What Are the Suitable Use Cases for MongoDB?

Actually, a lot. I have seen MongoDB deployed on a wide variety of environments.

For example, MongoDB is suitable for:

  • events logging
  • content management
  • gaming applications
  • payment applications
  • real-time analytics
  • Internet Of Things applications
  • content caching
  • time-series data applications

And many others.

We can say that you can use MongoDB basically for everything, it is a general-purpose database. The key point is instead the way you use it.

For example, if you plan to use MongoDB the same way as a relational database, with data normalized, a lot of collections around, and a myriad of joins to be managed by the application, then MongoDB is not the right choice for sure. Use a relational database.

The best way to use MongoDB is to adhere to a few best practices and modeling the collections keeping in mind some basic rules like embedding documents instead of creating multiple referenced collections.

Percona Server for MongoDB: The Enterprise-Class Open Source Alternative

Percona develops and deploys its own open source version of MongoDB: the Percona Server for MongoDB (PSMDB).

PSMDB is a drop-in replacement for MongoDB Community and it is 100% compatible. The great advantage provided by PSMDB is that you can get enterprise-class features for free, like:

  • encryption at the rest
  • audit logging
  • LDAP Authentication
  • LDAP Authorization
  • Log redaction
  • Kerberos Authentication
  • Hot backup
  • in-memory storage engine

Without PSMDB all these advanced features are available only in the MongoDB Enterprise subscription.

Please take a look at the following links for more details about PSMDB:

Percona Server for MongoDB Feature Comparison

Percona Server for MongoDB

Remember you can get in touch with Percona at any time for any details or for getting help.

Conclusion

Let’s have a look at the following list with the more important things you need to check before choosing MongoDB as the backend database for your applications. The three colored flags indicate if MongoDB is a good choice: red means it’s not, orange means it could be a good choice but with some limitations or potential bottlenecks, green means it’s very good.

Your applications primarily deal with JSON documents
Your data has unpredictable and frequent schema changes during the time
You have several collections with a lot of external references for assuring consistency and the majority of the queries need joins
You need to replicate stored procedures and triggers you have in your relational database
You need HA and read scalability
You need to scale your data to a very large size
You need to scale because of a huge amount of writes

 

And finally, remember the following:

Take a look at Percona Server for MongoDB 

Jan
13
2021
--

Running Kubernetes on the Edge

Running Kubernetes on the Edge

Running Kubernetes on the EdgeWhat is Edge

Edge is a buzzword that, behind the curtain, means moving private or public clouds closer to the end devices. End devices, such as the Internet of Things (from a doorbell to a VoIP station), become more complex and require more computational power.  There is a constant growth of connected devices and by the end of 2025, there will be 41.6 billion of them, generating 69.4 Zettabytes of data.

Latency, speed of data processing, or security concerns do not allow computation to happen in the cloud. Businesses rely on edge computing or micro clouds, which can run closer to the end devices. All this constructs the Edge.

How Kubernetes Helps Here

Containers are portable and quickly becoming a de facto standard to ship software. Kubernetes is a container orchestrator with robust built-in scaling capabilities. This gives the perfect toolset for businesses to shape their Edge computing with ease and without changing existing processes.

The cloud-native landscape has various small Kubernetes distributions that were designed and built for the Edge: k3s, microk8s, minikube, k0s, and newly released EKS Distro. They are lightweight, can be deployed with few commands, and are fully conformant. Projects like KubeEdge bring even more simplicity and standardization into the Kubernetes ecosystem on the Edge.

Running Kubernetes on the Edge also poses the challenge to manage hundreds and thousands of clusters. Google Anthos, Azure Arc, and VMWare Tanzu allow you to run your clusters anywhere and manage them through a single interface with ease.

Topologies

We are going to review various topologies that Kubernetes provides for the Edge to bring computation and software closer to the end devices.

The end device is a Kubernetes cluster

Kubernetes cluster edge

Some devices run complex software and require multiple components to operate – web servers, databases, built-in data-processing, etc. Using packages is an option, but compared to containers and automated orchestration, it is slow and sometimes turns the upgrade process into a nightmare. In such cases, it is possible to run a Kubernetes cluster on each end device and manage software and infrastructure components using well-known primitives.

The drawback of this solution is the overhead that comes from running etcd and masters’ components on every device.

The end device is a node

The end device is a node kubernetes

In this case, you can manage each end device through a single Kubernetes control plane. Deploying software to support and run your phones, printers or any other devices can be done through standard Kubernetes primitives.

Micro-clouds

kubernetes Micro-clouds

This topology is all about moving computational power closer to the end devices by creating micro-clouds on the Edge. Micro-cloud is formed by the Kubernetes nodes on the server farm on the customer premises. Running your AI/ML (like Kubeflow) or any other resource-heavy application in your own micro-cloud is done with Kubernetes and its primitives.

How Percona Addresses Edge Challenges

We at Percona continue to invest in the Kubernetes ecosystem and expand our partnership with the community. Our Kubernetes Operators for Percona XtraDB Cluster and MongoDB are open source and enable anyone to run production-ready MySQL and MongoDB databases on the Edge.

Check out how easy it is to deploy our operators on Minikube or EKS Distro (which is similar to microk8s). We are working on furthering Day 2 operations simplification and in future blog posts, you will see how to deploy and manage databases on multiple Kubernetes clusters with KubeApps.

Jan
12
2021
--

MySQL Backup and Recovery Best Practices

MySQL Backup and Recovery Best Practices

MySQL Backup and Recovery Best PracticesIn this blog, we will review all the backup and restore strategies for MySQL, the cornerstones of any application. There are a few options, depending on your topology, MySQL versions, etc. And based on that, there are some questions we need to ask ourselves to make sure we make the right choices.

How many backups we need to keep safe, or what’s the best retention policy for us?

This means the number of backups to safeguard, whether local or remote (external fileserver, cloud). The retention policy can be daily, weekly, or monthly, depending on the free space available.

What is the Recovery Time Objective?

The Recovery Time Objective (RTO) refers to the amount of time that may pass during a disruption before it exceeds the maximum allowable threshold specified in the Business Continuity Plan.

The key question related to RTO is, “How quickly must the data on this system be restored?”

What is the Recovery Point Objective?

The Recovery Point Objective (RPO) is the duration of time and service level within which a business process must be stored after a disaster in order to avoid unacceptable consequences associated with a break in continuity.

The key question related to RPO is, “How much data can we lose?”

Different Types of Backups

There are two backup types: physical and logical.

  • Physical (Percona XtraBackup, RDS/LVM Snapshots, MySQL Enterprise Backup), and also you can use cp or rsync command lines to copy the datadir as long as mysql is down/stopped.
  • Logical (mysqldump, mydumper, mysqlpump, mysql shell only for mysql 8)

Also is recommended to take a copy of binlog files, why? Well, this will help us to recover until the last transaction.

Why are backups needed?

Backups are needed in case of multiple problems:

  • Host Failure: We can get multiple problems from disks stalled or broken disks. Also from cloud services, our DB instance can be broken and it’s non-accessible.
  • Corrupted Data: This can happen on a power outage, MySQL wasn’t able to write correctly and close the file, sometimes when MySQL starts again it cannot start due to corrupted data and the crash recovery process cannot fix it.
  • Inconsistent Data: When a human mistake, delete/update erroneous data over the primary or replica node.
  • DataCenter Failure: power outage or internet provider issues.
  • Legislation/Regulation: provide consistent business value and customer satisfaction.

Now let me explain those different types of backups mentioned above, but before I continue, it’s important to configure a new and dedicated replica node for backups purposes, due to the high CPU load to avoid any issue on any other replica node (AKA backup server).

Logical Backup

This is a dump from logical database structure (CREATE DATABASE, CREATE TABLE statements) and content (INSERT statements). This is recommended to be used against smaller amounts of data. The disadvantage of this method is slower (backup and restore) if you compare it with physical backups. Using mydumper you can backup and restore a single database or a single table if it’s needed, and this is useful to copy some data to a different environment to run tests. Also, mydumper can take a consistent (as long as all the tables are InnoDB engine) backup and provides accurate master and slave log positions.

The output is larger than for physical backup, particularly when saved in text format, but it can be compressed on the fly depending on the software you are using. Mydumper can compress and mysqldump needs to add a pipe to redirect the output to gzip, for example.

Logical backups are used to address data corruption or the need to restore a subset of tables.

Physical (Raw) Backup

In short, this consists of exact copies of database directories and files. This can be a copy for all or a part from MySQL datadir directory. This kind of backup is most used to restore or create a new replica node easily and quickly and is used to address host failure. It’s recommended to restore using the same MySQL version. I recommend using Percona XtraBackup because it can include any related files such as configuration files like cnf config files.

Snapshot Backups

Some file system implementations enable “snapshots” to be taken. These provide logical copies of the file system at a given point in time, without requiring a physical copy of the entire file system. MySQL itself does not provide the capability for taking file system snapshots but it is available using third-party solutions such as LVM or ZFS.

The disadvantage is that sometimes physical backups do not compress much, because data is usually in a binary format and sometimes the table is already compressed.

Binary Log Backups

Binlog backups specifically address RPO. Binary log files contain records of each SQL query executed that made changes.

From MySQL 5.6 on, you can use mysqlbinlog to stream binary logs from a remote server. You can combine binlog backups with Percona XtraBackup or mydumper backup to allow restoration up to the end of the most-recently-backed-up binary log.

Incremental / Differential Backups

An incremental backup is a backup of everything that has changed since the last backup (a binary log backup is a special case of an incremental backup). This is a very good option if the dataset size is huge, as you can take a full backup at the beginning of the week and run incremental backups per day. Also, the backup size is smaller than the full backup.

The main risks associated with incremental backups are:

– A single corrupt incremental backup may invalidate all the others

– Incremental backups typically negatively affect the RTO

For a differential backup, it copies the differences from your last backup, and the advantage is that a lot of data does not change from one backup to the next, so the result can be significantly smaller backups. This saves disk space.

Percona XtraBackup supports both incremental and differential backups.

Offsite Storage

It’s highly recommended to copy all the backup methods to another place, like the cloud or an external file server, so in case of host failure or data center failure, you have another copy.

Not all the backup files need to be uploaded to the cloud, sometimes the time you need to spend in the download is bigger than the time consumed in the recovery process.

A good approach is to keep 1-7 days locally on the backup server in case a fast recovery is needed, and this depends on your business regulations.

Encryption

Backups have sensitive data, so it’s highly recommended to encrypt, especially for offsite storage. This adds more time when you need to restore a backup but it keeps your data safe.

GPG is a good option to encrypt backups, and if you use this option or some other alternative, don’t forget to get a copy of the keys/passphrase. If you lose it, your backups will be useless.

Restore Testing

Depending on your business, it’s highly recommended to test your backups at least once per month. This action validates your backups are not corrupted and it provides critical metrics on recovery time. This process should be automated to get the full backup, restore it, and finally configure this server as a replica from the current primary or another replica. This is good as well to validate that the replication process has no errors.

Many customers are using this methodology to refresh their QA/STG environment to have fresh data from production backups.

In addition to the above, it is recommended to create a manual or automated restore documentation process to keep all the steps together, so in case of disaster, you can follow it without wasting time.

Retention Requirements

Last but not least, it is very important to keep multiple copies of different backup types.

Our best recommendation is:

  • One or two physical backups locally on the backup server (as long as space allows it).
  • Seven daily and four weekly logical backups locally on the backup server.
  • 30 days of binlog backups locally on the backup server.
  • For offsite backups (like S3, Google Cloud, etc.), keep monthly backups for one year or more.

For local backups, keep in mind you will need a minimum of 2.5 times the current dataset size as free disk space to save/meet these retention policies. Don’t forget to encrypt all the backup types!

Legal or regulatory requirements may also dictate how long data must be archived.

Percona Can Help

Percona can help you choose, implement, and optimize the most appropriate MySQL backup and recovery solution for your MySQL ecosystem. If your current solution unexpectedly fails, we can facilitate your recovery with onsite, remote, or emergency consulting services. We can also help you take steps to prevent another occurrence. Every situation is unique and we will work with you to create the most effective solution for your business.

Contact Us

Jan
11
2021
--

A Snippet of Features: Reviewing PostgreSQL Version 13

Reviewing PostgreSQL Version 13

This latest and greatest version of PostgreSQL, version 13, has been here since last summer and continues the tradition of gains in performance, administration, application development, and security enhancements.

Of the many new features and enhancements that have been made, some of the more notable ones include:

  • Space savings and performance gains from de-duplication of B-tree index entries
  • Improved performance for queries that use aggregates or partitioned tables
  • Better query planning when using extended statistics
  • Parallelized vacuuming of indexes
  • Incremental sorting

TIP: More detailed information can be found in the Release Notes here.

I learned long ago that it can be something of a challenge keeping up to date with all this new stuff. Therefore, I’d like to cover a small subset of these new advances by demonstrating simple use cases which I hope you will find constructive and are related to pgbench, logical replication, streaming replication fail-over, and re-provisioning a deprecated PRIMARY as a STANDBY using pg_rewind.

About pgbench

As you may recall, pgbench is a simple program for benchmark tests on PostgreSQL by running the same sequence of SQL commands over and over in multiple concurrent database sessions. Pgbench finds its most common use, as the name implies, as a benchmarking tool testing hardware and configuration runtime parameters for the purposes of performance tuning.

The latest iteration of pgbench has a number of new capabilities and includes:

  • A new set of options chooses data generation either from the client or the server side i.e. ‘g’, and ‘G’. Whereas previous versions generated data on the pgbench client and then sent it to the server, one now has the ability to test data generation exclusively on the server itself without being affected by network latency.
  • The ability to create a partitioned “accounts” table using either range or hash partitioning although the default is range partitioning.
  • A new option ‘–show-script’ which echoes the actual code of any built-in script name on stderr.

Example 1: Hash and Range Partitioning

The first step is to of course initialize the benchmarking, in this case, we start out using hash partitioning:

# hash partitioning table public.pgbench_accounts
pgbench -i --partition-method=hash --partitions=5

...
creating tables...
creating 5 partitions...
generating data (client-side)...
100000 of 100000 tuples (100%) done (elapsed 0.04 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 0.79 s (drop tables 0.02 s, create tables 0.17 s, client-side generate 0.13 s, vacuum 0.26 s, primary keys 0.22 s).

Here’s what it should look like:

Schema |        Name        |       Type        |  Owner   | Persistence |  Size   
--------+--------------------+-------------------+----------+-------------+---------+
 public | pgbench_accounts   | partitioned table | postgres | permanent   | 0 bytes 
 public | pgbench_accounts_1 | table             | postgres | permanent   | 2656 kB 
 public | pgbench_accounts_2 | table             | postgres | permanent   | 2656 kB 
 public | pgbench_accounts_3 | table             | postgres | permanent   | 2656 kB 
 public | pgbench_accounts_4 | table             | postgres | permanent   | 2656 kB 
 public | pgbench_accounts_5 | table             | postgres | permanent   | 2656 kB 
 public | pgbench_branches   | table             | postgres | permanent   | 40 kB   
 public | pgbench_history    | table             | postgres | permanent   | 0 bytes 
 public | pgbench_tellers    | table             | postgres | permanent   | 40 kB

And here’s the partitioned pgbench_accounts table definition:

Partitioned table "public.pgbench_accounts"
  Column  |     Type      | Collation | Nullable | Default | Storage  | Stats target  Description
----------+---------------+-----------+----------+---------+----------+--------------
 aid      | integer       |           | not null |         | plain    |              
 bid      | integer       |           |          |         | plain    |              
 abalance | integer       |           |          |         | plain    |              
 filler   | character(84) |           |          |         | extended |              
Partition key: HASH (aid)
Indexes:
    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
Foreign-key constraints:
    "pgbench_accounts_bid_fkey" FOREIGN KEY (bid) REFERENCES pgbench_branches(bid)
Referenced by:
    TABLE "pgbench_history" CONSTRAINT "pgbench_history_aid_fkey" FOREIGN KEY (aid) REFERENCES pgbench_accounts(aid)
Partitions: pgbench_accounts_1 FOR VALUES WITH (modulus 5, remainder 0),
            pgbench_accounts_2 FOR VALUES WITH (modulus 5, remainder 1),
            pgbench_accounts_3 FOR VALUES WITH (modulus 5, remainder 2),
            pgbench_accounts_4 FOR VALUES WITH (modulus 5, remainder 3),
            pgbench_accounts_5 FOR VALUES WITH (modulus 5, remainder 4)

 

Alternatively, we can just as easily partition using range partitioning:

# range partitioning public.pgbench_accounts
pgbench -i --partition-method=range --partitions=5

Partitioned table "public.pgbench_accounts"
  Column  |     Type      | Collation | Nullable | Default | Storage  | Stats target | Description
----------+---------------+-----------+----------+---------+----------+--------------+
 aid      | integer       |           | not null |         | plain    |              |
 bid      | integer       |           |          |         | plain    |              |
 abalance | integer       |           |          |         | plain    |              |
 filler   | character(84) |           |          |         | extended |              |
Partition key: RANGE (aid)
Indexes:
    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
Partitions: pgbench_accounts_1 FOR VALUES FROM (MINVALUE) TO (20001),
            pgbench_accounts_2 FOR VALUES FROM (20001) TO (40001),
            pgbench_accounts_3 FOR VALUES FROM (40001) TO (60001),
            pgbench_accounts_4 FOR VALUES FROM (60001) TO (80001),
            pgbench_accounts_5 FOR VALUES FROM (80001) TO (MAXVALUE)

For those people curious to see the actual commands creating the partition(s) just update the runtime parameter and look in your postgres log:

postgres=# alter system set log_statement = 'ddl';
ALTER SYSTEM
postgres=# select pg_reload_conf();
pg_reload_conf
----------------
t

postgres=# show log_statement;
log_statement
---------------
ddl

 

LOG:  statement: drop table if exists pgbench_accounts, pgbench_branches, pgbench_history, pgbench_tellers
LOG:  statement: create table pgbench_history(tid int,bid int,aid    int,delta int,mtime timestamp,filler char(22))
LOG:  statement: create table pgbench_tellers(tid int not null,bid int,tbalance int,filler char(84)) with (fillfactor=100)
LOG:  statement: create table pgbench_accounts(aid    int not null,bid int,abalance int,filler char(84)) partition by range (aid)
LOG:  statement: create table pgbench_branches(bid int not null,bbalance int,filler char(88)) with (fillfactor=100)
LOG:  statement: create table pgbench_accounts_1
         partition of pgbench_accounts
         for values from (minvalue) to (20001) with (fillfactor=100)
LOG:  statement: create table pgbench_accounts_2
         partition of pgbench_accounts
         for values from (20001) to (40001) with (fillfactor=100)
LOG:  statement: create table pgbench_accounts_3
         partition of pgbench_accounts
         for values from (40001) to (60001) with (fillfactor=100)
LOG:  statement: create table pgbench_accounts_4
         partition of pgbench_accounts
         for values from (60001) to (80001) with (fillfactor=100)
LOG:  statement: create table pgbench_accounts_5
         partition of pgbench_accounts
         for values from (80001) to (maxvalue) with (fillfactor=100)
LOG:  statement: alter table pgbench_branches add primary key (bid)
LOG:  statement: alter table pgbench_tellers add primary key (tid)
LOG:  statement: alter table pgbench_accounts add primary key (aid)

 

Example 2: Benchmarking

Part I

Let’s initialize a database with the following conditions:

  • install pg_stat_statements to see our results
  • initialize benchmarking
  • generate data solely on the server-side
  • create primary keys
  • create foreign key constraints
  • range partition table “accounts” creating five child tables

Open file postgresql.conf? and edit the following runtime parameters. Restart the server and add extension pg_stat_statement to any database using the command create pg_stat_statement.

# pg_conftool 13 main postgresql.conf edit
    shared_preload_libraries = 'pg_stat_statements'
    pg_stat_statements.max = 10000
    pg_stat_statements.track = all

systemctl restart postgresql@13-main

Initialize an empty database:

export PGHOST=pg1-POC13 PGPORT=5432 PGDATABASE=db01 PGUSER=postgres PGPASSWORD=mypassword
createdb db01
pgbench -i -s 10 -I dtGvpf -F 90 --partition-method=hash --partitions=5

This query should return a nice summary of the commands thus far executed:

SELECT query, calls, total_exec_time::int, rows, (100.0 * shared_blks_hit /
       nullif(shared_blks_hit + shared_blks_read, 0))::int AS hit_percent
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 5;

 

-[ RECORD 1 ]                ----------------+------------------------------------------------------------------------------------
query           | SELECT abalance FROM pgbench_accounts WHERE aid = $1
calls           | 1560162
total_exec_time | 73958
rows            | 1560162
hit_percent     | 94
-[ RECORD 2 ]
----------------+------------------------------------------------------------------------------------
query           | insert into pgbench_accounts(aid,bid,abalance,filler) select aid, (aid - $1) / $2 + $3, $4, $5 from generate_series($7, $8) as aid
calls           | 1
total_exec_time | 2250
rows            | 1000000
hit_percent     | 100
-[ RECORD 3 ]
----------------+------------------------------------------------------------------------------------
query           | create database db03
calls           | 1
total_exec_time | 2092
rows            | 0
hit_percent     | 90
-[ RECORD 4 ]
----------------+------------------------------------------------------------------------------------
query           | vacuum analyze pgbench_accounts
calls           | 1
total_exec_time | 1591
rows            | 0
hit_percent     | 92
-[ RECORD 5 ]
----------------+------------------------------------------------------------------------------------
query           | alter table pgbench_accounts add primary key (aid)
calls           | 1
total_exec_time | 1086
rows            | 0
hit_percent     | 59

Part II

Execute the benchmarking for a duration of five minutes. Just to make things clearer, you can reset the stats in view pg_stat_statements.

# clear the stats before starting the benchmarking
#    SELECT pg_stat_statements_reset();
pgbench -c 40 -j 7 -T 300 -b tpcb-like db01 -P 60

New SQL statements representing the DML operations are now listed:

SELECT query, calls, total_exec_time::int, rows, (100.0 * shared_blks_hit /
       nullif(shared_blks_hit + shared_blks_read, 0))::int AS hit_percent
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 5;

 

-[ RECORD 1 ]
---+-----------------------------------------------------------------------------------------------------------
query           | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2
calls           | 42767
total_exec_time | 6203809
rows            | 42767
hit_percent     | 100
-[ RECORD 2 ]
---+-----------------------------------------------------------------------------------------------------------
query           | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2
calls           | 42767
total_exec_time | 3146175
rows            | 42767
hit_percent     | 100
-[ RECORD 3 ]
---+-----------------------------------------------------------------------------------------------------------
query           | UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2
calls           | 42767
total_exec_time | 28281
rows            | 42767
hit_percent     | 95
-[ RECORD 4 ]
---+-----------------------------------------------------------------------------------------------------------
query           | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP)
calls           | 42767
total_exec_time | 22797
rows            | 42767
hit_percent     | 100
-[ RECORD 5 ]
---+-----------------------------------------------------------------------------------------------------------
query           | SELECT $2 FROM ONLY "public"."pgbench_branches" x WHERE "bid" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
calls           | 42767
total_exec_time | 2347
rows            | 42767
hit_percent     | 100

Notice how the child tables have grown in size:

List of relations
 Schema |        Name        |       Type        |  Owner   | Persistence |  Size   | Description
--------+--------------------+-------------------+----------+-------------+---------+
 public | pgbench_accounts   | partitioned table | postgres | permanent   | 0 bytes |
 public | pgbench_accounts_1 | table             | postgres | permanent   | 28 MB   |
 public | pgbench_accounts_2 | table             | postgres | permanent   | 29 MB   |
 public | pgbench_accounts_3 | table             | postgres | permanent   | 28 MB   |
 public | pgbench_accounts_4 | table             | postgres | permanent   | 28 MB   |
 public | pgbench_accounts_5 | table             | postgres | permanent   | 28 MB   |
 public | pgbench_branches   | table             | postgres | permanent   | 168 kB  |
 public | pgbench_history    | table             | postgres | permanent   | 2384 kB |
 public | pgbench_tellers    | table             | postgres | permanent   | 272 kB  |

About Logical Replication

We now explore another very useful addition to pg version 13 which is the transparent replication of partitioned tables via logical decoding. No longer is it necessary to expend valuable time manually adding the triggers and supplementary instruction replicating the child tables.

Example

Using the aforementioned pgbench example, execute the following. Note that one must update table history, by adding a primary key, otherwise, replication is not possible with this table:

# UPDATE TABLE, ADD PK
alter table public.pgbench_history add primary key (tid,bid,aid,mtime);

# CREATE DATABASE ON SUBSCRIBER NODE
createdb -h pg4-POC13 -U postgres db

# COPY DATABASE SCHEMA
pg_dump -h pg1-POC13 -U postgres -s db01 | psql 'host=pg4-POC13 user=postgres dbname=db01'

#
# PROVIDER pg1-POC13: DB01
#
psql 'host=pg1-POC13 user=postgres password=mypassword dbname=db01' <<_eof_
   \set ON_ERROR_STOP
    create publication publication1 for all tables;
_eof_

#
# SUBSCRIBER pg4-POC13: DB01
#
psql 'host=pg4-POC13 user=postgres password=mypassword dbname=db01' <<_eof_
   \set ON_ERROR_STOP
    create subscription subscript_set1
        connection 'host=pg1-POC13 dbname=db01 user=postgres password=mypassword'
        publication publication1
        with (copy_data=true, create_slot=true, enabled=true, slot_name=myslot1);
_eof_

And here we see the child accounts tables have been correctly replicated:

List of relations
 Schema |        Name        |       Type        |  Owner   | Persistence |    Size    | Description
--------+--------------------+-------------------+----------+-------------+------------+
 public | pgbench_accounts   | partitioned table | postgres | permanent   | 0 bytes    |
 public | pgbench_accounts_1 | table             | postgres | permanent   | 28 MB      |
 public | pgbench_accounts_2 | table             | postgres | permanent   | 29 MB      |
 public | pgbench_accounts_3 | table             | postgres | permanent   | 28 MB      |
 public | pgbench_accounts_4 | table             | postgres | permanent   | 28 MB      |
 public | pgbench_accounts_5 | table             | postgres | permanent   | 28 MB      |
 public | pgbench_branches   | table             | postgres | permanent   | 8192 bytes |
 public | pgbench_history    | table             | postgres | permanent   | 2352 kB    |
 public | pgbench_tellers    | table             | postgres | permanent   | 8192 bytes |

About Streaming Replication

Because of the large number of new features added to streaming replication, I’m limiting the focus on two enhancements:

  • Restarting a STANDBY in order to point it to a newly promoted PRIMARY, after updating primary_conninfo and primary_slot_name, is no longer necessary as a reload will suffice.
  • pg_rewind has a new option, –write-recovery-conf,  that simplifies the steps of reprovisioning a failed PRIMARY as a viable STANDBY, similarly to the one found in pg_basebackup.

Recall the three-node streaming replication cluster we’ve been using:

pg1-POC13: PRIMARY (read-write)
pg2-POC13: REPLICA (read-only, streaming)
pg3-POC13: REPLICA (read-only, streaming)

Example 1: STANDBY, pg3, Points to New PRIMARY pg2

Suppose a failover promotes pg2. This new feature adds robustness to the entire database cluster. REPLICA pg3 continues service without interruption as it is redirected from pg1 to pg2. Previous versions of postgres required host pg3 to be restarted in order to effect new primary_conninfo parameters.

Step 1

pg3-POC13; run a benchmark of SELECTS on host pg3

pgbench -h pg3-POC13 -U postgres -c 40 -j 7 -T 300 -b select-only db01 -P 5

Step 2

pg2-POC13; promote the host while the bench-marking on pg3 is active

-- create a new slot for pg3
select * from pg_create_physical_replication_slot('pg3');
-- confirm slot is inactive
select * from pg_get_replication_slots();
-- promote host
select pg_promote():
-- confirm read-write state
select pg_is_in_recovery();

TIP: it’s understood that runtime parameter wal_keep_size is a non-zero value and is set sufficiently retaining WALs as pg3 is redirected away from pg1 towards pg2.

Step 3

pg3-POC13;  point to pg2-POC13

-- make the updates
alter system set primary_conninfo = 'user=postgres host=10.231.38.112';
alter system set primary_slot_name = 'pg3';
select pg_reload_conf();
-- confirm replication is active
select * from pg_stat_wal_receiver;

Step 4

pg2-POC13; validate replication from pg2->pg3

-- confirm slot is active
select * from pg_get_replication_slots();
-- confirm replication is active
select * from pg_stat_replication;

Example 2

When re-provisioning a failed PRIMARY, ie pg1, as a new STANDBY, it used to be necessary to edit the requisite recovery runtime configuration parameters after executing pg_rewind but before starting the host.

PostgreSQL version 13 now simplifies the exercise by providing the switch  –write-recovery-conf and updating postgresql.auto.conf with the correct runtime parameters.

TIP: pg_rewind requires runtime parameter wal_log_hints=on before re-provisioning takes place on the failed PRIMARY

Step 1

pg2-POC13; add a slot on PRIMARY pg2 for the new STANDBY pg1

-- as postgres create a new slot
select * from pg_create_physical_replication_slot('pg1');
select * from pg_get_replication_slots();

Step 2

pg1-POC13;  re-provision pg1

# as postgres, perform the rewind
/usr/lib/postgresql/13/bin/pg_rewind \
    --target-pgdata=/var/lib/postgresql/13/main \
    --source-server="host=pg2-POC13 port=5432 user=postgres dbname=postgres " \
    --write-recovery-conf \
    --progress

You should get messaging similar to the following:

pg_rewind: connected to server
    pg_rewind: servers diverged at WAL location 0/6CDCEA88 on timeline 1
    pg_rewind: rewinding from last common checkpoint at 0/6CDCE9D8 on timeline 1
    pg_rewind: reading source file list
    pg_rewind: reading target file list
    pg_rewind: reading WAL in target
    pg_rewind: need to copy 206 MB (total source directory size is 422 MB)
    211594/211594 kB (100%) copied
    pg_rewind: creating backup label and updating control file
    pg_rewind: syncing target data directory
    pg_rewind: Done!

Now you can restart host pg1-POC13 and bring it back into service:

# as root, restart the server
systemctl start postgresql@13-main

Login pg1 and confirm replication:

# confirm replication is active
select * from pg_stat_wal_receiver;

Caveat

In the case of failure, check the following:

  • pg2-POC13: As the PRIMARY you will want to check that you have a slot installed and ready before starting up the reprovisioned pg1 as a new standby
  • pg1-POC13:
    • check to see if he server started up i.e. ‘netstat -tlnp’
    • check postgresql.auto.conf if you get a FATAL server is starting up message
    • check the slot name
    • check connectivity info is correct

Conclusion

Although not germane to features and capabilities, I was a little curious to see the differences between PostgreSQL 13 and earlier versions. Here’s a table with metrics creating a data cluster, database, and time required initializing pgbench. It’s interesting to see how performance times have improved over the previous versions:

initdb         datacluster  binaries   createdb    pgbench -i
ver time        size(bytes) size(bytes) time        time
9.6 0m0.889s    38,877,134  38,844,934  0m0.311s    0m0.236s
10  0m0.729s    39,598,542  42,054,339  0m0.725s    0m0.240s
11  0m0.759s    40,844,747  41,336,566  0m0.683s    0m0.212s
12  0m0.592s    41,560,196  43,853,282  0m0.179s    0m0.213s
13  0m0.502s    41,266,877  65,652,665  0m0.188s    0m0.168s

Frankly, I find it amazing how the size of both binaries and the data cluster has remained so compact over the years. I remember, way back in 2000, I was working with a proprietary RDBMS that upon initialization created an empty instance of 1GB in size – I wonder how big it gets these days??

Happy Hacking!

References:

PostgreSQL 13 Released!

PostgreSQL 13 Documentation

PostgreSQL Versioning Policy


Our white paper “Why Choose PostgreSQL?” looks at the features and benefits of PostgreSQL and presents some practical usage examples. We also examine how PostgreSQL can be useful for companies looking to migrate from Oracle.

Download PDF

Jan
08
2021
--

Streaming MySQL Backups with Percona XtraBackup – Another Alternative

Streaming MySQL Backups with Percona XtraBackupToday we are going to present you with another way to make a database copy between servers using our Percona XtraBackup utility.  What is the difference with many others existing on the web?  Not many, but they are interesting in terms of performance and availability.

We will combine the xbstream utility with the power of pigz and socat, to take advantage of the multi-processing in the case of having multiple processors, and at the same time, reduce the use of network bandwidth in cases where this component is a bottleneck.  So let’s explain each component:

socat: This stands for SOcket CAT. It is a utility for data transfer between two addresses.

What makes socat so versatile is the fact that an address can represent a network socket, any file descriptor, a Unix domain datagram or stream socket, TCP and UDP (over both IPv4 and IPv6), SOCKS 4/4a over IPv4/IPv6, SCTP, PTY, datagram, and stream sockets, named and unnamed pipes, raw IP sockets, OpenSSL, or on Linux even any arbitrary network device

pigz: This stands for parallel implementation of gzip, and it is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data.

xbstream: (with parallelism) to process several files in parallel.

Packages required: pigz, socat, and, of course, Percona XtraBackup

Servers Involved in the Task

Source: the source database where the backup will be taken (MySQL 5.7 installed on CentOS 7.8)

Target: the destination where the backup will be send (MySQL 5.7 installed on CentOS 7.8)

Steps

1. Install the required packages in the Source and Target nodes:

Source # yum install -y pigz socat
Target # yum install -y pigz socat

If you haven’t installed Percona XtraBackup (in both servers), please follow the instructions explained in https://www.percona.com/doc/percona-xtrabackup/2.4/index.html#installation

Make sure you have a user with proper privileges to do the backup on Source db:

+---------------------------------------------------------------------------+
| Grants for bkpuser@localhost                                              |
+---------------------------------------------------------------------------+
| GRANT RELOAD, PROCESS, REPLICATION CLIENT ON *.* TO 'bkpuser'@'localhost' |
+---------------------------------------------------------------------------+

2. On Target, execute:

Stop current database service (if any):

Target # systemctl stop mysqld

Delete datadir contents (assuming it was installed with default settings) and make sure you’re logged in to the Target server!!

Target # rm -rf /var/lib/mysql/*

Finally, we’re going to execute the command to receive the backup from the source (Source):

Target # socat -u TCP-LISTEN:4444,reuseaddr stdio | pigz -dc -p 4 - | xbstream —p 4 -x -C /var/lib/mysql

3. On Source, execute the command to send the backup to target (Target).

Source # xtrabackup --defaults-file=/etc/my.cnf --backup --user=bkpuser --password=Bkpuser123! --stream=xbstream --parallel 4 --no-timestamp  --target-dir=/tmp | pigz -k -1 -p4 - | socat -u stdio TCP:Target:4444

You will see an output like this:

xtrabackup: recognized server arguments: --datadir=/var/lib/mysql --server-id=1 --log_bin=/var/lib/mysql/mysql-bin --innodb_log_file_size=200M --innodb_log_files_in_group=2 --open_files_limit=65535 --parallel=4
xtrabackup: recognized client arguments: --backup=1 --user=bkpuser --password=* --stream=xbstream --target-dir=/tmp
200822 11:10:16  version_check Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup' as 'bkpuser'  (using password: YES).
200822 11:10:16  version_check Connected to MySQL server
200822 11:10:16  version_check Executing a version check against the server...
200822 11:10:16  version_check Done.
200822 11:10:16 Connecting to MySQL server host: localhost, user: bkpuser, password: set, port: not set, socket: not set
Using server version 5.7.30-log
xtrabackup version 2.4.20 based on MySQL server 5.7.26 Linux (x86_64) (revision id: c8b4056)
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: open files limit requested 65535, set to 65535
xtrabackup: using the following InnoDB configuration:
xtrabackup:   innodb_data_home_dir = .
xtrabackup:   innodb_data_file_path = ibdata1:12M:autoextend
xtrabackup:   innodb_log_group_home_dir = ./
xtrabackup:   innodb_log_files_in_group = 2
xtrabackup:   innodb_log_file_size = 209715200
InnoDB: Number of pools: 1
200822 11:10:16 >> log scanned up to (6724690490)
xtrabackup: Generating a list of tablespaces
InnoDB: Allocated tablespace ID 2 for mysql/plugin, old maximum was 0
xtrabackup: Starting 4 threads for parallel data files transfer
200822 11:10:16 [01] Streaming ./ibdata1
...etc

4. Once finished with step 3, you will see an output like this on the Target node:

...
MySQL binlog position: filename 'mysql-bin.000091', position '102205647'
200822 11:10:21 [00] Streaming <STDOUT>
200822 11:10:21 [00]        ...done
200822 11:10:21 [00] Streaming <STDOUT>
200822 11:10:21 [00]        ...done
xtrabackup: Transaction log of lsn (4308505553) to (4308505562) was copied.
200822 11:10:21 completed OK!

And step 2 will be finished too, so you must execute the following commands in Target node:

Target # xtrabackup --prepare --use-memory=1G --target-dir=/var/lib/mysql/

From the documentation:

After you made a backup with the xtrabackup –backup option, you’ll first need to prepare it in order to restore it. Data files are not point-in-time consistent until they’ve been prepared, because they were copied at different times as the program ran, and they might have been changed while this was happening. If you try to start InnoDB with these data files, it will detect corruption and crash itself to prevent you from running on damaged data. The xtrabackup –prepare step makes the files perfectly consistent at a single instant in time, so you can run InnoDB on them.

Target # chown -R mysql:mysql  /var/lib/mysql/*
Target # systemctl start mysqld

And you have cloned a new database!

You can, of course, set up this new database as a replica with these additional steps executed in the Target node:

Look at the content of the file xtrabackup_binlog_info, it will be something like:

Target # cat /var/lib/mysql/xtrabackup_binlog_info

mysql-bin.000091 102205647

(We’re assuming they have created the following user/grants in the Source db, if not, create it)

mysql> show grants for replicator;
+----------------------------------------------------+
| Grants for replicator@%                            |
+----------------------------------------------------+
| GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%' |
+----------------------------------------------------+

Connect to the database, and run:

Target # mysql -u root -p

mysql> change master to master_host='Source',master_port=3306,master_user='replicator',master_password='R3pl1c4t10n!',master_log_file='mysql-bin.000091',master_log_pos=102205647;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> pager egrep -i "Master_Host|Master_User|Master_Port|file|behind"

mysql> show slave status\G
                  Master_Host: master
                  Master_User: replicator
                  Master_Port: 3306
              Master_Log_File: mysql-bin.000091
               Relay_Log_File: relay.000001
        Relay_Master_Log_File: mysql-bin.000091
               Until_Log_File:
           Master_SSL_CA_File:
        Seconds_Behind_Master: 0
             Master_Info_File: mysql.slave_master_info

1 row in set (0.00 sec)

And that’s all.  Happy streaming!

Jan
08
2021
--

Updated MySQL OSMetrics Plugins

Updated MySQL OSMetrics Plugins

Updated MySQL OSMetrics PluginsIt has been some time since I have posted updates to my plugins.  After the initial version, I decided to split the plugins into categories of metrics.  This will allow users to choose whether they want to install all of the plugins or only select ones they care about.

Since the installation process is unfamiliar to many users, I also expanded the instructions to make it a little easier to follow.  Moreover, I added a Makefile.

I have also reformatted the output of some plugins to be either horizontal or vertical in orientation.  There is still more work to do in this area as well.

Where to Get The MySQL Plugins

You can get the plugins from GitHub at https://github.com/toritejutsu/osmetrics but they will have to be compiled from source.  As mentioned above, you can choose whether to install all of them or one by one.  If you have an interest, feel free to contribute to the code as it would make them much more useful to get more input and actual usage.

What Are The Plugins?

These are a collection of MySQL plugins for displaying Operating System metrics in INFORMATION_SCHEMA. This would allow monitoring tools, such as Percona Monitoring and Management (PMM), to retrieve these values remotely via the MySQL interface.

Values are pulled via standard C library calls and some are read from the /proc filesystem so overhead is absolutely minimal. I added a couple of libraries originally to show that even Windows and other variants of UNIX can be utilized, but commented them out to keep it all simple for now.

Many variables were added to show what was possible. Some of these may not be of interest. I just wanted to see what kind of stuff was possible and would tweak these over time.

Also, none of the calculations were rounded. This was done just to keep precision for the graphing of values but could easily be changed later. If there is interest, this could be expanded to add more metrics and unnecessary ones removed. Just looking for feedback.

Keep in mind that my C programming skills are rusty and I am sure the code could be cleaned up.

Make sure you have the source code for MySQL and have done a cmake on it. This will be necessary to compile the plugin as well as some SQL install scripts.

Preparing The Environment

Below is the way that I compiled the plugin. You will obviously need to make changes to match your environment. You will also need to have the Percona Server for MySQL source code on your server:

wget https://www.percona.com/downloads/Percona-Server-5.7/Percona-Server-5.7.17-13/source/tarball/percona-server-5.7.17-13.tar.gz

Uncompress the file and go into the directory:

tar -zxvf percona-server-5.7.17-13.tar.gz
cd percona-server-5.7.17-13

I also had to add a few utilities:

sudo yum install cmake
sudo yum install boost
sudo yum install ncurses-devel
sudo yum install readline-devel
cmake -DDOWNLOAD_BOOST=1 -DWITH_BOOST=..

Compiling The Plugins

First, you will need to put the plugin code in the plugin directory of the source code you downloaded. For me, this was “/home/ec2-user/percona-server-5.7.17-13/plugin” and I named the directory “osmetrics”. Of course, you can just do a “git” to retrieve this to your server or download it as a zip file and decompress it. Just make sure it is placed into the “plugin” directory of the source code as noted above.

Next, you will need to know where your MySQL plugin directory is located. You can query that with the following SQL:

mysql> SHOW GLOBAL VARIABLES LIKE "%plugin_dir%";
+---------------+-------------------------+
| Variable_name | Value                   |
+---------------+-------------------------+
| plugin_dir    | /jet/var/mysqld/plugin/ |
+---------------+-------------------------+
1 row in set (0.01 sec)

You will then need to edit the Makefile and define this path there. Once that is complete, you can compile the plugins:

make clean
make
make install

Installing The Plugins

Finally, you can log in to MySQL and activate the plugins:

mysql> INSTALL PLUGIN OS_CPU SONAME 'osmetrics-cpu.so';
mysql> INSTALL PLUGIN OS_CPUGOVERNOR SONAME 'osmetrics-cpugovernor.so';
mysql> INSTALL PLUGIN OS_CPUINFO SONAME 'osmetrics-cpuinfo.so';
mysql> INSTALL PLUGIN OS_IOSCHEDULER SONAME 'osmetrics-ioscheduler.so';
mysql> INSTALL PLUGIN OS_DISKSTATS SONAME 'osmetrics-diskstats.so';
mysql> INSTALL PLUGIN OS_LOADAVG SONAME 'osmetrics-loadavg.so';
mysql> INSTALL PLUGIN OS_MEMINFO SONAME 'osmetrics-meminfo.so';
mysql> INSTALL PLUGIN OS_MEMORY SONAME 'osmetrics-memory.so';
mysql> INSTALL PLUGIN OS_MISC SONAME 'osmetrics-misc.so';
mysql> INSTALL PLUGIN OS_MOUNTS SONAME 'osmetrics-mounts.so';
mysql> INSTALL PLUGIN OS_NETWORK SONAME 'osmetrics-network.so';
mysql> INSTALL PLUGIN OS_STAT SONAME 'osmetrics-stat.so';
mysql> INSTALL PLUGIN OS_SWAPINFO SONAME 'osmetrics-swapinfo.so';
mysql> INSTALL PLUGIN OS_VERSION SONAME 'osmetrics-version.so';    
mysql> INSTALL PLUGIN OS_VMSTAT SONAME 'osmetrics-vmstat.so';

Alternatively, you can run the install SQL script:

mysql> SOURCE /path/to/install_plugins.sql

Verify Installation

If all went well, you should see several new plugins available. Just make sure the status is “ACTIVE.”

mysql> SHOW PLUGINS;
+-----------------------------+----------+--------------------+----------------------------+---------+
| Name                        | Status   | Type               | Library                    | License |
+-----------------------------+----------+--------------------+----------------------------+---------+
...
| OS_CPU                      | ACTIVE   | INFORMATION SCHEMA | osmetrics-cpu.so           | GPL     |
| OS_GOVERNOR                 | ACTIVE   | INFORMATION SCHEMA | osmetrics-cpugovernor.so   | GPL     |
| OS_CPUINFO                  | ACTIVE   | INFORMATION SCHEMA | osmetrics-cpuinfo.so       | GPL     |
| OS_DISKSTATS                | ACTIVE   | INFORMATION SCHEMA | osmetrics-diskstats.so     | GPL     |
| OS_IOSCHEDULER              | ACTIVE   | INFORMATION SCHEMA | osmetrics-diskscheduler.so | GPL     |    
| OS_LOADAVG                  | ACTIVE   | INFORMATION SCHEMA | osmetrics-loadavg.so       | GPL     |
| OS_MEMINFO                  | ACTIVE   | INFORMATION SCHEMA | osmetrics-meminfo.so       | GPL     |
| OS_MEMORY                   | ACTIVE   | INFORMATION SCHEMA | osmetrics-memory.so        | GPL     |
| OS_MISC                     | ACTIVE   | INFORMATION SCHEMA | osmetrics-misc.so          | GPL     |
| OS_MOUNTS                   | ACTIVE   | INFORMATION SCHEMA | osmetrics-mounts.so        | GPL     |
| OS_NETWORK                  | ACTIVE   | INFORMATION SCHEMA | osmetrics-network.so       | GPL     |
| OS_STAT                     | ACTIVE   | INFORMATION SCHEMA | osmetrics-stat.so          | GPL     |
| OS_SWAPINFO                 | ACTIVE   | INFORMATION SCHEMA | osmetrics-swapinfo.so      | GPL     |
| OS_VERSION                  | ACTIVE   | INFORMATION SCHEMA | osmetrics-version.so       | GPL     |
| OS_VMSTAT                   | ACTIVE   | INFORMATION SCHEMA | osmetrics-vmstat.so        | GPL     |
+-----------------------------+----------+--------------------+----------------------------+---------+

Querying The Plugins

Let’s look at some example output from each of the plugins below:

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_CPU;
+---------------+------------+--------------------------------------------------------------------+
| name          | value      | comment                                                            |
+---------------+------------+--------------------------------------------------------------------+
| numcores      |          1 | Number of virtual CPU cores                                        |
| speed         |   2299.892 | CPU speed in MHz                                                   |
| bogomips      |    4600.08 | CPU bogomips                                                       |
| user          |          0 | Normal processes executing in user mode                            |
| nice          |       4213 | Niced processes executing in user mode                             |
| sys           |     610627 | Processes executing in kernel mode                                 |
| idle          |        524 | Processes which are idle                                           |
| iowait        |          0 | Processes waiting for I/O to complete                              |
| irq           |          9 | Processes servicing interrupts                                     |
| softirq       |        765 | Processes servicing Softirqs                                       |
| guest         |          0 | Processes running a guest                                          |
| guest_nice    |          0 | Processes running a niced guest                                    |
| intr          |     200642 | Count of interrupts serviced since boot time                       |
| ctxt          |     434493 | Total number of context switches across all CPUs                   |
| btime         | 1595891204 | Ttime at which the system booted, in seconds since the Unix epoch  |
| processes     |       9270 | Number of processes and threads created                            |
| procs_running |          3 | Total number of threads that are running or ready to run           |
| procs_blocked |          0 | Number of processes currently blocked, waiting for I/O to complete |
| softirq       |        765 | Counts of softirqs serviced since boot time                        |
| idle_pct      |       0.09 | Average CPU idle time                                              |
| util_pct      |      99.91 | Average CPU utilization                                            |
| procs         |        120 | Number of current processes                                        |
| uptime_tv_sec |          1 | User CPU time used (in seconds)                                    |
| utime_tv_usec |     943740 | User CPU time used (in microseconds)                               |
| stime_tv_sec  |          1 | System CPU time (in seconds)                                       |
| stime_tv_usec |     315574 | System CPU time (in microseconds)                                  |
| utime         |    1.94374 | Total user time                                                    |
| stime         |   1.315574 | Total system time                                                  |
| minflt        |      34783 | Page reclaims (soft page faults)                                   |
| majflt        |          0 | Page faults                                                        |
| nvcsw         |        503 | Number of voluntary context switches                               |
| nivcsw        |        135 | Number of involuntary context switches                             |
+---------------+------------+--------------------------------------------------------------------+
32 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_CPUGOVERNOR;
+--------+-------------+
| name   | governor    |
+--------+-------------+
| cpu0   | performance |
+--------+-------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_CPUINFO;
+-----------+--------------+------------+-------+-------------------------------------------+----------+-----------+----------+------------+-------------+----------+---------+-----------+--------+----------------+-----+---------------+-------------+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+----------+--------------+-----------------+-----------------------------------+------------------+
| processor | vendor_id    | cpu_family | model | model_name                                | stepping | microcode | cpu_MHz  | cache_size | physical_id | siblings | core_id | cpu_cores | apicid | initial_apicid | fpu | fpu_exception | cpuid_level | wp  | flags                                                                                                                                                                                                                                                                   | bugs                                                                               | bogomips | clflush_size | cache_alignment | address_sizes                     | power_management |
+-----------+--------------+------------+-------+-------------------------------------------+----------+-----------+----------+------------+-------------+----------+---------+-----------+--------+----------------+-----+---------------+-------------+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+----------+--------------+-----------------+-----------------------------------+------------------+
| 0         | GenuineIntel | 6          | 63    | Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz | 2        | 0x43      | 2400.005 | 30720 KB   | 0           | 1        | 0       | 1         | 0      | 0              | yes | yes           | 13          | yes | fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave a | cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit | 4800.11  | 64           | 64              | 46 bits physical, 48 bits virtual |                  |
+-----------+--------------+------------+-------+-------------------------------------------+----------+-----------+----------+------------+-------------+----------+---------+-----------+--------+----------------+-----+---------------+-------------+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+----------+--------------+-----------------+-----------------------------------+------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_IOSCHEDULER;
+--------+-----------+
| device | scheduler |
+--------+-----------+
| xvda   | [noop]    |
+--------+-----------+
1 row in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_DISKSTATS;
+-----------+-----------+--------+---------------+--------------+--------------+---------------+----------------+---------------+-----------------+----------------+-----------------+------------------+---------------------------+
| major_num | minor_num | device | reads_success | reads_merged | sectors_read | time_reads_ms | writes_success | writes_merged | sectors_written | time_writes_ms | ios_in_progress | time_doing_io_ms | weighted_time_doing_io_ms |
+-----------+-----------+--------+---------------+--------------+--------------+---------------+----------------+---------------+-----------------+----------------+-----------------+------------------+---------------------------+
|       202 |         0 | xvda   |         10286 |           10 |       472913 |          7312 |           4137 |          2472 |          351864 |          14276 |               0 |             4452 |                     21580 |
|       202 |         1 | xvda1  |         10209 |           10 |       468929 |          7280 |           4137 |          2472 |          351864 |          14276 |               0 |             4436 |                     21548 |
|       202 |         2 | xvda2  |            40 |            0 |         3504 |            24 |              0 |             0 |               0 |              0 |               0 |               24 |                        24 |
+-----------+-----------+--------+---------------+--------------+--------------+---------------+----------------+---------------+-----------------+----------------+-----------------+------------------+---------------------------+
3 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_LOADAVG;
+--------+-------+------------------------+
| name   | value | comment                |
+--------+-------+------------------------+
| 1_min  |  0.09 | 1 minute load average  |
| 5_min  |  0.02 | 5 minute load average  |
| 15_min |  0.01 | 15 minute load average |
+--------+-------+------------------------+
3 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_MEMINFO;
+-----------------+------------+
| name            | value      |
+-----------------+------------+
| MemTotal        | 2090319872 |
| MemFree         | 1658920960 |
| MemAvailable    | 1762938880 |
| Buffers         |   22573056 |
| Cached          |  206209024 |
| SwapCached      |          0 |
| Active          |  284999680 |
| Inactive        |  100868096 |
| Active(anon)    |  156110848 |
| Inactive(anon)  |      53248 |
| Active(file)    |  128888832 |
| Inactive(file)  |  100814848 |
| Unevictable     |          0 |
| Mlocked         |          0 |
| SwapTotal       |          0 |
| SwapFree        |          0 |
| Dirty           |     811008 |
| Writeback       |          0 |
| AnonPages       |  157085696 |
| Mapped          |   53223424 |
| Shmem           |      65536 |
| Slab            |   29102080 |
| SReclaimable    |   18337792 |
| SUnreclaim      |   10764288 |
| KernelStack     |    2162688 |
| PageTables      |    3444736 |
| NFS_Unstable    |          0 |
| Bounce          |          0 |
| WritebackTmp    |          0 |
| CommitLimit     | 1045159936 |
| Committed_AS    |  770662400 |
| VmallocTotal    | 4294966272 |
| VmallocUsed     |          0 |
| VmallocChunk    |          0 |
| AnonHugePages   |          0 |
| ShmemHugePages  |          0 |
| ShmemPmdMapped  |          0 |
| HugePages_Total |          0 |
| HugePages_Free  |          0 |
| HugePages_Rsvd  |          0 |
| HugePages_Surp  |          0 |
| Hugepagesize    |    2097152 |
| DirectMap4k     |   60817408 |
| DirectMap2M     | 2086666240 |
+-----------------+------------+
44 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_MEMORY;
+----------------+-----------------------+--------------------------------------+
| name           | value                 | comment                              |
+----------------+-----------------------+--------------------------------------+
| total_ram      |            2090319872 | Total usable main memory size        |
| free_ram       |            1452339200 | Available memory size                |
| used_ram       |             637980672 | Used memory size                     |
| free_ram_pct   |                 69.48 | Available memory as a percentage     |
| used_ram_pct   |                 30.52 | Free memory as a percentage          |
| shared_ram     |                 61440 | Amount of shared memory              |
| buffer_ram     |             108040192 | Memory used by buffers               |
| total_high_ram |                     0 | Total high memory size               |
| free_high_ram  |                     0 | Available high memory size           |
| total_low_ram  |            2090319872 | Total low memory size                |
| free_low_ram   |            1452339200 | Available low memory size            |
| maxrss         |       140308942222128 | Maximum resident set size            |
+----------------+-----------------------+--------------------------------------+
12 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_MISC;
+-----------------------+-------------------+-------------------------------------------------+
| name                  | value             | comment                                         |
+-----------------------+-------------------+-------------------------------------------------+
| datadir_size          |        8318783488 | MySQL data directory size                       |
| datadir_size_free     |        2277470208 | MySQL data directory size free space            |
| datadir_size_used     |        6041313280 | MySQL data directory size used space            |
| datadir_size_used_pct |             72.62 | MySQL data directory used space as a percentage |
| uptime                |            100026 | Uptime (in seconds)                             |
| uptime_days           |                 1 | Uptime (in days)                                |
| uptime_hours          |                27 | Uptime (in hours)                               |
| procs                 |               122 | Number of current processes                     |
| swappiness            |                60 | Swappiness setting                              |
+-----------------------+-------------------+-------------------------------------------------+
9 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_MOUNTS;
+------------+--------------------------+------------------+---------------------------+
| device     | mount_point              | file_system_type | mount_options             |
+------------+--------------------------+------------------+---------------------------+
| proc       | /proc                    | proc             | rw,relatime               |
| sysfs      | /sys                     | sysfs            | rw,relatime               |
| devtmpfs   | /dev                     | devtmpfs         | rw,relatime,size=1010060k |
| devpts     | /dev/pts                 | devpts           | rw,relatime,gid=5,mode=62 |
| tmpfs      | /dev/shm                 | tmpfs            | rw,relatime               |
| /dev/xvda1 | /                        | ext4             | rw,noatime,data=ordered   |
| devpts     | /dev/pts                 | devpts           | rw,relatime,gid=5,mode=62 |
| none       | /proc/sys/fs/binfmt_misc | binfmt_misc      | rw,relatime               |
+------------+--------------------------+------------------+---------------------------+
8 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_NETWORK;
+-----------+------------+------------+----------+----------+------------+------------+-----------+-----------+
| interface | tx_packets | rx_packets | tx_bytes | rx_bytes | tx_dropped | rx_dropped | tx_errors | rx_errors |
+-----------+------------+------------+----------+----------+------------+------------+-----------+-----------+
| lo        |      26528 |      26528 |  1380012 |  1380012 |          0 |          0 |         0 |         0 |
| eth0      |     102533 |     144031 | 16962983 | 23600676 |          0 |          0 |         0 |         0 |
+-----------+------------+------------+----------+----------+------------+------------+-----------+-----------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_STAT;
+------+----------+-------+------+------+---------+--------+-------+---------+--------+---------+--------+---------+-------+-------+--------+--------+----------+------+-------------+-------------+-----------+------------+-------+-----------------------+-----------+----------+-----------------+---------+---------+--------+---------+-----------+----------+-------+-------+--------+-------------+-----------+-------------+--------+-----------------------+------------+-------------+------------+----------+-----------+-----------------+-----------------+-----------------+-----------------+-----------+
| pid  | comm     | state | ppid | pgrp | session | tty_nr | tpgid | flags   | minflt | cminflt | majflt | cmajflt | utime | stime | cutime | cstime | priority | nice | num_threads | itrealvalue | starttime | vsize      | rss   | rsslim                | startcode | endcode  | startstack      | kstkeep | kstkeip | signal | blocked | sigignore | sigcatch | wchan | nswap | cnswap | exit_signal | processor | rt_priority | policy | delayacct_blkio_ticks | guest_time | cguest_time | start_data | end_data | start_brk | arg_start       | arg_end         | env_start       | env_end         | exit_code |
+------+----------+-------+------+------+---------+--------+-------+---------+--------+---------+--------+---------+-------+-------+--------+--------+----------+------+-------------+-------------+-----------+------------+-------+-----------------------+-----------+----------+-----------------+---------+---------+--------+---------+-----------+----------+-------+-------+--------+-------------+-----------+-------------+--------+-----------------------+------------+-------------+------------+----------+-----------+-----------------+-----------------+-----------------+-----------------+-----------+
| 6656 | (mysqld) | S     | 2030 | 1896 |    1896 |      0 |    -1 | 4194304 |  34784 |       0 |      0 |       0 |    96 |    55 |      0 |      0 |       20 |    0 |          29 |           0 |    965078 | 1153900544 | 37324 | 1.8446744073709552e19 |   4194304 | 27414570 | 140728454321408 |       0 |       0 |      0 |  540679 |     12294 |     1768 |     0 |     0 |      0 |          17 |         0 |           0 |      0 |                     4 |          0 |           0 |   29511728 | 31209920 |  36462592 | 140728454327797 | 140728454328040 | 140728454328040 | 140728454328281 |         0 |
+------+----------+-------+------+------+---------+--------+-------+---------+--------+---------+--------+---------+-------+-------+--------+--------+----------+------+-------------+-------------+-----------+------------+-------+-----------------------+-----------+----------+-----------------+---------+---------+--------+---------+-----------+----------+-------+-------+--------+-------------+-----------+-------------+--------+-----------------------+------------+-------------+------------+----------+-----------+-----------------+-----------------+-----------------+-----------------+-----------+
1 row in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_SWAPINFO;
+---------------+-------+--------------------------------------+
| name          | value | comment                              |
+---------------+-------+--------------------------------------+
| total_swap    |     0 | Total swap space size                |
| free_swap     |     0 | Swap space available                 |
| used_swap     |     0 | Swap space used                      |
| free_swap_pct |     0 | Swap space available as a percentage |
| used_swap_pct |     0 | Swap space used as a percentage      |
+---------------+-------+--------------------------------------+
5 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_VMSTAT;
+---------------------------+----------+
| name                      | value    |
+---------------------------+----------+
| nr_free_pages             |   354544 |
| nr_zone_inactive_anon     |       13 |
| nr_zone_active_anon       |    38257 |
| nr_zone_inactive_file     |    30742 |
| nr_zone_active_file       |    70367 |
| nr_zone_unevictable       |        0 |
| nr_zone_write_pending     |        9 |
| nr_mlock                  |        0 |
| nr_page_table_pages       |      854 |
| nr_kernel_stack           |     2176 |
| nr_bounce                 |        0 |
| nr_zspages                |        0 |
| nr_free_cma               |        0 |
| numa_hit                  | 14011351 |
| numa_miss                 |        0 |
| numa_foreign              |        0 |
| numa_interleave           |    15133 |
| numa_local                | 14011351 |
| numa_other                |        0 |
| nr_inactive_anon          |       13 |
| nr_active_anon            |    38257 |
| nr_inactive_file          |    30742 |
| nr_active_file            |    70367 |
| nr_unevictable            |        0 |
| nr_slab_reclaimable       |     9379 |
| nr_slab_unreclaimable     |     2829 |
| nr_isolated_anon          |        0 |
| nr_isolated_file          |        0 |
| workingset_refault        |        0 |
| workingset_activate       |        0 |
| workingset_nodereclaim    |        0 |
| nr_anon_pages             |    38504 |
| nr_mapped                 |    13172 |
| nr_file_pages             |   100873 |
| nr_dirty                  |        9 |
| nr_writeback              |        0 |
| nr_writeback_temp         |        0 |
| nr_shmem                  |       15 |
| nr_shmem_hugepages        |        0 |
| nr_shmem_pmdmapped        |        0 |
| nr_anon_transparent_hugep |        0 |
| nr_unstable               |        0 |
| nr_vmscan_write           |        0 |
| nr_vmscan_immediate_recla |        0 |
| nr_dirtied                |   389218 |
| nr_written                |   381326 |
| nr_dirty_threshold        |    87339 |
| nr_dirty_background_thres |    43616 |
| pgpgin                    |   619972 |
| pgpgout                   |  2180908 |
| pswpin                    |        0 |
| pswpout                   |        0 |
| pgalloc_dma               |        0 |
| pgalloc_dma32             | 14085334 |
| pgalloc_normal            |        0 |
| pgalloc_movable           |        0 |
| allocstall_dma            |        0 |
| allocstall_dma32          |        0 |
| allocstall_normal         |        0 |
| allocstall_movable        |        0 |
| pgskip_dma                |        0 |
| pgskip_dma32              |        0 |
| pgskip_normal             |        0 |
| pgskip_movable            |        0 |
| pgfree                    | 14440053 |
| pgactivate                |    55703 |
| pgdeactivate              |        1 |
| pglazyfree                |      249 |
| pgfault                   | 14687206 |
| pgmajfault                |     1264 |
| pglazyfreed               |        0 |
| pgrefill                  |        0 |
| pgsteal_kswapd            |        0 |
| pgsteal_direct            |        0 |
| pgscan_kswapd             |        0 |
| pgscan_direct             |        0 |
| pgscan_direct_throttle    |        0 |
| zone_reclaim_failed       |        0 |
| pginodesteal              |        0 |
| slabs_scanned             |        0 |
| kswapd_inodesteal         |        0 |
| kswapd_low_wmark_hit_quic |        0 |
| kswapd_high_wmark_hit_qui |        0 |
| pageoutrun                |        0 |
| pgrotated                 |       44 |
| drop_pagecache            |        0 |
| drop_slab                 |        0 |
| oom_kill                  |        0 |
| numa_pte_updates          |        0 |
| numa_huge_pte_updates     |        0 |
| numa_hint_faults          |        0 |
| numa_hint_faults_local    |        0 |
| numa_pages_migrated       |        0 |
| pgmigrate_success         |        0 |
| pgmigrate_fail            |        0 |
| compact_migrate_scanned   |        0 |
| compact_free_scanned      |        0 |
| compact_isolated          |        0 |
| compact_stall             |        0 |
| compact_fail              |        0 |
| compact_success           |        0 |
| compact_daemon_wake       |        0 |
| compact_daemon_migrate_sc |        0 |
| compact_daemon_free_scann |        0 |
| htlb_buddy_alloc_success  |        0 |
| htlb_buddy_alloc_fail     |        0 |
| unevictable_pgs_culled    |     1300 |
| unevictable_pgs_scanned   |        0 |
| unevictable_pgs_rescued   |      266 |
| unevictable_pgs_mlocked   |     2626 |
| unevictable_pgs_munlocked |     2626 |
| unevictable_pgs_cleared   |        0 |
| unevictable_pgs_stranded  |        0 |
| thp_fault_alloc           |        0 |
| thp_fault_fallback        |        0 |
| thp_collapse_alloc        |        0 |
| thp_collapse_alloc_failed |        0 |
| thp_file_alloc            |        0 |
| thp_file_mapped           |        0 |
| thp_split_page            |        0 |
| thp_split_page_failed     |        0 |
| thp_deferred_split_page   |        0 |
| thp_split_pmd             |        0 |
| thp_split_pud             |        0 |
| thp_zero_page_alloc       |        0 |
| thp_zero_page_alloc_faile |        0 |
| thp_swpout                |        0 |
| thp_swpout_fallback       |        0 |
| swap_ra                   |        0 |
| swap_ra_hit               |        0 |
+---------------------------+----------+
130 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.OS_VERSION;
+------------+---------------------------+
| name       | value                     |
+------------+---------------------------+
| sysname    | Linux                     |
| nodename   | ip-172-31-25-133          |
| release    | 4.14.181-108.257.amzn1.x8 |
| version    | #1 SMP Wed May 27 02:43:0 |
| machine    | x86_64                    |
| domainname | (none)                    |
+------------+---------------------------+
6 rows in set (0.00 sec)

Uninstalling The Plugins

To uninstall the plugins, you can remove them with the following SQL commands. To completely remove them, you will need to remove them from your plugin directory.

mysql> UNINSTALL PLUGIN OS_CPU;
mysql> UNINSTALL PLUGIN OS_CPUGOVERNOR;
mysql> UNINSTALL PLUGIN OS_CPUINFO;
mysql> UNINSTALL PLUGIN OS_IOSCHEDULER;
mysql> UNINSTALL PLUGIN OS_DISKSTATS;
mysql> UNINSTALL PLUGIN OS_LOADAVG;
mysql> UNINSTALL PLUGIN OS_MEMINFO;
mysql> UNINSTALL PLUGIN OS_MEMORY;
mysql> UNINSTALL PLUGIN OS_MISC;
mysql> UNINSTALL PLUGIN OS_MOUNTS;
mysql> UNINSTALL PLUGIN OS_NETWORK;
mysql> UNINSTALL PLUGIN OS_STAT;
mysql> UNINSTALL PLUGIN OS_SWAPINFO;
mysql> UNINSTALL PLUGIN OS_VERSION;
mysql> UNINSTALL PLUGIN OS_VMSTAT;

Alternatively, you can run the uninstall SQL script:

mysql> SOURCE /path/to/uninstall_plugins.sql

What’s Next?

Who knows! If there is enough interest, I would be happy to expand the plugins. First, I need to do some more code cleanup and performance test them. I do not expect them to have a significant performance impact, but one never knows until you test…

Jan
08
2021
--

MongoDB 101: How to Tune Your MongoDB Configuration After Upgrading to More Memory

MongoDB configuration

MongoDB configurationIn this post, we will be discussing what to do when you add more memory to your MongoDB deployment, a common practice when you are scaling resources.

Why Might You Need to Add More Memory?

Scaling resources is a way of adding more resources to your environment.  There are two main ways this can be accomplished: vertical scaling and horizontal scaling.

  • Vertical scaling is increasing hardware capacity for a given instance, thus having a more powerful server.
  • Horizontal scaling is when you add more servers to your architecture.   A pretty standard approach for horizontal scaling, especially for databases,  is load balancing and sharding.

As your application grows, working sets are getting bigger, and thus we start to see bottlenecks as data that doesn’t fit into memory has to be retrieved from disk. Reading from disk is a costly operation, even with modern NVME drives, so we will need to deal with either of the scaling solutions we mentioned.

In this case, we will discuss adding more RAM, which is usually the fastest and easiest way to scale hardware vertically, and how having more memory can be a major help for MongoDB performance.

How to Calculate Memory Utilization in MongoDB

Before we add memory to our MongoDB deployment, we need to understand our current Memory Utilization.  This is best done by querying serverStatus and requesting data on the WiredTiger cache.

Since MongoDB 3.2, MongoDB has used WiredTiger as its default Storage Engine. And by default, MongoDB will reserve 50% of the available memory – 1 GB for the WiredTiger cache or 256 MB whichever is greater.

For example, a system with 16 GB of RAM, would have a WiredTiger cache size of 7.5 GB.

( 0.5 * (16-1) )

The size of this cache is important to ensure WiredTiger is performant. It’s worth taking a look to see if you should alter it from the default. A good rule is that the size of the cache should be large enough to hold the entire application working set.

How do we know whether to alter it? Let’s look at the cache usage statistics:

db.serverStatus().wiredTiger.cache
{
"application threads page read from disk to cache count" : 9,
"application threads page read from disk to cache time (usecs)" : 17555,
"application threads page write from cache to disk count" : 1820,
"application threads page write from cache to disk time (usecs)" : 1052322,
"bytes allocated for updates" : 20043,
"bytes belonging to page images in the cache" : 46742,
"bytes belonging to the history store table in the cache" : 173,
"bytes currently in the cache" : 73044,
"bytes dirty in the cache cumulative" : 38638327,
"bytes not belonging to page images in the cache" : 26302,
"bytes read into cache" : 43280,
"bytes written from cache" : 20517382,
"cache overflow score" : 0,
"checkpoint blocked page eviction" : 0,
"eviction calls to get a page" : 5973,
"eviction calls to get a page found queue empty" : 4973,
"eviction calls to get a page found queue empty after locking" : 20,
"eviction currently operating in aggressive mode" : 0,
"eviction empty score" : 0,
"eviction passes of a file" : 0,
"eviction server candidate queue empty when topping up" : 0,
"eviction server candidate queue not empty when topping up" : 0,
"eviction server evicting pages" : 0,
"eviction server slept, because we did not make progress with eviction" : 735,
"eviction server unable to reach eviction goal" : 0,
"eviction server waiting for a leaf page" : 2,
"eviction state" : 64,
"eviction walk target pages histogram - 0-9" : 0,
"eviction walk target pages histogram - 10-31" : 0,
"eviction walk target pages histogram - 128 and higher" : 0,
"eviction walk target pages histogram - 32-63" : 0,
"eviction walk target pages histogram - 64-128" : 0,
"eviction walk target strategy both clean and dirty pages" : 0,
"eviction walk target strategy only clean pages" : 0,
"eviction walk target strategy only dirty pages" : 0,
"eviction walks abandoned" : 0,
"eviction walks gave up because they restarted their walk twice" : 0,
"eviction walks gave up because they saw too many pages and found no candidates" : 0,
"eviction walks gave up because they saw too many pages and found too few candidates" : 0,
"eviction walks reached end of tree" : 0,
"eviction walks started from root of tree" : 0,
"eviction walks started from saved location in tree" : 0,
"eviction worker thread active" : 4,
"eviction worker thread created" : 0,
"eviction worker thread evicting pages" : 902,
"eviction worker thread removed" : 0,
"eviction worker thread stable number" : 0,
"files with active eviction walks" : 0,
"files with new eviction walks started" : 0,
"force re-tuning of eviction workers once in a while" : 0,
"forced eviction - history store pages failed to evict while session has history store cursor open" : 0,
"forced eviction - history store pages selected while session has history store cursor open" : 0,
"forced eviction - history store pages successfully evicted while session has history store cursor open" : 0,
"forced eviction - pages evicted that were clean count" : 0,
"forced eviction - pages evicted that were clean time (usecs)" : 0,
"forced eviction - pages evicted that were dirty count" : 0,
"forced eviction - pages evicted that were dirty time (usecs)" : 0,
"forced eviction - pages selected because of too many deleted items count" : 0,
"forced eviction - pages selected count" : 0,
"forced eviction - pages selected unable to be evicted count" : 0,
"forced eviction - pages selected unable to be evicted time" : 0,
"forced eviction - session returned rollback error while force evicting due to being oldest" : 0,
"hazard pointer blocked page eviction" : 0,
"hazard pointer check calls" : 902,
"hazard pointer check entries walked" : 25,
"hazard pointer maximum array length" : 1,
"history store key truncation calls that returned restart" : 0,
"history store key truncation due to mixed timestamps" : 0,
"history store key truncation due to the key being removed from the data page" : 0,
"history store score" : 0,
"history store table insert calls" : 0,
"history store table insert calls that returned restart" : 0,
"history store table max on-disk size" : 0,
"history store table on-disk size" : 0,
"history store table out-of-order resolved updates that lose their durable timestamp" : 0,
"history store table out-of-order updates that were fixed up by moving existing records" : 0,
"history store table out-of-order updates that were fixed up during insertion" : 0,
"history store table reads" : 0,
"history store table reads missed" : 0,
"history store table reads requiring squashed modifies" : 0,
"history store table remove calls due to key truncation" : 0,
"history store table writes requiring squashed modifies" : 0,
"in-memory page passed criteria to be split" : 0,
"in-memory page splits" : 0,
"internal pages evicted" : 0,
"internal pages queued for eviction" : 0,
"internal pages seen by eviction walk" : 0,
"internal pages seen by eviction walk that are already queued" : 0,
"internal pages split during eviction" : 0,
"leaf pages split during eviction" : 0,
"maximum bytes configured" : 8053063680,
"maximum page size at eviction" : 376,
"modified pages evicted" : 902,
"modified pages evicted by application threads" : 0,
"operations timed out waiting for space in cache" : 0,
"overflow pages read into cache" : 0,
"page split during eviction deepened the tree" : 0,
"page written requiring history store records" : 0,
"pages currently held in the cache" : 24,
"pages evicted by application threads" : 0,
"pages queued for eviction" : 0,
"pages queued for eviction post lru sorting" : 0,
"pages queued for urgent eviction" : 902,
"pages queued for urgent eviction during walk" : 0,
"pages read into cache" : 20,
"pages read into cache after truncate" : 902,
"pages read into cache after truncate in prepare state" : 0,
"pages requested from the cache" : 33134,
"pages seen by eviction walk" : 0,
"pages seen by eviction walk that are already queued" : 0,
"pages selected for eviction unable to be evicted" : 0,
"pages selected for eviction unable to be evicted as the parent page has overflow items" : 0,
"pages selected for eviction unable to be evicted because of active children on an internal page" : 0,
"pages selected for eviction unable to be evicted because of failure in reconciliation" : 0,
"pages walked for eviction" : 0,
"pages written from cache" : 1822,
"pages written requiring in-memory restoration" : 0,
"percentage overhead" : 8,
"tracked bytes belonging to internal pages in the cache" : 5136,
"tracked bytes belonging to leaf pages in the cache" : 67908,
"tracked dirty bytes in the cache" : 493,
"tracked dirty pages in the cache" : 1,
"unmodified pages evicted" : 0
}

 

There’s a lot of data here about WiredTiger’s cache, but we can focus on the following fields:

  • wiredTiger.cache.maximum bytes configured: This is the current maximum cache size.
  • wiredTiger.cache.bytes currently in the cache – This is the size of the data currently in the cache.   This is typically 80% of your cache size plus the amount of “dirty” cache that has not yet been written to disk. This should not be greater than the maximum bytes configured.  Having a value equal to or greater than the maximum bytes configured is a great indicator that you should have already scaled out.
  • wiredTiger.cache.tracked dirty bytes in the cache – This is the size of the dirty data in the cache. This should be less than five percent of your cache size value and can be another indicator that we need to scale out.   Once this goes over five percent of your cache size value WiredTiger will get more aggressive with removing data from your cache and in some cases may force your client to evict data from the cache before it can successfully write to it.
  • wiredTiger.cache.pages read into cache – This is the number of pages that are read into cache and you can use this to judge its per-second average to know what data is coming into your cache.
  • wiredTiger.cache.pages written from cache – This is the number of pages that are written from the cache to disk.   This will be especially heavy before checkpoints have occurred.  If this value continues to increase, then your checkpoints will continue to get longer and longer.

Looking at the above values, we can determine if we need to increase the size of the WiredTiger cache for our instance.  We might also look at the WiredTiger Concurrency Read and Write Ticket usage.  It’s fine that some tickets are used, but if the number continues to grow towards the number of cores then you’re reaching saturation of your CPU.  To check your tickets used you can see this in Percona Monitoring and Management Tool (PMM) or run the following query:

 

db.serverStatus().wiredTiger.concurrentTransactions
{
"write" : {
"out" : 0,
"available" : 128,
"totalTickets" : 128
},
"read" : {
"out" : 1,
"available" : 127,
"totalTickets" : 128
}
}

 

The wiredTiger.cache.pages read into cache value may also be indicative of an issue for read-heavy applications. If this value is consistently a large part of your cache size, increasing your memory may improve overall read performance.

Example

Using the following numbers as our example starting point, we can see the cache is small and there is definitely memory pressure on the cache:

We also are using the default wiredTiger cache size, so we know we have 16 GB of memory on the system (0.5 * (16-1)) = 7.5 GB.   Based on our knowledge of our (imaginary) application, we know the working set is 16 GB, so we want to be higher than this number.  In order to give us room for additional growth since our working set will only continue to grow, we could resize our server’s RAM from 16 GB to 48 GB.  If we stick with the default settings, this would increase our WiredTiger cache to 23.5 GB. (0.5 * (48-1)) = 23.5 GB.  This would leave 24.5 GB of RAM for the OS and its filesystem cache.  If we wanted to increase the size given to the WiredTiger cache we would set the storage.wiredTiger.engineConfig.cacheSizeGB to the value we wanted.   For example, say we want to allocate 30 GB to the wiredTiger cache to really avoid any reads from disk in the near term, leaving 18 GB for the OS and its filesystem cache.   We would add the following to our mongod.conf file:

storage:
   wiredTiger:
       engineConfig:
           cacheSizeGB: 30

For either the default setting or the specific settings to recognize the added memory and take effect, we will need to restart the mongod process.

Also note that unlike a lot of other database systems where the database cache is typically sized closer to 80-90% of system memory, MongoDB’s sweet spot is in the 50-70% range.  This is because MongoDB only uses the WiredTiger cache for uncompressed pages, while the operating system caches the compressed pages and writes them to the database files.  By leaving free memory to the operating system, we increase the likelihood of getting the page from the OS cache instead of needing to do a disk read.

Summary

In this article, we’ve gone over how to update your MongoDB configuration after you’ve upgraded to more memory.   We hope that this helps you tune your MongoDB configuration so that you can get the most out of your increased RAM.   Thanks for reading!

Additional Resources:

MongoDB Best Practices 2020 Edition

Tuning MongoDB for Bulk Loads

Jan
07
2021
--

TAM Enterprise Experiences – Data Encryption

TAM Enterprise Experiences – Data Encryption

TAM Enterprise Experiences – Data EncryptionIn previous TAM Enterprise Experiences posts, we have outlined typical aspects of utilizing MySQL in an Enterprise environment. One thing we have not yet covered is the topic of database encryption, both from the standpoint of business requirements as well as some of the more technical aspects of encryption.

In this post, we will cover:

  • Common enterprise compliance requirements
  • Types of MySQL encryption
  • Choosing the right encryption
  • Vault

Common Compliance Requirements

Beyond the obvious security concerns with sensitive data, most enterprise businesses also need to meet various compliance requirements, with the compliance requirement(s) dependent on the country the business is located in, the type of business, and the type of data being stored. Note that in all cases, the onus is on the business to protect the data based on these compliance requirements. Some of the more common enterprise compliance requirements are:

  • GDPR
    • Applies to businesses located within the EU.
    • 32(1) of the General Data Protection Regulation to implement appropriate technical and organizational measures to secure personal data. 
    • The GDPR deliberately does not define which specific technical and organizational measures are considered suitable in each case, in order to accommodate individual factors.
    • Source: https://gdpr-info.eu/issues/encryption/
  • HIPPA
    • Applies to the medical industry within the United States. 
    • The HIPAA encryption requirements for transmission security state that covered entities should “implement a mechanism to encrypt personal health information (PHI) whenever deemed appropriate.
    • Source: https://www.hipaajournal.com/hipaa-encryption-requirements/
  • PCI DSS
    • Applies to protect monetary transactions.
    • PCI encryption Requirement 3 of the Payment Card Industry’s Data Security Standard (PCI DSS) is to “protect stored cardholder data.” 
    • The public assumes merchants and financial institutions will protect data on payment cards to thwart theft and prevent unauthorized use.
    • Encryption of cardholder data with strong cryptography is an acceptable method of rendering the data unreadable in order to meet PCI DSS Requirement 3.4.
    • Source: https://www.pcisecuritystandards.org/pdfs/pci_fs_data_storage.pdf

Outside of compliance, there are of course other very critical reasons for an enterprise business to encrypt and protect data. A breach of security could result in a major negative business impact at best, and complete ruin at worst. Protecting business secrets from the competition, as well as an overall ethical and moral responsibility to protect information and data are other reasons that data security and encryption should always be taken seriously, regardless of business size or industry.

MySQL Encryption

There are several types of MySQL encryption:

  • Encryption At Rest
    • Full disk encryption
    • Encrypted database files
  • Encryption In Transit
    • TLS + Enforcement of SSL for TCP/IP user and replication accounts
    • Use of a UNIX SOCKET connection instead of the TCP/IP mysql connection
  • Encryption In Use
    • Applications encrypt data before storing it and decrypt it once retrieved.
    • The application takes responsibility for data security.

Choosing the Right Encryption

No matter the circumstances, at a bare minimum, encryption in-transit should be utilized to protect data in flight. All replication accounts and all user accounts should be using TLS + enforcement of SSL.

At some point in the future perhaps MySQL will mature to the point where in-use encryption won’t need to be handled by the application and the last bullet point can be dropped from the list. For now, however, the use of debuggers like strace can give access to the unencrypted data in memory on the MySQL server. Adding in this additional layer of application encryption can ensure that data in memory on the database server is encrypted and protected.

Encryption at the Volume/OS/Block Level

Strengths:

  • Simple to encrypt the volume or disk
  • MySQL isn’t aware of any change
  • The application isn’t aware of a change
  • Inexpensive

Weaknesses:

  • Doesn’t protect from insider threats
  • Centralized key storage and compliance can be problematic

Encryption at the Database Level

Strengths:

  • Protects from insider threats
  • Can encrypt across volumes you don’t control
  • Backups and restores are automatically encrypted
  • Lower overhead (3-5% performance hit)
  • DBA controlled
  • Centralized key storage
  • Compliance ready

Weaknesses:

  • Still vulnerable to in-memory attacks
  • More setup / complications
  • Loss of keys would be catastrophic

Encryption at the Application Level

Strengths:

  • Database servers are protected at all levels automatically since the data is unusable without decryption
  • Most flexible
  • Little to no overhead on databases

Weaknesses:

  • Many applications are not built with this in mind and are not easy to change
  • Full text and partial text search can be a problematic
  • Application shoulders all the responsibility for key security

What is Hashicorp’s Vault?

Hasicorp’s Vault is software for securely managing secrets. In this case, a secret is anything you want to tightly control access to, such as API Keys, passwords, or certificates. Vault is flexible with administration and can be controlled via a web GUI or the command line. There is also a strong API using curl with various ways to authenticate, and Hashicorp regularly pushes out updates to Vault keeping it up to date.

Vault Strengths:

  • One easily managed, centralized location for all keys
  • No backups of keyring file needed
  • Better security as the key is nowhere on the MySQL server itself
  • Powerful auditing capabilities

Conclusion

In a perfect world, your data would never be at risk and would always be protected. In reality, however, it is up to the data owners/administrators to protect their data using the methods outlined above.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com