Oct
19
2020
--

PMM 101: Troubleshooting MongoDB with Percona Monitoring and Management

Troubleshooting MongoDB with Percona Monitoring and Management

Troubleshooting MongoDB with Percona Monitoring and ManagementPercona Monitoring and Management (PMM) is an open-source tool developed by Percona that allows you to monitor and manage your MongoDB, MySQL, and PostgreSQL databases.  This blog will give you an overview of troubleshooting your MongoDB deployments with PMM.

Let’s start with a basic understanding of the architecture of PMM. PMM has two main architectural components:

  1. PMM Client – Client that lives on each database host in your environment.  Collects server metrics, system metrics, and database metrics used for Query Analytics
  2. PMM Server – Central part of PMM that the clients report all of their metric data to.   Also presents dashboards, graphs, and tables of that data in its web interface for visualization of your metric data.

For more details on the architecture of PMM, check out our docs.

Query Analytics

PMM Query Analytics (“QAN”) allows you to analyze MongoDB query performance over periods of time.  In the below screenshot you can see that the most longest-running query was against the testData collection.

Percona Monitoring and Management query analytics

If we drill deeper by clicking on the query in PMM we can see exactly what it was running. In this case, the query was searching in the testData collection of the mpg database looking for records where the value of x is 987544.

Percona Monitoring and Management

This is very helpful in determining what each query is doing, how much it is running, and which queries make up the bulk of your load.

The output is from db.currentOp(), and I agree it may not be clear at a glance what the application-side (or mongo shell) command was. This is a limitation of the MongoDB API in general – the drivers will send the request with perfect functional accuracy but it does not necessarily resemble what the user typed (or programmed).  But with an understanding of this, and focusing first on what the “command” field contains it is not too hard to picture a likely original format. For example the example above could have been sent by running “use mpg; db.testData.find({“x”: { “$lte”: …, “$gt”: … }).skip(0)” in the shell. The last “.skip(0)” is optional as it is 0 by default.

Additionally, you can see the full explain plan for your query just as you would by adding .explain() to your query.   In the below example we can see that the query did a full collection scan on the mpg.testData collection and we should think about adding an index to the ‘x’ field to improve the performance of this query.

Metrics Monitor

Metrics Monitor allows you to monitor, alert, and visualize different metrics related to your database overall, its internal metrics, and the systems they are running on.

Overall System Performance View

The first view that is helpful is your overall system performance view.   Here you can see at a high level, how much CPU and memory are being used, the amount of writes and reads from disk, network bandwidth, # of database connections, database queries per second, RAM, and the uptime for both the host and the database.   This view can often lead you to the problematic node(s) if you’re experiencing any issues and can also give you a high level of the overall health of your monitored environment.

Percona Monitoring and Management system overview

WiredTiger Metrics

Next, we’ll start digging into some of the database internal metrics that are helpful for troubleshooting MongoDB.  These metrics are mostly from the WiredTiger Storage Engine that is the default storage engine for MongoDB since MongoDB 3.0.  In addition to the metrics I cover, there are more documented here.

The WiredTiger storage engine uses tickets as a way to handle concurrency, The default is for WiredTiger to have 128 read and 128 write tickets.  PMM allows you to alert when your available tickets are getting low.  You can also correlate with other metrics as to why so many tickets are being utilized. The graph sample below shows a low-load situation – only ~1 ticket out of 128 was checked out at any time.

Percona Monitoring and Management wiredtiger

One of the metrics that could be causing you to use a large number of tickets is if your checkpoint time is high.   WiredTiger, by default, does a full checkpoint at least every 60 seconds, this is controlled by the WiredTiger parameter checkpoint=(wait=60)).  Checkpointing flushes all the dirty pages to disk. (By the way ‘dirty’ is not as bad as it sounds – it’s just a storage engine term meaning ‘not committed to disk yet’.)  High checkpointing times can lead to more tickets being in use.

Finally, we have WiredTiger Cache Activity metrics. WiredTiger Cache activity indicates the level of data that is being read into or written from the cache.  These metrics can help you baseline your normal cache activity, so you can notice if you have a large amount of data being read into the cache, perhaps from a poorly tuned query, or a lot of data being written from the cache.

WiredTiger Cache Activity

Database Metrics

PMM also has database metrics that are not WiredTiger specific.   Here we can see the uptime for the node, queries per second, latency, connections, and number of cursors.   These are higher-level metrics which can be indicative of a larger problem such as connection storms, storage latency, and excessive queries per second.  These can help you hone in on potential issues for your database.

Percona Monitoring and Management database metrics

Node Overview Metrics

System metrics can point you towards an issue at the O/S level that may or may not correlate to your database.  CPU, CPU saturation, core usage, DISK I/O, Swap Activity, and Network Traffic are some of the metrics that can help you find issues that may start at the O/S level or below.  Additional metrics to the below can be found in our documentation.

Node Overview Metrics

Takeaways

In this blog, we’ve discussed how PMM can help you troubleshoot your MongoDB deployment, whether you’re looking at the WiredTiger specific metrics, system-level metrics, or database level metrics PMM has you covered and can help you troubleshoot your MongoDB deployment.  Thanks for reading!

Additional Resources:

Download Percona Monitoring and Management

PMM for MongoDB Quick Start Guide

PMM Blog Topics

MongoDB Blog Topics

Oct
19
2020
--

5 Things Developers Should Know Before Deploying MongoDB

Developers Should Know Before Deploying MongoDB

Developers Should Know Before Deploying MongoDBMongoDB is one of the most popular databases and is one of the easiest NoSQL databases to set up. Oftentimes, developers want a quick environment to just test out an idea they have for an application or to try and figure out a good data model for their data without waiting for their Operations team to spin up the infrastructure.  What can sometimes happen is these quick, one-off instances grow, and before you know it that little test DB is your production DB supporting your new app. For anyone who finds themselves in this situation, I encourage you to check out our Percona blogs as we have lots of great information for those both new and experienced with MongoDB.  Don’t let the ease of installing MongoDB fool you into a false sense of security, there are things you need to consider as a developer before deploying MongoDB.  Here are five things developers should know before deploying MongoDB in production.

1) Enable Authentication and Authorization

Security is of utmost importance to your database.  While gone are the days when security was disabled by default for MongoDB, it’s still easy to start MongoDB without security.  Without security and with your database bound to a public IP, anyone can connect to your database and steal your data.   By simply adding some important security configuration options to your configuration file, you can ensure that your data is protected.  You can also configure MongoDB to utilize native LDAP or Kerberos for authentication.  Setting up authentication and authorization is one of the simplest ways to ensure that your MongoDB database is secure.  The most important configuration option is turning on authorization which enables users and roles and requires you to authenticate and have the proper roles to access your data.

security:
  authorization: enabled
  keyfile: /path/to/our.keyfile

 

2) Connect to a Replica Set/Multiple Mongos, Not Individual Nodes

MongoDB’s drivers all support connecting directly to a standalone node, a replica set, or a mongos for sharded clusters.   Sometimes your database starts off with one specific node that is always your primary.  It’s easy to set your connection string to only connect to that one node.   But what happens when that one node goes down?   If you don’t have a highly available connection string in your application configuration, then you’re missing out on a key advantage of MongoDB replica sets. Connect to the primary no matter which node it is.  All of MongoDB’s supported language drivers support the MongoDB URI connection string format and implement the failover logic.  Here are some examples of connection strings for PyMongo, MongoDB’s Python Driver, of a standalone connection string, a replica set, and an SRV record connection string.  If you have the privilege to set up SRV DNS records, it allows you to standardize your connection string to point to an address without needing to worry about the underlying infrastructure getting changed.

Standalone Connection String:

client = MongoClient('mongodb://hostabc.example.com:27017/?authSource=admin')

 

Replica Set Connection String:

client = MongoClient('mongodb://hostabc.example.com:27017,hostdef:27017,hostxyz.example.com/?replicaSet=foo&authSource=admin')

 

SRV Connection String:

client = MongoClient('mongodb+srv://host.example.com/')

Post-script for clusters: If you’re just starting you’re usually not setting up a sharded cluster. But if it is a cluster then instead of using a replicaset connection you will connect to a mongos node. To get automatic failover in the event of mongos node being restarted (or otherwise being down) start them on multiple hosts and put them, comma-concatenated, in your connection string’s host list. As with replicasets you can use SRV records for these too.

3) Sharding Can Help Performance But Isn’t Always a Silver Bullet

Sharding is how MongoDB handles the partitioning of data.  This practice is used to distribute load across more replicasets for a variety of reasons such as write performance, low-latency geographical writes, and archiving data to shards utilizing slower and cheaper storage.   These sharding approaches are helpful in keeping your working set in memory because it lowers the amount of data each shard has to deal with.

As previously mentioned, sharding can also be used to reduce latency by separating your shards by geographic region, a common example if having a US-based shard, an EU-based shard, and a shard in Asia where the data is kept local to its origin.  Although it is not the only application for shard zoning “Geo-sharding” like this is a common one. This approach can also help applications comply with various data regulations that are becoming more important and more strict throughout the world.

While sharding can oftentimes help write performance, that sometimes comes at the detriment of read performance.  An easy example of a poor read performance would be if we needed to run a query to find all of the orders regardless of their origin. This find query would need to be sent to the US shard, the EU shard, and the shard in Asia, with all the network latency that comes with reading from the non-local regions, and then it would need to sort all the returned records on the mongos query router before returning them to the client. This kind of give and take should help you determine what approach you take to choosing a shard key and weighing its impact on your typical query patterns.

4) Replication ? Backups

MongoDB Replication, while powerful and easy to set up, is not a substitution for a good backup strategy.  Some might think that their replica set members in a DR data center will be sufficient to keep them up in a data loss scenario.   While a replica set member in a DR center will surely help you in a DR situation, it will not help you if you accidentally drop a database or a collection in production as that delete will quickly be replicated to your secondary in your DR data center.

Other common misconceptions are that delayed replica set members keep you safe.   Delayed members still rely on you finding the issue you want to restore from before it gets applied to your delayed member.  Are your processes that rock-solid that you can guarantee that you’ll find the issue before it reaches your delayed member?

Backups are just as important with MongoDB as they were with any other database.  There are tools like mongodump, mongoexport, Percona Backup for MongoDB, and Ops Manager (Enterprise Edition only) that support Point In Time Recovery, Oplog backups, Hot Backups, full and Incremental Backups.  As mentioned, Backups can be run from any node in your replica set.  The best practice is to run your backup from a secondary node so you don’t put unnecessary pressure on your primary node.   In addition to the above methods, you can also take snapshots of your data, this is possible as long as you pause writes to the node that you’re snapshotting by freezing the file system to ensure a consistent snapshot of your MongoDB database.

5) Schemaless is a Myth, Schemas Still Matter

MongoDB was originally touted as a schemaless database, this was attractive to developers who had long struggled to update and maintain their schemas in relational databases.   But these schemas succeeded for good reasons in the early days of databases and while MongoDB allowed you the flexibility to not set up your schema and create it on the fly, this often led to some poor-performing schema designs and anti-patterns.   There are lots of stories out in the wild of users not enforcing any structured schema on their MongoDB data models and running into various performance problems as their schema began to become unwieldy.  Today, MongoDB supports JSON schema and schema validation.  These approaches allow you to apply as much or as little structure to your schemas as is needed, so you still have the flexibility of MongoDB’s looser schema structure while still enforcing schema rules that will keep your application performing well and your data model consistent.

Another aspect that is affected by poor schema design in MongoDB is its aggregation framework.   The aggregation framework lets you do more analytical query patterns such as sorting, grouping, and some useful things such as unwinding of arrays and supporting joins and a whole lot more.  Without a good schema, these sorts of queries can really suffer poor performance.

MongoDB was also popular due to its lack of support for joins. Joins can be expensive and avoiding them allowed MongoDB to run quite fast.  Though MongoDB has since added $lookup to support left outer joins, embedded documents are a typical workaround to this approach.   This approach comes with its pros and cons.  As with relational databases, embedding documents is essentially creating a One-to-N relationship, this is covered in greater detail in this blog.  In MongoDB, the value of N matters, if it’s One-to-few (2-10), one-to-many,(10-1000) this can still be a good schema design as long as your indexes support your queries.   When you get to one-to-tons(10000+) this is where you need to consider things like MongoDB’s 16 MB limit per document or using references to the parent document.

Examples of each of these approaches:

One-to-Few, consider having multiple phone numbers for a user:

{  "_id" : ObjectId("1234567890"),
  "name" :  "John Doe",
  "phone" : [     
     { "type" : "mobile", "number" : "+1-585-555-5555" }, 
     { "type" : "work", "number" : "+1-585-555-1111"}  
            ]
}

One-to-Many, consider a parts list for a product with multiple items:

{ "_id" : ObjectId("123"),
 “Item” : “Widget”,
 “Price” : 100 
}

{  "_id" : ObjectId("0123456789"), 
   "manufacturer" : "Percona",
   "catalog_number" : 123456,
   "parts" : [    
      { “item”: ObjectID("123")},  
      { “item”: ObjectID("456")},
      { “item”: ObjectID("789")},
       ...  
              ] 
}

One-to-Tons, consider a social network type application:

{  "_id" : ObjectId("123"),
   "username" : "Jane Doe" 
}

{  "_id" : ObjectId("456"),
   "username" : "Eve DBA"
 }

{  "_id" : ObjectId("9876543210"),
   "username" : "Percona",
   "followers" : [     
                    ObjectID("123"),
                    ObjectID("456"),
                    ObjectID("789"),
                    ...  
                 ]
}

 

Bonus Topic: Transactions

MongoDB supports multi-document transactions since MongoDB 4.0 (replica sets) and MongoDB 4.2 (sharded clusters).  Transactions in MongoDB work quite similarly to how they work in relational databases.   That is to say that either all actions in the transaction succeed or they all fail.  Here’s an example of a transaction in MongoDB:

rs1:PRIMARY> session.startTransaction() 
rs1:PRIMARY> session.getDatabase("percona").test.insert({today : new Date()})
WriteResult({ "nInserted" : 1 })
rs1:PRIMARY> session.getDatabase("percona").test.insert({some_value : "abc"})
WriteResult({ "nInserted" : 1 }) 
rs1:PRIMARY> session.commitTransaction()

Transactions can be quite powerful if they are truly needed for your application, but do realize the performance implications as all queries in a transaction will wait to finish until the whole transaction succeeds or fails.

Takeaways:

While MongoDB is easy to get started with and has a lower barrier to entry, just like any other database there are some key things that you, as a developer, should consider before deploying MongoDB.   We’ve covered enabling authentication and authorization to ensure you have a secure application and don’t leak data.   We’ve highlighted using Highly available connection strings, whether to your replica set, a mongos node list, or utilizing SRV, to ensure you’re always connecting to the appropriate nodes.  The balancing act of ensuring that when you select your shard key you consider the impact to both reads and writes and understand the tradeoffs that you are making.   The importance of backups and to not rely on replication as a backup method was also covered.  Finally, we covered the fact that schemas still matter with MongoDB, but you still have flexibility in defining how rigid it is. We hope this helps you have a better idea about things to consider when deploying MongoDB for your next application and to be able to understand it better.  Thanks for reading!

Oct
07
2020
--

Webinar October 27: Disaster Recovery and High Availability – The Concepts, The Mistakes, and How To Properly Plan For Failure

Percona Disaster Recovery and High Availability

Percona Disaster Recovery and High AvailabilityAny good system must be built to expect the unexpected. None are perfect and at some point, something WILL happen to render the system non-operational causing failure.

Join Dimitri Vanoverbeke, Senior Percona Engineer, as he discusses the concepts of High Availability, Disaster Recovery, common missteps that happen along the way, and how to ultimately prepare for failure.

Please join Dimitri Vanoverbeke on Tuesday, October 27th, at 1 pm EDT for his webinar “Disaster Recovery and High Availability – The Concepts, The Mistakes, and How To Properly Plan For Failure“.

Register for Webinar

If you can’t attend, sign up anyway and we’ll send you the slides and recording afterward.

Oct
07
2020
--

Our Approach to Percona Monitoring and Management Upgrade Testing

Percona Monitoring and Management Upgrade

Percona Monitoring and Management UpgradeHey Community! This is my first blog post so let me introduce myself. My name is Vasyl Yurkovych. I am a QA Automation Engineer at Percona and I have solid experience in software testing. 

Software quality is one of the main focuses of Percona so we put a lot of energy into it (hopefully so you don’t have to!). Our biggest challenge in testing isn’t necessarily running the routines against the current version, but the fact that users can decide to upgrade at any time, from pretty much any supported version.  So I’d like to share with you our upgrade testing approach for Percona Monitoring and Management (PMM) in hopes that this might be useful to others who create software that users install!

PMM is released every month and our users do not reinstall each new version – they just perform the upgrade operation when there’s a compelling feature or fix. Also, these PMM instances are not “empty”; they are full of various settings, monitored DBs, etc.,  all of which need to be preserved.

Taking into account all of that, we want to make sure that: 

  • the user will not suffer after the upgrade and will enjoy the new version of PMM
  • user’s instances will still be under monitoring without missing vital data during the upgrade
  • settings will be preserved
  • new features will work as expected

We decided to automate this process and select critical automation scenarios to execute. They were split into 3 stages: Pre Upgrade, Upgrade, and Post Upgrade.

Pre Upgrade (UI and API Scenarios)

  1. “Fill” PMM with monitored instances. We add each supported DB type for monitoring. Along with this, we ensure that the monitoring status of each added instance is ”RUNNING”. For these scenarios, we use corresponding PMM API endpoints.
  2. Apply custom Settings.
  3. Check that the Upgrade widget at the Home Dashboard is present and contains an Update button, the available version of PMM is correct and What’s new link is there.

Upgrade

  1. To fully simulate a user’s behavior we use the UI upgrade option.

Post Upgrade (UI Scenarios Only)

  1. After the upgrade itself was successful we check that the Upgrade widget at the Home Dashboard indicates that the PMM has the latest version.
  2. Also, we check the PMM Inventory to confirm that all the previously monitored DB instances exist and have “RUNNING” statuses.
  3. We check that Query Analytics Dashboard has corresponding DB filters (filters only exist when the DB specific queries exist).
  4. We confirm that all our custom Settings were properly upgraded and remain intact.

Also, we test upgrades from the older PMM versions, so we created a CI job that runs on the weekends and during the release testing phase. This gives us the ability to check the upgrade from any available version to the latest one. All we have to do is to specify the version we want to upgrade from. Check out this screenshot as an example of a failed upgrade.

Currently, we use this approach for testing Docker-based PMM Server upgrade, because PMM docker images are most commonly used. But we plan to implement the same upgrade job for OVF and AMI based PMM-Server soon. 

This is the final piece of our upgrade testing approach which alerts us immediately if some version has a problem upgrading to the latest one and allows us to react at the same time.

While this may seem an obvious tactic to ensure software quality, it’s amazing how often we’ll discover something that only impacts a single version.  

You are now safe to upgrade your Percona Monitoring and Management version, as PMM CI is watching your back!

Oct
02
2020
--

How to Use CHECK Constraint in MySQL 8

CHECK Constraint MySQL 8

CHECK Constraint MySQL 8Hello everyone, in this little post we will review a new feature in MySQL 8.

What is “CHECK Constraint”?

This is a new feature to specify a condition to check the value before INSERT or UPDATE into a row. The constraint could return an error if the result of a search condition is FALSE for any row of the table (but not if the result is UNKNOWN or TRUE).

This feature starts working on MySQL 8.0.16, and in previous versions, we could create it, but it doesn’t work, meaning the syntax is supported but it is not working,

There are some rules to keep in mind…

– AUTO_INCREMENT columns are not permitted
– Refer to another column in another table is not permitted
– Stored functions and user-defined functions are not permitted (you can not call a function or any user-defined functions)
– Stored procedure and function parameters are not permitted (you cannot call a procedure and function parameters)
– Subqueries are not permitted
– Columns used in foreign key for the next actions (ON UPDATE, ON DELETE) are not permitted
– This CHECK is evaluated for the next statements INSERT, UPDATE, REPLACE, LOAD DATA, and LOAD XML. Also, CHECK constraint is evaluated for INSERT IGNORE, UPDATE IGNORE, LOAD DATA … IGNORE, and LOAD XML … IGNORE. For those statements, a warning occurs if a constraint evaluates to FALSE. The insert or update is skipped.

Let’s See Some Examples

I created the next table to test this functionality. This is super easy as you can see in examples:

CREATE TABLE users (
id int not null auto_increment,
firstname varchar(50) not null,
lastname varchar(50) not null,
age TINYINT unsigned not null CONSTRAINT `check_1` CHECK (age > 15),
gender ENUM('M', 'F') not null,
primary key (id)
) engine = innodb;

In this simple test, we can write or update rows only if the “age” column value is more than 15.

Let’s see an example trying to INSERT rows with the “age” column less than 15:

mysql> INSERT INTO users SET firstname = 'Name1', lastname = 'LastName1', age = 10, gender = 'M';
ERROR 3819 (HY000): Check constraint 'check_1' is violated.

To DROP, use the next example:

ALTER TABLE users DROP CHECK check_1;

Let’s see another example adding more logic into it. I altered the table with the next CHECKs:

ALTER TABLE users
ADD CONSTRAINT gender_male
CHECK (
    CASE
        WHEN gender = 'M'
        THEN
            CASE
                WHEN age >= 21
                THEN 1
                ELSE 0
            END
        ELSE 1
    END = 1
);

ALTER TABLE users
ADD CONSTRAINT gender_female
CHECK (
    CASE
        WHEN gender = 'F'
            THEN
                CASE
                    WHEN age >= 18
                    THEN 1
                    ELSE 0
                END
        ELSE 1
    END = 1
);

We added more logic, and now it depends on the “gender” and “age” columns. A CHECK constraint is satisfied if, and only if, the specified condition evaluates to TRUE or UNKNOWN(for NULL column value) for the row of the table. The constraint is violated otherwise.

Let see an example from the previous logic.

mysql> INSERT INTO users SET firstname = 'Name2', lastname = 'LastName2', age = 10, gender = 'F';
ERROR 3819 (HY000): Check constraint 'gender_female' is violated.

mysql> INSERT INTO users SET firstname = 'Name3', lastname = 'LastName3', age = 10, gender = 'M';
ERROR 3819 (HY000): Check constraint 'gender_male' is violated.

As you can see in the ERROR message, MySQL is showing the CHECK constraint name. This is good to use from the application source code to debug the error and to know from which CHECK is failing.

Finally, this is the table structure:

CREATE TABLE `users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(50) NOT NULL,
`lastname` varchar(50) NOT NULL,
`age` tinyint(3) unsigned NOT NULL,
`gender` enum('M','F') NOT NULL,
PRIMARY KEY (`id`),
CONSTRAINT `gender_female` CHECK (((case when (`gender` = 'F') then (case when (`age` > 18) then 1 else 0 end) else 1 end) = 1)),
CONSTRAINT `gender_male` CHECK (((case when (`gender` = 'M') then (case when (`age` > 21) then 1 else 0 end) else 1 end) = 1))
) ENGINE=InnoDB AUTO_INCREMENT=4;

We can add more logic in the table using this feature, but from my previous experience as a programmer, I don’t recommend adding logic in the tables because it is difficult to find or debug errors unless you do not have access to the application code.

Oct
02
2020
--

Rate Limit (Throttle) for MySQL with ProxySQL

Rate Limit (Throttle) for MySQL with ProxySQL

Rate Limit (Throttle) for MySQL with ProxySQLMaybe one of the more “obscure” operations when dealing with replica lag, or, in general, when one needs to control writes to the database, is the Rate limit. It’s also lately one of the most popular conversations around the community.

But what is it? In plain words: holding up queries for a while, giving air to the replicas to breath and catch up. Something similar to the Galera’s Flow Control mechanism, although flow control, when it kicks in, stops all the writes while the nodes catch up. With a throttle no write is stopped, just delayed.

There are several ways to do this. A popular tool is Freno but this is also something that can be achieved with ProxySQL. Let’s see how.

Delay

ProxySQL has a variable called “mysql-default_query_delay”  that is pretty self-explanatory. It will add a delay, in milliseconds, to all the queries passing through ProxySQL.

Now, the trick is to monitor the replica lag and apply the delay. Something that in this case is solved with a small bash script:

#/bin/bash

set -o pipefail

proxysqlhost=127.0.0.1
proxysqladmport=6032
proxysqluser=admin
proxysqlpass=admin

replicahost=mysql2
replicauser=pmm
replicapass=pmm

lagThreshold=3


function check_lag() {
         lag=$(mysql -h$replicahost -u$replicauser -p$replicapass -e "SHOW SLAVE STATUS\G" | grep  "Seconds_Behind_Master" | awk -F\: '{print $2}' 2>&1)
}

function modifyProxysqlVariable () {
        out=$(mysql -u$proxysqluser -p$proxysqlpass -h$proxysqlhost -P6032 -e"set mysql-default_query_delay = $delay ; LOAD MYSQL VARIABLES TO RUNTIME;" 2>&1)
        echo $out
}

function destructor () {
        delay=0
        modifyProxysqlVariable
        echo "bye"
}

trap destructor EXIT INT TERM

while true; do

        check_lag
        echo $lag
        if [ $lag -ge $lagThreshold ]; then
                delay=1
                modifyProxysqlVariable
        elif [ $lag -eq 0 ]; then
                delay=0
                modifyProxysqlVariable
        fi
        sleep 0.001
done

Replica lag will be monitored on a loop, and when it’s bigger than the threshold, 1mS of delay will be added overall. And after lag is under control, the delay is removed.

For testing purposes, I have set up a basic Primary/Replica environment, running with sysbench on high concurrency setup to make the replica lag on purpose. I collected some graphs with Percona Monitoring and Management,  and here’s how it looks:

Rate Limit MySQL

The above graph shows the lag constantly growing until the script starts to run and the replica eventually catches up. Fantastic! Until you see the Questions graphs and it looks very….flat

throttle MySQL

So what happened is that until the lag reached the threshold (in this case, 3 seconds) all the traffic was delayed. Once the lag is under control, traffic is back but not at 100% since the throttle script continues to run. So it maintains the replica up to date at a cost of rate-limiting the writes.

And the above graph shows the same (questions) but from the ProxySQL dashboard perspective.

Now, what’s the alternative?

Delay on Rules

ProxySQL query rules can be set with delays per rule. Since the idea is to add less overhead, a good approach is to check which query rule is the least used and see if by adding a delay to it we can control the lag.

Admin> SELECT hits, mysql_query_rules.rule_id, match_digest, match_pattern, replace_pattern, cache_ttl, apply, delay FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id;
+-----------+---------+--------------+-----------------------+-----------------+-----------+-------+-------+
| hits      | rule_id | match_digest | match_pattern         | replace_pattern | cache_ttl | apply | delay |
+-----------+---------+--------------+-----------------------+-----------------+-----------+-------+-------+
| 0         | 1       | NULL         | ^SELECT .* FOR UPDATE | NULL            | NULL      | 0     | NULL  |
| 609897074 | 2       | NULL         | ^SELECT .*            | NULL            | NULL      | 0     | NULL  |
| 87128148  | 3       | NULL         | ^UPDATE .*            | NULL            | NULL      | 0     | 0     |
| 43561905  | 4       | NULL         | ^INSERT .*            | NULL            | NULL      | 0     | 0     |
| 43562935  | 5       | NULL         | ^DELETE .*            | NULL            | NULL      | 0     | 0     |
| 0         | 6       | NULL         | ^REPLACE .*           | NULL            | NULL      | 0     | NULL  |
+-----------+---------+--------------+-----------------------+-----------------+-----------+-------+-------+
6 rows in set (0.01 sec)

From that query, we can see that the rule with id 5, the one that routes the DELETEs, is the least used from all the rules that hit the Primary.

We just need to make a small adjustment to the script to modify the mysql_query_rules table, by adding a proper function. The full script with the modification looks like this:

#/bin/bash

set -o pipefail

proxysqlhost=127.0.0.1
proxysqladmport=6032
proxysqluser=admin
proxysqlpass=admin

replicahost=mysql2
replicauser=pmm
replicapass=pmm

lagThreshold=3

ruleId=5

function destructor () {
        delay=0
        modifyRuleDelay
        echo "bye"
}

function check_lag() {
         lag=$(mysql -h$replicahost -u$replicauser -p$replicapass -e "SHOW SLAVE STATUS\G" | grep  "Seconds_Behind_Master" | awk -F\: '{print $2}' 2>&1)
}

function modifyRuleDelay () {
        out=$(mysql -u$proxysqluser -p$proxysqlpass -h$proxysqlhost -P6032 -e"update mysql_query_rules set delay = $delay where rule_id = $ruleId ; LOAD MYSQL QUERY RULES TO RUNTIME;" 2>&1)
}

trap destructor EXIT INT TERM

while true; do

        check_lag
        echo $lag
        if [ $lag -ge $lagThreshold ]; then
                delay=1
                modifyRuleDelay
        elif [ $lag -eq 0 ]; then
                delay=0
                modifyRuleDelay
        fi
        sleep 0.001
done

Now, running the script we can see that the same behavior happens. Lag is controlled:

rate limit for Proxsql

What about the Questions? Well, now the drop is not complete, and some traffic can still be routed and when the replica catches up, the behavior is the same as a small reduction, being that the cost of keeping the lag controlled.

ProxySQL reports the same.

Now, we have talked about controlling lag but pretty much any status variable can be monitored. Perhaps your issues are with contention (Threads_running) or with IO-bound issues (InnoDB checkpoint age). Regardless of the variable, rate limit can be achievable with ProxySQL.

Sep
29
2020
--

Using Security Threat Tool and Alertmanager in Percona Monitoring and Management

security threat tool percona monitoring and management

security threat tool percona monitoring and managementWith version 2.9.1 of Percona Monitoring and Management (PMM) we delivered some new improvements to its Security Threat Tool (STT).

Aside from an updated user interface, you now have the ability to run STT checks manually at any time, instead of waiting for the normal 24 hours check cycle. This can be useful if, for example, you want to see an alert gone after you fixed it. Moreover, you can now also temporarily mute (for 24 hours) some alerts you may want to work on later.

But how do these actions work?

Alertmanager

In a previous article, we briefly explained how the STT back end publishes alerts to Alertmanager so they appear in the STT section of PMM.

Now, before we uncover the details of that, please bear in mind that PMM’s built-in Alertmanager is still under development. We do not recommend you use it directly for your own needs, at least not for now.

With that out of the way, let’s see the details of the interaction with Alertmanager.

To retrieve the current alerts, the interface calls an Alertmanager’s API, filtering for non-silenced alerts:

GET /alertmanager/api/v2/alerts?silenced=false[...]

This call returns a list of active alerts, which looks like this:

[
  {
    "annotations": {
      "description": "MongoDB admin password does not meet the complexity requirement",
      "summary": "MongoDB password is weak"
    },
    "endsAt": "2020-09-30T14:39:03.575Z",
    "startsAt": "2020-04-20T12:08:48.946Z",
    "labels": {
      "service_name": "mongodb-inst-rpl-1",
      "severity": "warning",
      ...
    },
    ...
  },
  ...
]

Active alerts have a

startsAt

timestamp at the current time or in the past, while the

endsAt

 timestamp is in the future. The other properties contain descriptions and the severity of the issue the alert is about.

labels

, in particular, uniquely identify a specific alert and are used by Alertmanager to deduplicate alerts. (There are also other “meta” properties, but they are out of the scope of this article.)

Force Check

Clicking on “Run DB checks” will trigger an API call to the PMM server, which will execute the checks workflow on the PMM back end (you can read more about it here). At the end of that workflow, alerts are sent to Alertmanager through a POST call to the same endpoint used to retrieve active alerts. The call payload has the same structure as shown above.

Note that while you could create alerts manually this way, that’s highly discouraged, since it could negatively impact STT alerts. If you want to define your own rules for Alertmanager, PMM can integrate with an external Alertmanager, independent of STT. You can read more in Percona Monitoring and Management, Meet Prometheus Alertmanager.

Silences

Alertmanager has the concept of Silences. To temporarily mute an alert, the front end generates a “silence” payload starting from the metadata of the alert the user wants to mute and calls the silence API on Alertmanager:

POST /alertmanager/api/v2/silences

An example of a silence payload:

{
  "matchers": [
    { "name": "service_name", "value": "mongodb-inst-rpl-1", "isRegex": false },
    { "name": "severity", "value": "warning", "isRegex": false },
    ...
  ],
  "startsAt": "2020-09-14T20:24:15Z",
  "endsAt": "2020-09-15T20:24:15Z",
  "createdBy": "someuser",
  "comment": "reason for this silence",
  "id": "a-silence-id"
}

As a confirmation of success, this API call will return a

silenceID

:

{ "silenceID": "1fcaae42-ec92-4272-ab6b-410d98534dfc" }

 

Conclusion

From this quick overview, you can hopefully understand how simple it is for us to deliver security checks. Alertmanager helps us a lot in simplifying the final stage of delivering security checks to you in a reliable way. It allows us to focus more on the checks we deliver and the way you can interact with them.

We’re constantly improving our Security Threat Tool, adding more checks and features to help you protect your organization’s valuable data. While we’ll try to make our checks as comprehensive as possible, we know that you might have very specific needs. That’s why for the future we plan to make STT even more flexible, adding scheduling of checks (since some need to run more/less frequently than others), disabling of checks, and even the ability to let you add your own checks! Keep following the latest releases as we continue to iterate on STT.

For now, let us know in the comments: what other checks or features would you like to see in STT? We love to hear your feedback!

Check out our Percona Monitoring and Management Demo site or download Percona Monitoring and Management today and give it a try!

Sep
21
2020
--

MongoDB 101: 5 Configuration Options That Impact Security (And How to Set Them)

MongoDB Security

MongoDB SecurityAs with any database platform, MongoDB security is of paramount importance to keeping your data safe.  MongoDB and other data platforms like Redis and Elasticsearch are often in the news for data breaches because of misconfigured settings in the database.  So how do you keep you and your company’s data from being compromised and from becoming another statistic?

We’ll show you five configuration options, as well as others that are required to go along with them, for your MongoDB deployment that will help keep your data secure while allowing use by users and applications with least-privileged access using modern authentication methods, keeping your data encrypted on disk and over the wire, and to see who is accessing your data as well. These configuration options are across the following areas in security: authentication, authorization, encryption, and auditing.

MongoDB Security Overview

MongoDB security is composed of four main areas of focus, authentication (who), authorization(what), encryption (how), and auditing (when).  This section is intended to give you a high-level overview of the different security focus areas for MongoDB.

Authentication is how you identify yourself to MongoDB.  There are many ways to authenticate oneself to a MongoDB database, from standard username and password using the SCRAM (Salted Challenge Response Authentication Mechanism) protocol, certificate-based authentication to tying into an identity management solution such as LDAP (Lightweight Directory Access Protocol), Active Directory and Kerberos.

Authorization is how MongoDB determines what you, as an authenticated user, can do.  MongoDB supports authorization using the RBAC (Role-Based Access Control) method.  MongoDB lets you create roles which are groupings of privileges that any user granted that role can do.  MongoDB comes with a comprehensive set of built-in roles as well as giving you the flexibility to create your own custom roles.  Common roles like read-only and write are there of course, but also ones useful for monitoring any node, backup and restore, and user administration. Additionally, MongoDB also supports LDAP authorization which allows you to sync LDAP groups with roles to simplify management.

Encryption is how your data can be encrypted while in flight (Transport) and while on disk (At Rest).   Transport encryption keeps your data encrypted while it is sent to and from your application to MongoDB.  MongoDB supports TLS/SSL encryption for data in-flight using x.509 Certificates, and here’s an example of setting up Transport EncryptionEncryption at Rest keeps your data safe from an external party who might get a copy of your data files as they’ll be completely unreadable in their encrypted form.

Auditing shows you when users connected, when privileges were changed, various admin events, users attempt something they shouldn’t, etc. based on filter criteria you can set.  This is helpful in compliance situations where you have to be able to show what who was on the database at what time, what privileges they had, when privileges were changed, etc.

Tip: Don’t confuse auditing as a way to track users’ activities in real-time, but rather think of it as a way to create a tamper-proof, append-only log file that you can go back to that shows what was happening and by whom during a specific time.

Tip:  Auditing is an expensive operation and will impact performance, be sure that you’re getting value from it and your IT Compliance team is able to actively use it, before setting it up.

Configuration Options

So while knowing the important areas of MongoDB Security is great, how do we ensure they are enabled or set up correctly?  And which ones are the most important?  We’ll now go through 5 configuration options that will help you secure your MongoDB environment!  We’ll also list some required configuration options that will work in conjunction with our 5 most important configuration options to keep your data safe.  We’ll break these configuration options into their security focus areas.

MongoDB uses a configuration file in the YAML file format.   The configuration file is usually found in the following locations, depending on your Operating System:

DEFAULT CONFIGURATION FILE

  • On Linux, a default /etc/mongod.conf configuration file is included when using a package manager to install MongoDB.
  • On Windows, a default <install directory>/bin/mongod.cfg configuration file is included during the installation.
  • On macOS, a default /usr/local/etc/mongod.conf configuration file is included when installing from MongoDB’s official Homebrew tap.

Authentication:

Our first configuration option, security.authorization, is perhaps the most important for enabling security on your MongoDB Deployment.  This configuration option not only enforces MongoDB using authentication so that a user must at least login using credentials but it also simultaneously engages role-based access control which limits what a user can do.

security:
  authorization: enabled

 

Tip:  If you set this configuration option up before creating a user in MongoDB, you could use the localhost exception in order to create your first user.  This is especially helpful in cases of automation or other situations where you want to have all your configuration options configured only once and then come in and add users.

There are several other authentication configuration options that are required for your MongoDB deployment:

  • security.clusterAuthMode – The authentication mode used between replica set or sharded cluster nodes to authenticate.  This will typically be either keyFile or x509.  Acceptable values are:
    • keyFile – default, only accepts keyfiles
    • x509 – uses only x509 certificates for cluster authentication
    • sendKeyFile – only used when transitioning from keyFile to x509 certificate authentication.   Accepts keyFiles and x509 certificates
    • sendX509 – only used when transitioning from x509 certificate authentication to keyFile authentication.   Accepts x509 certificates and keyFiles
  • security.keyFile – sets the destination of the keyFile if using keyFile based authentication.  Note that the user MongoDB is running as must have read-only or read/write level permissions on the keyfile, with no permissions granted to other users.

Authorization:

The security.authorization configuration option that enabled authentication is also the most important configuration option for setting up authorization since it also gives us roles that allow us to authorize users to have specific permissions.   While there are some authorization configuration options for the inbuilt authorization system, most of the options are useful for integrating LDAP or other external authentication mechanisms with your roles.  Read more about setting up LDAP Authorization, as well as a great blog post discussing how to set it up.

Encryption:

Transport Encryption ensures that your data is encrypted between your application and the database, it also can be used to encrypt data between members of your replica set and sharded cluster.  The most important configuration option here is net.tls.mode. This configuration option is new in MongoDB 4.2, previous to MongoDB 4.2, this configuration option is named net.ssl.mode.   This configuration option decides how strictly you want to enforce TLS encryption.   The options for this configuration option are:

  • Disabled – signifies that there is no encryption whatsoever.
  • requireTLS – signifies that all traffic, regardless of origin, is encrypted.  requireTLS is the most secure setting for this configuration option.
  • allowTLS – signifies that there is no encryption going on between members of the replica set or sharded cluster, but the DB server will accept both encrypted and non-encrypted traffic from the application hosts.  Only used for transitioning between disabled to requireTLS in a rolling restart fashion.
  • preferTLS – signifies that there is encryption going on between members of the replica set or sharded cluster and that the DB server will accept both encrypted and non-encrypted traffic from the application hosts. Only used for transitioning between disabled to requireTLS in a rolling restart fashion.
net:
  tls:
    mode: requireTL

 

Additional required configuration options for transport encryption are:

  • net.tls.certificateKeyFile – location of the .pem file with the certificate and it’s key to be used for application connections.  Note that the user MongoDB is running as must have read permissions on this file.
  • net.tls.clusterFile – location of the .pem file used for transport encryption between replica set or cluster members.   Note that the user MongoDB is running as must have read permissions on this file.
  • net.tls.CAFile – location of the .pem file with the root certificate chain from the Certificate Authority.  Note that the user MongoDB is running as must have read permissions on this file.

Data at Rest Encryption ensures that your data can’t be read by someone who steals your database’s data files unless they also steal the key.  This prevents someone from reading your MongoDB data files at the file system level.  Here the most important configuration option is security.enableEncryption.

security:
  enableEncryption: true

Additional required configuration options for Data At Rest Encryption are:

  • security.encryptionCipherMode – form of encryption to use, options are AES256-CBC and AES256-GCM

Percona Server for MongoDB Specific Configuration Options:

Percona Server for MongoDB has integration with HashiCorp Vault for secret management for your Data at Rest Encryption. Read the documentation for Vault and Using Vault to Store the Master Key for Data at Rest Encryption on Percona Server for MongoDB.

Important configuration options for the Vault Integration are:

  • security.vault.serverName – server name that your vault server is on
  • Security.vault.port – port for vault connectivity
  • security.vault.tokenFile – location of file with vault token
  • Security.vault.secret – location for secrets, since these are set up per node, this should have a distinguishing characteristic such as node name in it
  • security.vault.serverCAFile – location of CAFile (Certificate Authority) on your local mongodb node
  • security.vault.rotateMasterKey – only used to rotate the master key

MongoDB Enterprise Specific Data At Rest Encryption Configuration Options:

Currently, MongoDB Enterprise does not have Vault Integration for Encryption at rest except in MongoDB Atlas.  MongoDB Enterprise does support the KMIP protocol and you can integrate MongoDB with any Key Management tool that utilizes the KMIP protocol.  Documentation can be found here.

Important configuration options to support Key Management through the KMIP protocol are:

  • security.kmip.serverName – server name where your Key Management tool resides
  • security.kmip.port – port for your key management tool
  • security.kmip.serverCAfile – path on your MongoDB hosts of a CA file (Certificate Authority) for secure connection to your Key Management Tool
  • security.kmip.clientCertificateFile – path to the client certificate used for authentication to your Key Management tool
  • security.kmip.rotateMasterKey – only used to rotate the master key

Auditing:

Auditing allows IT Security Compliance teams to track and log activities that are run against the MongoDB database.  There are several important auditing configuration options for MongoDB,  auditLog.filter is the most important as it decides what exactly you are setting up in your auditing log.

auditLog:
  filter:  { <field1>: <expression1>, ... }

 

For example, if we only wanted to have an audit log entry created every time someone created or removed a collection, we would set the auditLog.Filter as such:

auditLog:
  filter: { atype: { $in: [ "createCollection", "dropCollection" ] } }

 

If we wanted to audit everyone with a specific role, we could set the auditFilter as such:

auditLog:
  filter:{ roles: { role: "readWrite", db: "appdb" } }

 

Additional required configuration options for auditing are:

  • auditLog.destination – whether the audit log will be written to a file, to the console, or to the syslog
  • auditLog.path – if outputting to a file, the destination directory, and file name of the audit log.   Note that the user MongoDB runs as must have read and write permissions to this directory.
  • auditLog.format – the format the audit log is output to, options are JSON and BSON, with JSON being the more commonly used format.

Finally, while auditing is important to track and log activity in your database, including accessing PII or other sensitive data, you don’t want to expose PII in your auditing or other log files.  To accomplish this you must set up log redaction on your MongoDB Replica Set or Sharded Cluster.   The important configuration option for log redaction is security.redactClientLogData.   Acceptable values for this configuration option are true and false.

security:
  redactClientLogData: true

Summary

In this blog post, we’ve gone over five important MongoDB configuration options to ensure you have a more secure MongoDB deployment as well as some other configuration options that help the five keep your data secure.  We hope that these configuration options will help you build more secure MongoDB deployments and avoid being a statistic of a data breach.   Thanks for reading!

Additional Resources:

MongoDB Top Five Security Concerns

Securing MongoDB Webinar


Our latest resource, Using Open Source Software to Ensure the Security of Your MongoDB Database, documents how to deploy a secure, enterprise-grade, MongoDB deployment without worrying about license fees, giving organizations the flexibility to deploy consistent models across their entire infrastructure.

Download “Using Open Source Software to Ensure the Security of Your MongoDB Database”

Sep
11
2020
--

How Much Memory Does the Process Really Take on Linux?

Memory Process takes on Linux

Memory Process takes on LinuxOne of the questions you often will be faced with operating a Linux-based system is managing memory budget. If a program uses more memory than available you may get swapping to happen, oftentimes with a terrible performance impact, or have Out of Memory (OOM) Killer activated, killing process altogether.

Before adjusting memory usage, either by configuration, optimization, or just managing the load, it helps to know how much memory a given program really uses.

If your system runs essentially a single user program (there is always a bunch of system processes) it is easy.  For example, if I run a dedicated MySQL server on a system with 128GB of RAM I can use “used” as a good proxy of what is used and “available” as what can still be used.

root@rocky:/mnt/data2/mysql# free -h
              total        used        free      shared  buff/cache   available
Mem:          125Gi        88Gi       5.2Gi       2.0Mi        32Gi        36Gi
Swap:          63Gi        33Mi        63Gi

There is just swap to keep in mind but if the system is not swapping heavily, even there is swap space used, it usually keeps “unneeded junk” which does not need to factor in the calculation.

If you’re using Percona Monitoring and Management you will see it in Memory Utilization:

Percona Monitoring and Management memory utilization

And in the Swap Activity graphs in the “Node Summary” dashboard:

swap activity

If you’re running multiple processes that share system resources, things get complicated because there is no one-to-map mapping between “used”  memory and process.

Let’s list just some of those complexities:

  • Copy-on-Write semantics for “fork” of the process – All processes would share the same “used” memory until process modifies data, only in this case it gets its own copy.
  • Shared memory –  As the name says, shared memory is a memory that is shared across different processes.
  • Shared Libraries –  Libraries are mapped into every process which uses it and are part of its memory usage, though they are shared among all processes which use the same library.
  • Memory Mapped Files and Anonymous mmap()  –  There are a lot of complicated details here. For an example, check out Memory Mapped Files for more details. This discussion on StackExchange also has some interesting bits.

With that complexity in mind let’s look at the “top” output, one the most common programs to look at current load on Linux. By default, “top” sorts processes by CPU usage so we’ll press “Shift-M” to sort it by (resident) memory usage instead.

The first thing you will notice is that this system, which has only 1GB of physical memory, has a number of processes which has virtual memory (VIRT) in excess of 1GB.

For various reasons, modern memory allocators and programming languages (i.e. GoLang) can allocate a lot of virtual memory which they do not really use so virtual memory usage has little value to understand how much real memory a process needs to operate.

Now there is resident memory (RES) which shows us how much physical memory the process really uses.  This is good… but there is a problem. Memory can be non-resident either because it was not really “used” and exists as virtual memory only, or because it was swapped out.

If we look into the stats kernel actually provides for the process we’ll see there is more data available:

root@PMM2Server:~# cat /proc/3767/status
Name:   prometheus
Umask:  0022
State:  S (sleeping)
Tgid:   3767
Ngid:   0
Pid:    3767
PPid:   3698
TracerPid:      0
Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000
FDSize: 256
Groups: 1000
NStgid: 3767    17
NSpid:  3767    17
NSpgid: 3767    17
NSsid:  3698    1
VmPeak:  3111416 kB
VmSize:  3111416 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    608596 kB
VmRSS:    291356 kB
RssAnon:          287336 kB
RssFile:            4020 kB
RssShmem:              0 kB
VmData:  1759440 kB
VmStk:       132 kB
VmExe:     26112 kB
VmLib:         8 kB
VmPTE:      3884 kB
VmSwap:   743116 kB
HugetlbPages:          0 kB
CoreDumping:    0
Threads:        11
SigQ:   0/3695
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffe3bfa3a00
SigIgn: 0000000000000000
SigCgt: fffffffe7fc1feff
CapInh: 00000000a80425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        2
Speculation_Store_Bypass:       vulnerable
Cpus_allowed:   1
Cpus_allowed_list:      0
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        346
nonvoluntary_ctxt_switches:     545

VmSwap is a particularly interesting data point, as it shows the amount of memory used by this process which was swapped out.

VmRSS+VmSwap is a much better indication of the “physical” memory the process needs. In the case above, it will be 1010MB, which is a lot higher than 284MB of resident set size but also a lot less than the 3038MB “virtual memory” size for this process.

The problem with the swapped out part, though we do not know if it was swapped out “for good” being dead weight, is, for example, some code or data which is not used in your particular program use case or if it was swapped out due to memory pressure, and we really would need to have it in real memory (RAM) for optimal performance – but we do not have enough memory available.

The helpful data point to look at in this case is major page faults. It is not in the output above but is available in another file –  /proc/[pid]/stat. Here is some helpful information on Stack Overflow.

A high number of major page faults indicates the stuff program needs is not in physical memory. Swap activity will be included here but also references to currently unmapped code in the shared library or references to the data in memory-mapped files, which is currently not in RAM. In any case, a high rate of major page faults often indicates RAM pressure.

Let’s go to our friend “top”, though, and see if we can get more helpful information displayed in it. You can use the “F” keyboard shortcut to select fields you want to be displayed.

You can add SWAP,  Major Faults Delta, and USED columns to display all the items we spoke about!

Looking at this picture we can see a large portion of “prometheus” process is swapped out and it has 2K major page faults/sec happening, pointing out the fact it is likely suffering.

The “clickhouse-serv” process is another interesting example, as it has over 4G “resident size” but has relatively little memory used and a lot less major page faults.

Finally, let’s look at “percona-qan-api” process which has a very small portion swapped out but shows 2K major page faults as well.   I’m honestly not quite sure what it is, but it does not seem to be swap-IO related.

Summary

Want to see how much memory process is using?  Do not look at virtual memory size or resident memory size but look at “used” memory defined as resident memory size + swap usage. Want to see if there is actual memory pressure? Check out system-wide swap input/output statistics, as well as major page fault rate, for a process you’re investigating. 

Aug
31
2020
--

Percona Monthly Bug Report: August 2020

Percona Monthly Bug Report August 2020

Percona Monthly Bug Report August 2020At Percona, we operate on the premise that full-transparency makes a product better. We strive to build the best open-source database products, but also to help you manage any issues that arise in any of the databases that we support. And, in true open-source form, report back on any issues or bugs you might encounter along the way.

We constantly update our bug reports and monitor other boards to ensure we have the latest information, but we wanted to make it a little easier for you to keep track of the most critical ones. This monthly post is a central place to get information on the most noteworthy open and recently resolved bugs. 

In this edition of our monthly bug report, we have the following list of bugs:

Percona Server/MySQL Bugs

PS-7264(MySQL#83799):  [ERROR] InnoDB: dict_load_foreigns() returned 38 for ALTER TABLE

Affects Version/s: 5.6,5.7  [Tested/Reported version 5.6.49,5.7.30 ]

Critical since a user can easily hit this bug when performing ALTER on the table with FK for specified cases in the bug report. With debug build, it crashes. 

MySQL#91977: Dropping Large Table Causes Semaphore Waits; No Other Work Possible

Affects Version/s: 5.7,8.0  [Tested/Reported version 5.7.23, 8.0.12]

Running DROP/ALTER on the large table could lead to this bug. DROP/ALTER query will be stuck in ‘checking permissions’ state  and later it may crash mysqld due to long semaphore

Wait. It’s critical since it can result in unplanned downtime. The issue also evident with the pt-online-schema-change tool while performing ALTER operations on large tables.

PS-6990(MySQL#100118): Server doesn’t restart because of too many gaps in the mysql.gtid_executed table

Affects Version/s: 8.0  [Tested/Reported version 8.0.19, 8.0.20,8.0.21]

Fixed Version/s: 8.0.21 (Next release)

This issue introduces due to the Clone plugin replication coordinates worklog (WL#9211) changes for updating the mysql.gtid_executed table. With this bug, Replica not usable, as mysql.gtid_executed is getting huge, crashes, and never comes back because of the way how the GTID compression works.

PS-7203:  audit plugin memory leak on replicas when opening tables.

Affects Version/s: 5.7,8.0  [Tested/Reported version 5.7.26-29, 5.7.29-32]

Fixed Version/s: 5.7.31, 8.0.21 (Next release)

Audit plugin could leak memory resulting in OOM on the slave if STATEMENT base replication is used.

PS-7163(MySQL#99286):Concurrent update cause crash in row_search_mvcc

Affects Version/s: 5.7  [Tested/Reported version 5.7.29]

Concurrent updates on the same record (DELETE/INSERT and UPDATE from different sessions) could lead to this crash. This bug is critical since such operations are common for many database environments.

Percona Xtradb Cluster

PXC-3373: [ERROR] WSREP: Certification exception: Unsupported key prefix: ^B: 71 (Protocol error) at galera/src/key_set.cpp:throw_bad_prefix():152

Affects Version/s: 5.7   [Tested/Reported 5.7.30]

IST to the lower version node will fail with this error when the write load is running on the Donor node. Rolling upgrade is the most commonly used method resulting in a different version of pxc nodes for while and in such cases, the user can experience this issue easily.

Possible workaround,

  • Upgrade pxc node to have the same version across the cluster.
  • Stop write load on the donor node while IST is running. 

PXC-3352: Unexpected ERROR 1205 modifying a child table in an FK relationship

Affects Version/s: 5.7   [Tested/Reported 5.7.28]

Fixed Version/s: 5.7.31, 8.0.20 (Next release)

Affected users using foreign keys with PXC, When deleting/updating from a child table in a FK relationship if the parent table’s referenced rows are locked, the operation on a child table failed with lock wait timeout error when the parent table is unlocked.  

Percona Xtrabackup

PXB-2237: PXB crashes during a backup when an encrypted table is updated

Affects Version/s:  2.4  [Tested/Reported version 2.4.20]

Databases with encrypted tables are affected by this bug. As a workaround, Taking backup in non-peak hours could avoid this crash.

PXB-2178: Restoring datadir from partial backup results in an inconsistent data dictionary

 Affects Version/s:  8.0  [Tested/Reported version 8.0.11

As a result of this bug after restored, you will see additional database/s which were not part of a partial backup. Issue evident only in Xtrabackup 8.0 due to new data dictionary implementation in MySQL 8.0 version, this issue is not reproducible with xtrabackup 2.4. 

The workaround for this issue is to use “DROP DATABASE IF EXISTS” for dropping unwanted extra database/s. 

Percona Toolkit

PT-1747: pt-online-schema-change was bringing the database into a broken state when applying the “rebuild_constraints” foreign keys modification method if any of the child tables were blocked by the metadata lock.

Affects Version/s:  3.x   [Tested/Reported version 3.0.13]

Critical bug since it can cause data inconsistency in the user environment. It potentially affects who rebuilds tables with foreign keys.

PT-1853: pt-online-schema-change doesn’t handle self-referencing foreign keys properly

When using pt-osc to change a table that has a self FK pt-osc creates the FK pointing to the old table instead of pointing to the _new table. Because of it pt-osc needs to rebuild the FK after swapping the tables (DROP the FK and recreating it again pointing to the _new table). This can cause issues because the INPLACE algorithm is supported when foreign_key_checks is disabled. Otherwise, only the COPY algorithm is supported.

Affects Version/s:  3.x  [Tested/Reported version 3.2.0]

Fixed Version/s: 3.2.1

Affects who rebuild tables with a self-referencing foreign key.

PMM  [Percona Monitoring and Management]

PMM-4547: MongoDB dashboard replication lag count incorrect

Affects Version/s:  2.9  [Tested/Reported version 2.0,2.9.0]

Replica lag takes the lag value from the delayed node which makes it unreliable for all nodes. So PMM cannot be used to monitor delays.

PMM-6336:   Incompatible pmm-admin options: ‘–disable-queryexamples’ and ‘–query-source=perfschema’

In case the user doesn’t want to show a query example that has “Sensitive Data” on PMM Query Analytics UI  they can disable it with the help of –disable-queryexamples option in pmm-admin.

But this option is not working as expected when mysql instance added using –query-source perfschema option and on PMM Query Analytics UI  query examples is visible with original data.

Workaround: we do not see this issue when using –query-source slowlog 

Affects Version/s:  2.9  [Tested/Reported version 2.9]

Fixed version/s: 2.10.0

PMM-5823: pmm-server log download and api to get the version failed with timeout

Affects Version/s:  2.8  [Tested/Reported version 2.2]

Occurring at irregular intervals and only affected to pmm docker installation with no external internet access from pmm-server docker container. The issue only visible for a while(around 5-10 mins) after starting pmm-server later you will not see this issue.

Summary

We welcome community input and feedback on all our products. If you find a bug or would like to suggest an improvement or a feature, learn how in our post, How to Report Bugs, Improvements, New Feature Requests for Percona Products.

For the most up-to-date information, be sure to follow us on Twitter, LinkedIn, and Facebook.

Quick References:

Percona JIRA  

MySQL Bug Report

Report a Bug in a Percona Product

___

About Percona:

As the only provider of distributions for all three of the most popular open source databases—PostgreSQL, MySQL, and MongoDB—Percona provides expertise, software, support, and services no matter the technology.

Whether its enabling developers or DBAs to realize value faster with tools, advice, and guidance, or making sure applications can scale and handle peak loads, Percona is here to help.

Percona is committed to being open source and preventing vendor lock-in. Percona contributes all changes to the upstream community for possible inclusion in future product releases.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com