Jan
10
2022
--

MySQL 8.0 Functional Indexes

MySQL 8.0 Functional Indexes

MySQL 8.0 Functional IndexesWorking with hundreds of different customers I often face similar problems around running queries. One very common problem when trying to optimize a database environment is index usage. A query that cannot use an index is usually a long-running one, consuming more memory or triggering more disk iops.

A very common case is when a query uses a filter condition against a column that is involved in some kind of functional expression. An index on that column can not be used.

Starting from MySQL 8.0.13 functional indexes are supported. In this article, I’m going to show what they are and how they work.

The Well-Known Problem

As already mentioned, a very common problem about index usage is when you have a filter condition against one or more columns involved in some kind of functional expression.

Let’s see a simple example.

You have a table called products containing the details of your products, including a create_time TIMESTAMP column. If you would like to calculate the average price of your products on a specific month you could do the following:

mysql> SELECT AVG(price) FROM products WHERE MONTH(create_time)=10;
+------------+
| AVG(price) |
+------------+
| 202.982582 |
+------------+

The query returns the right value, but take a look at the EXPLAIN:

mysql> EXPLAIN SELECT AVG(price) FROM products WHERE MONTH(create_time)=10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 99015
     filtered: 100.00
        Extra: Using where

 

The query triggers a full scan of the table. Let’s create an index on create_time and check again:

mysql> ALTER TABLE products ADD INDEX(create_time);
Query OK, 0 rows affected (0.71 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> explain SELECT AVG(price) FROM products WHERE MONTH(create_time)=10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 99015
     filtered: 100.00
        Extra: Using where

 

A full scan again. The index we have created is not effective. Indeed any time an indexed column is involved in a function the index can not be used.

To optimize the query the workaround is rewriting it differently in order to isolate the indexed column from the function.

Let’s test the following equivalent query:

mysql> SELECT AVG(price) FROM products WHERE create_time BETWEEN '2019-10-01' AND '2019-11-01';
+------------+
| AVG(price) |
+------------+
| 202.982582 |
+------------+

mysql> EXPLAIN SELECT AVG(price) FROM products WHERE create_time BETWEEN '2019-10-01' AND '2019-11-01'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: range
possible_keys: create_time
          key: create_time
      key_len: 5
          ref: NULL
         rows: 182
     filtered: 100.00
        Extra: Using index condition

 

Cool, now the index is used. Then rewriting the query was the typical suggestion.

Quite a simple solution, but not all the times it was possible to change the application code for many valid reasons. So, what to do then?

 

MySQL 8.0 Functional Indexes

Starting from version 8.0.13, MySQL supports functional indexes. Instead of indexing a simple column, you can create the index on the result of any function applied to a column or multiple columns.

Long story short, now you can do the following:

mysql> ALTER TABLE products ADD INDEX((MONTH(create_time)));
Query OK, 0 rows affected (0.74 sec)
Records: 0  Duplicates: 0  Warnings: 0

Be aware of the double parentheses. The syntax is correct since the expression must be enclosed within parentheses to distinguish it from columns or column prefixes.

Indeed the following returns an error:

mysql> ALTER TABLE products ADD INDEX(MONTH(create_time));
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'create_time))' at line 1

Let’s check now our original query and see what happens to the EXPLAIN

mysql> SELECT AVG(price) FROM products WHERE MONTH(create_time)=10;
+------------+
| AVG(price) |
+------------+
| 202.982582 |
+------------+

 

mysql> EXPLAIN SELECT AVG(price) FROM products WHERE MONTH(create_time)=10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: ref
possible_keys: functional_index
          key: functional_index
      key_len: 5
          ref: const
         rows: 182
     filtered: 100.00
        Extra: NULL

 

The query is no longer a full scan and runs faster. The functional_index has been used, with only 182 rows examined. Awesome.

Thanks to the functional index we are no longer forced to rewrite the query.

Which Functional Indexes are Permitted

We have seen an example involving a simple function applied to a column, but you are granted to create more complex indexes.

A functional index may contain any kind of expressions, not only a single function. The following patterns are valid functional indexes:

INDEX( ( col1 + col2 ) )
INDEX( ( FUNC(col1) + col2 – col3 ) )

You can use ASC or DESC as well:

INDEX( ( MONTH(col1) ) DESC )

You can have multiple functional parts, each one included in parentheses:

INDEX( ( col1 + col2 ), ( FUNC(col2) ) )

You can mix functional with nonfunctional parts:

INDEX( (FUNC(col1)), col2, (col2 + col3), col4 )

There are also limitations you should be aware of:

  • A functional key can not contain a single column. The following is not permitted:
    INDEX( (col1), (col2) )
  • The primary key can not include a functional key part
  • The foreign key can not include a functional key part
  • SPATIAL and FULLTEXT indexes can not include functional key parts
  • A functional key part can not refer to a column prefix

At last, remember that the functional index is useful only to optimize the query that uses the exact same expression. An index created with nonfunctional parts can be used instead to solve multiple different queries.

For example, the following conditions can not rely on the functional index we have created:

WHERE YEAR(create_time) = 2019

WHERE create_time > ‘2019-10-01’

WHERE create_time BETWEEN ‘2019-10-01’ AND ‘2019-11-01’

WHERE MONTH(create_time+INTERVAL 1 YEAR)

All these will trigger a full scan.

Functional Index Internal

The functional indexes are implemented as hidden virtual generated columns. For this reason, you can emulate the same behavior even on MySQL 5.7 by explicitly creating the virtual column. We can test this, starting by dropping the indexes we have created so far.

mysql> SHOW CREATE TABLE products\G
*************************** 1. row ***************************
       Table: products
Create Table: CREATE TABLE `products` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `description` longtext,
  `price` decimal(8,2) DEFAULT NULL,
  `create_time` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `create_time` (`create_time`),
  KEY `functional_index` ((month(`create_time`)))
) ENGINE=InnoDB AUTO_INCREMENT=149960 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

 

mysql> ALTER TABLE products DROP INDEX `create_time`, DROP INDEX `functional_index`;
Query OK, 0 rows affected (0.03 sec)

We can try now to create the virtual generated column:

mysql> ALTER TABLE products ADD COLUMN create_month TINYINT GENERATED ALWAYS AS (MONTH(create_time)) VIRTUAL;
Query OK, 0 rows affected (0.04 sec)

Create the index on the virtual column:

mysql> ALTER TABLE products ADD INDEX(create_month);
Query OK, 0 rows affected (0.55 sec)

 

mysql> SHOW CREATE TABLE products\G
*************************** 1. row ***************************
       Table: products
Create Table: CREATE TABLE `products` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `description` longtext,
  `price` decimal(8,2) DEFAULT NULL,
  `create_time` timestamp NULL DEFAULT NULL,
  `create_month` tinyint GENERATED ALWAYS AS (month(`create_time`)) VIRTUAL,
  PRIMARY KEY (`id`),
  KEY `create_month` (`create_month`)
) ENGINE=InnoDB AUTO_INCREMENT=149960 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

 

We can now try our original query. We expect to see the same behavior as the functional index.

mysql> SELECT AVG(price) FROM products WHERE MONTH(create_time)=10;
+------------+
| AVG(price) |
+------------+
| 202.982582 |
+------------+

mysql> EXPLAIN SELECT AVG(price) FROM products WHERE MONTH(create_time)=10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
   partitions: NULL
         type: ref
possible_keys: create_month
          key: create_month
      key_len: 2
          ref: const
         rows: 182
     filtered: 100.00
        Extra: NULL

 

Indeed, the behavior is the same. The index on the virtual column can be used and the query is optimized.

The good news is that you can use this workaround to emulate a functional index even on 5.7, getting the same benefits. The advantage of MySQL 8.0 is that it is completely transparent, no need to create the virtual column.

Since the functional index is implemented as a hidden virtual column, there is no additional space needed for the data, only the index space will be added to the table.

By the way, this is the same technique used for creating indexes on JSON documents’ fields.

Conclusion

The functional index support is an interesting improvement you can find in MySQL 8.0. Some of the queries that required rewriting to get optimized don’t require that anymore. Just remember that only the queries having the same filter pattern can rely on the functional index. Then you need to create additional indexes or other functional indexes to improve other search patterns.

The same feature can be implemented on MySQL 5.7 with the explicit creation of a virtual generated column and the index.

For more detailed information, read the following page:

https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-functional-key-parts

Jan
05
2022
--

In Application and Database Design, Small Things Can Have a Big Impact

Application and Database Design

Application and Database DesignWith modern application design, systems are becoming more diverse, varied and have more components than ever before. Developers are often forced to become master chefs adding the ingredients from dozens of different technologies and blending them together to create something tasty and amazing. But with so many different ingredients, it is often difficult to understand how the individual ingredients interact with each other. The more diverse the application, the more likely it is that some seemingly insignificant combination of technology may cause cascading effects.

Many people I talk to have hundreds if not thousands of different libraries, APIs, components, and services making up the systems they support. In this type of environment, it is very difficult to know what small thing could add up to something much bigger. Look at some of the more recent big cloud or application outages, they often have their root cause in something rather small.

Street Fight: Python 3.10 -vs- 3.9.7

Let me give you an example of something I ran across recently. I noticed that performance on two different nodes running the same hardware/application code was performing drastically different than one another. The app server was running close to 100% CPU on one server while the other was around 40%. Each had the same workload and the same database, etc. It turned out that one server was using Python 3.10.0 and the other was running 3.9.7. This combined with the MySQL Connector/Python lead to almost a 50% reduction in database throughput (the 3.10.0 release saw a regression). However, this performance change was not seen either in my PostgreSQL testing or in testing the mysqlclient connector. It happened only when running the pure python version of the MySQL connector. See the results below:

Python 3.10 -vs- 3.9.7

Note: This workload was using a warmed BP, with workload running for over 24 hours before the test. I cycled the application server making the change to either the version of python or the MySQL library. These tests are repeatable regardless of the length of the run. All data fits into memory here. I am not trying to make an authoritative statement on a specific technology, merely pointing out the complexity of layers of technology.

Looking at this purely from the user perspective, this particular benchmark simulates certain types of users.  Let’s look at the number of users who could complete their actions per second.  I will also add another pure python MySQL driver to the mix.

Note: There appears to be a significant regression in 3.10. The application server was significantly busier when using Python 3.10 and one of the pure python drivers than when running the same test in 3.9 or earlier.

The main difference between the MySQL Connector and mysqlclient is the mysqlclient is using the C libmysqlclient. Oddly the official MySQL Connector says it should switch between the pure python and C version if available, but I was not seeing that behavior ( so I have to look into it ). This resulted in the page load time for the app in Python 3.10.0 taking 0.05 seconds up from 0.03 seconds in Python 3.9.7. However, the key thing I wanted to highlight is that sometimes seemingly small or insignificant changes can lead to a drastic difference in performance and stability. You could be running along fine for months or even years before something upgrades to something that you would not think would drastically impact performance.

Can This Be Fixed With Better Testing?

You may think this is a poster child for testing before upgrading components, and while that is a requirement, it won’t necessarily prevent this type of issue. While technology combinations, upgrades, and releases can often have some odd side effects, oftentimes issues they introduce remain hidden and don’t manifest until some outside influence pops up. Note: These generally happen at the worst time possible, like during a significant marketing campaign, event, etc. The most common way I see nasty bugs get exposed is not through a release or in testing but often with a change in workload. Workload changes can hide or raise bottlenecks at the worst times. Let’s take a look at the above combination of Python version and different connectors with a different workload:

Here the combination of reporting and read/write workload push all the app nodes and database nodes to the redline. These look fairly similar in terms of performance, but the workload is hiding the above issues I mentioned. A system pushed to the red will behave differently than it will in the real world. If you ended up testing upgrading python on your app servers to 3.10.0 by pushing your systems to the max, you may see the above small regression as within acceptable limits. In reality, however, the upgrade could net you seeing a 50% decrease in throughput when moved to production.

Do We Care?

Depending on how your application is built many people won’t notice the above-mentioned performance regression after the upgrade happens. First, most people do not run their servers even close to 100% load, adding more load on the boxes may not immediately impact their user’s performance. Adding .02 seconds of load time to a user may be imperceptible unless under heavy load ( which would increase that load time). The practical impact is to speed up the point at which you need to either add more nodes or upgrade their instances sooner.

Second, scaling application nodes automatically is almost a requirement in most modern cloud-native environments. Reaching a point where you need to add more nodes and more processing power will come with increases in users on your application, so it is easily explained away.

Suppose users won’t immediately notice and the system will automatically expand as needed ( preventing you from knowing or getting involved ). In that case, do we, or should we care about adding more nodes or servers? Adding nodes is cheap; it is not free.

First, there is a direct cost to you. Take your hosting costs for your application servers and double them in the case above. What is that? $10K, $100K, $1M a year? That is money that is wasted. Look no further than the recent news lamenting the ever-increasing costs of the cloud i.e.:

Second, there is a bigger cost that comes with complexity. Observability is such a huge topic because everything we end up doing in modern environments is done in mass. The more nodes and servers you have the more potential exists for issues or one node behaving badly. While our goal is to create a system where everything is replaceable and can be torn down and rebuilt to overcome problems, this is often not the reality. Instead, we end up replicating bad code, underlying bottlenecks, and making a single problem a problem at 100x the scale.

We need to care. Application workload is a living entity that grows, shrinks, and expands with users. While modern systems need to scale quickly up and down to meet the demand, that should not preclude us from having to look out for hidden issues and bottlenecks. It is vitally important that we understand our applications workloads and look for deviations in the normal patterns.  We need to ask why something changed and dig in to find the answer.  Just because we can build automation to mitigate the problem does not mean we should get complacent and lazy about fixing and optimizing our systems.

Dec
29
2021
--

Q & A on Webinar “MySQL Performance for DevOps”

MySQL Performance for DevOps

MySQL Performance for DevOpsFirst I want to thank everyone who attended my November 16, 2021 webinar “MySQL Performance for DevOps“. Recording and slides are available on the webinar page.

Here are answers to the questions from participants which I was not able to provide during the webinar.

Q: Hi! We have troubles with DELETE queries. We have to remove some data periodically (like, hourly, daily) and we have short-term server stalls during these DELETEs. Server is running on modern NVMe’s so we wonder why do we have this situation. Those DELETE’s are not so large, like 10 000 – 15 000 records, but tables on which DELETE’s are performed update frequently.

A: I would test if a similar

DELETE

  statement is slow when you run it on the development server in an isolated environment while no other session is connected to the MySQL server instance.  If it is slow in this case too, check if MySQL uses indexes to resolve the condition

WHERE

  for the

DELETE

  statement. You can use

EXPLAIN

  statement for

DELETE

  or convert

DELETE

  into a similar

SELECT

  query and experiment.

If the

DELETE

  statement is running fast when called in the isolated environment, check how parallel sessions affect its performance. If the tables you are deleting from are updated frequently,

DELETE

  statements could cause and be affected by locking conflicts. To resolve this situation study how MySQL works with locks. Great presentation about InnoDB locks “InnoDB Locking Explained with Stick Figures” could be found at https://www.slideshare.net/billkarwin/innodb-locking-explained-with-stick-figures Then you need to optimize

DELETE

  and

UPDATE

  statements, so they finish faster. Alternatively, you can separate them in time, so they have less effect on each other. You may also split

DELETE

  statements, so they update fewer records at a time.

Q: Question 2. We have innodb_buffer_size set around 260Gb on the dedicated server with about 320Gb of total RAM. Still, we have 99.9% memory full and there are no other large memory consumers, only MySQL (Percona 8.0.23). The server starts and around 3 hours it takes all available memory regardless of the innodb_buffer_size setting. We never had something like this with 5.7. Do you have any ideas?

A: MySQL uses memory not only for the InnoDB buffer pool but for other data, such as session-based and operation-based buffers. For example, if you have 100 connections that use underlying temporary tables to resolve queries and set the size of the internal temporary table to 100MB you will use around 10G additional memory for these tables. Query memory digest tables in Performance Schema and views on these tables in the

sys

 schema to find the operations that allocate memory in your MySQL server.

Q: Can we get a copy of this presentation?

A: You should have received a copy of the slides. If you did not, they are attached to this blog post: DevOps_Perf_202111

Q: buffer_pool_size should be what percentage of the host RAM?

A: The percentage of the host RAM is a very rough estimation of the ideal amount of memory you need to allocate for the InnoDB buffer pool. For example, the MySQL user manual in past had recommendations for having InnoDB buffer pool size up to 80% of the available RAM. But 80% of RAM is very different if the host has, say, 8G, or 1024G. In the former case, 80% is 6.4G and the host will have 1.6G for other MySQL buffers and the operating system that could be not enough. In the latter case, 80% is 819.2G and the host will have 204.8G for other needs. Depending on your workload it could be a huge waste of resources. I recommend you to read this blog post: https://www.percona.com/blog/2015/06/02/80-ram-tune-innodb_buffer_pool_size/ and follow the links in the end, then choose the size, appropriate for your data set and workload.

Q: How we can fitting RAM size vs data size?

Example: if I have 1G of data, how many RAM I need for get 100 QPS, and if I have 100G of data how many RAM I need for get 100 QPS?

A: RAM size, dataset size, and the number of queries per second that your server can handle are not directly related. You need to test your queries and follow how they are executed. For example, if you select everything from the InnoDB table and your table holds either 1G or 100G of data, and you do not access any other table on the server, the very first run will be slower than following because InnoDB will read data into the buffer pool. Then performance and the number of queries per second will be limited only by network speed and bandwidth between your client and server having you can allocate about 100G for your buffer pool. But cached size will stay almost the same as the table size no matter how many connections you have. Your MySQL server will only use a small amount of memory for new connections buffers.

In another case, however, you may have a comparatively small table that you will access by a quite complicated query. For example, if you try to repeat the test case for still valid https://bugs.mysql.com/bug.php?id=29423, a single query on the 184M table would run for a much longer time than you expect. In this case number of queries per second will be also very low.

Q: Do you have a recommendation parameter list for MySQL RDS on AWS?

A: It is the same as for the dedicated MySQL server but you may have not been able to change some of the options.

Q: If you know you have SSD’s, but ROTA = 1, what has to be configured to make use of the SSDs?

A: For SSD ROTA should be 0. If you are sure you have SSDs but they are shown as rotational disks this means that your storage is configured incorrectly. Depending on the configuration you may still have the same performance as if the disks were recognized properly. If this is not the case, check your storage, RAID controller, and system configuration.

MySQL just sends system commands for reading, writing, and syncing data. It does not care if the disk is rotational or not. For MySQL performance value of ROTA does not really matter.

Q: If you believed you tuned both master and slave for the best performance, but seconds behind master continues to increase, you decide to split the shard, but xtrabackup fails with log wrap.  But even if you were to get a good backup, once it is online, the slave will never catch up.  The Kobayashi Maru, a no win situation – have you been there?  What did you do?

A: First make sure if you configured a multi-threaded replica. If you use parallel type

LOGICAL_CLOCK

, study option binlog_transaction_dependency_tracking. Practically how it works when set to

WRITESET

  or to

WRITESET_SESSION

 . For avoiding log wrap during backup increase redo log file size. If you can stop the source server, stop it and set up a replica by copying datadir: it is faster than using XtraBackup, because you would not need to copy changes in the redo log files while the backup is running.

Q: In MySQL 5.7, the tmp tablespace is now InnoDB, how can you tune tmp to take advantage of RAM and not use disk?

A: The tablespace file on disk is used only when the in-memory table is converted into a disk-based table. Otherwise, temporary tables continue using memory.

Q: What are the top 6 variables to get the best performance, how can you verify how effective their setting are, looking at the global status, when can you know when those variables can be increased to get the best utilization from CPUs/RAM/Disk/Network.

A: While I showed variables that can improve performance in most cases on my “Conclusion” slides I recommend you to start from the issue you are trying to solve and start adjusting variables only when you understand what you are doing.

Some of such variables could be measured for effectiveness. For example, if the number of free buffers in the output of

SHOW ENGINE INNODB STATUS

  is small and the buffer pool hit rate shows that a number of disk access is consistently greater than the number of the buffer pool hits, it indicates that the buffer pool size may be too small for you your workload and data.

Regarding CPU, if the number of active threads is high, and you see performance drop when concurrency increases while the operating system shows low CPU usage, it may be a symptom that either:

– you limited the upper limit of the number of active engine threads

– disk does not support so many parallel operations and active threads are waiting for IO

Another issue with CPU performance could happen if the upper limit of the number of active engine threads is not set or too high and threads are spending time doing nothing while waiting in the priority queue.

The only option that directly limits IO activity is

innod_io_capacity

  that limits the speed of background InnoDB operations. If set too low InnoDB may underuse your fast disk and if set too high InnoDB could start writing too fast, so each write request will waste time waiting in its queue.

Q: What was the last InnoDB setting, the one which should up to no of CPU cores?

A: This is

innodb_thread_concurrency

  that limits the number of InnoDB threads that could run in parallel. You should set it either to 0 or to the number of CPU cores.

Q: Which is more secure and faster community MySQL or Percona MySQL or aws rds?

A: Percona MySQL has performance, diagnostic improvements, as well as Enterprise-level features, available as open source. AWS RDS supports hardware scaling on demand and physical replication that uses InnoDB redo log files instead of binary logs. However, it does not allow you to have the same control on the server as for your own physical instance. Community MySQL works on a higher number of platforms, thus uses function calls that work on all of them where Percona MySQL or AWS RDS may use optimized variants. So each of them has its own advantages and disadvantages.

Q: In case with open tables >>> open_files (and cannot change open_files) how to set table_open_cache? “as big as possible”?

A: Status variable

Open_files

  is “the number of files that are open. This count includes regular files opened by the server. It does not include other types of files such as sockets or pipes. Also, the count does not include files that storage engines open using their own internal functions rather than asking the server level to do so.” (https://dev.mysql.com/doc/refman/8.0/en/server-status-variables.html#statvar_Open_files) The status variable

Open_tables

  is “the number of tables that are open”. They are not related to each other. You need to watch that value of 

Opened_tables

  (“the number of tables that have been opened”) is not greater than

Open_tables

 .

There is an operating system option “open files” that is visible if you run the command

ulimit -n

. This option should be greater than the maximum number of files that your MySQL instance can simultaneously open. Speaking about

Open_tables

 : you cannot have this value set to a number that is larger than the operating system option “open files” unless your tables are stored in the shared or general tablespace.

Q: How to tell if we should tune join_buffer_size? wait events anywhere?

A: If you use

JOIN

  queries that do not use indexes and they perform slowly because of this. Start from regular query tuning using slow query log, Performance Schema, and Query Analyzer in PMM to find queries that require optimization. In Query Analyzer add a column “Full Join” to your query list. In the Performance Schema search for statements where the value of

SELECT_FULL_JOIN

  is greater than 0 in the

events_statements_*

  tables.

Check also my “Introduction to MySQL Query Tuning for Dev[Op]s” webinar.

Q: How to measure memory consumption of table_open_cache? 15K/table? FRM-related? some way to estimate?

A: This is event “

memory/sql/TABLE_SHARE::mem_root

” Check also this blog post.

Q: Hello guys!

Do we need to prepare different optimization depends on MySQL engine e.g. XtraDB, InnoDB? If yes, could you please explain differences?

Best regards,

Oleg Stelmach

A: XtraDB is an enhanced version of InnoDB in the Percona Server: https://www.percona.com/doc/percona-server/8.0/percona_xtradb.html. So differences are added features in the Percona Server. Namely, the options that exist in the Percona server and do not exist in the upstream Community MySQL.

Q: Regarding threads. Do better to use hyperthreading\multithreading for MySQL instance or we need to turn off this function?

Best regards,

Oleg Stelmach

A: You do not have to turn this option off but you may see that MySQL performance is not linearly predictable in high concurrent workloads. I recommend you to check this blog post with hyperthreading benchmarks on MySQL and comments on it for a better understanding of how hyperthreading can affect MySQL performance.

Q: Besides from setting os swap-pines correctly. would also recommend to enable memlock in my.cnf?

A: Normally you do not need it.

Dec
14
2021
--

High Availability and Disaster Recovery Recipes for PostgreSQL on Kubernetes

High Availability and Disaster Recovery PostgreSQL on Kubernetes

Percona Distribution for PostgreSQL Operator allows you to deploy and manage highly available and production-grade PostgreSQL clusters on Kubernetes with minimal manual effort. In this blog post, we are going to look deeper into High Availability, Disaster Recovery, and Scaling of PostgreSQL clusters.

High Availability

Our default custom resource manifest deploys a highly available (HA) PostgreSQL cluster. Key components of HA setup are:

  • Kubernetes Services that point to pgBouncer and replica nodes
  • pgBouncer – a lightweight connection pooler for PostgreSQL
  • Patroni – HA orchestrator for PostgreSQL
  • PostgreSQL nodes – we have one primary and 2 replica nodes in hot standby by default

high availability postgresql

Kubernetes Service is the way to expose your PostgreSQL cluster to applications or users. We have two services:

  • clusterName-pgbouncer

    – Exposing your PostgreSQL cluster through pgBouncer connection pooler. Both reads and writes are sent to the Primary node. 

  • clusterName-replica

    – Exposes replica nodes directly. It should be used for reads only. Also, keep in mind that connections to this service are not pooled. We are working on a better solution, where the user would be able to leverage both connection pooling and read-scaling through a single service.

By default we use ClusterIP service type, but you can change it in

pgBouncer.expose.serviceType

or

pgReplicas.hotStandby.expose.serviceType,

respectively.

Every PostgreSQL container has Patroni running. Patroni monitors the state of the cluster and in case of Primary node failure switches the role of the Primary to one of the Replica nodes. PgBouncer always knows where Primary is.

As you see we distribute PostgreSQL cluster components across different Kubernetes nodes. This is done with Affinity rules and they are applied by default to ensure that single node failure does not cause database downtime.

Multi-Datacenter with Multi-AZ

Good architecture design is to run your Kubernetes cluster across multiple datacenters. Public clouds have a concept of availability zones (AZ) which are data centers within one region with a low-latency network connection between them. Usually, these data centers are at least 100 kilometers away from each other to minimize the probability of regional outage. You can leverage multi-AZ Kubernetes deployment to run cluster components in different data centers for better availability.

Multi-Datacenter with Multi-AZ

To ensure that PostgreSQL components are distributed across availability zones, you need to tweak affinity rules. Now it is only possible through editing Deployment resources directly:

$ kubectl edit deploy cluster1-repl2
…
-            topologyKey: kubernetes.io/hostname
+            topologyKey: topology.kubernetes.io/zone

Scaling

Scaling PostgreSQL to meet the demand at peak hours is crucial for high availability. Our Operator provides you with tools to scale PostgreSQL components both horizontally and vertically.

Vertical Scaling

Scaling vertically is all about adding more power to a PostgreSQL node. The recommended way is to change resources in the Custom Resource (instead of changing them in Deployment objects directly). For example, change the following in the

cr.yaml

to get 256 MBytes of RAM for all PostgreSQL Replica nodes:

  pgReplicas:
    hotStandby:
      resources:
        requests:
-         memory: "128Mi"
+         memory: "256Mi"

Apply

cr.yaml

:

$ kubectl apply -f cr.yaml

Use the same approach to tune other components in their corresponding sections.

You can also leverage Vertical Pod Autoscaler (VPA) to react to load spikes automatically. We create a Deployment resource for Primary and each Replica node. VPA objects should target these deployments. The following example will track one of the replicas Deployment resources of cluster1 and scale automatically:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: pxc-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:     cluster1-repl1  
    namespace:  pgo
  updatePolicy:
    updateMode: "Auto"

Please read more about VPA and its capabilities in its documentation.

Horizontal Scaling

Adding more replica nodes or pgBouncers can be done by changing size parameters in the Custom Resource. Do the following change in the default

cr.yaml

:

  pgReplicas:
    hotStandby:
-      size: 2
+      size: 3

Apply the change to get one more PostgreSQL Replica node:

$ kubectl apply -f cr.yaml

Starting from release 1.1.0 it is also possible to scale our cluster using kubectl scale command. Execute the following to have two PostgreSQL replica nodes in cluster1:

$ kubectl scale --replicas=2 perconapgcluster/cluster1
perconapgcluster.pg.percona.com/cluster1 scaled

In the latest release, it is not possible to use Horizontal Pod Autoscaler (HPA) yet and we will have it supported in the next one. Stay tuned.

Disaster Recovery

It is important to understand that Disaster Recovery (DR) is not High Availability. DR’s goal is to ensure business continuity in the case of a massive disaster, such as a full region outage. Recovery in such cases can be of course automated, but not necessarily – it strictly depends on the business requirements.

Disaster Recovery postgresql

Backup and Restore

I think it is the most common Disaster Recover protocol – take the backup, store it in some 3rd party premises, restore to another datacenter if needed.

This approach is simple, but comes with a long recovery time, especially if the database is big. Use this method only if it passes your Recovery Time Objectives (RTO).

Recovery Time Objectives

Our Operator handles backup and restore for PostgreSQL clusters. The disaster recovery is built around pgBackrest and looks like the following:

  1. Configure pgBackrest to upload backups to S3 or GCS (see our documentation for details).
  2. Create the backup manually (through pgTask) or ensure that a scheduled backup was created. 
  3. Once the Main cluster fails, create the new cluster in the Disaster Recovery data center. The cluster must be running in standby mode and pgBackrest must be pointing to the same repository as the main cluster:
spec:
  standby: true
  backup:
  # same config as on original cluster

Once data is recovered, the user can turn off standby mode and switch the application to DR cluster.

Continuous Restoration

This approach is quite similar to the above: pgBackrest instances continuously synchronize data between two clusters through object storage. This approach minimizes RTO and allows you to switch the application traffic to the DR site almost immediately. 

Continuous Restoration postgresql

Configuration here is similar to the previous case, but we always run a second PostgreSQL cluster in the Disaster Recovery data center. In case of main site failure just turn off the standby mode:

spec:
  standby: false

You can use a similar setup to migrate the data to and from Kubernetes. Read more about it in the Migrating PostgreSQL to Kubernetes blog post.

Conclusion

Kubernetes Operators provide ready-to-use service, and in the case of Percona Distribution for PostgreSQL Operator, the user gets a production-grade, highly available database cluster. In addition, the Operator provides day-2 operation capabilities and automates day-to-day routine.

We encourage you to try out our operator. See our GitHub repository and check out the documentation.

Found a bug or have a feature idea? Feel free to submit it in JIRA.

For general questions please raise the topic in the community forum

Are you a developer and looking to contribute? Please read our CONTRIBUTING.md and send the Pull.

Dec
08
2021
--

Percona Distribution for PostgreSQL Operator 1.1.0 – Notable Features

Features in Percona Distribution for PostgreSQL Operator

Features in Percona Distribution for PostgreSQL OperatorPercona in 2021 is heavily invested in making the PostgreSQL ecosystem better and contributing to it from different angles:

With this in mind let me introduce to you Percona Distribution for PostgreSQL Operator version 1.1.0 and its notable features:

  • Smart Update – forget about manual and error-prone database upgrades
  • System Users management – add and modify system users with ease with a single Kubernetes Secret resource
  • PostgreSQL 14 support – leverage the latest and greatest by running Percona Distribution for PostgreSQL on Kubernetes

Full release notes can be found here.

Smart Update Feature

Updating databases and their components is always a challenge. In our Operators for MySQL and MongoDB we have simplified and automated upgrade procedures, and now it’s time for PostgreSQL. In the 1.1.0 version, we ship this feature as Technical Preview with a plan to promote it to GA in the next release.

This feature consists of two parts:

  • Version Service – get the latest or recommended version of the database or other component (PMM for example)
  • Smart Update – apply new version without downtime

Version Service

This feature answers the question: which PostgreSQL/pgBackRest/pgBouncer version should I be running with this Operator? It is important to note, that Version Service and Smart Update can only perform minor version upgrades (ex. from 13.1 to 13.4). Major Version upgrades are manual for now and will be automated in the Operator soon.

The way it works is well depicted on the following diagram: 

Percona Distribution for PostgreSQL Operator

Version Service is an open source tool, see the source code on Github. Percona hosts check.percona.com and Operators use it by default, but users can run their own self-hosted Version Service.

Users who worked with our Operators for MySQL and MongoDB will find the configuration of Version Service and Smart Update quite familiar:

  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: recommended
    schedule: "0 2 * * *"

  • Define Version Service endpoint
  • Define PostgreSQL version – Operator will automatically figure out components versions
  • Schedule defines the time when the rollout of newer versions is going to take place. Good practice to set this time outside of peak hours.

Smart Update

Okay, now Operator knows the versions that should be used. It is time to apply them and do it with minimal downtime. Here is where the Smart Update feature kicks in. 

The heart of Smart Update is smartUpdateCluster function. The goal here is to switch container images versions for database components in a specific order and minimize downtime. Once the image is changed, Kubernetes does the magic. For Deployment resources, which we use in our Operator, Kubernetes first spins up the Pod with a new image and then terminates the old one. This provides minimal downtime. The update itself looks like this:

  1. Upgrade pgBackRest image in Deployment object in Kubernetes
  2. Start upgrading PostgreSQL itself
    1. Percona Monitoring and Management which runs as a sidecar gets the new version here as well
    2. Same for pgBadger
    3. We must upgrade replica nodes first here. If we upgrade the primary node first, the cluster will not recover. The tricky part here, is that in an event of failover Primary node can be somewhere in the pgReplicas Deployment. So we need to verify where the primary is first and only after that change the image. See the Smart Update sequence diagram for more details. 
  3. Last, but not least – change the image for pgBouncer. To minimize the downtime here, we recommend running at least two pgBouncer nodes. By default pgBouncer.size is set to 3.

As a result, the user gets the latest, most secure, and performant PostgreSQL and its components automatically with minimal downtime.

System Users Management

Our Operator has multiple system users to manage the cluster and ensure its health. Our users raised two main concerns:

  • it is not possible to change system user password with the Operator after cluster deployment
  • it is confusing that there is a Secret object per user

In this release, we are moving all system users to a single Secret. The change in the Secret resource is going to trigger the update of the passwords in PostgreSQL automatically.

If the cluster is created from scratch the Secret with system users is going to be created automatically and passwords would be randomly generated. By default the Secret name is

<clusterName>-users

, it can be changed under

spec.secretUsers

variable in the Custom Resource.

spec:
  secretsName: my-custom-secret

When upgrading from 1.0.0 to 1.1.0, if you want to keep old passwords, please create the Secret resource manually. Otherwise, the passwords for system users are going to be generated randomly and updated by the Operator.

PostgreSQL 14 Support

PostgreSQL 14 provides an extensive set of new features and enhancements to security, performance, usability for client applications, and more.

Most notable of them include the following:

  • Expired B-tree index entries can now be detected and removed between vacuum runs. This results in a lesser number of page splits and reduces the index bloat.
  • The vacuum process now deletes dead tuples in a single cycle, as opposed to the previous 2-step approach of first marking tuples as deleted and then actually freeing up space in the next run. This speeds up free space cleanup.
  • Support for subscripts in JSON is added to simplify data retrieval using a commonly recognized syntax.
  • Stored procedures can accept OUT parameters.
  • The libpq library now supports the pipeline mode. Previously, the client applications waited for a transaction to be completed before sending the next one. The pipeline mode allows the applications to send multiple transactions at the same time thus boosting performance.
  • Large transactions are now streamed to subscribers in-progress, thus increasing the performance. This improvement applies to logical replication.
  • LZ4 compression is added for TOAST operations. This speeds up large data processing and also improves the compression ratio.
  • SCRAM is made the default authentication mechanism. This mechanism improves security and simplifies regulatory compliance for data security.

In the 1.1.0 version of PostgreSQL Distribution for PostgreSQL Operator, we enable our users to run the latest and greatest PostgreSQL 14. PostgreSQL 14 is the default version since this release, but you still can use versions 12 and 13.

Conclusion

Kubernetes Operators are mainly seen as the tool to automate deployment and management of the applications. With this Percona Distribution for PostgreSQL Operator release, we simplify PostgreSQL management even more and enable users to leverage the latest version 14. 

We encourage you to try out our operator. See our github repository and check out the documentation.

Found a bug or have a feature idea? Feel free to submit it in JIRA.

For general questions please raise the topic in the community forum

You are a developer and looking to contribute? Please read our CONTRIBUTING.md and send the Pull Request.

Nov
26
2021
--

Percona Monthly Bug Report: November 2021

Percona Bug Report November 2021

Percona Bug Report November 2021Here at Percona, we operate on the premise that full transparency makes a product better. We strive to build the best open-source database products, but also to help you manage any issues that arise in any of the databases that we support. And, in true open source form, report back on any issues or bugs you might encounter along the way.

We constantly update our bug reports and monitor other boards to ensure we have the latest information, but we wanted to make it a little easier for you to keep track of the most critical ones. These posts are a central place to get information on the most noteworthy open and recently resolved bugs. 

In this November 2021 edition of our monthly bug report, we have the following list of bugs:

Percona Server for MySQL/MySQL Bugs

PS-7919:  Percona Server crashes during CREATE TABLE statements. The problem was seen when tables have many partitions and there were already many tablespaces present in the instance.

Affects Version/s: 8.0 [Tested/Reported version 8.0.25]

Fixed Version/s: 8.0.26

 

PS-7866 (MySQL#104961): MySQL crashes when running a complex query with JOINS. In MySQL error log it only adds a single line as “Segmentation fault (core dumped)” and there are no further details like backtraces in MySQL error log. Further analyzing the core dump shows that something happened at CreateIteratorFromAccessPath.

Affects Version/s: 8.0 [Tested/Reported version 8.0.23, 8.0.25]

 

PS-1653/PS-4935  (MySQL#78612, MySQL#97001 ):  Optimizer Issue for query with ORDER BY+ LIMIT. In some cases, Optimizer chooses a full table scan instead of an index range scan on query order by primary key with limit. 

  • It happens only when we use ORDER BY  and LIMIT together.
  • Issue reproducible for a table with and without primary key.

Affects Version/s: 5.7,8.0 [Tested/Reported version 5.7.27]

Fixed Version/s: 8.0.21, 5.7.33

 

Recently MySQL 8.0.27 was released with some major changes/ Improvements, Following are the few noticeable changes:

  • Previously, MySQL user accounts were authenticated to the server using a single authentication method. MySQL now supports multifactor authentication (MFA), which makes it possible to create accounts that have up to three authentication methods.
  • The default_authentication_plugin variable is deprecated as of MySQL 8.0.27, New replacement variable authentication_policy system variable, which is introduced in MySQL 8.0.27 with the multifactor authentication feature.
  • MySQL MTS Replication: Multithreading is now enabled by default for replica servers.

Defaults values for MTS variables as follows,

replica_parallel_workers=4

replica_preserve_commit_order=1.

replica_parallel_type=LOGICAL_CLOCK

 

Percona XtraDB Cluster

PXC-3724: PXC node crashes with long semaphore. This issue occurred due to locking when writing on multiple nodes in PXC cluster. This is critical as it blocks all nodes to perform any transactions and finally crashes PXC node.

Affects Version/s: 8.0  [Tested/Reported version 8.0.22,8.0.25]

 

PXC-3729: PXC node fails with two applier threads due to conflict. In a certain combination of a table having both primary key and (multi-column) unique key, and highly concurrent write workload updating similar UK values, parallel appliers on secondary nodes may conflict with each other, causing node aborts with inconsistency state. These can be seen even when writes are going to a single writer node only. As transactions were already certified and replicated, they should never fail on the appliers.

Affects Version/s: 8.0  [Tested/Reported version 8.0.21]

Fixed Version/s: 8.0.25

 

PXC-3449: When ALTER TABLE (TOI) is executed in the user session, sometimes it happens that it conflicts (MDL) with high priority transaction, which causes BF-BF abort and server termination.

Affects Version/s: 8.0  [Tested/Reported version 8.0.21]

Fixed Version/s: 8.0.25

 

PXC-3387: Server hits the assertion while the query performs an intermediate commit during update of table stats in the Data Dictionary. You will see this issue when autocommit=0

Affects Version/s: 8.0  [Tested/Reported version 8.0.21]

Fixed Version/s: 8.0.25

 

Percona XtraBackup

PXB-2629:  Downloading backup using xbcloud on Debian-10 not working. Download stuck at some point for forever or it fails with the Segmentation fault message at the end.

Found issue with curl version for this issue. The default curl version on Debian-10 is curl 7.64.0 libcurl/7.64.0 upgrading it to curl 7.74.0 (x86_64-pc-linux-gnu) libcurl/7.74.0 version fix the issue.

Affects Version/s: 8.0  [Tested/Reported version 8.0.21]

Fixed Version/s: 8.0.25

 

Percona Toolkit

PT-1889: Incorrect output when using pt-show-grants for users based on MySQL roles. As a result, they can not be applied back properly on the MySQL server. Due to this, we can not use pt-show-grants for MySQL roles until this issue is fixed.

Affects Version/s:  3.2.1

 

PT-1747: pt-online-schema-change was bringing the database into a broken state when applying the “rebuild_constraints” foreign keys modification method if any of the child tables were blocked by the metadata lock.

Affects Version/s:  3.0.13

Fixed Version: 3.4.0

 

PMM  [Percona Monitoring and Management]

PMM-7116: MongoDB ReplSet Summary Dashboard shows incorrect replset member’s state: STARTUP instead of PRIMARY.

Affects Version/s: 2.x  [Tested/Reported version 2.12,2.20]

Fixed Version:  2.25.0

 

PMM-9085: Upgrading to 2.22 pmm-managed component crashes with the following error: panic: interface conversion: agentpb.AgentResponsePayload is nil, not *agentpb.GetVersionsResponse

Affects Version/s: 2.x  [Tested/Reported version 2.22]

Fixed Version:  2.25.0

 

PMM-9156: pmm-agent paths-base option not working for pmm2-client binary installation in PMM 2.23.0. Starting pmm-agent process gives “level=error msg=”Error reading textfile collector directory”

Affects Version/s: 2.x  [Tested/Reported version 2.23]

 

PMM-7846:  Adding MongoDB instance via pmm-admin with tls option not working and failing with error Connection check failed: timeout (context deadline exceeded).

Affects Version/s: 2.x  [Tested/Reported version 2.13, 2.16]

 

Percona Server for MongoDB

PSMDB-892:  RWC defaults pollute the logs with duplicate “Refreshed RWC defaults” messages as a result log is saturated with the message in the title.

Affects Version/s:  4.4.6

Fixed Version: 4.4.8

 

WT-7984:  This issue in MongoDB 4.4.8 causes a checkpoint thread to read and persist an incomplete version of data to disk. Data in memory remains correct unless the server crashes or experiences an unclean shutdown. Then, the inconsistent checkpoint is used for recovery and introduces corruption.

The bug is triggered on cache pages that receive an update during a running checkpoint and which are evicted during the checkpoint.

Affects Version/s:  4.4.8

Fixed Version: 4.4.9

 

PSMDB-671: createBackup returns ok:1 for archived backup even when there is no disk space available.

Affects Version/s: 4.0.12-6, 4.2.1-1, 3.6.15-3.5

Fixed Version:3.6.19-7.0, 4.0.20-13, 4.2.9-9

 

Percona Distribution for PostgreSQL

DISTPG-317:  Installing Percona-PostgreSQL13 from its repository, the package dependencies are such it is going to remove PostgreSQL Community 12 installed.

Affects Version/s: 13.4

 

Summary

We welcome community input and feedback on all our products. If you find a bug or would like to suggest an improvement or a feature, learn how in our post, How to Report Bugs, Improvements, New Feature Requests for Percona Products.

For the most up-to-date information, be sure to follow us on Twitter, LinkedIn, and Facebook. 

Quick References:

Percona JIRA

MySQL Bug Report

Report a Bug in a Percona Product

MySQL 8.0.27 Release notes

https://jira.mongodb.org/

___

About Percona:

As the only provider of distributions for all three of the most popular open source databases—PostgreSQL, MySQL, and MongoDB—Percona provides expertise, software, support, and services no matter the technology.

Whether its enabling developers or DBAs to realize value faster with tools, advice, and guidance, or making sure applications can scale and handle peak loads, Percona is here to help.

Percona is committed to being open source and preventing vendor lock-in. Percona contributes all changes to the upstream community for possible inclusion in future product releases.

Nov
23
2021
--

Multi-Tenant Kubernetes Cluster with Percona Operators

multi-tenant kubernetes cluster

multi-tenant kubernetes clusterThere are cases where multiple teams, customers, or applications run in the same Kubernetes cluster. Such an environment is called multi-tenant and requires some preparation and management. Multi-tenant Kubernetes deployment allows you to utilize the economy of scale model on various levels:

  • Smaller compute footprint – one control plane, dense container deployments
  • Ease of management – one cluster, not hundreds

In this blog post, we are going to review multi-tenancy best practices, recommendations and see how Percona Kubernetes Operators can be deployed and managed in such Kubernetes clusters.

Multi-Tenancy

Generic

Multi-tenancy usually means a lot of Pods and workloads in a single cluster. You should always remember that there are certain limits when designing your infrastructure. For vanilla Kubernetes, these limits are quite high and hard to reach:

  • 5000 nodes
  • 10 000 namespaces
  • 150 000 pods

Managed Kubernetes services have their own limits that you should keep in mind. For example, GKE allows a maximum of 110 Pods per node on a standard cluster and only 32 on GKE Autopilot nodes.

The older AWS EKS CNI plugin was limiting the number of Pods per node to the number of IP addresses EC2 can have. With the prefix assignment enabled in CNI, you are still going to hit a limit of 110 pods per node.

Namespaces

Kubernetes Namespaces provides a mechanism for isolating groups of resources within a single cluster. The scope of k8s objects can either be cluster scope or namespace scoped. Objects which are accessible across all the namespaces like

ClusterRole

 are cluster scoped and those which are accessible only in a single namespace like Deployments are namespace scoped.

kubernetes namespaces

Deploying a database with Percona Operators creates pods that are namespace scoped. This provides interesting opportunities to run workloads on different namespaces for different teams, projects, and potentially, customers too. 

Example: Percona Distribution for MongoDB Operator and Percona Server for MongoDB can be run on two different namespaces by adding namespace metadata fields. Snippets are as follows:

# Team 1 DB running in team1-db namespace
apiVersion: psmdb.percona.com/v1-11-0
kind: PerconaServerMongoDB
metadata:
 name: team1-server
 namespace: team1-db

# Team 1 deployment running in team1-db namespace
apiVersion: apps/v1
kind: Deployment
metadata:
 name: percona-server-mongodb-operator-team1
 namespace: team1-db


# Team 2 DB running in team2-db namespace
apiVersion: psmdb.percona.com/v1-11-0
kind: PerconaServerMongoDB
metadata:
 name: team2-server
 namespace: team2-db

# Team 2 deployment running in team2-db namespace
apiVersion: apps/v1
kind: Deployment
metadata:
 name: percona-server-mongodb-operator-team2
 namespace: team2-db

Suggestions:

  1. Avoid using the standard namespaces like
    kube-system

    or

    default

    .

  2. It’s always better to run independent workloads on different namespaces unless there is a specific requirement to do it in a shared namespace.

Namespaces can be used per team, per application environment, or any other logical structure that fits the use case.

Resources

The biggest problem in any multi-tenant environment is this – how can we ensure that a single bad apple doesn’t spoil the whole bunch of apples?

ResourceQuotas

Thanks to Resource Quotas, we can restrict the resource utilization of namespaces.

ResourceQuotas

 also allows you to restrict the number of k8s objects which can be created in a namespace. 

Example of the YAML manifest with resource quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team1-quota         
  namespace: team1-db    # Namespace where operator is deployed
spec:
  hard:
    requests.cpu: "10"     # Cumulative CPU requests of all k8s objects in the namespace cannot exceed 10vcpu
    limits.cpu: "20"       # Cumulative CPU limits of all k8s objects in the namespace cannot exceed 20 vcpu
    requests.memory: 10Gi  # Cumulative memory requests of all k8s objects in the namespace cannot exceed 10Gi
    limits.memory: 20Gi    # Cumulative memory limits of all k8s objects in the namespace cannot exceed 20Gi
    requests.ephemeral-storage: 100Gi  # Cumulative ephemeral storage request of all k8s objects in the namespace cannot exceed 100Gi
    limits.ephemeral-storage: 200Gi    # Cumulative ephemeral storage limits of all k8s objects in the namespace cannot exceed 200Gi
    requests.storage: 300Gi            # Cumulative storage requests of all PVC in the namespace cannot exceed 300Gi
    persistentvolumeclaims: 5          # Maximum number of PVC in the namespace is 5
    count/statefulsets.apps: 2         # Maximum number of statefulsets in the namespace is 2
    # count/psmdb: 2                   # Maximum number of PSMDB objects in the namespace is 2, replace the name with proper Custom Resource

Please refer to the Resource Quotas documentation and apply quotas that are required for your use case.

If resource quotas are applied to a namespace, it is required to set containers’ requests and limits, otherwise, you are going to have an error similar to the following:

Error creating: pods "my-cluster-name-rs0-0" is forbidden: failed quota: my-cpu-memory-quota: must specify limits.cpu,requests.cpu

All Percona Operators provide the capability to fine-tune the requests and limits. The following example sets CPU and memory requests for Percona XtraDB Cluster containers:

spec:
  pxc:
    resources:
      requests:
        memory: 4G
        cpu: 2

LimitRange

With

ResourceQuotas

we can control the cumulative resources in the namespaces but if we want to enforce constraints on individual Kubernetes objects, LimitRange is a useful option. 

For example, if Team 1,2,3 are provided a namespace to run workloads,

ResourceQuota

will ensure that none of the team can exceed the quotas allocated and over-utilize the cluster… but what if a badly configured workload (say an operator run from team 1 with higher priority class) is utilizing all the resources allocated to the team?

LimitRange

can be used to enforce resources like compute, memory, ephemeral storage, storage with PVC. The example below highlights some of the possibilities.

apiVersion: v1
kind: LimitRange
metadata:
  name: lr-team1
  namespace: team1-db
spec:
  limits:
  - type: Pod                      
    max:                            # Maximum resource limit of all containers combined. Consider setting default limits
      ephemeral-storage: 100Gi      # Maximum ephemeral storage cannot exceed 100GB
      cpu: "800m"                   # Maximum CPU limits of the Pod is 800mVCPU
      memory: 4Gi                   # Maximum memory limits of the Pod is 4 GB
    min:                            # Minimum resource request of all containers combined. Consider setting default requests
      ephemeral-storage: 50Gi       # Minimum ephemeral storage should be 50GB
      cpu: "200m"                   # Minimum CPU request is  200mVCPU
      memory: 2Gi                   # Minimum memory request is 2 GB
  - type: PersistentVolumeClaim
    max:
      storage: 2Gi                  # Maximum PVC storage limit
    min:
      storage: 1Gi                  # Minimum PVC storage request

Suggestions:

  1. When it’s feasible, apply
    ResourceQuotas

    and

    LimitRanges

     to the namespaces where the Percona operator is running. This ensures that tenants are not overutilizing the cluster.

  2. Set alerts to monitor objects and usage of resources in namespaces. Automation of
    ResourceQuotas

     changes may also be useful in some scenarios.

  3. It is advisable to use a buffer on maximum expected utilization before setting the
    ResourceQuotas

    .

  4. Set
    LimitRanges

    to ensure workloads are not overutilizing resources in individual namespaces.

Roles and Security

Kubernetes provides several modes to authorize an API request. Role-Based access control is a popular way for authorization. There are four important objects to provide access:

ClusterRole Represents a set of permissions across the cluster (cluster scope)
Role Represents a set of permissions within a namespace ( namespace scope)
ClusterRoleBinding Granting permission to subjects across the cluster ( cluster scope )
RoleBinding Granting permissions to subjects within a namespace ( namespace scope)

Subjects in the

RoleBinding/ClusterRoleBinding

can be users, groups, or service accounts. Every pod running in the cluster will have an identity and a service account attached (“default” service account in the same namespace will be attached if not explicitly specified). Permissions granted to the service account with

RoleBinding/ClusterRoleBinding

dictate the access that pods will have. 

Going by the best policy of least privileges, it’s always advisable to use Roles with the least set of permissions and bind it to a service account with

RoleBinding

. This service account can be used to run the operator or custom resource to ensure proper access and also restrict the blast radius.

Avoid granting cluster-level access unless there is a strong use case to do it.

Example: RBAC in MongoDB Operator uses Role and

RoleBinding

restricting access to a single namespace for the service account. The same service account is used for both CustomResource and the Operator

Network Policies

Network isolation provides additional security to applications and customers in a multi-tenant environment. Network policies are Kubernetes resources that allow you to control the traffic between Pods, CIDR blocks, and network endpoints, but the most common approach is to control the traffic between namespaces:

kubernetes network policies

Most Container Network Interface (CNI) plugins support the implementation of network policies, however, if they don’t and

NetworkPolicy

is created, the resource is silently ignored. For example, AWS CNI does not support network policies, but AWS EKS can run Calico CNI which does.

It is a good approach to follow the least privilege approach, whereby default traffic is denied and access is granted granularly:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: app1-db
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Allow traffic from Pods in namespace

app1

to namespace

app1-db

:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: app1-db
spec:
  podSelector: {}
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: app1
  policyTypes:
  - Ingress

Policy Enforcement

In a multi-tenant environment, policy enforcement plays a key role. Policy enforcement ensures that k8s objects pass the required quality gates set by administrators/teams. Some examples of policy enforcement could be:

  1. All the workloads have proper labels 
  2. Proper network policies are set for DB
  3. Unsafe configurations are not allowed (Example)
  4. Backups are always enabled (Example)

The K8s ecosystem offers a wide range of options to achieve this. Some of them are listed below:

  1. Open Policy Agent (OPA) is a CNCF graduated project which gives a high-level declarative language to author and enforces policies across k8s objects. (Examples from Google and OPA repo can be helpful)
  2. Mutating Webhooks can be used to modify API calls before it reaches the API server. This can be used to set required properties for k8s objects. (Example: mutating webhook to add
    NetworkPolicy

    for Pods created in production namespaces)

  3. Validating Webhooks can be used to check if k8s API follows the required policy, any API which doesn’t follow the policy will be rejected. (Example: validating webhook to ensure huge pages of 1GB is not used in the pod )

Cluster-Wide

Percona Distribution for MySQL Operator and Percona Distribution for PostgreSQL Operator both support cluster-wide mode which allows single Operator deploy and manage databases across multiple namespaces (support for cluster-wide mode in Percona Operator for MongoDB is on the roadmap). Is also possible to have an Operator per namespace:

Operator per namespace

For example, a single deployment of Percona Distribution for MySQL Operator can monitor multiple namespaces in cluster-wide mode. The use can specify them in WATCH_NAMESPACE environment variable in the

cw-bundle.yaml

file:

    spec:
      containers:
      - command:
        - percona-xtradb-cluster-operator
        env:
        - name: WATCH_NAMESPACE
          value: "namespace-a, namespace-b"

In a multi-tenant environment, it depends on the amount of freedom you want to give to the tenants. Usually when the tenants are highly trusted (for instance internal teams), then it is fine to choose namespace-scoped deployment, where each team can deploy and manage the Operator themselves.

Conclusion

It is important to remember that Kubernetes is not a multi-tenant system out of the box. Various levels of isolation were described in this blog post that would help you to run your applications and databases securely and ensure operational stability. 

We encourage you to try out our Operators:

CONTRIBUTING.md in every repository is there for those of you who want to contribute your ideas, code, and docs.

For general questions please raise the topic in the community forum.

Nov
18
2021
--

Should I Create an Index on Foreign Keys in PostgreSQL?

Index on Foreign Keys in PostgreSQL

Welcome to a weekly blog where I get to answer (like, really answer) some of the questions I’ve seen in the webinars I’ve presented lately. If you missed the latest one, PostgreSQL Performance Tuning Secrets, it might be helpful to give some of it a listen before or after you read this post. Each week, I’ll dive deep into one question. Let me know what you think in the comments. 

We constantly hear that indexes improve read performance and it’s usually true, but we also know that it will always have an impact on writes. What we don’t hear about too often is that in some cases, it may not give any performance improvement at all. This happens more than we want and might happen more than we even notice, and foreign keys (FKs) are a great example. I’m not saying that all FK’s indexes are bad, but most of the ones I’ve seen are just unnecessary, only adding load to the system.

For example, the below relationship where we have a 1:N relationship between the table “Supplier” and table “Product”:

foreign keys index

If we pay close attention to the FK’s in this example it won’t have a high number of lookups on the child table using the FK column, “SupplierID” in this example, if we compare with the number of lookups using “ProductID” and probably “ProductName”. The major usage will be to keep the relationship consistent and search in the other direction, finding the supplier for a certain product. In this circumstance, adding an index to the FK child without ensuring the access pattern requires it will add the extra cost of updating the index every time we update the “Product” table.

Another point we need to pay attention to is the index cardinality. If the index cardinality is too low Postgres won’t use it and the index will be just ignored. One can ask why that happens and if that wouldn’t still be cheaper for the database to go, for example, through half of the indexes instead of doing a full table scan? The answer is no, especially for databases that use heap tables like Postgres. The table access in Postgres (heap table) is mostly sequential, which is faster than random access in spinning HDD disks and still a bit faster on SSDs, while b+-tree index access is random by nature.

When Postgres uses an index it needs to open the index file, find the records it needs and then open the table file, do a lookup using the page addresses it got from the indexes changing the access pattern from sequential to random and depending on the data distribution it will probably access the majority of the table pages, ending up in a full table scan but now using random access, which is much more expensive. If we have columns with low cardinality and we really need to index them we need to use an alternative to b-tree indexes, for example, a GIN index, but this is a topic for another discussion.

With all of that, we may think that FK indexes are always evil and never use them on a child table, right? Well, that’s not true either and there are many circumstances they are useful and needed, for example, the below picture has another two tables, “Customer” and “Order”:

FK child table

It can be handy to have an index on the child table “Order->CustomerId” as it’s common to show all orders from a certain user and the column “CustomerId” on the table “Order” will be used quite frequently as the lookup key.

Another good example is to provide a faster method to validate referential integrity. If one needs to change the parent table (update or delete any parent key) the children need to be checked to make sure that the relationship isn’t broken. In this case, having an index on the child’s side would help to improve performance. When it worths the index however is “load dependant”. If the parent key has many deletes it might be a case to consider, however, if it’s a mostly static table or mostly has inserts or updates to the other columns other than the parent key column then it’s not a good candidate to have an index on the children tables.

There are many other examples that can be given to explain why an index on the child table might be useful and worth the extra write cost/penalty.

Conclusion

The takeaway here is that we should not indiscriminately create indexes on all FKs because many of them will just not be used or so rarely used that they aren’t worth the cost. It’s better to initially design the database with the FKs but not the indexes and add them while the database grows and we understand the workload. It’s possible that at some point we find that we need an index on “Product->SupplierId” due to our workload and the index on “Order->CustomerId” isn’t necessary anymore. Loads change and data distribution as well, index shall follow them and not be treated as immutable entities.


As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

Nov
10
2021
--

Community Meetup Schedule Till the End of the Year

percona community meetups

percona community meetupsThis fall we started to run live-streamed online meetups accompanied by a live chat with the audience. Every week we invite experts for a real-time talk and devote this meeting to a specific database.

But the holiday season is coming soon! So, we slightly changed the meetup schedule and would like to announce it. Here are the dates of the upcoming events:

  1. November 17th – Meetup for PostgreSQL
  2. December 1st – Meetup for Percona Monitoring and Management (live stream from AWS re:Invent)
  3. December 8th – Meetup for MySQL
  4. December 15th – Meetup for PostgreSQL
  5. December 29th – Meetup for MongoDB

What do meetups look like? On Wednesday we gather in a virtual studio at 11 am EST/ 5 pm CEST and start streaming on YouTube and Twitch. Each event is dedicated to one open source database or technology. Experts share their knowledge with you and answer questions that you can ask on chat or on Discord. It is that simple, just come and join us!

Community Meetups Percona

Our next meetup will take place on November 17th. We will discuss autovacuum in PostgreSQL with Lead Senior Support Engineer Sergey Kuzmichev. Check out the event details on our Community website. You will learn more about autovacuum and get answers to your questions live. The Percona expert could even advise you on your particular issue during the live stream.

Normally guest speakers prepare some topics for discussion but things do not always go according to plan, and since it is a live stream, your question or suggestion to our amazing host Matt Yonkovit can change the direction of conversation dynamically.

The current meetup agenda for November 17th:

  • 1. Why do we need the autovacuum at all?
  • 2. How to stop worrying and start to love the autovacuum?
  • 3. Monitoring for related issues.

Percona MeetUp for PostgreSQL November 2021 - Autovacuum

Also, we have already run nine meetups and if you missed them, you can find their recordings here:

All the recordings, their transcripts, and also publications about future meetups are available on our Community website: https://percona.community/events/percona-meetups/.

Come and talk about open source with us! If you feel inspired to participate in meetups not only as a listener but as a speaker, we would love to see you. Just contact us at community-team@percona.com to discuss your involvement.

Percona Open Source Community Meetups

Nov
09
2021
--

What if … MySQL’s Repeatable Reads Cause You to Lose Money?

MySQL Repeatable Reads

MySQL Repeatable ReadsWell, let me say if that happens it’s because there is a logic mistake in your application. But you need to know and understand what happens in MySQL to be able to avoid the problem. 

In short, the WHY of this article is to inform you about possible pitfalls and how to prevent them from causing you damage.

Let us start by having a short introduction to what Repeatable reads are about. Given I am extremely lazy, I am going to use (a lot) existing documentation from the MySQL documentation.

Transaction isolation is one of the foundations of database processing. Isolation is the I in the acronym ACID; the isolation level is the setting that fine-tunes the balance between performance and reliability, consistency, and reproducibility of results when multiple transactions are making changes and performing queries at the same time.

InnoDB offers all four transaction isolation levels described by the SQL:1992 standard: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, and SERIALIZABLE. The default isolation level for InnoDB is REPEATABLE READ.

  • REPEATABLE READ
    This is the default isolation level for InnoDB. Consistent reads within the same transaction read the snapshot established by the first read. This means that if you issue several plain (nonlocking) SELECT statements within the same transaction, these SELECT statements are consistent also with respect to each other.
  • READ COMMITTED
    Each consistent read, even within the same transaction, sets and reads its own fresh snapshot.

And about Consistent Non-Blocking reads:

A consistent read means that InnoDB uses multi-versioning to present to a query a snapshot of the database at a point in time. The query sees the changes made by transactions that committed before that point in time, and no changes made by later or uncommitted transactions. The exception to this rule is that the query sees the changes made by earlier statements within the same transaction. This exception causes the following anomaly: If you update some rows in a table, a SELECT sees the latest version of the updated rows, but it might also see older versions of any rows. If other sessions simultaneously update the same table, the anomaly means that you might see the table in a state that never existed in the database.

Ok, but what does all this mean in practice? To understand, let us simulate this scenario:

I have a shop and I decide to grant a bonus discount to a selected number of customers that:

  • Have an active account to my shop
  • Match my personal criteria to access the bonus

My application is set to perform batch operations in a moment of less traffic and unattended. This includes reactivating dormant accounts that customers ask to reactivate.  

What we will do is to see what happens, by default, and then see what we can do to avoid pitfalls.

The Scenario

I will use three different connections to connect to the same MySQL 8.0.27 instance. The only relevant setting I have modified is the Innodb_timeout that I set to 50 seconds. 

  • Session 1 will simulate a process that should activate the bonus feature for the selected customers
  • Session 2 is an independent process that reactivates the given list of customers
  • Session 3 is used to collect lock information

For this simple test I will use the customer table in the sakila schema modified as below:

CREATE TABLE `customer` (
  `customer_id` smallint unsigned NOT NULL AUTO_INCREMENT,
  `store_id` tinyint unsigned NOT NULL,
  `first_name` varchar(45) NOT NULL,
  `last_name` varchar(45) NOT NULL,
  `email` varchar(50) DEFAULT NULL,
  `address_id` smallint unsigned NOT NULL,
  `active` tinyint(1) NOT NULL DEFAULT '1',
  `create_date` datetime NOT NULL,
  `last_update` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `bonus` int NOT NULL DEFAULT '0',
  `activate_bonus` varchar(45) NOT NULL DEFAULT '0',
  PRIMARY KEY (`customer_id`),
  KEY `idx_fk_store_id` (`store_id`),
  KEY `idx_fk_address_id` (`address_id`),
  KEY `idx_last_name` (`last_name`),
  KEY `idx_bonus` (`bonus`),
  CONSTRAINT `fk_customer_address` FOREIGN KEY (`address_id`) REFERENCES `address` (`address_id`) ON DELETE RESTRICT ON UPDATE CASCADE,
  CONSTRAINT `fk_customer_store` FOREIGN KEY (`store_id`) REFERENCES `store` (`store_id`) ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=600 DEFAULT CHARSET=utf8mb4 COLLATE=ut

As you can see I have added the bonus and activate_bonus fields plus the idx_bonus index.

To be able to trace the locks, these are the threads ids by session:

session 1 17439
session 2 17430
session 3 17443

To collect the lock information:

SELECT 
    index_name, lock_type, lock_mode, lock_data, thread_id
FROM
    performance_schema.data_locks
WHERE
    object_schema = 'sakila'
        AND object_name = 'customer'
        AND lock_type = 'RECORD'
        AND thread_id IN (17439 , 17430)
ORDER BY index_name , lock_data DESC;

Ok, ready? Let’s start!

The Run…

While the following steps can be done in a more compressed way, I prefer to do it in a more verbose way, to make it more human-readable.

First, let us set the environment:

session1 >set transaction_isolation = 'REPEATABLE-READ';
Query OK, 0 rows affected (0.07 sec)
session1 >Start Transaction;
Query OK, 0 rows affected (0.07 sec)

Then let see the list of the customers we will modify:

session1 >SELECT * FROM sakila.customer where bonus = 1 and active =1 order by last_name;
+-------------+----------+------------+-----------+-------+----------------+
| customer_id | store_id | first_name | last_name | bonus | activate_bonus |
+-------------+----------+------------+-----------+-------+----------------+
|         383 |        1 | MARTIN     | BALES     |     1 | 0              |
|         539 |        1 | MATHEW     | BOLIN     |     1 | 0              |
|         441 |        1 | MARIO      | CHEATHAM  |     1 | 0              |
|         482 |        1 | MAURICE    | CRAWLEY   |     1 | 0              |
|         293 |        2 | MAE        | FLETCHER  |     1 | 0              |
|          38 |        1 | MARTHA     | GONZALEZ  |     1 | 0              |
|         444 |        2 | MARCUS     | HIDALGO   |     1 | 0              |
|         252 |        2 | MATTIE     | HOFFMAN   |     1 | 0              |
|         256 |        2 | MABEL      | HOLLAND   |     1 | 0              |
|         226 |        2 | MAUREEN    | LITTLE    |     1 | 0              |
|         588 |        1 | MARION     | OCAMPO    |     1 | 0              |
|         499 |        2 | MARC       | OUTLAW    |     1 | 0              |
|         553 |        1 | MAX        | PITT      |     1 | 0              |
|         312 |        2 | MARK       | RINEHART  |     1 | 0              |
|          80 |        1 | MARILYN    | ROSS      |     1 | 0              |
|         583 |        1 | MARSHALL   | THORN     |     1 | 0              |
|         128 |        1 | MARJORIE   | TUCKER    |     1 | 0              |
|          44 |        1 | MARIE      | TURNER    |     1 | 0              |
|         267 |        1 | MARGIE     | WADE      |     1 | 0              |
|         240 |        1 | MARLENE    | WELCH     |     1 | 0              |
|         413 |        2 | MARVIN     | YEE       |     1 | 0              |
+-------------+----------+------------+-----------+-------+----------------+
21 rows in set (0.08 sec)

As you can see we have 21 customers fitting our criteria. How much money is involved in this exercise?

session1 >SELECT
    ->     SUM(amount) income,
    ->     SUM(amount) * 0.90 income_with_bonus,
    ->    (SUM(amount) - (SUM(amount) * 0.90)) loss_because_bonus
    -> FROM
    ->     sakila.customer AS c
    ->         JOIN
    ->     sakila.payment AS p ON c.customer_id = p.customer_id
    ->     where active = 1 and bonus =1 ;
+---------+-------------------+--------------------+
| income  | income_with_bonus | loss_because_bonus |
+---------+-------------------+--------------------+
| 2416.23 |         2174.6070 |           241.6230 |
+---------+-------------------+--------------------+

This exercise is going to cost me ~242 dollars. Keep this number in mind. What locks do I have at this point?

session3 >select index_name, lock_type, lock_mode,lock_data from performance_schema.data_locks where  object_schema = 'sakila' and object_name = 'customer' and lock_type = 'RECORD'  and
thread_id in (17439,17430) order by index_name, lock_data desc;
Empty set (0.00 sec)

The answer is none. 

Meanwhile, we have the other process that needs to reactivate the customers:

session2 >set transaction_isolation = 'REPEATABLE-READ';
Query OK, 0 rows affected (0.00 sec)

session2 >Start Transaction;
Query OK, 0 rows affected (0.00 sec)

session2 >SELECT * FROM sakila.customer where bonus = 1 and active =0 ;
+-------------+----------+------------+-----------+-------+----------------+
| customer_id | store_id | first_name | last_name | bonus | activate_bonus |
+-------------+----------+------------+-----------+-------+----------------+
|           1 |        1 | MARY       | SMITH     |     1 | 0              |
|           7 |        1 | MARIA      | MILLER    |     1 | 0              |
|           9 |        2 | MARGARET   | MOORE     |     1 | 0              |
|         178 |        2 | MARION     | SNYDER    |     1 | 0              |
|         236 |        1 | MARCIA     | DEAN      |     1 | 0              |
|         246 |        1 | MARIAN     | MENDOZA   |     1 | 0              |
|         254 |        2 | MAXINE     | SILVA     |     1 | 0              |
|         257 |        2 | MARSHA     | DOUGLAS   |     1 | 0              |
|         323 |        2 | MATTHEW    | MAHAN     |     1 | 0              |
|         408 |        1 | MANUEL     | MURRELL   |     1 | 0              |
+-------------+----------+------------+-----------+-------+----------------+
10 rows in set (0.00 sec)

In this case, the process needs to reactivate 10 users.

session2 >update sakila.customer set active = 1 where bonus = 1 and active =0 ;
Query OK, 10 rows affected (0.00 sec)
Rows matched: 10  Changed: 10  Warnings: 0
session2 >commit;
Query OK, 0 rows affected (0.00 sec)

All good, right? But before going ahead let’s double-check session 1:

session1 >SELECT * FROM sakila.customer where bonus = 1 and active =1 order by last_name;
+-------------+----------+------------+-----------+-------+----------------+
| customer_id | store_id | first_name | last_name | bonus | activate_bonus |
+-------------+----------+------------+-----------+-------+----------------+
|         383 |        1 | MARTIN     | BALES     |     1 | 0              |
|         539 |        1 | MATHEW     | BOLIN     |     1 | 0              |
|         441 |        1 | MARIO      | CHEATHAM  |     1 | 0              |
|         482 |        1 | MAURICE    | CRAWLEY   |     1 | 0              |
|         293 |        2 | MAE        | FLETCHER  |     1 | 0              |
|          38 |        1 | MARTHA     | GONZALEZ  |     1 | 0              |
|         444 |        2 | MARCUS     | HIDALGO   |     1 | 0              |
|         252 |        2 | MATTIE     | HOFFMAN   |     1 | 0              |
|         256 |        2 | MABEL      | HOLLAND   |     1 | 0              |
|         226 |        2 | MAUREEN    | LITTLE    |     1 | 0              |
|         588 |        1 | MARION     | OCAMPO    |     1 | 0              |
|         499 |        2 | MARC       | OUTLAW    |     1 | 0              |
|         553 |        1 | MAX        | PITT      |     1 | 0              |
|         312 |        2 | MARK       | RINEHART  |     1 | 0              |
|          80 |        1 | MARILYN    | ROSS      |     1 | 0              |
|         583 |        1 | MARSHALL   | THORN     |     1 | 0              |
|         128 |        1 | MARJORIE   | TUCKER    |     1 | 0              |
|          44 |        1 | MARIE      | TURNER    |     1 | 0              |
|         267 |        1 | MARGIE     | WADE      |     1 | 0              |
|         240 |        1 | MARLENE    | WELCH     |     1 | 0              |
|         413 |        2 | MARVIN     | YEE       |     1 | 0              |
+-------------+----------+------------+-----------+-------+----------------+
21 rows in set (0.08 sec)

Perfect! My repeatable read still sees the same snapshot. Let me apply the changes:

session1 >update sakila.customer set activate_bonus=1 where bonus = 1 and active =1 ;
Query OK, 31 rows affected (0.06 sec)
Rows matched: 31  Changed: 31  Warnings: 0

Wait, what? My list reports 21 entries, but I have modified 31! And if I check the cost:

session1 >SELECT
    ->     SUM(amount) income,
    ->     SUM(amount) * 0.90 income_with_bonus,
    ->    (SUM(amount) - (SUM(amount) * 0.90)) loss_because_bonus
    -> FROM
    ->     sakila.customer AS c
    ->         JOIN
    ->     sakila.payment AS p ON c.customer_id = p.customer_id
    ->     where active = 1 and bonus =1 ;
+---------+-------------------+--------------------+
| income  | income_with_bonus | loss_because_bonus |
+---------+-------------------+--------------------+
| 3754.01 |         3378.6090 |           375.4010 |
+---------+-------------------+--------------------+

Well, now the cost of this operation is 375 dollars, not 242. In this case, we are talking about peanuts, but guess what could be if we do something similar on thousands of users. 

Anyhow, let us first:

session1 >rollback;
Query OK, 0 rows affected (0.08 sec)

And cancel the operation.

So what happened; is this a bug?

No, it is not! It is how repeatable reads work in MySQL. The snapshot is relevant to the read operation, but if another session is able to write given also the absence of lock at the moment of reads, the next update operation will touch any value in the table matching the conditions. 

As shown above, this can be very dangerous. But only if you don’t do the right things in the code. 

How Can I Prevent This From Happening?

When coding, hope for the best, plan for the worse, always! Especially when dealing with databases. That approach may save you from spending nights trying to fix the impossible. 

So how can this be prevented? You have two ways, both simple, but both with positive and negative consequences.

  1. Add this simple clausole to the select statement: for share
  2. Add the other simple clausole to the select statement: for update
  3. Use Read-committed

Solution one is easy and clean, with no other change in the code. BUT it creates locks, and if your application is lock sensitive this may be an issue for you.

Solution two is also easy, but a bit more encapsulated and locking. 

On the other hand, solution three does not add locks but requires modifications in the code and it still leaves some space for problems.

Let us see them in detail.

Solution One

Let us repeat the same steps:

session1 >set transaction_isolation = 'REPEATABLE-READ';
session1 >Start Transaction;
session1 >SELECT * FROM sakila.customer where bonus = 1 and active =1 order by last_name for share;

Now we check the locks (for brevity I have cut some entries and kept a few as a sample):

session3 >select index_name, lock_type, lock_mode,lock_data,thread_id from performance_schema.data_locks where
    -> object_schema = 'sakila'
    -> and object_name = 'customer'
    -> and lock_type = 'RECORD'
    -> and thread_id = 17439
    -> order by index_name, lock_data desc;
+------------+-----------+---------------+------------------------+-----------+
| index_name | lock_type | lock_mode     | lock_data              | thread_id |
+------------+-----------+---------------+------------------------+-----------+
| idx_bonus  | RECORD    | S             | supremum pseudo-record |     17439 |
| idx_bonus  | RECORD    | S             | 1, 9                   |     17439 |
<snip>
| idx_bonus  | RECORD    | S             | 1, 128                 |     17439 |
| idx_bonus  | RECORD    | S             | 1, 1                   |     17439 |
| PRIMARY    | RECORD    | S,REC_NOT_GAP | 9                      |     17439 |
<snip>
| PRIMARY    | RECORD    | S,REC_NOT_GAP | 1                      |     17439 |
+------------+-----------+---------------+------------------------+-----------+
63 rows in set (0.00 sec)

This time we can see that the select has raised a few locks all in S (shared) mode. 

For brevity, I am skipping to report exactly the same results as in the first exercise.

If we go ahead and try with session two:

session2 >set transaction_isolation = 'READ-COMMITTED';
session2 >Start Transaction;
session2 >SELECT * FROM sakila.customer where bonus = 1 and active =0 ;
session2 >update sakila.customer set active = 1 where bonus = 1 and active =0 ;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Here we go! Attempting to change the values in the customer table is now on hold waiting to acquire the lock. If the duration exceeds the Innodb_wait_timeout, then the execution is interrupted and the application must have a try-catch mechanism to retry the operation. This last one is a best practice that you should already have in place in your code. If not, well add it now!

At this point, session one can proceed and complete the operations. After that and before the final commit, we will be able to observe:

+------------+-----------+---------------+------------------------+-----------+
| index_name | lock_type | lock_mode     | lock_data              | thread_id |
+------------+-----------+---------------+------------------------+-----------+
| idx_bonus  | RECORD    | S             | supremum pseudo-record |     17439 |
| idx_bonus  | RECORD    | X             | supremum pseudo-record |     17439 |
| idx_bonus  | RECORD    | S             | 1, 9                   |     17439 |
| idx_bonus  | RECORD    | X             | 1, 9                   |     17439 |
<snip>
| idx_bonus  | RECORD    | X             | 1, 1                   |     17439 |
| idx_bonus  | RECORD    | S             | 1, 1                   |     17439 |
| PRIMARY    | RECORD    | X,REC_NOT_GAP | 9                      |     17439 |
| PRIMARY    | RECORD    | S,REC_NOT_GAP | 9                      |     17439 |
<snip>|
| PRIMARY    | RECORD    | X,REC_NOT_GAP | 1                      |     17439 |
| PRIMARY    | RECORD    | S,REC_NOT_GAP | 1                      |     17439 |
+------------+-----------+---------------+------------------------+-----------+
126 rows in set (0.00 sec)

As you can see we now have two different lock types, the Shared one from the select, and the exclusive lock (X) from the update.

Solution Two

In this case, we just change the kind of lock we set on the select. In solution one we opt for a shared lock, which allows other sessions to eventually acquire another shared lock. Here we go for an exclusive lock that will put all the other operations on hold.

session1 >set transaction_isolation = 'REPEATABLE-READ';
session1 >Start Transaction;
session1 >SELECT * FROM sakila.customer where bonus = 1 and active =1 order by last_name for UPDATE;

If now we check the locks (again, for brevity I have cut some entries and kept a few as a sample):

session3 >select index_name, lock_type, lock_mode,lock_data,thread_id from performance_schema.data_locks where
    -> object_schema = 'sakila'
    -> and object_name = 'customer'
    -> and lock_type = 'RECORD'
    -> and thread_id = 17439
    -> order by index_name, lock_data desc;
+------------+-----------+---------------+------------------------+-----------+
| index_name | lock_type | lock_mode     | lock_data              | thread_id |
+------------+-----------+---------------+------------------------+-----------+
| idx_bonus  | RECORD    | X             | supremum pseudo-record |     17439 |
| idx_bonus  | RECORD    | X             | 1, 9                   |     17439 |
<snip>
| idx_bonus  | RECORD    | X             | 1, 128                 |     17439 |
| idx_bonus  | RECORD    | X             | 1, 1                   |     17439 |
| PRIMARY    | RECORD    | X,REC_NOT_GAP | 9                      |     17439 |
| PRIMARY    | RECORD    | X,REC_NOT_GAP | 80                     |     17439 |
<snip>
| PRIMARY    | RECORD    | X,REC_NOT_GAP | 128                    |     17439 |
| PRIMARY    | RECORD    | X,REC_NOT_GAP | 1                      |     17439 |
+------------+-----------+---------------+------------------------+-----------+
63 rows in set (0.09 sec)

In this case, the kind of lock is X as exclusive, so the other operations when requesting a lock must wait for this transaction to complete.

session2 >#set transaction_isolation = 'READ-COMMITTED';
session2 >
session2 >Start Transaction;
Query OK, 0 rows affected (0.06 sec)

session2 >SELECT * FROM sakila.customer where bonus = 1 and active =0 for update;

And it will wait for N time where N is either the Innodb_lock_wait_timeout or the commit time from session one.

Note the lock request on hold:

+------------+-----------+---------------+------------------------+-----------+
| index_name | lock_type | lock_mode     | lock_data              | thread_id |
+------------+-----------+---------------+------------------------+-----------+
| idx_bonus  | RECORD    | X             | supremum pseudo-record |     17439 |
<snip>
| idx_bonus  | RECORD    | X             | 1, 128                 |     17439 |
| idx_bonus  | RECORD    | X             | 1, 1                   |     17430 |<--
| idx_bonus  | RECORD    | X             | 1, 1                   |     17439 |
| PRIMARY    | RECORD    | X,REC_NOT_GAP | 9                      |     17439 |
<snip>
| PRIMARY    | RECORD    | X,REC_NOT_GAP | 1                      |     17439 |
+------------+-----------+---------------+------------------------+-----------+
64 rows in set (0.09 sec)

Session two is trying to lock exclusively the idx_bonus but cannot proceed.

Once session one goes ahead, we have:

session1 >update sakila.customer set activate_bonus=1 where bonus = 1 and active =1 ;
Query OK, 0 rows affected (0.06 sec)
Rows matched: 21  Changed: 0  Warnings: 0

session1 >
session1 >SELECT * FROM sakila.customer where bonus = 1 and active =1 ;
+-------------+----------+------------+-----------+-------+----------------+
| customer_id | store_id | first_name | last_name | bonus | activate_bonus |
+-------------+----------+------------+-----------+-------+----------------+
|          38 |        1 | MARTHA     | GONZALEZ  |     1 | 1              |
|          44 |        1 | MARIE      | TURNER    |     1 | 1              |
|          80 |        1 | MARILYN    | ROSS      |     1 | 1              |
|         128 |        1 | MARJORIE   | TUCKER    |     1 | 1              |
|         226 |        2 | MAUREEN    | LITTLE    |     1 | 1              |
|         240 |        1 | MARLENE    | WELCH     |     1 | 1              |
|         252 |        2 | MATTIE     | HOFFMAN   |     1 | 1              |
|         256 |        2 | MABEL      | HOLLAND   |     1 | 1              |
|         267 |        1 | MARGIE     | WADE      |     1 | 1              |
|         293 |        2 | MAE        | FLETCHER  |     1 | 1              |
|         312 |        2 | MARK       | RINEHART  |     1 | 1              |
|         383 |        1 | MARTIN     | BALES     |     1 | 1              |
|         413 |        2 | MARVIN     | YEE       |     1 | 1              |
|         441 |        1 | MARIO      | CHEATHAM  |     1 | 1              |
|         444 |        2 | MARCUS     | HIDALGO   |     1 | 1              |
|         482 |        1 | MAURICE    | CRAWLEY   |     1 | 1              |
|         499 |        2 | MARC       | OUTLAW    |     1 | 1              |
|         539 |        1 | MATHEW     | BOLIN     |     1 | 1              |
|         553 |        1 | MAX        | PITT      |     1 | 1              |
|         583 |        1 | MARSHALL   | THORN     |     1 | 1              |
|         588 |        1 | MARION     | OCAMPO    |     1 | 1              |
+-------------+----------+------------+-----------+-------+----------------+
21 rows in set (0.08 sec)

session1 >SELECT      SUM(amount) income,     SUM(amount) * 0.90 income_with_bonus,    (SUM(amount) - (SUM(amount) * 0.90)) loss_because_bonus FROM     sakila.customer AS c         JOIN     sakila.payment AS p ON c.customer_id = p.customer_id     where active = 1 and bonus =1;
+---------+-------------------+--------------------+
| income  | income_with_bonus | loss_because_bonus |
+---------+-------------------+--------------------+
| 2416.23 |         2174.6070 |           241.6230 |
+---------+-------------------+--------------------+
1 row in set (0.06 sec)

Now my update matches the expectations and the other operations are on hold.

After session one commits:

session2 >SELECT * FROM sakila.customer where bonus = 1 and active =0 for update;
+-------------+----------+------------+-----------+-------+----------------+
| customer_id | store_id | first_name | last_name | bonus | activate_bonus |
+-------------+----------+------------+-----------+-------+----------------+
|           1 |        1 | MARY       | SMITH     |     1 | 1              |
|           7 |        1 | MARIA      | MILLER    |     1 | 1              |
|           9 |        2 | MARGARET   | MOORE     |     1 | 1              |
|         178 |        2 | MARION     | SNYDER    |     1 | 1              |
|         236 |        1 | MARCIA     | DEAN      |     1 | 1              |
|         246 |        1 | MARIAN     | MENDOZA   |     1 | 1              |
|         254 |        2 | MAXINE     | SILVA     |     1 | 1              |
|         257 |        2 | MARSHA     | DOUGLAS   |     1 | 1              |
|         323 |        2 | MATTHEW    | MAHAN     |     1 | 1              |
|         408 |        1 | MANUEL     | MURRELL   |     1 | 1              |
+-------------+----------+------------+-----------+-------+----------------+
10 rows in set (52.72 sec) <-- Note the time!

session2 >update sakila.customer set active = 1 where bonus = 1 and active =0 ;
Query OK, 10 rows affected (0.06 sec)
Rows matched: 10  Changed: 10  Warnings: 0

Session two is able to complete, BUT it was on hold for 52 seconds waiting for session one. As said, this solution is a good one if you can afford locks and waiting time.

Solution Three

In this case, we will use a different isolation model that will allow session one to see what session two has modified.

session1 >set transaction_isolation = 'READ-COMMITTED';
session1 >Start Transaction;
session1 >SELECT * FROM sakila.customer where bonus = 1 and active =1 order by last_name;
+-------------+----------+------------+-----------+-------+----------------+
| customer_id | store_id | first_name | last_name | bonus | activate_bonus |
+-------------+----------+------------+-----------+-------+----------------+
|         383 |        1 | MARTIN     | BALES     |     1 | 0              |
|         539 |        1 | MATHEW     | BOLIN     |     1 | 0              |
|         441 |        1 | MARIO      | CHEATHAM  |     1 | 0              |
|         482 |        1 | MAURICE    | CRAWLEY   |     1 | 0              |
|         293 |        2 | MAE        | FLETCHER  |     1 | 0              |
|          38 |        1 | MARTHA     | GONZALEZ  |     1 | 0              |
|         444 |        2 | MARCUS     | HIDALGO   |     1 | 0              |
|         252 |        2 | MATTIE     | HOFFMAN   |     1 | 0              |
|         256 |        2 | MABEL      | HOLLAND   |     1 | 0              |
|         226 |        2 | MAUREEN    | LITTLE    |     1 | 0              |
|         588 |        1 | MARION     | OCAMPO    |     1 | 0              |
|         499 |        2 | MARC       | OUTLAW    |     1 | 0              |
|         553 |        1 | MAX        | PITT      |     1 | 0              |
|         312 |        2 | MARK       | RINEHART  |     1 | 0              |
|          80 |        1 | MARILYN    | ROSS      |     1 | 0              |
|         583 |        1 | MARSHALL   | THORN     |     1 | 0              |
|         128 |        1 | MARJORIE   | TUCKER    |     1 | 0              |
|          44 |        1 | MARIE      | TURNER    |     1 | 0              |
|         267 |        1 | MARGIE     | WADE      |     1 | 0              |
|         240 |        1 | MARLENE    | WELCH     |     1 | 0              |
|         413 |        2 | MARVIN     | YEE       |     1 | 0              |
+-------------+----------+------------+-----------+-------+----------------+
21 rows in set (0.08 sec)

Same result as before. Now let us execute commands in session two:

session2 >set transaction_isolation = 'READ-COMMITTED';
Query OK, 0 rows affected (0.00 sec)

session2 >
session2 >Start Transaction;
Query OK, 0 rows affected (0.00 sec)

session2 >SELECT * FROM sakila.customer where bonus = 1 and active =0 ;
+-------------+----------+------------+-----------+-------+----------------+
| customer_id | store_id | first_name | last_name | bonus | activate_bonus |
+-------------+----------+------------+-----------+-------+----------------+
|           1 |        1 | MARY       | SMITH     |     1 | 0              |
|           7 |        1 | MARIA      | MILLER    |     1 | 0              |
|           9 |        2 | MARGARET   | MOORE     |     1 | 0              |
|         178 |        2 | MARION     | SNYDER    |     1 | 0              |
|         236 |        1 | MARCIA     | DEAN      |     1 | 0              |
|         246 |        1 | MARIAN     | MENDOZA   |     1 | 0              |
|         254 |        2 | MAXINE     | SILVA     |     1 | 0              |
|         257 |        2 | MARSHA     | DOUGLAS   |     1 | 0              |
|         323 |        2 | MATTHEW    | MAHAN     |     1 | 0              |
|         408 |        1 | MANUEL     | MURRELL   |     1 | 0              |
+-------------+----------+------------+-----------+-------+----------------+
10 rows in set (0.00 sec)
session2 >update sakila.customer set active = 1 where bonus = 1 and active =0 ;
Query OK, 10 rows affected (0.00 sec)
Rows matched: 10  Changed: 10  Warnings: 0

session2 >commit;
Query OK, 0 rows affected (0.00 sec)

All seems the same, but if we check again in session one:

session1 >SELECT * FROM sakila.customer where bonus = 1 and active =1 order by last_name;
+-------------+----------+------------+-----------+------+----------------+
| customer_id | store_id | first_name | last_name |bonus | activate_bonus |
+-------------+----------+------------+-----------+------+----------------+
|         383 |        1 | MARTIN     | BALES     |    1 | 1              |
|         539 |        1 | MATHEW     | BOLIN     |    1 | 1              |
|         441 |        1 | MARIO      | CHEATHAM  |    1 | 1              |
|         482 |        1 | MAURICE    | CRAWLEY   |    1 | 1              |
|         236 |        1 | MARCIA     | DEAN      |    1 | 0              |
|         257 |        2 | MARSHA     | DOUGLAS   |    1 | 0              |
|         293 |        2 | MAE        | FLETCHER  |    1 | 1              |
|          38 |        1 | MARTHA     | GONZALEZ  |    1 | 1              |
|         444 |        2 | MARCUS     | HIDALGO   |    1 | 1              |
|         252 |        2 | MATTIE     | HOFFMAN   |    1 | 1              |
|         256 |        2 | MABEL      | HOLLAND   |    1 | 1              |
|         226 |        2 | MAUREEN    | LITTLE    |    1 | 1              |
|         323 |        2 | MATTHEW    | MAHAN     |    1 | 0              |
|         246 |        1 | MARIAN     | MENDOZA   |    1 | 0              |
|           7 |        1 | MARIA      | MILLER    |    1 | 0              |
|           9 |        2 | MARGARET   | MOORE     |    1 | 0              |
|         408 |        1 | MANUEL     | MURRELL   |    1 | 0              |
|         588 |        1 | MARION     | OCAMPO    |    1 | 1              |
|         499 |        2 | MARC       | OUTLAW    |    1 | 1              |
|         553 |        1 | MAX        | PITT      |    1 | 1              |
|         312 |        2 | MARK       | RINEHART  |    1 | 1              |
|          80 |        1 | MARILYN    | ROSS      |    1 | 1              |
|         254 |        2 | MAXINE     | SILVA     |    1 | 0              |
|           1 |        1 | MARY       | SMITH     |    1 | 0              |
|         178 |        2 | MARION     | SNYDER    |    1 | 0              |
|         583 |        1 | MARSHALL   | THORN     |    1 | 1              |
|         128 |        1 | MARJORIE   | TUCKER    |    1 | 1              |
|          44 |        1 | MARIE      | TURNER    |    1 | 1              |
|         267 |        1 | MARGIE     | WADE      |    1 | 1              |
|         240 |        1 | MARLENE    | WELCH     |    1 | 1              |
|         413 |        2 | MARVIN     | YEE       |    1 | 1              |
+-------------+----------+------------+-----------+------+----------------+
31 rows in set (0.09 sec)

We now can see all the 31 customers and also the value calculation:

+---------+-------------------+--------------------+
| income  | income_with_bonus | loss_because_bonus |
+---------+-------------------+--------------------+
| 3754.01 |         3378.6090 |           375.4010 |
+---------+-------------------+--------------------+

It is reporting the right values.

At this point, we can program some logic in the process to check the possible tolerance and have the process either complete performing the update operations or stop the process and exit.

This as mentioned requires additional coding and more logic.

To be fully honest, this solution still leaves space for some possible interference, but at least allows the application to be informed about what is happening. Still, it should be used only if you cannot afford to have a higher level of locking, but this is another story/article.

Conclusion

When writing applications that interact with an RDBMS, you must be very careful in what you do and HOW you do it. While using data facilitation layers like Object Relational Mapping (ORM), seems to make your life easier, in reality, you may lose control of a few crucial aspects of the application’s interaction. So be very careful when opting for how to access your data.

About the case reported above we can summarize a few pitfalls:

  1. First and most important, never ever have processes that may interfere with the one running at the same time. Be very careful when you design them and even more careful when planning their executions.  
  2. Use options such as for share or for update in your select statements. Use them carefully to avoid unuseful locks, but use them. 
  3. In case you have a long process that requires modifying data, and you need to be able to check if other processes have altered the status of the data, but at the same time, cannot have a long wait time for locks. Then use a mix, setting READ-COMMITTED as isolation, to allow your application to check, but also add things like for share or for update in the select statement, immediately before the DML and the commit. That will allow you to prevent writes while you are finalizing, and also to significantly reduce the locking time. 
  4. Keep in mind that long-running processes and long-running transactions can be the source of a lot of pain, especially when using REPEATABLE-READ. Try whenever you can to split the operations into small chunks. 

Finally, when developing, remember that DBAs are friends, and are there to help you to do things at your best. It may seem they are giving you a hard time, but that is because their point of view is different, focusing on data consistency, availability, and durability. But they can help you to save a lot of time after the code is released, especially when you try to identify why something has gone wrong. 

So involve them soon in the design process, use them and let them be part of the process.

References

https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

https://dev.mysql.com/doc/refman/8.0/en/innodb-consistent-read.html

https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-model.html

 

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com