Oct
24
2018
--

PostgreSQL locking, part 2: heavyweight locks

Locking in PostgreSQL

Locking in PostgreSQLPostgreSQL locking visibility for application developers and DBAs is in most cases related to heavyweight locks. Complex database locking operations require full instrumentation using views from the system catalog. It should be clear which object is locked by a specific database “backend” process. An alternative name for any lock is “bottleneck”. In order to make database operations parallel we should split a single bottleneck into multiple operation specific tasks.

This is the second part of three blog posts related to table level locks. The previous post was about row-level locks, and a subsequent post reviews the latches protecting internal database structures.

Example environment

A simple table with several rows:

CREATE TABLE locktest (c INT);
INSERT INTO locktest VALUES (1), (2);

Helper view

In order to check different types of these locks, let’s create a helper view as suggested by Bruce Momjian in his presentation:

CREATE VIEW lockview AS SELECT pid, virtualtransaction AS vxid, locktype AS lock_type,
mode AS lock_mode, granted,
CASE
WHEN virtualxid IS NOT NULL AND transactionid IS NOT NULL
THEN virtualxid || ' ' || transactionid
WHEN virtualxid::text IS NOT NULL
THEN virtualxid
ELSE transactionid::text                                                                                                                                                                                           END AS xid_lock, relname,
page, tuple, classid, objid, objsubid
FROM pg_locks LEFT OUTER JOIN pg_class ON (pg_locks.relation = pg_class.oid)
WHERE -- do not show our view’s locks
pid != pg_backend_pid() AND
-- no need to show self-vxid locks                                                                                                                                                                                 virtualtransaction IS DISTINCT FROM virtualxid
-- granted is ordered earlier
ORDER BY 1, 2, 5 DESC, 6, 3, 4, 7;

RowShareLock (ROW SHARE)

Many applications use the read-modify-write paradigm. For instance, the application fetches single object fields from a table, modifies the data, and saves the changes back to the database. In a multi-user environment, different users could modify the same rows in the course of this transaction. We can get inconsistent data with just a plain select. In response to user demands, almost all SQL databases have SELECT … FOR SHARE locking. This feature prevents the application entity from making data modifications until the locker transaction commits or rolls back.

For example:

  1. There is a user with multiple bank accounts stored in an accounts table, with total_amount stored in a bank_clients table.
  2. In order to update the total_amount field, we should prevent modification of all rows related to the specific bank client.
  3. It would be better to use a single update statement to calculate total_amount and select it from the accounts table. If the update requires external data, or some action from the user, then several statements are required
START TRANSACTION;
SELECT * FROM accounts WHERE client_id = 55 FOR SHARE;
SELECT * FROM bank_clients WHERE client_id=55 FOR UPDATE;
UPDATE bank_clients SET total_amount=38984.33, client_status='gold' WHERE client_id=55;
COMMIT;

The SELECT FOR SHARE statement creates a “RowShareLock” lock on the relation locktest.

Here’s exactly the same lock created with an SQL statement:

BEGIN;
LOCK TABLE locktest IN ROW SHARE MODE;

A single heavyweight RowShareLock is required regardless of the number of rows locked by the query.

This is illustrated with an unfinished transaction in the following example. Start the unfinished transaction, and select from lockview in a second connection to the database:

BEGIN;
SELECT * FROM locktest FOR SHARE;
-- In second connection:
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid |   lock_type   |   lock_mode   | granted | xid_lock | relname
-------+------+---------------+---------------+---------+----------+----------
 21144 | 3/13 | transactionid | ExclusiveLock | t       | 586      |
 21144 | 3/13 | relation      | RowShareLock  | t       |          | locktest

RowExclusiveLock (ROW EXCLUSIVE)

Real queries that modify rows also require heavyweight locks on tables, one per table.

The next example uses a DELETE query, but an UPDATE will have the same effect.

All commands that modify data in a table obtain a ROW EXCLUSIVE lock.

BEGIN;
DELETE FROM locktest;
-- second connection
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid |   lock_type   |    lock_mode     | granted | xid_lock | relname
-------+------+---------------+------------------+---------+----------+----------
 10997 | 3/6  | transactionid | ExclusiveLock    | t       | 589      |
 10997 | 3/6  | relation      | RowExclusiveLock | t       |          | locktest

This new lock is incompatible with the previous FOR SHARE example.

SELECT * FROM locktest FOR SHARE

  waits for the delete transaction to finish or abort:

postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname,page,tuple FROM lockview;
  pid  | vxid |   lock_type   |    lock_mode     | granted | xid_lock | relname  | page | tuple
-------+------+---------------+------------------+---------+----------+----------+------+-------
 10997 | 3/6  | transactionid | ExclusiveLock    | t       | 589      |          |      |
 10997 | 3/6  | relation      | RowExclusiveLock | t       |          | locktest |      |
 11495 | 5/9  | relation      | RowShareLock     | t       |          | locktest |      |
 11495 | 5/9  | tuple         | RowShareLock     | t       |          | locktest |    0 |     1
 11495 | 5/9  | transactionid | ShareLock        | f       | 589      |          |      |

Queries modifying table content also lock all indexes, even if the index does not contain modified fields.

-- preparation
CREATE INDEX c_idx2 ON locktest (c);
ALTER TABLE locktest ADD COLUMN c2 INT;
CREATE INDEX c2_idx ON locktest(c2);
-- unfinished example transaction
BEGIN;
UPDATE locktest SET c=3 WHERE c=1;
-- second connection
postgres=# SELECT * FROM lockview;
 pid  |  vxid  | lock_type  |    lock_mode     | granted | xid_lock | relname  | page | tuple | classid | objid | objsubid
------+--------+------------+------------------+---------+----------+----------+------+-------+---------+-------+----------
 3998 | 3/7844 | virtualxid | ExclusiveLock    | t       | 3/7844   |          |      |       |         |       |
 3998 | 3/7844 | relation   | RowExclusiveLock | t       |          | c2_idx   |      |       |         |       |
 3998 | 3/7844 | relation   | RowExclusiveLock | t       |          | c_idx    |      |       |         |       |
 3998 | 3/7844 | relation   | RowExclusiveLock | t       |          | c_idx2   |      |       |         |       |
 3998 | 3/7844 | relation   | RowExclusiveLock | t       |          | locktest |      |       |         |       |

ShareLock (SHARE)

The non-concurrent version of CREATE INDEX prevents table updates, e.g. DROP TABLE or INSERT or DELETE, with ShareLock.

BEGIN;
CREATE INDEX c_idx ON locktest (c);
-- second connection
postgres=# SELECT * FROM lockview;
 pid  |  vxid  |   lock_type   |      lock_mode      | granted | xid_lock | relname  | page | tuple | classid | objid | objsubid
------+--------+---------------+---------------------+---------+----------+----------+------+-------+---------+-------+----------
 3998 | 3/7835 | virtualxid    | ExclusiveLock       | t       | 3/7835   |          |      |       |         |       |
 3998 | 3/7835 | transactionid | ExclusiveLock       | t       | 564      |          |      |       |         |       |
 3998 | 3/7835 | relation      | AccessExclusiveLock | t       |          |          |      |       |         |       |
 3998 | 3/7835 | relation      | ShareLock           | t       |          | locktest |      |       |         |       |

You can execute multiple CREATE INDEX queries in parallel unless the index name is exactly the same. The wait happens on the row lock (ShareLock with “transactionid” type) in the pg_class table.

Note that there is also AccessExclusiveLock lock with type “relation”, but it’s not a table level one.

ShareUpdateExclusiveLock (SHARE UPDATE EXCLUSIVE)

These database maintenance operations need to take a ShareUpdateExclusiveLock:

  • ANALYZE table
  • VACUUM (without full) runs.
  • CREATE INDEX CONCURRENTLY

The 

ANALYZE tablename;

  statement updates table statistics. The query planner/optimizer is able to provide the best plans for query execution only if the statistics are up to date.

BEGIN;
ANALYZE locktest;
-- in second connection
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid |   lock_type   |        lock_mode         | granted | xid_lock | relname
-------+------+---------------+--------------------------+---------+----------+----------
 10997 | 3/7  | transactionid | ExclusiveLock            | t       | 591      |
 10997 | 3/7  | relation      | ShareUpdateExclusiveLock | t       |          | locktest

There is no conflict between RowExclusiveLock and ShareUpdateExclusiveLock. UPDATE/DELETE/INSERT could still modify rows during ANALYZE.

VACUUM and CREATE INDEX CONCURRENTLY can be executed only outside a transaction. To see the effects of these statements in lockview, execute a conflicting transaction first e.g. run ANALYZE in a transaction, or run VACUUM against a huge table.

CREATE INDEX CONCURRENTLY locking can be confusing. SHARE UPDATE EXCLUSIVE lock does not conflict with a ROW EXCLUSIVE lock that’s used for DELETES, INSERT and UPDATES. Unfortunately, CREATE INDEX CONCURRENTLY waits until active transactions are finished twice due to full table scans:

In a concurrent index build, the index is actually entered into the system catalogs in one transaction, then two table scans occur in two more transactions. Before each table scan, the index build must wait for existing transactions that have modified the table to terminate.” PostgreSQL Documentation

AccessExclusiveLock  (ACCESS EXCLUSIVE)

This lock conflicts with any other locks and is used by these statements:

  • CREATE RULE
  • DROP TABLE
  • DROP INDEX
  • TRUNCATE
  • VACUUM FULL
  • LOCK TABLE (default mode)
  • CLUSTER
  • REINDEX
  • REFRESH MATERIALIZED VIEW (without CONCURRENTLY)
BEGIN;
CREATE RULE r_locktest AS ON INSERT TO locktest DO INSTEAD NOTHING;
-- second connection
postgres=# select pid,vxid,lock_type,lock_mode,granted,xid_lock,relname from lockview;
  pid  | vxid |   lock_type   |      lock_mode      | granted | xid_lock | relname
-------+------+---------------+---------------------+---------+----------+----------
 10997 | 3/19 | transactionid | ExclusiveLock       | t       | 596      |
 10997 | 3/19 | relation      | AccessExclusiveLock | t       |          | locktest

More importantly, drop index requires access exclusive locks for both table and index:

BEGIN;
DROP INDEX c_idx;
-- second connection
postgres=# SELECT * FROM lockview;
 pid  |  vxid  |   lock_type   |      lock_mode      | granted | xid_lock | relname  | page | tuple | classid | objid | objsubid
------+--------+---------------+---------------------+---------+----------+----------+------+-------+---------+-------+----------
 3998 | 3/7839 | virtualxid    | ExclusiveLock       | t       | 3/7839   |          |      |       |         |       |
 3998 | 3/7839 | transactionid | ExclusiveLock       | t       | 569      |          |      |       |         |       |
 3998 | 3/7839 | relation      | AccessExclusiveLock | t       |          | c_idx    |      |       |         |       |
 3998 | 3/7839 | relation      | AccessExclusiveLock | t       |          | locktest |      |       |         |       |

Note: This is the most dangerous type of lock. Avoid running queries requiring access exclusive lock in production, or at least put the application in maintenance mode.

ExclusiveLock

Meanwhile, SQL commands don’t use ExclusiveLock, except for with the general LOCK TABLE statement. This lock prevents all requests except for a non-locking select (i.e. without FOR SHARE/UPDATE).

BEGIN;
LOCK TABLE locktest IN EXCLUSIVE MODE;
-- second connection
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid | lock_type |   lock_mode   | granted | xid_lock | relname
-------+------+-----------+---------------+---------+----------+----------
 10997 | 3/21 | relation  | ExclusiveLock | t       |          | locktest

Savepoints

Savepoint produces an additional ExclusiveLock of transactionid type with new xid value.

BEGIN;
SELECT * FROM locktest FOR SHARE;
SAVEPOINT s1;
SELECT * FROM locktest FOR UPDATE;
-- second connection
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid |   lock_type   |    lock_mode    | granted | xid_lock | relname
-------+------+---------------+-----------------+---------+----------+----------
 10997 | 3/37 | transactionid | ExclusiveLock   | t       | 602      |
 10997 | 3/37 | transactionid | ExclusiveLock   | t       | 603      |
 10997 | 3/37 | relation      | AccessShareLock | t       |          | c_idx
 10997 | 3/37 | relation      | RowShareLock    | t       |          | locktest

pg_advisory_lock

Sometimes application developers require synchronization between processes. In such systems, the application creates and removes locks frequently. Systems with a row-based locks implementation tend to cause table bloat.

There are many functions related to advisory locks:

  • per session or per transaction
  • wait if lock is not available or immediately return false
  • exclusive or shared
  • 64-bit or two 32-bit integers resource identifiers

Imagine that we have several cron jobs and that the application should prevent simultaneous runs of the same script. Next, each script can check if a lock is available in PostgreSQL for specific integer job identifier:

postgres=# SELECT pg_try_advisory_lock(10);
 pg_try_advisory_lock
----------------------
 t
-- second connection
postgres=# SELECT * FROM lockview;
 pid  | vxid | lock_type |   lock_mode   | granted | xid_lock | relname | page | tuple | classid | objid | objsubid
------+------+-----------+---------------+---------+----------+---------+------+-------+---------+-------+----------
 3998 | 3/0  | advisory  | ExclusiveLock | t       |          |         |      |       |       0 |    10 |        1
-- other connections
SELECT pg_try_advisory_lock(10);
 pg_try_advisory_lock
----------------------
 f

The query produces ExclusiveLock with type advisory.

Deadlocks

Any system with multiple locks tends to have a deadlock situation when queries will never finish. The only way to resolve such issues: kill one of blocked statements. More importantly, deadlock detection is an expensive procedure in PostgreSQL. A check for deadlock only happens when a transaction is locked for deadlock_timeout milliseconds—after one second by default.

Here is an illustration of a deadlock situation for two different connections A and B:

Any deadlock starting from lock wait.

A: BEGIN; SELECT c FROM locktest WHERE c=1 FOR UPDATE;
B: BEGIN; SELECT c FROM locktest WHERE c=2 FOR UPDATE; SELECT c FROM locktest WHERE c=1 FOR UPDATE;

You are not alone with the identification of deadlocks, as the pg_stat_activity system view helps you to find statements and transactions causing lock waits:

postgres=# SELECT pg_stat_activity.pid AS pid,
query, wait_event, vxid, lock_type,
lock_mode, granted, xid_lock
FROM lockview JOIN pg_stat_activity ON (lockview.pid = pg_stat_activity.pid);
  pid  |          query             |  wait_event   | vxid |   lock_type   |      lock_mode      | granted | xid_lock
-------+----------------------------+---------------+------+---------------+---------------------+---------+----------
 10997 | SELECT ... c=1 FOR UPDATE; | ClientRead    | 3/43 | transactionid | ExclusiveLock       | t       | 605
 10997 | SELECT ... c=1 FOR UPDATE; | ClientRead    | 3/43 | advisory      | ExclusiveLock       | t       |
 10997 | SELECT ... c=1 FOR UPDATE; | ClientRead    | 3/43 | relation      | AccessShareLock     | t       |
 10997 | SELECT ... c=1 FOR UPDATE; | ClientRead    | 3/43 | relation      | RowShareLock        | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | transactionid | ExclusiveLock       | t       | 606
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | advisory      | ExclusiveLock       | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | relation      | AccessShareLock     | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | relation      | RowShareLock        | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | tuple         | AccessExclusiveLock | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | transactionid | ShareLock           | f       | 605

The SELECT FOR UPDATE on c=2 row causes a deadlock:

SELECT c FROM locktest WHERE c=2 FOR UPDATE;

Afterwards, PostgreSQL reports in server log:

2018-08-02 08:46:07.793 UTC [10997] ERROR:  deadlock detected
2018-08-02 08:46:07.793 UTC [10997] DETAIL:  Process 10997 waits for ShareLock on transaction 606; blocked by process 11495.
Process 11495 waits for ShareLock on transaction 605; blocked by process 10997.
Process 10997: select c from locktest where c=2 for update;
Process 11495: select c from locktest where c=1 for update;
2018-08-02 08:46:07.793 UTC [10997] HINT:  See server log for query details.
2018-08-02 08:46:07.793 UTC [10997] CONTEXT:  while locking tuple (0,3) in relation "locktest"
2018-08-02 08:46:07.793 UTC [10997] STATEMENT:  SELECT c FROM locktest WHERE c=2 FOR UPDATE;
ERROR:  deadlock detected
DETAIL:  Process 10997 waits for ShareLock on transaction 606; blocked by process 11495.
Process 11495 waits for ShareLock on transaction 605; blocked by process 10997.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (0,3) in relation "locktest"

As you can see, the database server aborts one blocked transaction automatically.

Multi-way deadlocks

Normally there are just two transactions creating deadlocks. However, in complex cases, an application could caused deadlocks with multiple transactions forming a dependency circle.

A: locks row1, B locks row2, C locks row3

Step2

A: trying to get row3, B: trying to get row1, C: trying to get row2

Summary

  • Do not put DDL statements in long transactions.
  • Please avoid DDL on during high load for frequently updated tables
  • CLUSTER command requires exclusive access to the table and all it’s indexes
  • Monitor postgresql log for deadlock-related messages

Photo by shy sol from Pexels

Nov
09
2017
--

MySQL and Linux Context Switches

Context Switches

In this blog post, I’ll look at MySQL and Linux context switches and what is the normal number per second for a database environment.

You might have heard many times about the importance of looking at the number of context switches to indicate if MySQL is suffering from the internal contention issues. I often get the question of what is a “normal” or “acceptable” number, and at what point should you worry about the number of context switches per second?

First, let’s talk about what context switches are in Linux. This StackOverflow Thread provides a good discussion, with a lot of details, but basically it works like this:  

The process (or thread in MySQL’s case) is running its computations. Sooner or later, it has to do some blocking operation: disk IO, network IO, block waiting on a mutex or yield. The execution switches to the other process, and this is called voluntary context switch.On the other hand, the process/thread may need to be preempted by the scheduler because it used an allotted amount of CPU time (and now other tasks need to run) or because it is required to run high priority task. This is called involuntary context switches. When all the process in the system are added together and totaled, this is the system-wide number of context switches reported (using, for example, vmstat):

root@nuc2:~# vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
17  0      0 12935036 326152 2387388    0    0     0     5     0      1  9  0 91  0  0
20  0      0 12933936 326152 2387384    0    0     0     3 32228 124791 77 22  1  0  0
17  0      0 12933348 326152 2387364    0    0     0    11 33212 124575 78 22  1  0  0
16  0      0 12933380 326152 2387364    0    0     0    78 32470 126100 78 22  1  0  0

This is a global number. In many cases, however, it is better to look at it as context switches per CPU logical core. This is because cores execute tasks independently. As such, they have mostly independent causes for context switches. If you have a large number of cores, there can be quite a difference:

MySQL Context Switches

The number of context switches per second on this system looks high (at more than 1,000,000). Considering it has 56 logical cores, however, it is only about 30,000 per second per logical core (which is not too bad).

So how do we judge if the number of context switches is too high in your system? One answer is that it is too high if you’re wasting too much CPU on context switches. This brings up the question: how many context switches can the system handle if it is only doing context switches?

It is easy to find this out!  

Sysbench has a “threads” test designed specifically to measure this. For example:

sysbench --thread-locks=128 --time=7200 --threads=1024 threads run

Check the vmstat output or the Context Switches PMM graph:

MySQL Context Switches 1

We can see this system can handle up to 35 million context switches per second in total (or some 500K per logical CPU core on average).

I don’t recommend using more than 10% of CPU resources on context switching, so I would try to keep the number of the context switches at no more than 50K per logical CPU core.

Now let’s think about context switches from the other side: how many context switches do we expect to have at the very minimum for given load? Even if all the stars align and your query to MySQL doesn’t need any disk IO or context switches due to waiting for mutexes, you should expect at least two context switches: one to the client thread which processes the query and one for the query response sent to the client.    

Using this logic, if we have 100,000 queries/sec we should expect 200,000 context switches at the very minimum.

In the real world, though, I would not worry about contention being a big issue if you have less than ten context switches per query.

It is worth noting that in MySQL not every contention results in a context switch. InnoDB implements its own mutexes and RW-locks, which often try to “spin” to wait for a resource to become available. This wastes CPU time directly rather than doing a context switch.

Summary:

  • Look at the number of context switches per logical core rather than the total for easier-to-compare numbers
  • Find out how many context switches your system can handle per second, and don’t get too concerned if your context switches are no more than 10% of that number
  • Think about the number of context switches per query: the minimum possible is two, and values less than 10 make contention an unlikely issue
  • Not every MySQL contention results in a high number of context switches
Apr
19
2017
--

How We Made Percona XtraDB Cluster Scale

Percona XtraDB Cluster SST Traffic Encryption

Percona XtraDB ClusterIn this blog post, we’ll look at the actions and efforts Percona experts took to scale Percona XtraDB Cluster.

Introduction

When we first started analyzing Percona XtraDB Cluster performance, it was pretty bad. We would see contention even with 16 threads. Performance was even worse with sync binlog=1, although the same pattern was observed even with the binary log disabled. The effect was not only limited to OLTP workloads, as even other workloads (like update-key/non-key) were also affected in a wider sense than OLTP.

That’s when we started analyzing the contention issues and found multiple problems. We will discuss all these problems and the solutions we adapted. But before that, let’s look at the current performance level.

Check this blog post for more details.

The good news is Percona XtraDB Cluster is now optimized to scale well for all scenarios, and the gain is in the range of 3x-10x.

Understanding How MySQL Commits a Transaction

Percona XtraDB Cluster contention is associated mainly with Commit Monitor contention, which comes into the picture during commit time. It is important to understand the context around it.

When a commit is invoked, it proceeds in two phases:

  • Prepare phase: mark the transaction as PREPARE, updating the undo segment to capture the updated state.
    • If bin-log is enabled, redo changes are not persisted immediately. Instead, a batch flush is done during Group Commit Flush stage.
    • If bin-log is disabled, then redo changes are persisted immediately.
  • Commit phase: Mark the transaction commit in memory.
    • If bin-log is enabled, Group Commit optimization kicks in, thereby causing a flush of redo-logs (that persists changes done to db-objects + PREPARE state of transaction) and this action is followed by a flush of the binary logs. Since the binary logs are flushed, redo log capturing of transaction commit doesn’t need to flush immediately (Saving fsync)
    • If bin-log is disabled, redo logs are flushed on completion of the transaction to persist the updated commit state of the transaction.

What is a Monitor in Percona XtraDB Cluster World?

Monitors help maintain transaction ordering. For example, the Commit Monitor ensures that no transaction with a global-seqno greater than the current commit-processing transaction’s global seqno is allowed to proceed.

How Percona XtraDB Cluster Commits a Transaction

Percona XtraDB Cluster follows the existing MySQL semantics of course, but has its own step to commit the transaction in the replication world. There are two important themes:

  1. Apply/Execution of transaction can proceed in parallel
  2. Commit is serialized based on cluster-wide global seqno.

Let’s understand the commit flow with Percona XtraDB Cluster involved (Percona XtraDB Cluster registers wsrep as an additional storage engine for replication).

  • Prepare phase:
    • wsrep prepare: executes two main actions:
      • Replicate the transaction (adding the write-set to group-channel)
      • Entering CommitMonitor. Thereby enforcing ordering of transaction.
    • binlog prepare: nothing significant (for this flow).
    • innobase prepare: mark the transaction in PREPARE state.
      • As discussed above, the persistence of the REDO log depends on if the binlog is enabled/disabled.
  • Commit phase
    • If bin-log is enabled
      • MySQL Group Commit Logic kicks in. The semantics ensure that the order of transaction commit is the same as the order of them getting added to the flush-queue of the group-commit.
    • If bin-log is disabled
      • Normal commit action for all registered storage engines is called with immediate persistence of redo log.
    • Percona XtraDB Cluster then invokes the post_commit hook, thereby releasing the Commit Monitor so that the next transaction can make progress.

With that understanding, let’s look at the problems and solutions:

PROBLEM-1:

Commit Monitor is exercised such that the complete commit operation is serialized. This limits the parallelism associated with the prepare-stage. With log-bin enabled, this is still ok since redo logs are flushed at group-commit flush-stage (starting with 5.7). But if log-bin is disabled, then each commit causes an independent redo-log-flush (in turn probable fsync).

OPTIMIZATION-1:

Split the replication pre-commit hook into two explicit actions: replicate (add write-set to group-channel) + pre-commit (enter commit-monitor).

The replicate action is performed just like before (as part of storage engine prepare). That will help complete the InnoDB prepare action in parallel (exploring much-needed parallelism in REDO flush with log-bin disabled).

On completion of replication, the pre-commit hook is called. That leads to entering the Commit Monitor for enforcing the commit ordering of the transactions. (Note: Replication action assigns the global seqno. So even if a transaction with a higher global seqno finishes the replication action earlier (due to CPU scheduling) than the transaction with a lower global seqno, it will wait in the pre-commit hook.)

Improved parallelism in the innodb-prepare stage helps accelerate log-bin enabled flow, and the same improved parallelism significantly helps in the log-bin disabled case by reducing redo-flush contention, thereby reducing fsyncs.


PROBLEM-2:

MySQL Group Commit already has a concept of ordering transactions based on the order of their addition to the GROUP COMMIT queue (FLUSH STAGE queue to be specific). Commit Monitor enforces the same, making the action redundant but limiting parallelism in MySQL Group Commit Logic (including redo-log flush that is now delayed to the flush stage).

With the existing flow (due to the involvement of Commit Monitor), only one transaction can enter the GROUP COMMIT Queue, thereby limiting optimal use of Group Commit Logic.

OPTIMIZATION-2:

Release the Commit Monitor once the transaction is successfully added to flush-stage of group-commit. MySQL will take it from there to maintain the commit ordering. (We call this interim-commit.)

Releasing the Commit Monitor early helps other transactions to make progress and real MySQL Group Commit Leader-Follower Optimization (batch flushing/sync/commit) comes into play.

This also helps ensure batch REDO log flushing.


PROBLEM-3:

This problem is specific to when the log-bin is disabled. Percona XtraDB Cluster still generates the log-bin, as it needs it for forming a replication write-set (it just doesn’t persist this log-bin information). If disk space is not a constraint, then I would suggest operating Percona XtraDB Cluster with log-bin enabled.

With log-bin disabled, OPTIMIZATION-1 is still relevant, but OPTIMIZATION-2 isn’t, as there is no group-commit protocol involved. Instead, MySQL ensures that the redo-log (capturing state change of transaction) is persisted before reporting COMMIT as a success. As per the original flow, the Commit Monitor is not released till the commit action is complete.

OPTIMIZATION-3:

The transaction is already committed to memory and the state change is captured. This is about persisting the REDO log only (REDO log modification is already captured by mtr_commit). This means we can release the Commit Monitor just before the REDO flush stage kicks in. Correctness is still ensured as the REDO log flush always persists the data sequentially. So even if trx-1 loses its slots before the flush kicks in, and trx-2 is allowed to make progress, trx-2’s REDO log flush ensures that trx-1’s REDO log is also flushed.


Conclusion

With these three main optimizations, and some small tweaks, we have tuned Percona XtraDB Cluster to scale better and made it fast enough for the growing demands of your applications. All of this is available with the recently released Percona XtraDB Cluster 5.7.17-29.20. Give it a try and watch your application scale in a multi-master environment, making Percona XtraDB Cluster the best fit for your HA workloads.

Oct
17
2013
--

InnoDB scalability issues due to tables without primary keys

Each day there is probably work done to improve performance of the InnoDB storage engine and remove bottlenecks and scalability issues. Hence there was another one I wanted to highlight:

Scalability issues due to tables without primary keys

This scalability issue is caused by the usage of tables without primary keys. This issue typically shows itself as contention on the InnoDB dict_sys mutex. Now the dict_sys mutex controls access to the data dictionary. This mutex is used at various places. I will only mention a few of them:

  • During operations such as opening and closing table handles, or
  • When accessing I_S tables, or
  • During undo of a freshly inserted row, or
  • During other data dictionary modification operations such as CREATE TABLE, or
  • Within the “Persistent Stats” subsystem, among other things.

Of course this list is not exhaustive but should give you a good picture of how heavily it is used.

But the thing is when you are mainly debugging contention related to a data dictionary control structure, you start to look off at something that is directly related to data dictionary modifications. You look for execution of CREATE TABLE, DROP TABLE, TRUNCATE TABLE, etc. But what if none of that is actually causing the contention on the dict_sys mutex? Are you aware when generating “row-id” values, for tables without explicit primary keys, or without non-nullable unique keys, dict_sys mutex is acquired. So INSERTs to tables with implicit primary keys is a InnoDB system-wide contention point.

Let’s also take a look at the relevant source code.

Firstly, below is the function that does the row-id allocation which is defined in the file storage/innobase/row/row0ins.cc

3060 /***********************************************************//**
3061 Allocates a row id for row and inits the node->index field. */
3062 UNIV_INLINE
3063 void
3064 row_ins_alloc_row_id_step(
3065 /*======================*/
3066         ins_node_t*     node)   /*!< in: row insert node */ 3067 { 3068         row_id_t        row_id; 3069  3070         ut_ad(node->state == INS_NODE_ALLOC_ROW_ID);
3071
3072         if (dict_index_is_unique(dict_table_get_first_index(node->table))) {
3073
3074                 /* No row id is stored if the clustered index is unique */
3075
3076                 return;
3077         }
3078
3079         /* Fill in row id value to row */
3080
3081         row_id = dict_sys_get_new_row_id();
3082
3083         dict_sys_write_row_id(node->row_id_buf, row_id);
3084 }

Secondly, below is the function that actually generates the row-id which is defined in the file storage/innobase/include/dict0boot.ic

26 /**********************************************************************//**
 27 Returns a new row id.
 28 @return the new id */
 29 UNIV_INLINE
 30 row_id_t
 31 dict_sys_get_new_row_id(void)
 32 /*=========================*/
 33 {
 34         row_id_t        id;
 35
 36         mutex_enter(&(dict_sys->mutex));
 37
 38         id = dict_sys->row_id;
 39
 40         if (0 == (id % DICT_HDR_ROW_ID_WRITE_MARGIN)) {
 41
 42                 dict_hdr_flush_row_id();
 43         }
 44
 45         dict_sys->row_id++;
 46
 47         mutex_exit(&(dict_sys->mutex));
 48
 49         return(id);
 50 }

Finally, I would like to share results of a few benchmarks that I conducted in order to show you how this affects performance.

Benchmarking affects of non-presence of primary keys

First off all, let me share information about the host that was used in the benchmarks. I will also share the MySQL version and InnoDB configuration used.

Hardware

The host was a “hi1.4xlarge” Amazon EC2 instance. The instance comes with 16 vCPUs and 60.5GB of memory. The instance storage consists of 2×1024 SSD-backed storage volumes, and the instance is connected to a 10 Gigabit ethernet network. So the IO performance is very decent. I created a RAID 0 array from the 2 instance storage volumes and created XFS filesystem on the resultant software RAID 0 volume. This configuration would allows us to get the best possible IO performance out of the instance.

MySQL

The MySQL version used was 5.5.34 MySQL Community Server, and the InnoDB configuration looked as follows:

innodb-flush-method            = O_DIRECT
innodb-log-files-in-group      = 2
innodb-log-file-size           = 512M
innodb-flush-log-at-trx-commit = 2
innodb-file-per-table          = 1
innodb-buffer-pool-size        = 42G
innodb-buffer-pool-instances   = 8
innodb-io-capacity             = 10000
innodb_adaptive_hash_index     = 1

I conducted two different types of benchmarks, and both of them were done by using sysbench.

First one involved benchmarking the performance of single-row INSERTs for tables with and without explicit primary keys. That’s what I would be showing first.

Single-row INSERTs

The tables were generated as follows for the benchmark involving tables with primary keys:

sysbench --test=/root/sysbench/sysbench/tests/db/insert.lua --oltp-tables-count=64 --oltp-table-size=1000000 --mysql-table-engine=innodb --mysql-user=root --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=test prepare

This resulted in the following table being created:

CREATE TABLE `sbtest1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `k` int(10) unsigned NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB

While the tables without primary keys were generated as follows:

sysbench --test=/root/sysbench/sysbench/tests/db/insert.lua --oltp-tables-count=64 --oltp-table-size=1000000 --oltp-secondary --mysql-table-engine=innodb --mysql-user=root --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=test prepare

This resulted in the tables being created with the following structure:

CREATE TABLE `sbtest1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `k` int(10) unsigned NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  KEY `xid` (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB

The actual benchmark for the table with primary keys was run as follows:

sysbench --test=/root/sysbench/sysbench/tests/db/insert.lua --oltp-tables-count=64 --oltp-table-size=1000000 --oltp-dist-type=uniform --mysql-table-engine=innodb --mysql-user=root --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=test --max-time=300 --num-threads=16 --max-requests=0 --report-interval=1 run

While the actual benchmark for the table without primary keys was run as follows:

sysbench --test=/root/sysbench/sysbench/tests/db/insert.lua --oltp-tables-count=64 --oltp-table-size=1000000 --oltp-secondary --oltp-dist-type=uniform --mysql-table-engine=innodb --mysql-user=root --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=test --max-time=300 --num-threads=16 --max-requests=0 --report-interval=1 run

Note that the benchmarks were run with three variations in the number of concurrent threads used by sysbench: 16, 32 and 64.
Below are how the graphs look like for each of these benchmarks.

Writes per second 16 threads
Writes per second 32 threads
Writes per second 64 threads

Some key things to note from the graphs are that the throughput of the INSERTs to the tables without explicit primary keys never goes above 87% of the throughput of the INSERTs to the tables with primary keys defined. Furthermore, as we increase the concurrency downward spikes start appearing. These become more apparent when we move to a concurrency of 64 threads. This is expected, because the contention is supposed to increase as we increase the concurrency of operations that contend on the dict_sys mutex.

Now let’s take a look at how this impacts the bulk load performance.

Bulk Loads

The bulk loads to the tables with primary keys were performed as follows:

sysbench --test=/root/sysbench/sysbench/tests/db/parallel_prepare.lua --oltp-tables-count=64 --oltp-table-size=1000000 --mysql-table-engine=innodb --mysql-user=root --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=test --num-threads=16 run

While the bulk loads to the tables without primary keys were performed as follows:

sysbench --test=/root/sysbench/sysbench/tests/db/parallel_prepare.lua --oltp-tables-count=64 --oltp-table-size=1000000 --oltp-secondary --mysql-table-engine=innodb --mysql-user=root --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=test --num-threads=16 run

Note that the benchmarks were again run with three variations in the number of concurrent threads used by sysbench: 16, 32 and 64.
Below is what the picture is portrayed by the graph.

Parallel Bulk Loading of Tables

Here again, you can see how the bulk load time increases as we increase the number of concurrent threads. This against points to the increase in contention on the dict_sys mutex. With 16 threads the bulk load time for tables without primary keys is 107% more than the bulk load time for the tables with primary keys. This increases to 116% with 32 threads and finally 124% with 64 threads.

Conclusion

Tables without primary keys cause a wide range of contention because they rely on acquiring dict_sys mutex to generate row-id values. This mutex is used at critical places within InnoDB. Hence the affect of large amount of INSERTs to tables without primary keys is not only isolated to that table alone but can be seen very widely. There are a number of times I have seen tables without primary keys being used in many different scenarios that include simple INSERTs to these tables as well as multi-row INSERTs as a result of, for example, INSERT … SELECT into a table that is being temporarily created. The advice is always to have primary keys present in your tables. Hopefully I have been able to highlight the true impact non-presence of primary keys can have.

The post InnoDB scalability issues due to tables without primary keys appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com