Aug
06
2018
--

Basic Understanding of Bloat and VACUUM in PostgreSQL

VACUUM and Bloat PostgreSQL

VACUUM and Bloat PostgreSQLImplementation of MVCC (Multi-Version Concurrency Control) in PostgreSQL is different and special when compared with other RDBMS. MVCC in PostgreSQL controls which tuples can be visible to transactions via versioning.

What is versioning in PostgreSQL?

Let’s consider the case of an Oracle or a MySQL Database. What happens when you perform a DELETE or an UPDATE of a row? You see an UNDO record maintained in a global UNDO Segment. This UNDO segment contains the past image of a row, to help database achieve consistency. (the “C” in A.C.I.D). For example, if there is an old transaction that depends on the row that got deleted, the row may still be visible to it because the past image is still maintained in the UNDO. If you are an Oracle DBA reading this blog post, you may quickly recollect the error

ORA-01555 snapshot too old

 . What this error means is—you may have a smaller undo_retention or not a huge UNDO segment that could retain all the past images (versions) needed by the existing or old transactions.

You may not have to worry about that with PostgreSQL.

Then how does PostgreSQL manage UNDO ?

In simple terms, PostgreSQL maintains both the past image and the latest image of a row in its own Table. It means, UNDO is maintained within each table. And this is done through versioning. Now, we may get a hint that, every row of PostgreSQL table has a version number. And that is absolutely correct. In order to understand how these versions are maintained within each table, you should understand the hidden columns of a table (especially xmin) in PostgreSQL.

Understanding the Hidden Columns of a Table

When you describe a table, you would only see the columns you have added, like you see in the following log.

percona=# \d scott.employee
                                          Table "scott.employee"
  Column  |          Type          | Collation | Nullable |                    Default
----------+------------------------+-----------+----------+------------------------------------------------
 emp_id   | integer                |           | not null | nextval('scott.employee_emp_id_seq'::regclass)
 emp_name | character varying(100) |           |          |
 dept_id  | integer                |           |          |

However, if you look at all the columns of the table in pg_attribute, you should see several hidden columns as you see in the following log.

percona=# SELECT attname, format_type (atttypid, atttypmod)
FROM pg_attribute
WHERE attrelid::regclass::text='scott.employee'
ORDER BY attnum;
 attname  |      format_type
----------+------------------------
 tableoid | oid
 cmax     | cid
 xmax     | xid
 cmin     | cid
 xmin     | xid
 ctid     | tid
 emp_id   | integer
 emp_name | character varying(100)
 dept_id  | integer
(9 rows)

Let’s understand a few of these hidden columns in detail.

tableoid : Contains the OID of the table that contains this row. Used by queries that select from inheritance hierarchies.
More details on table inheritance can be found here : https://www.postgresql.org/docs/10/static/ddl-inherit.html

xmin : The transaction ID(xid) of the inserting transaction for this row version. Upon update, a new row version is inserted. Let’s see the following log to understand the xmin more.

percona=# select txid_current();
 txid_current
--------------
          646
(1 row)
percona=# INSERT into scott.employee VALUES (9,'avi',9);
INSERT 0 1
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 9;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  647 |    0 |    0 |    0 |      9 | avi      |       9
(1 row)

As you see in the above log, the transaction ID was 646 for the command => select txid_current(). Thus, the immediate INSERT statement got a transaction ID 647. Hence, the record was assigned an xmin of 647. This means, no transaction ID that has started before the ID 647, can see this row. In other words, already running transactions with txid less than 647 cannot see the row inserted by txid 647. 

With the above example, you should now understand that every tuple has an xmin that is assigned the txid that inserted it.

Note: the behavior may change depending on the isolation levels you choose, would be discussed later in another blog post.

xmax : This values is 0 if it was not a deleted row version. Before the DELETE is committed, the xmax of the row version changes to the ID of the transaction that has issued the DELETE. Let’s observe the following log to understand that better.

On Terminal A : We open a transaction and delete a row without committing it.

percona=# BEGIN;
BEGIN
percona=# select txid_current();
 txid_current
--------------
          655
(1 row)
percona=# DELETE from scott.employee where emp_id = 10;
DELETE 1

On Terminal B : Observe the xmax values before and after the delete (that has not been committed).

Before the Delete
------------------
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 10;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  649 |    0 |    0 |    0 |     10 | avi      |      10
After the Delete
------------------
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 10;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  649 |  655 |    0 |    0 |     10 | avi      |      10
(1 row)

As you see in the above logs, the xmax value changed to the transaction ID that has issued the delete. If you have issued a ROLLBACK, or if the transaction got aborted, xmax remains at the transaction ID that tried to DELETE it (which is 655) in this case.

Now that we understand the hidden columns xmin and xmax, let’s observe what happens after a DELETE or an UPDATE in PostgreSQL. As we discussed earlier, through the hidden columns in PostgreSQL for every table, we understand that there are multiple versions of rows maintained within each table. Let’s see the following example to understand this better.

We’ll insert 10 records to the table : scott.employee

percona=# INSERT into scott.employee VALUES (generate_series(1,10),'avi',1);
INSERT 0 10

Now, let’s DELETE 5 records from the table.

percona=# DELETE from scott.employee where emp_id > 5;
DELETE 5
percona=# select count(*) from scott.employee;
 count
-------
     5
(1 row)

Now, when you check the count after DELETE, you would not see the records that have been DELETED. To see any row versions that exist in the table but are not visible, we have an extension called pageinspect. The pageinspect module provides functions that allow you to inspect the contents of database pages at a low level, which is useful for debugging purposes. Let’s create this extension to see the older row versions those have been deleted.

percona=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |              tuple_data_split
--------+--------+---------------------------------------------
    668 |      0 | {"\\x01000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x02000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x03000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x04000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x05000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x06000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x07000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x08000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x09000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x0a000000","\\x09617669","\\x01000000"}
(10 rows)

Now, we could still see 10 records in the table even after deleting 5 records from it. Also, you can observe here that t_xmax is set to the transaction ID that has deleted them. These deleted records are retained in the same table to serve any of the older transactions that are still accessing them.

We’ll take a look at what an UPDATE would do in the following Log.  

percona=# DROP TABLE scott.employee ;
DROP TABLE
percona=# CREATE TABLE scott.employee (emp_id INT, emp_name VARCHAR(100), dept_id INT);
CREATE TABLE
percona=# INSERT into scott.employee VALUES (generate_series(1,10),'avi',1);
INSERT 0 10
percona=# UPDATE scott.employee SET emp_name = 'avii';
UPDATE 10
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |               tuple_data_split
--------+--------+-----------------------------------------------
    672 |    673 | {"\\x01000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x02000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x03000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x04000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x05000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x06000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x07000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x08000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x09000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x0a000000","\\x09617669","\\x01000000"}
    673 |      0 | {"\\x01000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x02000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x03000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x04000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x05000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x06000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x07000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x08000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x09000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x0a000000","\\x0b61766969","\\x01000000"}
(20 rows)

An UPDATE in PostgreSQL would perform an insert and a delete. Hence, all the records being UPDATED have been deleted and inserted back with the new value. Deleted records have non-zero t_xmax value.

Records for which you see a non-zero value for t_xmax may be required by the previous transactions to ensure consistency based on appropriate isolation levels.

We discussed about xmin and xmax. What are these hidden columns cmin and cmax ?

cmax : The command identifier within the deleting transaction or zero. (As per the documentation). However, both cmin and cmax are always the same as per the PostgreSQL source code.

cmin : The command identifier within the inserting transaction. You could see the cmin of the 3 insert statements starting with 0, in the following log.

See the following log to understand how the cmin and cmax values change through inserts and deletes in a transaction.

On Terminal A
---------------
percona=# BEGIN;
BEGIN
percona=# INSERT into scott.employee VALUES (1,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (2,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (3,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (4,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (5,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (6,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (7,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (8,'avi',2);
INSERT 0 1
percona=# COMMIT;
COMMIT
percona=# select xmin,xmax,cmin,cmax,* from scott.employee;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  644 |    0 |    0 |    0 |      1 | avi      |       2
  644 |    0 |    1 |    1 |      2 | avi      |       2
  644 |    0 |    2 |    2 |      3 | avi      |       2
  644 |    0 |    3 |    3 |      4 | avi      |       2
  644 |    0 |    4 |    4 |      5 | avi      |       2
  644 |    0 |    5 |    5 |      6 | avi      |       2
  644 |    0 |    6 |    6 |      7 | avi      |       2
  644 |    0 |    7 |    7 |      8 | avi      |       2
(8 rows)

If you observe the above output log, you see cmin and cmax values incrementing for each insert.

Now let’s delete 3 records from Terminal A and observe how the values appear in Terminal B before COMMIT.

On Terminal A
---------------
percona=# BEGIN;
BEGIN
percona=# DELETE from scott.employee where emp_id = 4;
DELETE 1
percona=# DELETE from scott.employee where emp_id = 5;
DELETE 1
percona=# DELETE from scott.employee where emp_id = 6;
DELETE 1
On Terminal B, before issuing COMMIT on Terminal A
----------------------------------------------------
percona=# select xmin,xmax,cmin,cmax,* from scott.employee;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  644 |    0 |    0 |    0 |      1 | avi      |       2
  644 |    0 |    1 |    1 |      2 | avi      |       2
  644 |    0 |    2 |    2 |      3 | avi      |       2
  644 |  645 |    0 |    0 |      4 | avi      |       2
  644 |  645 |    1 |    1 |      5 | avi      |       2
  644 |  645 |    2 |    2 |      6 | avi      |       2
  644 |    0 |    6 |    6 |      7 | avi      |       2
  644 |    0 |    7 |    7 |      8 | avi      |       2
(8 rows)

Now, in the above log, you see that the cmax and cmin values have incrementally started from 0 for the records being deleted. Their values where different before the delete, as we have seen earlier. Even if you ROLLBACK, the values remain the same.

After understanding the hidden columns and how PostgreSQL maintains UNDO as multiple versions of rows, the next question would be—what would clean up this UNDO from a table? Doesn’t this increase the size of a table continuously? In order to understand that better, we need to know about VACUUM in PostgreSQL.

VACUUM in PostgreSQL

As seen in the above examples, every such record that has been deleted but is still taking some space is called a dead tuple. Once there is no dependency on those dead tuples with the already running transactions, the dead tuples are no longer needed. Thus, PostgreSQL runs VACUUM on such Tables. VACUUM reclaims the storage occupied by these dead tuples. The space occupied by these dead tuples may be referred to as Bloat. VACUUM scans the pages for dead tuples and marks them to the freespace map (FSM). Each relation apart from hash indexes has an FSM stored in a separate file called <relation_oid>_fsm.

Here, relation_oid is the oid of the relation that is visible in pg_class.

percona=# select oid from pg_class where relname = 'employee';
  oid
-------
 24613
(1 row)

Upon VACUUM, this space is not reclaimed to disk but can be re-used by future inserts on this table. VACUUM stores the free space available on each heap (or index) page to the FSM file.

Running a VACUUM is a non-blocking operation. It never causes exclusive locks on tables. This means VACUUM can run on a busy transactional table in production while there are several transactions writing to it.

As we discussed earlier, an UPDATE of 10 records has generated 10 dead tuples. Let us see the following log to understand what happens to those dead tuples after a VACUUM.

percona=# VACUUM scott.employee ;
VACUUM
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |               tuple_data_split
--------+--------+-----------------------------------------------
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
    673 |      0 | {"\\x01000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x02000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x03000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x04000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x05000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x06000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x07000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x08000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x09000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x0a000000","\\x0b61766969","\\x01000000"}
(20 rows)

In the above log, you might notice that the dead tuples are removed and the space is available for re-use. However, this space is not reclaimed to filesystem after VACUUM. Only the future inserts can use this space.

VACUUM does an additional task. All the rows that are inserted and successfully committed in the past are marked as frozen, which indicates that they are visible to all the current and future transactions. We will be discussing this in detail in our future blog post “Transaction ID Wraparound in PostgreSQL”.

VACUUM does not usually reclaim the space to filesystem unless the dead tuples are beyond the high water mark.

Let’s consider the following example to see when a VACUUM could release the space to filesystem.

Create a table and insert some sample records. The records are physically ordered on the disk based on the primary key index.

percona=# CREATE TABLE scott.employee (emp_id int PRIMARY KEY, name varchar(20), dept_id int);
CREATE TABLE
percona=# INSERT INTO scott.employee VALUES (generate_series(1,1000), 'avi', 1);
INSERT 0 1000

Now, run ANALYZE on the table to update its statistics and see how many pages are allocated to the table after the above insert.

percona=# ANALYZE scott.employee ;
ANALYZE
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
relpages | total_bytes | relsize
---------+-------------+---------
6        | 49152       | 49152
(1 row)

Let’s now see how VACUUM behaves when you delete the rows with emp_id > 500

percona=# DELETE from scott.employee where emp_id > 500;
DELETE 500
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
relpages | total_bytes | relsize
---------+-------------+---------
3        | 24576       | 24576
(1 row)

In the above log, you see that the VACUUM has reclaimed half the space to filesystem. Earlier, it occupied 6 pages (8KB each or as set to parameter : block_size). After VACUUM, it has released 3 pages to filesystem.

Now, let’s repeat the same exercise by deleting the rows with emp_id < 500

percona=# DELETE from scott.employee ;
DELETE 500
percona=# INSERT INTO scott.employee VALUES (generate_series(1,1000), 'avi', 1);
INSERT 0 1000
percona=# DELETE from scott.employee where emp_id < 500;
DELETE 499
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
 relpages | total_bytes | relsize
----------+-------------+---------
        6 |       49152 |   49152
(1 row)

In the above example, you see that the number of pages still remain same after deleting half the records from the table. This means, VACUUM has not released the space to filesystem this time.

As explained earlier, if there are pages with no more live tuples after the high water mark, the subsequent pages can be flushed away to the disk by VACUUM. In the first case, it is understandable that there are no more live tuples after the 3rd page. So, the 4th, 5th and 6th page have been flushed to disk.

However, If you would need to reclaim the space to filesystem in the scenario where we deleted all the records with emp_id < 500, you may run VACUUM FULL. VACUUM FULL rebuilds the entire table and reclaims the space to disk.

percona=# VACUUM FULL scott.employee ;
VACUUM
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
 relpages | total_bytes | relsize
----------+-------------+---------
        3 |       24576 |   24576
(1 row)

Please note that VACUUM FULL is not an ONLINE operation. It is a blocking operation. You cannot read from or write to the table while VACUUM FULL is in progress. We will discuss about the ways to rebuild a table online without blocking in our future blog post.

The post Basic Understanding of Bloat and VACUUM in PostgreSQL appeared first on Percona Database Performance Blog.

Jan
14
2015
--

MySQL performance implications of InnoDB isolation modes

Over the past few months I’ve written a couple of posts about dangerous debt of InnoDB Transactional History and about the fact MVCC can be the cause of severe MySQL performance issues. In this post I will cover a related topic – InnoDB Transaction Isolation Modes, their relationship with MVCC (multi-version concurrency control) and how they impact MySQL performance.

The MySQL Manual provides a decent description of transaction isolation modes supported by MySQL – I will not repeat it here but rather focus on performance implications.

SERIALIZABLE – This is the strongest isolation mode to the point it essentially defeats Multi-Versioning making all SELECTs locking causing significant overhead both in terms of lock management (setting locks is expensive) and the concurrency you can get. This mode is only used in very special circumstances among MySQL Applications.

REPEATABLE READ – This is default isolation level and generally it is quite nice and convenient for the application. It sees all the data at the time of the first read (assuming using standard non-locking reads). This however comes at high cost – InnoDB needs to maintain transaction history up to the point of transaction start, which can be very expensive. The worse-case scenario being applications with a high level of updates and hot rows – you really do not want InnoDB to deal with rows which have hundreds of thousands of versions.

In terms of performance impact, both reads and writes can be impacted. For select queries traversing multiple row versions is very expensive but so it is also for updates, especially as version control seems to cause severe contention issues in MySQL 5.6

Here is example: I ran sysbench for a completely in-memory data set and when start transaction and run full table scan query couple of times while keeping transaction open:

sysbench  --num-threads=64 --report-interval=10 --max-time=0 --max-requests=0 --rand-type=pareto --oltp-table-size=80000000 --mysql-user=root --mysql-password= --mysql-db=sbinnodb  --test=/usr/share/doc/sysbench/tests/db/update_index.lua run

As you can see the write throughput drops drastically and stays low at all times when transaction is open, and not only when the query is running. This is perhaps the worse-case scenario I could find which happens when you have select outside of transaction followed by a long transaction in repeatable-read isolation mode. Though you can see regression in other cases, too.

Here is the set of queries I used if someone wants to repeat the test:

select avg(length(c)) from sbtest1;
begin;
select avg(length(c)) from sbtest1;
select sleep(300);
commit;

Not only is Repeatable Read the default isolation level, it is also used for logical backups with InnoDB – think mydumper or mysqldump –single-transaction
These results show such backup methods not only can’t be used with large data sets due to long recovery time but also can’t be used with some high write environments due to their performance impact.

READ COMMITTED mode is similar to REPEATABLE READ with the essential difference being what versions are not kept to the start of first read in transaction, but instead to the start of the current statement. As such using this mode allows InnoDB to maintain a lot less versions, especially if you do not have very long running statements. If you have some long running selects – such as reporting queries performance impact can be still severe.

In general I think good practice is to use READ COMITTED isolation mode as default and change to REPEATABLE READ for those applications or transactions which require it.

READ UNCOMMITTED – I think this is the least understood isolation mode (not a surprise as it only has 2 lines of documentation about it) which only describe it from the logical point of view. If you’re using this isolation mode you will see all the changes done in the database as they happen, even those by transactions which have not been yet committed. One nice use case for this isolation mode is what you can “watch” as some large scale UPDATE statements are happening with dirty reads showing which rows have already been changes and which have not.

So this statement shows changes that have not been committed yet and might not ever be committed if the transaction doing them runs into some errors so – this mode should be used with extreme care. There are a number of cases though when we do not need 100% accurate data and in this case this mode becomes very handy.

So how does READ UNCOMMITTED behave from a performance point of view? In theory InnoDB could purge row versions even if they have been created after the start of the statement in READ UNCOMMITTED mode. In practice, due to a bug or some intricate implementation detail it does not do that – row versions still will be kept to the start of the statement. So if you run very long running SELECT in READ UNCOMMITTED statements you will get a large amount of row versions created, just as if you would use READ COMMITTED. No win here.

There is an important win from SELECT side – READ UNCOMMITTED isolation mode means InnoDB never needs to go and examine the older row versions – the last row version is always the correct one which can cause dramatic performance improvement especially if the undo space had spilled over to the disk so finding old row versions can cause a lot of IO.

Perhaps the best illustration I discovered with query select avg(k) from sbtest1; ran in parallel with the same update heavy workload stated above. In READ COMMITTED isolation mode it never completes – I assume because new index entries are inserted faster than they are scanned, in case of READ UNCOMMITTED isolation mode it completes in a minute or so.

Final Thoughts: Using InnoDB Isolation modes correctly can help your application to get the best possible performance. Your mileage may vary and in some cases you will see no difference; in others it will be absolutely dramatic. There also seems to be a lot of work to be done in relationship to InnoDB performance with long version history. I hope it will be tackled in the future MySQL version.

The post MySQL performance implications of InnoDB isolation modes appeared first on MySQL Performance Blog.

Dec
17
2014
--

InnoDB’s multi-versioning handling can be Achilles’ heel

I believe InnoDB storage engine architecture is great for a lot of online workloads, however, there are no silver bullets in technology and all design choices have their trade offs. In this blog post I’m going to talk about one important InnoDB limitation that you should consider.

InnoDB is a multiversion concurrency control (MVCC) storage engine which means many versions of the single row can exist at the same time. In fact there can be a huge amount of such row versions. Depending on the isolation mode you have chosen, InnoDB might have to keep all row versions going back to the earliest active read view, but at the very least it will have to keep all versions going back to the start of SELECT query which is currently running.

In most cases this is not a big deal – if you have many short transactions happening you will have only a few row versions to deal with. If you just use the system for reporting queries but do not modify data aggressively at the same time you also will not have many row versions. However, if you mix heavy updates with slow reporting queries going at the same time you can get into a lot of trouble.

Consider for example an application with a hot row (something like actively updated counter) which has 1000 updates per second together with some heavy batch job that takes 1000 to run. In such case we will have 1M of row versions to deal with.

Let’s now talk about how those old-row versions are stored in InnoDB – they are stored in the undo space as an essentially linked list where each row version points to the previous row version together with transaction visibility information that helps to decide which version will be visible by this query. Such design favors short new queries that will typically need to see one of the newer rows, so they do not have to go too far in this linked list. This might not be the case with reporting queries that might need to read rather old row version which correspond to the time when the query was started or logical backups that use consistent reads (think mysqldump or mydumper) which often would need to access such very old row versions.

So going through the linked list of versions is expensive, but how expensive it can get? In this case a lot depends upon whenever UNDO space fits in memory, and so the list will be traversed efficiently – or it does not, in which case you might be looking at the massive disk IO. Keep in mind undo space is not clustered by PRIMARY key, as normal data in InnoDB tables, so if you’re updating multiple rows at the same time (typical case) you will be looking at the row-version chain stored in many pages, often as little as one row version per page, requiring either massive IO or a large amount of UNDO space pages to present in the InnoDB Buffer pool.

Where it can get even worse is Index Scan. This is because Indexes are structured in InnoDB to include all row versions corresponding to the key value, current and past. This means for example the index for KEY=5 will contain pointers to all rows that either have value 5 now or had value 5 some time in the past and have not been purged yet. Now where it can really bite is the following – InnoDB needs to know which of the values stored for the key are visible by the current transaction – and that might mean going through all long-version chains for each of the keys.

This is all theory, so lets see how we can simulate such workloads and see how bad things really can get in practice.

I have created 1Bil rows “sysbench” table which takes some 270GB space and I will use a small buffer pool – 6GB. I will run sysbench with 64 threads pareto distribution (hot rows) while running a full table scan query concurrently: select avg(k) from sbtest1 Here is exact sysbench run done after prepare.

sysbench --num-threads=64 --report-interval=10 --max-time=0 --max-requests=0 --rand-type=pareto --oltp-table-size=1000000000 --mysql-user root --mysql-password=password  --test /usr/share/doc/sysbench/tests/db/oltp.lua run

Here is the explain for the “reporting” query that you would think to be a rather efficient index scan query. With just 4 bytes 1 Billion of values would be just 4G (really more because of InnoDB overhead) – not a big deal for modern systems:

mysql> explain select avg(k) from sbtest1 G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: sbtest1
         type: index
possible_keys: NULL
          key: k_1
      key_len: 4
          ref: NULL
         rows: 953860873
        Extra: Using index
1 row in set (0.00 sec)

2 days have passed and the “reporting” query is still running… furthermore the foreground workload started to look absolutely bizarre:

[207850s] threads: 64, tps: 0.20, reads: 7.40, writes: 0.80, response time: 222481.28ms (95%), errors: 0.70, reconnects:  0.00
[207860s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[207870s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[207880s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[207890s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[207900s] threads: 64, tps: 2.70, reads: 47.60, writes: 11.60, response time: 268815.49ms (95%), errors: 0.00, reconnects:  0.00
[207910s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[207920s] threads: 64, tps: 2.30, reads: 31.60, writes: 9.50, response time: 294954.28ms (95%), errors: 0.00, reconnects:  0.00
[207930s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[207940s] threads: 64, tps: 2.90, reads: 42.00, writes: 12.20, response time: 309332.04ms (95%), errors: 0.00, reconnects:  0.00
[207950s] threads: 64, tps: 0.20, reads: 4.60, writes: 1.00, response time: 318922.41ms (95%), errors: 0.40, reconnects:  0.00
[207960s] threads: 64, tps: 0.20, reads: 1.90, writes: 0.50, response time: 335170.09ms (95%), errors: 0.00, reconnects:  0.00
[207970s] threads: 64, tps: 0.60, reads: 13.20, writes: 2.60, response time: 292842.88ms (95%), errors: 0.60, reconnects:  0.00
[207980s] threads: 64, tps: 2.60, reads: 37.60, writes: 10.20, response time: 351613.43ms (95%), errors: 0.00, reconnects:  0.00
[207990s] threads: 64, tps: 5.60, reads: 78.70, writes: 22.10, response time: 186407.21ms (95%), errors: 0.00, reconnects:  0.00
[208000s] threads: 64, tps: 8.10, reads: 120.20, writes: 32.60, response time: 99179.05ms (95%), errors: 0.00, reconnects:  0.00
[208010s] threads: 64, tps: 19.50, reads: 280.50, writes: 78.90, response time: 27559.69ms (95%), errors: 0.00, reconnects:  0.00
[208020s] threads: 64, tps: 50.70, reads: 691.28, writes: 200.70, response time: 5214.43ms (95%), errors: 0.00, reconnects:  0.00
[208030s] threads: 64, tps: 77.40, reads: 1099.72, writes: 311.31, response time: 2600.66ms (95%), errors: 0.00, reconnects:  0.00
[208040s] threads: 64, tps: 328.20, reads: 4595.40, writes: 1313.40, response time: 911.36ms (95%), errors: 0.00, reconnects:  0.00
[208050s] threads: 64, tps: 538.20, reads: 7531.90, writes: 2152.10, response time: 484.46ms (95%), errors: 0.00, reconnects:  0.00
[208060s] threads: 64, tps: 350.70, reads: 4913.45, writes: 1404.09, response time: 619.42ms (95%), errors: 0.00, reconnects:  0.00
[208070s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[208080s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[208090s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[208100s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[208110s] threads: 64, tps: 1.60, reads: 24.20, writes: 6.80, response time: 42385.40ms (95%), errors: 0.10, reconnects:  0.00
[208120s] threads: 64, tps: 0.80, reads: 28.20, writes: 3.40, response time: 51381.54ms (95%), errors: 2.80, reconnects:  0.00
[208130s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[208140s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[208150s] threads: 64, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[208160s] threads: 64, tps: 0.60, reads: 14.20, writes: 2.40, response time: 93248.04ms (95%), errors: 0.80, reconnects:  0.00

As you can see we have long stretches of times when there are no queries completed at all… going to some spikes of higher performance. This is how it looks on the graph:

Corresponding CPU usage:

This shows what we are not only observing something we would expect with InnoDB design but also there seems to be some serve starvation happening in this case which we can confirm:

Last time write locked in file /mnt/workspace/percona-server-5.6-debian-binary/label_exp/ubuntu-trusty-64bit/percona-server-5.6-5.6.21-70.0/storage/innobase/row/row0ins.cc line 2357
--Thread 139790809552640 has waited at btr0cur.cc line 3852 for 194.00 seconds the semaphore:
S-lock on RW-latch at 0x7f24d9c01bc0 '&block->lock'
a writer (thread id 139790814770944) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0sel.cc line 4125
Last time write locked in file /mnt/workspace/percona-server-5.6-debian-binary/label_exp/ubuntu-trusty-64bit/percona-server-5.6-5.6.21-70.0/storage/innobase/row/row0ins.cc line 2357
--Thread 139790804735744 has waited at row0sel.cc line 3506 for 194.00 seconds the semaphore:
S-lock on RW-latch at 0x7f24d9c01bc0 '&block->lock'
a writer (thread id 139790814770944) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0sel.cc line 4125
Last time write locked in file /mnt/workspace/percona-server-5.6-debian-binary/label_exp/ubuntu-trusty-64bit/percona-server-5.6-5.6.21-70.0/storage/innobase/row/row0ins.cc line 2357
--Thread 139790810756864 has waited at row0sel.cc line 4125 for 194.00 seconds the semaphore:
S-lock on RW-latch at 0x7f24d9c01bc0 '&block->lock'
a writer (thread id 139790814770944) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0sel.cc line 4125
Last time write locked in file /mnt/workspace/percona-server-5.6-debian-binary/label_exp/ubuntu-trusty-64bit/percona-server-5.6-5.6.21-70.0/storage/innobase/row/row0ins.cc line 2357
--Thread 139790811158272 has waited at btr0cur.cc line 3852 for 194.00 seconds the semaphore:
S-lock on RW-latch at 0x7f24d9c01bc0 '&block->lock'
a writer (thread id 139790814770944) has reserved it in mode  wait exclusive

Waiting for the given buffer pool block to become available for more than 3 minutes is a big issue – this lock should never be held by more than a few microseconds.

SHOW PROCESSLIST confirms even most basic selects by primary key can get stalled for long time

|  5499 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT c FROM sbtest1 WHERE id=1                                                                     |         0 |             0 |
|  5500 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT c FROM sbtest1 WHERE id=1                                                                     |         0 |             0 |
|  5501 | root          | localhost | sbtest             | Query   |    185 | statistics   | SELECT c FROM sbtest1 WHERE id=1                                                                     |         0 |             0 |
|  5502 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT c FROM sbtest1 WHERE id=1                                                                     |         0 |             0 |
|  5503 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 1 AND 1+99 ORDER BY c                                |         0 |             0 |
|  5504 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT c FROM sbtest1 WHERE id=1                                                                     |         0 |             0 |
|  5505 | root          | localhost | sbtest             | Query   |     14 | updating     | UPDATE sbtest1 SET k=k+1 WHERE id=1                                                                  |         0 |             0 |
|  5506 | root          | localhost | sbtest             | Query   |    236 | updating     | DELETE FROM sbtest1 WHERE id=1                                                                       |         0 |             0 |
|  5507 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT c FROM sbtest1 WHERE id=1                                                                     |         0 |             0 |
|  5508 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT c FROM sbtest1 WHERE id BETWEEN 1 AND 1+99                                                    |         0 |             0 |
|  5509 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT c FROM sbtest1 WHERE id=1                                                                     |         0 |             0 |
|  5510 | root          | localhost | sbtest             | Query   |     14 | updating     | UPDATE sbtest1 SET c='80873149502-45132831358-41942500598-17481512835-07042367094-39557981480-593123 |         0 |             0 |
|  5511 | root          | localhost | sbtest             | Query   |    236 | updating     | UPDATE sbtest1 SET k=k+1 WHERE id=18                                                                 |         0 |             1 |
|  5512 | root          | localhost | sbtest             | Query   |     14 | statistics   | SELECT c FROM sbtest1 WHERE id=7                                                                     |         0 |             0 |
|  6009 | root          | localhost | sbtest             | Query   | 195527 | Sending data | select avg(k) from sbtest1                                                                           |         0 |             0 |

How do I know it is UNDO space related issue in this case? Because it ends up taking majority of buffer pool

mysql> select page_type,count(*) from INNODB_BUFFER_PAGE group by page_type;
+-------------------+----------+
| page_type         | count(*) |
+-------------------+----------+
| EXTENT_DESCRIPTOR |        1 |
| FILE_SPACE_HEADER |        1 |
| IBUF_BITMAP       |      559 |
| IBUF_INDEX        |      855 |
| INDEX             |     2186 |
| INODE             |        1 |
| SYSTEM            |      128 |
| UNDO_LOG          |   382969 |
| UNKNOWN           |     6508 |
+-------------------+----------+
9 rows in set (1.04 sec)

And it does so in a very crazy way – when there is almost no work being done UNDO_LOG contents of the buffer pool is growing very rapidly while when we’re getting some work done the INDEX type pages take a lot more space. To me this seems like as the index scan is going it touches some hot rows and some not-so-hot ones, containing less row versions and so does not put much pressure on “undo space.”

Take Away: Now you might argue that this given workload and situation is rather artificial and rather narrow. It well might be. My main point here is what if you’re looking at just part of your workload, such as your main short application queries, and not taking reporting or backups into account “because their performance is not important.” In this case you might be in for a big surprise. Those background activities might be taking much more than you would expect, and in addition, they might have much more of a severe impact to your main application workload, like in this case above.

P.S: I’ve done more experiments to validate how bad the problem really is and I can repeat it rather easily even without putting system into overdrive. Even if I run sysbench injecting just 25% of the transactions the system is possibly capable of handling at peak I have “select avg(k) from sbtest1″ query on 1 billion row table to never complete as it looks like the new entries are injected into the index at this point faster than Innodb can examine which of them are visible.

The post InnoDB’s multi-versioning handling can be Achilles’ heel appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com