Imagine you’re the CEO of a thriving company and you’ve been horrified by the news of the Sony hack, the Target breach and the litany of security issues that have plagued big companies in recent years. You swear you’re going to do whatever’s necessary to make sure it won’t happen to your company. But do you realize what that really means? Read More
If you’ve been studying complex systems you know what minor changes might cause consequences of much greater proportions, sometimes causing some effects that are not easily explained at first. I recently ran across a great illustration of such behavior while doing MySQL benchmarks which I thought would be interesting to share.
I’m using a very simple benchmark – Sysbench 0.5 on Percona Server 5.6.21-70.1 just running update queries:
sysbench --num-threads=64 --report-interval=10 --max-time=0 --max-requests=0 --rand-type=pareto --oltp-table-size=1000000000 --mysql-user=root --mysql-password= --mysql-db=sbinnodb --test=/usr/share/doc/sysbench/tests/db/update_index.lua run
Some people frown upon such benchmarks due to their triviality and being irrelevant to workloads. I like them because they often allow you to study already complex system behavior in a much more controlled environment than “real” workloads – and so you can understand internal dynamics better.
For this test I set innodb_purge_threads=8, innodb_max_purge_lag=1000000, innodb_max_purge_lag_delay=5000 as otherwise chances are that the history will just be growing forever not allowing the system to reach the steady state. I wrote about InnoDB history problems in detail recently in a post titled “InnoDB transaction history often hides dangerous ‘debt’.
The only change I was looking for is changing innodb_flush_log_at_trx_commit from 1 to 2. I then went back to 1 a few hours later, after the system had reached the steady state. What I would normally expect in such a case is a performance increase because there is less work to do on commit and that is pretty much it. Looking at the graphs however I can see much more changes in the system. Let me go through them and explain the ones which I can:
The throughput graph is probably the most interesting here. We can see that indeed the throughput has substantially improved when set innodb_flush_log_at_trx_commit=2 and then went back to the old one after it was set back to 1. We see however in 2 some interesting effects here – a spike of a few minutes and later a dip below the original performance level for a few hours, until finally recovery back to the baseline. I’ll explain this with different graphs.
CPU usage is interesting, too. We can see there is user CPU increase correlating to the throughput increase but also a drastic IOWait increase. Why would it increase when we’re actually making workload less IO bound by not flushing log as much? The reason is because there are more threads now that can perform IO concurrently, hence increasing what is reporting IO wait and deceasing idle time.
InnoDB writing much more data is expected and it very well matches throughput. Not much of anything unexpected.
As we see setting innodb_flush_log_at_trx_commit=2 causes the amount of dirty pages in the buffer pool to go down significantly. It recovers when it is put back. The reason I suspect in this case is the undo space – as we will see in later graph history length spikes quickly – obviously requiring some space in the buffer pool.
This is perhaps one of the most interesting graphs here. You can see for some reason when log flush is enabled that purge threads are able to keep up. But if you disable it they cannot, which makes history grow very quickly until the max allowed amount. This is by the way the time we see the spike in the throughput as there is no throttling being done to keep history in check. This also explains why after I set innodb_flush_log_at_trx_commit back to 1 that performance first drops well below original levels – InnoDB needs to go through and purge the history, which takes resources. Another interesting thing you can notice is how uneven history cleanup is. Specifically you can note some 75% percent of the history has been cleaned up almost instantly while initial 25% took a lot of time to work out. I do not have an exact explanation here yet I believe it has to do with the fact initially history purging requires a significant amount of IO while at a certain point in time a purge happens completely in the buffer pool and hence being much faster.
Why would purge threads have a hard time keeping up when log flush is disabled? I suspect there are 2 reasons. First, the system can provide much more throughput in such configuration (not keeping purging into account) while purge subsystem is limited and can’t keep up with such a high load. The other reason is – eliminating serialization on the log flush exposes other parts of InnoDB to contention which further suppresses purge system progress.
This is another relevant graph to show the throttling done because purging can’t keep up. As you see we got about 2.5ms delay injected for updates to give purging some time to keep up. It is good to watch it to understand if the max delay you specified is enough to prevent purge lag while still avoid having the system grind to a halt. Had I set innodb_max_purge_lag_delay to 1000 it would not be enough to prevent history length growing to infinity, having I set it to 100000 the system could grind to a halt in case there is some long-running transaction causing history length to grow to about 1 million.
Lets look at Physical IO – it is also quite interesting. As you can see in the first and last parts of the graph there are more pages flushed than pages read, while when we disable log flushing we get much more reads than writes which only stops after history purging catch up is complete. I find this very interesting – this means that working with even modest history – 1M in this case causes a lot of IO – I assume because due to access pattern to the undo space it is constantly being washed out from the buffer pool. This graph is a very good illustration why large history might be a death spiral, as once it is there the system needs to do much more work.
This IO utilization graph might be a bit hard to understand but what it generally shows is that during the time the hard drive has been 100% utilized, there was still at least one outstanding IO to serve. However the concurrency was different with more average IO requests outstanding when we had log flushing disabled. This makes sense. We can also see it was highest when we enabled log flushing back but there was history to go through and catch up.
Changes to contention are very curious – we can see there is a lot more spin rounds and specifically shared lock spin rounds when we have log flush disabled. This is what I believe comes from less serialization on the log flushing causes a lot more internal contention.
It might not be very easy to see but this graph shows that innodb_row_lock_times spikes 1000x when log flushing is disabled. Why would that be? I suspect the reason is injected delay for purging. I also suspect the design of this throttling so it only is done when an actual change is attempted to be made, this is where update statement already holds some locks, which causes more row lock contention for other statements.
InnoDB has something called Server Activity count. As you can see from this graph it is not really a good indicator – for some reason it shows a lower value when the server was doing much more work. I do not understand other than this variable is incremented enough to explain it but find it curious.
LRU flushing also is drastically different between enabled and disabled log flushing. We get a lot more calls to search free pages and a lot more scans. I’m not sure why exactly this is happening but I think this has something to do with undo space caching which is constantly washed out from buffer pool which increased IO seems to indicate.
Finally lets look at perhaps the most counter-intuitive result – the amount of pending pwrites to the log file – we allowed to cache more log files in the OS cache but the amount of waits on writes to the log files actually increased? I’m not very sure why this is happening – perhaps there is a lot of IO pressure on OS so it does not delay writes to the log files for long time?
This is the same graph showing log fsyncs and log pwrites together. We can see when we set innodb_flush_log_at_trx_commit=2 that we see a lot more waits on pwrites, while less waits on fsync (which now should only be happening once per second or so). In general these two graphs tell me in some cases assuming log writes with innodb_flush_log_at_trx_commit=2 will be “free” because the operating system will instantly acknowledge them and put in the cache is not always correct.
Summary: The main point I’m looking to illustrate here is what changes you’re doing to the MySQL configuration (same can be said about your application OS configuration, hardware, etc.) can cause much larger changes to system behavior than you would guess. If you have the time it is good to validate whenever there are any other changes to system behavior beyond what you would think. If not at least keep an open mind and assume what any change, even a small one, can cause dramatic system behavior changes.
If you want to explore similar graphs check out Percona Cloud Tools – it’s easy to get going and allows you to build graphs like these absolutely free and perform a lot of in depth analyses. Now in free beta, Percona Cloud Tools is a hosted service providing access to query performance insights for all MySQL uses.
The post How small changes impact complex systems – MySQL example appeared first on MySQL Performance Blog.
The sluggish pace of technological change inside large companies is forever surprising and endlessly appalling to me and in nowhere is this more apparent than in big business’ approach to mobile app development. Read More
State Snapshot Transfer (SST) is used in Percona XtraDB Cluster (PXC) when a new node joins the cluster or to resync a failed node if Incremental State Transfer (IST) is no longer available. SST is triggered automatically but there is no magic: If it is not configured properly, it will not work and new nodes will never be able to join the cluster. Let’s have a look at a few classic issues.
Port for SST is not open
The donor and the joiner communicate on port 4444, and if the port is closed on one side, SST will always fail.
You will see in the error log of the donor that SST is started:
[...] 141223 16:08:48 [Note] WSREP: Node 2 (node1) requested state transfer from '*any*'. Selected 0 (node3)(SYNCED) as donor. 141223 16:08:48 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 6) 141223 16:08:48 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 141223 16:08:48 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '192.168.234.101:4444/xtrabackup_sst' --auth 'sstuser:s3cret' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid '04c085a1-89ca-11e4-b1b6-6b692803109b:6'' [...]
But then nothing happens, and some time later you will see a bunch of errors:
[...] 2014/12/23 16:09:52 socat E connect(3, AF=2 192.168.234.101:4444, 16): Connection timed out WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 0 1 (20141223 16:09:52.057) WSREP_SST: [ERROR] Cleanup after exit with status:32 (20141223 16:09:52.064) WSREP_SST: [INFO] Cleaning up temporary directories (20141223 16:09:52.068) 141223 16:09:52 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '192.168.234.101:4444/xtrabackup_sst' --auth 'sstuser:s3cret' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid '04c085a1-89ca-11e4-b1b6-6b692803109b:6' [...]
On the joiner side, you will see a similar sequence: SST is started, then hangs and is finally aborted:
[...] 141223 16:08:48 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 6) 141223 16:08:48 [Note] WSREP: Requesting state transfer: success, donor: 0 141223 16:08:49 [Note] WSREP: (f9560d0d, 'tcp://0.0.0.0:4567') turning message relay requesting off 141223 16:09:52 [Warning] WSREP: 0 (node3): State transfer to 2 (node1) failed: -32 (Broken pipe) 141223 16:09:52 [ERROR] WSREP: gcs/src/gcs_group.cpp:long int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():717: Will never receive state. Need to abort.
The solution is of course to make sure that the ports are open on both sides.
SST is not correctly configured
Sometimes you will see an error like this on the donor:
141223 21:03:15 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '192.168.234.102:4444/xtrabackup_sst' --auth 'sstuser:s3cretzzz' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid 'e63f38f2-8ae6-11e4-a383-46557c71f368:0'' [...] WSREP_SST: [ERROR] innobackupex finished with error: 1. Check /var/lib/mysql//innobackup.backup.log (20141223 21:03:26.973)
And if you look at
41223 21:03:26 innobackupex: Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_file=/etc/my.cnf;mysql_read_default_group=xtrabackup;mysql_socket=/var/lib/mysql/mysql.sock' as 'sstuser' (using password: YES). innobackupex: got a fatal error with the following stacktrace: at /usr//bin/innobackupex line 2995 main::mysql_connect('abort_on_error', 1) called at /usr//bin/innobackupex line 1530 innobackupex: Error: Failed to connect to MySQL server: DBI connect(';mysql_read_default_file=/etc/my.cnf;mysql_read_default_group=xtrabackup;mysql_socket=/var/lib/mysql/mysql.sock','sstuser',...) failed: Access denied for user 'sstuser'@'localhost' (using password: YES) at /usr//bin/innobackupex line 2979
The default SST method is
xtrabackup-v2 and for it to work, you need to specify a username/password in the my.cnf file:
And you also need to create the corresponding MySQL user:
mysql> GRANT RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost' IDENTIFIED BY 's3cret';
So you should check that the user has been correctly created in MySQL and that
wsrep_sst_auth is correctly set.
Galera versions do not match
Here is another set of errors you may see in the error log of the donor:
141223 21:14:27 [Warning] WSREP: unserialize error invalid flags 2: 71 (Protocol error) at gcomm/src/gcomm/datagram.hpp:unserialize():101 141223 21:14:30 [Warning] WSREP: unserialize error invalid flags 2: 71 (Protocol error) at gcomm/src/gcomm/datagram.hpp:unserialize():101 141223 21:14:33 [Warning] WSREP: unserialize error invalid flags 2: 71 (Protocol error) at gcomm/src/gcomm/datagram.hpp:unserialize():101
Here the issue is that you try to connect a node using Galera 2.x and a node running Galera 3.x. This can happen if you try to use a PXC 5.5 node and a PXC 5.6 node.
The right solution is probably to understand why you ended up with such inconsistent versions and make sure all nodes are using the same Percona XtraDB Cluster version and Galera version.
But if you know what you are doing, you can also instruct the node using Galera 3.x that it will communicate with Galera 2.x nodes by specifying in the my.cnf file:
SST errors can have multiple reasons for occurring, and the best way to diagnose the issue is to have a look at the error log of the donor and the joiner. Galera is in general quite verbose so you can follow the progress of SST on both nodes and see where it fails. Then it is mostly about being able to interpret the error messages.
The post Diagnosing SST errors with Percona XtraDB Cluster for MySQL appeared first on MySQL Performance Blog.
Some time ago, Peter Zaitsev posted a blog titled “How well does your table fits in innodb buffer pool?” He used some special INFORMATION_SCHEMA tables developed for Percona Server 5.1 to report how much of each InnoDB table and index resides in your buffer pool.
As Peter pointed out, you can use this view into the buffer pool to watch a buffer pool warm up with pages as you run queries. You can also use it for capacity planning. If you expect some tables need to be fully loaded in the buffer pool to be used efficiently, but the buffer pool isn’t large enough to hold them, then it’s time to increase the size of the buffer pool.
The problem, however, was that system tables change from version to version. Specifically, the INNODB_BUFFER_POOL_PAGES_INDEX table no longer exists in Percona Server 5.6, and the INNODB_INDEX_STATS table changed some column names in Percona Server 5.5.8, and the table no longer exists in Percona Server 5.6. So many of the comments on Peter’s blog rightly pointed out that the example query didn’t work on subsequent versions of Percona Server. And MySQL Community Edition at the time didn’t have the feature at all. They asked for an update to the blog post.
So here’s an updated, simplified query to report the content of your buffer pool, tested on the most recent versions.
Percona Server 5.1 and 5.5:
USE information_schema; SET @page_size = @@innodb_page_size; SET @bp_pages = @@innodb_buffer_pool_size/@page_size; SELECT P.TABLE_NAME, P.PAGE_TYPE, CASE WHEN P.INDEX_NAME IS NULL THEN NULL WHEN P.TABLE_NAME LIKE '`SYS_%' THEN P.INDEX_NAME WHEN P.INDEX_NAME <> 'PRIMARY' THEN 'SECONDARY' ELSE 'PRIMARY' END AS INDEX_TYPE, COUNT(DISTINCT P.PAGE_NUMBER) AS PAGES, ROUND(100*COUNT(DISTINCT P.PAGE_NUMBER)/@bp_pages,2) AS PCT_OF_BUFFER_POOL, CASE WHEN P.TABLE_NAME IS NULL THEN NULL WHEN P.TABLE_NAME LIKE 'SYS_%' THEN NULL ELSE ROUND(100*COUNT(DISTINCT P.PAGE_NUMBER)/CASE P.INDEX_NAME WHEN 'PRIMARY' THEN TS.DATA_LENGTH/@page_size ELSE TS.INDEX_LENGTH/@page_size END, 2) END AS PCT_OF_INDEX FROM INNODB_BUFFER_PAGE AS P JOIN INNODB_SYS_TABLES AS T ON P.SPACE = T.SPACE JOIN TABLES AS TS ON (T.SCHEMA, T.NAME) = (TS.TABLE_SCHEMA, TS.TABLE_NAME) WHERE TS.TABLE_SCHEMA <> 'mysql' GROUP BY TABLE_NAME, PAGE_TYPE, INDEX_TYPE;
MySQL 5.6 and 5.7 (this also works on Percona Server 5.6):
USE information_schema; SET @page_size = @@innodb_page_size; SET @bp_pages = @@innodb_buffer_pool_size/@page_size; SELECT P.TABLE_NAME, P.PAGE_TYPE, CASE WHEN P.INDEX_NAME IS NULL THEN NULL WHEN P.TABLE_NAME LIKE '`SYS_%' THEN P.INDEX_NAME WHEN P.INDEX_NAME <> 'PRIMARY' THEN 'SECONDARY' ELSE 'PRIMARY' END AS INDEX_TYPE, COUNT(DISTINCT P.PAGE_NUMBER) AS PAGES, ROUND(100*COUNT(DISTINCT P.PAGE_NUMBER)/@bp_pages,2) AS PCT_OF_BUFFER_POOL, CASE WHEN P.TABLE_NAME IS NULL THEN NULL WHEN P.TABLE_NAME LIKE 'SYS_%' THEN NULL ELSE ROUND(100*COUNT(DISTINCT P.PAGE_NUMBER)/CASE P.INDEX_NAME WHEN 'PRIMARY' THEN TS.DATA_LENGTH/@page_size ELSE TS.INDEX_LENGTH/@page_size END, 2) END AS PCT_OF_INDEX FROM INNODB_BUFFER_PAGE AS P JOIN INNODB_SYS_TABLES AS T ON P.SPACE = T.SPACE JOIN TABLES AS TS ON T.NAME = CONCAT(TS.TABLE_SCHEMA, '/', TS.TABLE_NAME) WHERE TS.TABLE_SCHEMA <> 'mysql' GROUP BY TABLE_NAME, PAGE_TYPE, INDEX_TYPE;
In both cases, the output looks something like the following (if I have read from a single table called test.foo):
+--------------+-------------------+------------+-------+--------------------+--------------+ | TABLE_NAME | PAGE_TYPE | INDEX_TYPE | PAGES | PCT_OF_BUFFER_POOL | PCT_OF_INDEX | +--------------+-------------------+------------+-------+--------------------+--------------+ | NULL | FILE_SPACE_HEADER | NULL | 1 | 0.00 | NULL | | NULL | IBUF_BITMAP | NULL | 1 | 0.00 | NULL | | NULL | INODE | NULL | 1 | 0.00 | NULL | | `test`.`foo` | INDEX | PRIMARY | 2176 | 3.32 | 98.37 | | `test`.`foo` | INDEX | SECONDARY | 2893 | 4.41 | 88.47 | +--------------+-------------------+------------+-------+--------------------+--------------+
Unfortunately, the INFORMATION_SCHEMA tables report total size of secondary indexes for a table, but not the size of each index individually. Therefore this query shows the percent of index only for the primary index (which is also the clustered index, i.e. the table itself), and then all other secondary indexes grouped together.
The PERFORMANCE_SCHEMA also includes some information about the contents of the buffer pool. The MySQL SYS Schema makes it easy to query this. But this view doesn’t calculate the percentage of each table in the buffer pool, nor the percentage of the buffer pool occupied by each table.
mysql> SELECT * FROM sys.innodb_buffer_stats_by_table; +---------------+----------------------+------------+------------+-------+--------------+-----------+-------------+ | object_schema | object_name | allocated | data | pages | pages_hashed | pages_old | rows_cached | +---------------+----------------------+------------+------------+-------+--------------+-----------+-------------+ | test | foo | 149.64 MiB | 106.19 MiB | 9577 | 9577 | 9577 | 1050490 | | InnoDB System | SYS_TABLES | 160.00 KiB | 91.24 KiB | 10 | 10 | 10 | 594 | | InnoDB System | SYS_INDEXES | 128.00 KiB | 93.59 KiB | 8 | 8 | 8 | 1345 | | InnoDB System | SYS_COLUMNS | 80.00 KiB | 47.13 KiB | 5 | 5 | 5 | 761 | | InnoDB System | SYS_DATAFILES | 48.00 KiB | 16.40 KiB | 3 | 3 | 3 | 246 | | InnoDB System | SYS_FIELDS | 48.00 KiB | 16.02 KiB | 3 | 3 | 3 | 377 | | InnoDB System | SYS_FOREIGN | 48.00 KiB | 0 bytes | 3 | 3 | 3 | 0 | | InnoDB System | SYS_TABLESPACES | 48.00 KiB | 15.83 KiB | 3 | 3 | 3 | 242 | | InnoDB System | SYS_FOREIGN_COLS | 16.00 KiB | 0 bytes | 1 | 1 | 1 | 0 | . . .
The post How well does your table fit in the InnoDB buffer pool in MySQL 5.6+? appeared first on MySQL Performance Blog.
You probably already know ‘Big Data’ is top of mind heading into 2015. How could you not? You are hearing about it constantly from vendors and journalists alike (guilty as charged). And you know what that hype says, right? Big data is going to provide all the answers, make your companies run more efficiently and help you make brilliant, data-driven decisions that give your… Read More
At Emergence Capital, we have had the opportunity to invest in visionaries such as Marc Benioff, Aaron Levie and David Sacks who have built major enterprise cloud applications that have formed the basis of the next generation of business software around the world. As we look out to 2015, we are excited to invest in what’s next for the enterprise. Read More
It has been a while since I have looked at InnoDB crash recovery. A lot has change in the last few years – we have serious crash recovery performance improvements in MySQL 5.5 and MySQL 5.6, we have solid state drives raising as typical high performance IO subsystem and we also have the ability to set much larger log files and often have a much larger InnoDB Buffer Pool to work with.
First let me revisit the challenge with have with InnoDB configuration. For write-intensive workloads it is extremely important to size innodb_log_file_size for good performance, however the longer log file size you have the longer you might have to wait for InnoDB to complete crash recovery, which impacts your recovery strategy.
How much can innodb_log_file_size impact performance? Massively! Doing intensive writes to a database that well fits in memory, I’d say there’s a 10x difference between having combined size of log files of 32GB vs 64MB.
Before we look at some performance numbers let’s talk about what InnoDB Crash Recovery time depends on:
Combined Innodb Log File Size – innodb_log_file_size*innodb_log_files_in_group is what really matters. It does not really matter which of those two you change. I prefer to keep innodb_log_files_in_group as default and only work with innodb_log_file_size. The larger size you have allocated the longer recovery will take.
innodb_checkpoint_age – Combined size of InnoDB log files defines how many changes not reflected in the tablespace we may have where innodb_checkpoint_age shows how much changes we actually have at the current moment, being an actual driving factor of recovery time. If you have very large log files allocated but for your workload innodb_checkpoint_age stays low chances are recovery will be quick. Be careful however – intensive writes can cause innodb_checkpoint_age to go much higher than the average for your workload causing recovery time from crashes at that time to be much longer.
Innodb Buffer Pool Size – This is another very important factor. During recovery, InnoDB has to redo changes to the unflushed/dirty pages from buffer pool, which is obviously limited by buffer pool size. This also means innodb_max_dirty_pages_pct can be used to impact recovery speed. This is the number of dirty pages being the true driving factor. With small buffer pool, a limited number of dirty pages based on the workload you might not have innodb_checkpoint_age to go high even if you have allowed for large log space.
Data Structure matters a lot for recovery speed. Generally shorter rows being updated will mean longer recovery time for the same log file size. This should make sense as shorter row changes means there is less log space produced for the same amount of page changes. If you do a lot of blob writes InnoDB crash recovery can be short even with relatively large log files.
Access Pattern is another key factor – the more “random” access is the more distinct pages you will have touched during the same innodb_checkpoint_age the longer recovery can take.
Hardware – Better hardware means recovery goes faster, as much is obvious. More specifically you will be looking for storage performance at low concurrency (both reads and writes are important) as well as fast CPU cores – crash recovery at this point is not able to use multiple cores effectively.
Let’s now look at the test….
I am running Sysbench on an 11GB table, designed to fit in the 12GB buffer pool. Here is the exact command:
sysbench --tx-rate=4000 --num-threads=64 --report-interval=10 --max-time=0 --max-requests=0 --rand-type=uniform --oltp-table-size=40000000 --mysql-user=root --mysql-password=password --test=/usr/share/doc/sysbench/tests/db/update_index.lua run
The box is rather low end i3-4010U (4 threads) CPU with a Samsung EVO 840GB SSD, so numbers are expected to be higher on real server hardware.
In my first test I’m injecting 4000 updates/sec which is about half of what the box can do at sustained load. I do this to illustrate more common load scenario as we rarely run systems at their saturation point in real world. The uniform distribution should mean worse case scenarios for in-memory workloads through I think recovery speed would be slower if u use random writes to the database much larger than the amount of memory.
At this workload I’m getting innodb_checkpoint_age of 15GB even though total log file size is 32GB. Crashing the system makes for about 40 minutes recovery time so the log was processed at the 6.25MB/sec
Here are some interesting graphs:
As you can see recovery is essentially close to single core. It also can be CPU bound at times (and will be more so with faster storage) – at certain times of recovery when logs are being scanned it can be completely CPU bound (see how IO wait essentially goes to zero at some times)
Over time as recovery progresses more and more blocks become cached, so they do not have to be read from the disk for log records to be applied, meaning the workload becomes more and more write bound.
This is an unweighted IO utilization graph where 1000 corresponds to 100% of time where at least one IO request was outstanding. As you can see from this and the previous drive, InnoDB does not keep the IO well saturated all the time during crash recovery.
Additionally to the first sysbench crash test I did two more – one running a system completely saturated with updates. This made innodb_checkpoint_age to go as high as 23.2GB and crash recovery took 1h 20 minutes, showing some 4.8MB/sec The thing to consider in this case is that MySQL was not able to keep up with purging the history so it was growing quickly meaning crash recovery had to cover a lot of undo space modifications.
Finally I also did a run with a more skewed pareto distribution which resulted in 9.8G innodb_checkpoint_age 33min crash recovery time and 4.94MB/sec of log processing speed.
As I explained above there are a lot of moving parts so your numbers are likely to be quite different, yet I hope this can provide some reasonable baseline you can use for calculation.
Note also waiting for the server to recover from the crash is only one way to deal with recovery. Even if you size log files to be very small you will likely need to deal with Operating System boot and when warmup which will take a few minutes. It is often much better to use a different primary method of crash recovery, such as failover to the MySQL Replication Slave or using Percona XtraDB Cluster. If you use these methods you can often use quite a high combined InnoDB log file size to optimize for performance.
Final Thoughts: Even though InnoDB Crash Recovery has improved in MySQL 5.5 and MySQL 5.6 there is still room to improve it even more. As we see from the resource usage graphs during recovery there is an opportunity to both use multiple CPU cores more effectively as well as drive IO subsystem with higher concurrency and in more sustained fashion.
This is a long overdue blog post from London’s 44con Cyber Security conference back in September. A lot of old memories were brought to the front as it were; the one I’m going to cover in this blog post is: file carving.
So what is file carving? despite the terminology it’s not going to be a full roast dinner; unless you have an appetite for data which as you’re here I’m assuming you have.
The TL;DR of “what is file carving” is taking a target blob of data (often a multi GB / TB file) and reducing it in to targeted pieces of data, this could be for instance grabbing all the jpeg images in a packet capture / mysqldump; or pulling that single table/schema out of a huge mysqldump with –all-databases (if you’re not using mydumper you really should it avoids issues like this!) aka “Sorting the wheat from the chaff”.
Let’s take for example at the time of writing this post I am looking to extract a single schema out of one such mysqldump –all-database file of around 2GB (2GB of course isn’t large however it’s large enough to give a practical example; the methods for larger files are of course the same). So where to start?
You’ll need the following tools installed:
- xxd (you can substitute xxd for od, hexer or any other hex editing / viewing tool you are comfortable with, just make sure it can handle very large files)
Let’s carve out the mysql schema
dbusby@kali:~$ xxd yourdumpfile.sql | grep '
mysql' -B5 | grep 'ASE' -A2 -B2
00003c0: 6e74 2044 6174 6162 6173 653a 2060 6d79 nt Database:
00003d0: 7371 6c60 0a2d 2d0a 0a43 5245 4154 4520 sql
00003e0: 4441 5441 4241 5345 202f 2a21 3332 3331 DATABASE /*!3231
00003f0: 3220 4946 204e 4f54 2045 5849 5354 532a 2 IF NOT EXISTS*
0000400: 2f20 606d 7973 716c 6020 2f2a 2134 3031 /
Wonderful so we have some hex representation of the sql dumpfile why on earth do we want the hex? we need to define our offsets. In short our offsets are the position of the start and end of the chunk we intend to carve from the file.
From the above our start offset is 00003d9 at the start of CREATE DATABASE; for those unfamiliar with hexdump outputs I recommend looking at the tool hexer a vi like tool and pressing v to enter visual selection mode select a few characters and you’ll not something as follows “visual selection: 0x000003d9 – …”.
You can of course work out the range visually from the above, 00003d0 is the start of the line, each alphanumeric pair is a single byte the byte offset notation is hexedecimal 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f.
00003d0: 7371 6c60 0a2d 2d0a 0a43 5245 4154 4520 sql
00003d0 == s, 00003d1 == q, 00003d2 == l And so on, we can easily verify this using xxd
dbusby@kali:~$ xxd -s 0x3d9 yourdumpfile.sql | head -n3mysql
00003d9: 4352 4541 5445 2044 4154 4142 4153 4520 CREATE DATABASE
00003e9: 2f2a 2133 3233 3132 2049 4620 4e4f 5420 /*!32312 IF NOT
00003f9: 4558 4953 5453 2a2f 2060 6d79 7371 6c60 EXISTS*/
right so now we need the end offset, as above we establish a search pattern as the schema data we're carving is in the midst of a larger file we can look for the start of the dump for the next schema.
dbusby@kali:~$ xxd -s 0x3d9 yourdumpfile.sql | grep '--' -A5 | grep C -A2 -B2 | less
0083b19: 2043 7572 7265 6e74 2044 6174 6162 6173 Current Databas
0083b29: 653a 2060 7065 7263 6f6e 6160 0a2d 2d0a e:
I’ve piped into less here as there were many matches to the grep patterns.
From the above we can see a potential offset of 0x83b19 however we want to “backtrack” a few bytes to before the — comment start.
dbusby@kali:~$ xxd -s 0x83b14 yourdumpfile.sql | head -n1
0083b14: 2d2d 0a2d 2d20 4375 7272 656e 7420 4461 --.-- Current Da
Excellent we have our offsets starting at 0x3d9 ending at 0x83b14 we need to now convert base16 (hexidecimal) into base10 fortunatly we can do this usinc the bc utility very easily however we will need to fully expand and make upper case our offsets.
dbusby@kali:~$ echo 'ibase=16;00003D9' | bc
dbusby@kali:~$ echo 'ibase=16;0083B14' | bc
dbusby@kali:~$ echo '539412-985' | bc
dbusby@kali:~$ dd if=yourdumpfile.sql of=mysql.sql skip=985 bs=1 count=538427
538427+0 records in
538427+0 records out
538427 bytes (538 kB) copied, 1.08998 s, 494 kB/s
Let’s discuss this a little; what we have done here is convert our start offset to a base10 count of bytes to offset by when using dd (skip=985) we then convert the end offset to its base10 byte position, and by removing the startoffset base10 value this gives us the size of the chunk we are carving.
We now put this into a dd command line, and voila! we have a mysql.sql file which contains only the mysqldump data.
I hope this post helps somewhat to demystify file carving; the above techniques can be applied to any for of file carving need and is not limited only to mysql files.
Oracle added another piece to its growing cloud portfolio today, buying digital-marketing, Data as a Service broker Datalogix. The move not only gives Oracle another cloud tool, it’s one that’s aimed directly at marketers, an area where Oracle is fighting hard with rival Salesforce.com and others for dominance. The announcement did not include terms. Datalogix gives Oracle… Read More