Feb
22
2019
--

PostgreSQL fsync Failure Fixed – Minor Versions Released Feb 14, 2019

fsync postgresql upgrade

PostgreSQL logoIn case you didn’t already see this news, PostgreSQL has got its first minor version released for 2019. This includes minor version updates for all supported PostgreSQL versions. We have indicated in our previous blog post that PostgreSQL 9.3 had gone EOL, and it would not support any more updates. This release includes the following PostgreSQL major versions:

What’s new in this release?

One of the common fixes applied to all the supported PostgreSQL versions is on – panic instead of retrying after fsync () failure. This fsync failure has been in discussion for a year or two now, so let’s take a look at the implications.

A fix to the Linux fsync issue for PostgreSQL Buffered IO in all supported versions

PostgreSQL performs two types of IO. Direct IO – though almost never – and the much more commonly performed Buffered IO.

PostgreSQL uses O_DIRECT when it is writing to WALs (Write-Ahead Logs aka Transaction Logs) only when

wal_sync_method

 is set to :

open_datasync

 or to 

open_sync

 with no archiving or streaming enabled. The default 

wal_sync_method

 may be

fdatasync

 that does not use O_DIRECT. This means, almost all the time in your production database server, you’ll see PostgreSQL using O_SYNC / O_DSYNC while writing to WAL’s. Whereas, writing the modified/dirty buffers to datafiles from shared buffers is always through Buffered IO.  Let’s understand this further.

Upon checkpoint, dirty buffers in shared buffers are written to the page cache managed by kernel. Through an fsync(), these modified blocks are applied to disk. If an fsync() call is successful, all dirty pages from the corresponding file are guaranteed to be persisted on the disk. When there is an fsync to flush the pages to disk, PostgreSQL cannot guarantee a copy of a modified/dirty page. The reason is that writes to storage from the page cache are completely managed by the kernel, and not by PostgreSQL.

This could still be fine if the next fsync retries flushing of the dirty page. But, in reality, the data is discarded from the page cache upon an error with fsync. And the next fsync would obviously succeed ignoring the previous errors, because it now includes the next set of dirty buffers that need to be written to disk and not the ones that failed earlier.

To understand it better, consider an example of Linux trying to write dirty pages from page cache to a USB stick that was removed during an fsync. Neither the ext4 file system nor the btrfs nor an xfs tries to retry the failed writes. A silently failing fsync may result in data loss, block corruption, table or index out of sync, foreign key or other data integrity issues… and deleted records may reappear.

Until a while ago, when we used local storage or storage using RAID Controllers with write cache, it might not have been a big problem. This issue goes back to the time when PostgreSQL was designed for buffered IO but not Direct IO. Should this now be considered an issue with PostgreSQL and the way it’s designed? Well, not exactly.

All this started with the error handling during a writeback in Linux. A writeback asynchronously performs dirty page writes from page cache to filesystem. In ext4 like filesystems, upon a writeback error, the page is marked clean and up to date, and the user space is unaware of the problem.

fsync errors are now detected

Starting from kernel 4.13, we can now reliably detect such errors during fsync. So, any open file descriptor to a file includes a pointer to the address_space structure, and a new 32-bit value (errseq_t) has been added that is visible to all the processes accessing that file. With the new minor version for all supported PostgreSQL versions, a PANIC is triggered upon such error. This performs a database crash and initiates recovery from the last CHECKPOINT. There is a patch expected to be released in PostgreSQL 12 that works for newer kernel versions and modifies the way PostgreSQL handles the file descriptors. A long term solution to this issue may be Direct IO, but you might see a different approach to this in PG 12.

A good amount of work on this issue was done by Jeff Layton on reporting writeback errors, and Matthew Wilcox. What this patch means is that a writeback error gets reported during an fsync, which can be seen by another process that opens that file. A new 32-bit value that stores an error code and a sequence number are added to a new

typedef: errseq_t

 . So, these errors are now in the

address_space

 . But, if the struct inode is gone due to a memory pressure, this patch has no value.

Can i enable or disable the PANIC on fsync failure in PostgreSQL newer releases ?

Yes. You can set this parameter :

data_sync_retry

 to false (default), where a PANIC-level error is raised to recover from WAL through a database crash. You must be sure to have a proper high-availability mechanism so that the impact is minimal for your application. You could let your application failover to a slave, which could minimize the impact.

You can always set

data_sync_retry

 to true, if you are sure about how your OS behaves during write-back failures. By setting this to true, PostgreSQL will just report an error and continue to run.

Some of the other possible issues now fixed and common to these minor releases

  1. A lot of features and fixes related to PARTITIONING have been applied in this minor release. (PostgreSQL 10 and 11 only).
  2. Autovacuum has been made more aggressive about removing leftover temporary tables.
  3. Deadlock when acquiring multiple buffer locks.
  4. Crashes in logical replication.
  5. Incorrect planning of queries in which a lateral reference must be evaluated at a foreign table scan.
  6. Fixed some issues reported with ANALYZE and TRUNCATE operations.
  7. Fix to contrib/hstore to calculate correct hash values for empty hstore values that were created in version 8.4 or before.
  8. A fix to pg_dump’s handling of materialized views with indirect dependencies on primary keys.

We always recommend that you keep your PostgreSQL databases updated to the latest minor versions. Applying a minor release might need a restart after updating the new binaries.

Here is the sequence of steps you should follow to upgrade to the latest minor versions after thorough testing :

  1. Shutdown the PostgreSQL database server
  2. Install the updated binaries
  3. Restart your PostgreSQL database server

Most of the time, you can choose to update the minor versions in a rolling fashion in a master-slave (replication) setup because it avoids downtime for both reads and writes simultaneously. For a rolling style update, you could perform the update on one server after another… but not all at once. However, the best method that we’d almost always recommend is – shutdown, update and restart all instances at once.

If you are currently running your databases on PostgreSQL 9.3.x or earlier, we recommend that you to prepare a plan to upgrade your PostgreSQL databases to the supported versions ASAP. Please subscribe to our blog posts so that you can hear about the various options for upgrading your PostgreSQL databases to a supported major version.


Photo by Andrew Rice on Unsplash

Jul
18
2018
--

Why Consumer SSD Reviews are Useless for Database Performance Use Case

Anandtech Table reviewing consumer SSD performance

If you’re reading consumer SSD reviews and using them to estimate SSD performance under database workloads, you’d better stop. Databases are not your typical consumer applications and they do not use IO in the same way.

Let’s look, for example, at this excellent AnandTech review of Samsung 960 Pro –  a consumer NVMe device that I happen to have in my test lab.

Anandtech Table reviewing consumer SSD performance

The summary table is actually great, showing the performance both at Queue Depth 1 (single threaded) as well as Queue Depth 32 – a pretty heavy concurrent load.

Even at QD1 we see 50K (4K) writes per second, which should be enough for pretty serious database workloads.

In reality, though, you might be in for some disappointing surprises. While “normal” buffered IO is indeed quite fast, this drive really hates fsync() calls, with a single thread fsync() latency of 3.5ms or roughly 300 fsync/sec. That’s just slightly more than your old school spinning drive.

Why is fsync() performance critical for databases?

To achieve Durabilitythe letter “D” of ACIDdatabases tend to rely on a write ahead log (WAL) which is sequentially written. The WAL must be synced to disk on every transaction commit using fsync() or similar measures, such as opening file with O_SYNC flag. These tend to have similar performance implications.

Other database operations use fsync() too, but writing WAL is where it usually hurts the most.    

In a fully durable configuration MySQL tends to be impacted even more by poor fsync() performance. It may need to perform as many as three fsync operations per transaction commit. Group commit reduces the impact on throughput but transaction latency will still be severely impacted

Want more bad news? If the fsync() performance is phenomenal on your consumer SSD it indeed might be too good to be true. Over the years, some consumer SSDs “faked” fsync and accepted possible data loss in the case of power failure. This might not be a big deal if you only use them for testing but it is a showstopper for any real use.

Want to know more about your drive’s fsync() performance?  You can use these sysbench commands:

sysbench fileio --time=60 --file-num=1 --file-extra-flags= --file-total-size=4096 --file-block-size=4096 --file-fsync-all=on --file-test-mode=rndwr --file-fsync-freq=0 --file-fsync-end=0  --threads=1 --percentile=99 prepare
sysbench fileio --time=60 --file-num=1 --file-extra-flags= --file-total-size=4096 --file-block-size=4096 --file-fsync-all=on --file-test-mode=rndwr --file-fsync-freq=0 --file-fsync-end=0  --threads=1 --percentile=99 run | grep "avg:"

You can also use ioping as described in this blog post

I wish that manufacturers’ tech specifications described fsync latency, along with a clear statement as to whether the drive guarantees no loss of data on power failure. Likewise, I wish folk doing storage reviews could include these in their research.

Interested in fsync() performance for variety of devices?  Yves Trudeau wrote an excellent blog post about fsync() performance on various storage devices  a few months ago.

Other technical resources

Principal Support Escalation Specialist Sveta Smirnova presents Troubleshooting MySQL Concurrency Issues with Load Testing Tools. 

You can download a three part series of eBooks by Principal Consultant Alexander Rubin and me on MySQL Performance.

The post Why Consumer SSD Reviews are Useless for Database Performance Use Case appeared first on Percona Database Performance Blog.

Feb
08
2018
--

Fsync Performance on Storage Devices

fsync

fsync performanceWhile preparing a post on the design of ZFS based servers for use with MySQL, I stumbled on the topic of fsync call performance. The fsync call is very expensive, but it is essential to databases as it allows for durability (the “D” of the ACID acronym).

Let’s first review the type of disk IO operations executed by InnoDB in MySQL. I’ll assume the default InnoDB variable values.

The first and most obvious type of IO are pages reads and writes from the tablespaces. The pages are most often read one at a time, as 16KB random read operations. Writes to the tablespaces are also typically 16KB random operations, but they are done in batches. After every batch, fsync is called on the tablespace file handle.

To avoid partially written pages in the tablespaces (a source of data corruption), InnoDB performs a doublewrite. During a doublewrite operation, a batch of dirty pages, from 1 to about 100 pages, is first written sequentially to the doublewrite buffer and fsynced. The doublewrite buffer is a fixed area of the ibdata1 file, or a specific file with the latest Percona Server for MySQL 5.7. Only then do the writes to the tablespaces of the previous paragraph occur.

That leaves us with the writes to the InnoDB log files. During those writes, the transaction information — a kind of binary diff of the affected pages — is written to the log files and then the log file is fsynced. The duration of the fsync call can be a major contributor to the COMMIT latency.

Because the fsync call takes time, it greatly affects the performance of MySQL. Because of this, you probably noticed there are many status variables that relate to fsyncs. To overcome the inherent limitations of the storage devices, group commit allows multiple simultaneous transactions to fsync the log file once for all the transactions waiting for the fsync. There is no need for a transaction to call fsync for a write operation that another transaction already forced to disk. A series of write transactions sent over a single database connection cannot benefit from group commit.

Fsync Results

In order to evaluate the fsync performance, I used the following Python script:

#!/usr/bin/python
import os, sys, mmap
# Open a file
fd = os.open( "testfile", os.O_RDWR|os.O_CREAT|os.O_DIRECT )
m = mmap.mmap(-1, 512)
for i in range (1,1000):
   os.lseek(fd,os.SEEK_SET,0)
   m[1] = "1"
   os.write(fd, m)
   os.fsync(fd)
# Close opened file
os.close( fd )

The script opens a file with the O_DIRECT flag, writes and fsyncs it 1000 times and close the file. I added O_DIRECT after an internal discussion with my colleagues, but it doesn’t change the results and it doesn’t remove the need for calling fsync. We’ll discuss in more detail the impacts of O_DIRECT after we reviewed the results. The script is called with the time command like below:

root@lab:/tmp/testfsync# time python /root/fsync.py
real 0m18.320s
user 0m0.060s
sys 0m0.096s

In the above example using a 7.2k rpm drive, the fsync rate is about 56/s for a latency of 18ms. A 7.2k RPM drive performs 120 rotations per second. On average, the fsyncs require a bit more than two rotations to complete. The filesystem appears to make very little differences: ext4 and XFS show similar results. That means if MySQL uses such storage devices for the InnoDB log files, the latency of each transaction is at least 18ms. If the application workload requires 200 write transactions per second, they’ll need to be executed using at least four database connections.

So, let’s begin with rotational devices. These are becoming a bit less common now with databases, especially without a raid controller. I could only find a few.

Drive RPM Rate Latency Notes
WDC WD2500BJKT 5400 22/s 45 ms Laptop SATA from 2009
ST2000LM003 5400 15/s 66 ms USB-3 portable drive
ST3750528AS 7200 40/s 25 ms Desktop grade SATA
WD2502ABYS-18B7A0 7200 56/s 18 ms Desktop grade SATA
HUA723020ALA641 7200 50/s 20 ms Enterprise grade SATA, md mirror
Dell SAS unknown 7200 58/s 17 ms Behind Perc ctrl but no write cache
HDWE150 7200 43/s 23 ms Recent Desktop grade SATA, 5TB

 

I unfortunately didn’t have access to any 10k or 15k RPM drives that were not behind a raid controller with a write cache. If you have access to such drives, run the above script a few times and send me your results, that would help create a more complete picture! So, we can see a correlation between the rotational speed and the fsync rate, which makes sense. The faster a disk turns, the faster it can fsync. The fsync call saves the data and then updates the metadata. Hence, the heads need to move. That’s probably the main explanation for the remaining disparity. A good point, all drives appears to be fully complying with the SATA flush command even though they all have an enabled write cache. Disabling the drives write caches made no difference.

With the above number, the possible transaction rates in fully ACID mode is pretty depressing. But those drives were rotating ones, what about SSD drives? SSD are memory devices and are much faster for random IO operations. There are extremely fast for reads, and good for writes. But as you will see below, not that great for fsyncs.

Drive rate latency notes
SAMSUNG MZ7LN512 160/s 6.3ms Consumer grade SATA
Crucial_CT480M500SSD1 108/s 9.3ms Consumer grade SATA
Intel 520 2031/s 0.49ms Consumer grade SATA
SAMSUNG MZVPV512HDGL 104/s 9.6ms Consumer grade NVMe
Samsung SSD 960 PRO 267/s 3.8ms High-end consumer grade NVMe
Intel PC-3100 1274/s 0.79ms Low-end consumer grade NVMe (cheat?)
Intel 750 2038/s 0.49ms High-end consumer grade NVMe
Intel PC-3700 7380/s 0.14ms High-end enterprise-grade NVMe

 

Again, this is a small sample of the devices I have access to. All SSD/Flash have write caches, but only the high-end devices have capacitors to flush their write cache to the flash with a loss of power. The PC-3100 device is actually in my home server, and it is obviously cheating. If you look at the card specs on the Intel website, it doesn’t have the “Enhanced Power Loss Data Protection” and “End-to-End Data Protection” features. The much more expansive PC-3700 does. I use the PC-3100 as a ZFS L2ARC device, so I am good. In general, the performance of a flash device varies a bit more than rotational devices, since factors like the number of recent writes and the filling factor come into play.

Even when using a high-end NVMe device like the PC-3700, you can’t reach 10k fully ACID transactions per second at low thread concurrency. How do you reach the higher levels? The answer here is the good old raid controller with a protected write cache. The write cache is basically using DRAM memory protected from power loss by a battery. SAN controllers have similar caches. The writes to the InnoDB log files are sequential writes interleaved with fsyncs. The raid controller concatenates the sequential writes, eventually writing one big chunk on disk and… ignoring the fsyncs. Here’s the result from the only device I had access to:

Drive rate latency notes
Dell Perc with BBU 23000/s 0.04ms Array of 7.2k rpm drives

 

That’s extremely fast but, of course, it is memory. I modified the script to loop 10k times instead of 1k. In theory, something a single slave thread doing simple transactions could reach a rate of 20k/s or more while being fully ACID.

Discussion

We must always consider the results we got in the previous section in the context of a given application. For example, a server using an Intel PC-3700 NVMe card can do more than 7000 fully ACID transactions per second even if it is fully durable provided those transactions are issued by a sufficient number of threads. Adding threads will not allow scaling infinitely. At some point, other bottlenecks like mutex contention or page flushing will dominate.

We often say that Galera-based cluster solutions like Percona XtraDB Cluster (PXC) add latency to the transactions, since it involves communication over the network. With the Galera protocol, a commit operation returns only when all the nodes have received the data. Thus, tt is a good practice to relax the local durability and use innodb_flush_log_at_trx_commit set to 0 or 2. On a local network, the ping time is always below 1ms and often below 0.1ms. As a result, the transaction latency is often smaller.

About fdatasync

The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two. In the Python script, if I replace os.fsync with os.fdatasync, here are the results for a subset of devices:

Drive rpm rate latency notes
ST2000LM003 5400 72/s 13 ms USB-3 portable drive
WD2502ABYS-18B7A0 7200 118/s 8.5 ms Desktop grade SATA
SAMSUNG MZ7LN512 N/A 333/s 3.0ms Consumer grade SATA
Crucial_CT480M500SSD1 N/A 213/s 4.7ms Consumer grade SATA
Samsung SSD 960 PRO N/A 714/s 1.4ms High-end consumer grade NVMe

 

In all cases, the resulting rates have more than doubled. The fdatasync call has a troubled history, as there were issues with it many years ago. Because of those issues, InnoDB never uses fdatasync, only fsyncs. You can find the following comments in the InnoDB os/os0file.cc:

/* We let O_SYNC only affect log files; note that we map O_DSYNC to
O_SYNC because the datasync options seemed to corrupt files in 2001
in both Linux and Solaris */

2001 is a long time ago. Given the above results, maybe we should reconsider the use of fdatasync. From the Linux main page on fdatasync, you find:

fdatasync() is similar to fsync(), but does not flush modified
metadata unless that metadata is needed in order to allow a
subsequent data retrieval to be correctly handled. For example,
changes to st_atime or st_mtime (respectively, time of last
access and time of last modification; see stat(2)) do not require
flushing because they are not necessary for a subsequent data
read to be handled correctly. On the other hand, a change to
the file size (st_size, as made by say ftruncate(2)), would
require a metadata flush.

So, even with fdatasync, operations like extending an InnoDB tablespace will update the metadata correctly. This appears to be an interesting low-hanging fruit in term of MySQL performance. In fact, webscalesql already have fdatasync available

O_DIRECT

Why do we need a fsync or fdatasync with O_DIRECT? With O_DIRECT, the OS is not buffering anything along the way. So the data should be persisted right? Actually, the OS is not buffering but the device very likely is. Here are a few results to highlight the point using a 7.2k rpm SATA drive:

Test rate latency
O_DIRECT, drive Write cache enabled 4651/s 0.22ms
O_DIRECT, drive Write cache disabled 101/s 9.9ms
ASYNC + fdatasync, Write cache enabled 119/s 8.4ms
ASYNC + fdatasync, Write cache disabled 117/s 8.5ms

 

The drive write cache was enabled/disabled using the hdparm command. Clearly, there’s no way the drive can persist 4651 writes per second. O_DIRECT doesn’t send the SATA flush command to the disk, so we are only writing to the drive write cache. If the drive write cache is disabled, the rate falls to a more reasonable value of 101/s. What is interesting — and I don’t really understand why — is that opening the file in async mode and performing fdatasync is significantly faster. As expected, the presence of the drive write cache has no impacts on ASYNC + fdatasync. When the fdatasync call occurs, the data is still in the OS file cache.

If you want to use only O_DIRECT, you should make sure all the storage write caches are crash safe. That’s why MySQL adds a fsync call after a write to a file opened with O_DIRECT.

ZFS

These days, I find it difficult to write a blog post without talking about ZFS. How does ZFS handles fsyncs and fdatasyncs? ZFS, like a database, performs write ahead logging in the ZIL. That means calls like fsync and fdatasync return when the data has been persisted to the ZIL, and not to the actual filesystem. The real write operation is done a few seconds later by a background thread. That means the added write for the metadata does not impact performance right away. My home server uses ZFS over a pair of 7.2k RPM drive and doesn’t have a SLOG device. The ZIL is thus stored on the 7.2k RPM drives. The results are the following:

Drive rpm rate latency
ZFS fsync 7200 104/s 9.6 ms
ZFS fdatasync 7200 107/s 9.3 ms

 

Remember that with ZFS, you need to disable the O_DIRECT mode. The fdatasync rate appears to be slightly faster, but it is not really significant. With ZFS, the fsync/fdatasync performance relates to where the ZIL is stored. If there is no SLOG device, the ZIL is stored with the data and thus, the persitence performance of the devices used for the data matter. If there is a SLOG device, the persistence performance is governed by the specs of the device(s) on which the SLOG is located. That’s a very important aspect we have to consider when designing a MySQL server that will use ZFS. The design of such server will be discussed in more details in a future post.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com