Feb
15
2023
--

How Bloom Filters Work in MyRocks

Bloom Filters Work in MyRocks

Bloom filters are an essential component of an LSM-based database engine like MyRocks. This post will illustrate through a simple example how bloom filters work in MyRocks.

Why?

With MyRocks/RocksDB, data is stored in a set of large SST files.  When MyRocks needs to find the value associated with a given key, it uses a bloom filter to guess if the key could potentially be in an SST file.

How?

A bloom filter is a space-efficient way of storing information about a list of keys. At its base, there is a bitmap and a hash function.  The hash value of the keys stored in an SST is computed, and the results are used to set some bits to “1” in the bitmap.  When you want to know if a key is present or not in the list, you run it through the hash function and check if the corresponding bits in the bitmap are “1” or “0”. If one of the bits is “0”, you know for sure the key is not in the list. If all the bits are “1”, the value may be present.  The probability of a false positive depends only on a few factors:

  • Size of the bitmap
  • Number of keys in the list
  • Number of bits set to “1” for each key value

An example

Let’s assume we are creating a list of names and want a bloom filter on that list. We’ll use an 8 bytes bloom filter (64 bits), and we’ll set two bits according to this function:

mysql> set @md5value=md5('Yves');select mod(conv(mid(@md5value,1,8),16,10),32) b1, mod(conv(mid(@md5value,9,8),16,10),32)+32 b2;
Query OK, 0 rows affected (0.00 sec)
+------+------+
| b1   | b2   |
+------+------+
|   15 |   39 |
+------+------+
1 row in set (0.00 sec)

So the value ‘Yves’ sets the bits 15 and 39 to ‘1’.

Now, if I add 10 names (taken from the Percona slack channel “architecture_projects”) we have:

  • Alexey: 29 53
  • Alick: 8 49
  • Bennie: 31 63
  • Brenda: 20 54
  • Brett: 12 61
  • Carlos: 3 46
  • Corrado: 27 34
  • David: 15 57
  • Deb: 13 38
  • Eliana: 11 39

The bitmap now looks like:

0 0 1 0 0 0 0 0
1 0 0 1 1 1 0 1
0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 1
0 0 1 0 0 0 1 1
0 0 0 0 0 0 1 0
0 1 0 0 0 1 1 0
0 1 0 0 0 1 0 1

Now, if I want to use the bitmap to “guess” if “Bennie” is in the list, that will work just fine, bits 31 and 63 are “1”. But… “Yves” is not in the list even though bits 15 (from David) and 39 (from Eliana) are set to “1”. A query for “Yves” would be a false positive.

On the other hand, let’s consider “Francisco” for which the bits are 27 and 60.  Bit 27 is set to “1” by the “Corrado” value, but bit 60 is “0”.  We thus know for sure that “Francisco” is not in the list.

Tuning

In terms of tuning, two parameters can be tuned, the size of the bitmap and the number of bits set by every value. The bitmap size must be large enough so that the proportion of bits set to “1” is not overwhelming. In the above example, if you add all the first names of the people at Percona you’ll end up with only “1”.  That would be obviously useless.  To avoid this issue, we could instead use a 1 GB bitmap. The number of false positives would be low, but 1GB is a lot of memory. In most cases a bitmap of a few hundred bytes is sufficient.

With MyRocks, you can control the size of the bloom filter bitmap with the column family parameter memtable_prefix_bloom_size_ratio. For good performance, the filter blocks are cached in the RocksDB block cache and normally stay there since they are accessed frequently.

The filter_policy variable controls the number of bits per key in the column family settings. By default, 10 bits are used per key which yields less than 1% of false positives.

Conclusion

I hope that this post has provided you with a basic understanding of bloom filters and how they are implemented in MyRocks. LSM storage engines like MyRocks are very different from the more common B-Tree-based storage engines like InnoDB. Bloom filter is an important feature improving performance of LSM storage engines. A good understanding of how bloom filters work can only lead to improved database performance.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Dec
23
2022
--

A MyRocks Use Case

A MyRocks Use Case

A MyRocks Use CaseI wrote this post on MyRocks because I believe it is the most interesting new MySQL storage engine to have appeared over the last few years. Although MyRocks is very efficient for writes, I chose a more generic workload that will provide a different MyRocks use case.

The use case is the TPC-C benchmark but executed not on a high-end server but on a lower-spec virtual machine that is I/O limited like for example, with AWS EBS volumes. I decided to use a virtual machine with two CPU cores, four GB of memory, and storage limited to a maximum of 1000 IOPs of 16KB. The storage device has performance characteristics pretty similar to an AWS gp2 EBS volume of about 330 GB in size. I emulated these limits using the KVM iotune settings in my lab.

<iotune>
     <total_iops_sec>1000</total_iops_sec>
     <total_bytes_sec>16384000</total_bytes_sec>
</iotune>

MyRocks and RocksDB

If you wonder what is the difference between MyRocks and RocksDB, consider MyRocks as the piece of code, or the glue, that allows MySQL to store data in RocksDB. RocksDB is a very efficient key-value store based on LSM trees. MyRocks stores table rows in RocksDB using an index id value concatenated with the primary key as the key and then the internal MySQL binary row representation as the value. MyRocks handles indexes in a similar fashion. There are obviously tons of details but that is the main principle behind MyRocks. Inside MyRocks, there is an embedded instance of RocksDB running.

 

Dataset

The TPC-C dataset I used was with a scale of 200. As seen in the figure below, the sizes of the dataset are very different using InnoDB vs MyRocks.  While with InnoDB the size is 20GB, it is only 4.3GB with MyRocks. This is a tribute to the efficient compression capabilities of MyRocks.

InnoDB and MyRocks dataset sizes

InnoDB and MyRocks dataset sizes

A keen observer will quickly realize the compressed dataset size with MyRocks is roughly the same as the amount of memory of the virtual machine. This is not an accident, it is on purpose. I want to illustrate, maybe using an obvious use case, that you can’t use general rules like “InnoDB is faster for reads” or “MyRocks is only good for writes”. A careful answer would be: “it depends…”

 

TPC-C on MyRocks

In order to be able to run the sysbench TPC-C script, you need to use a binary collation and the read-committed isolation level. You must also avoid foreign key constraints. A typical sysbench invocation would look like this:

./tpcc.lua --mysql-host=10.0.4.112 --mysql-user=sysbench --mysql-password=sysbench --mysql-ssl=off \
   --mysql-db=sysbench --threads=4 --scale=200 --use_fk=0 --mysql_storage_engine=rocksdb \
   --mysql_table_options="COLLATE latin1_bin" --trx_level=RC --report-interval=10 --time=3600 run

I used a rocksdb_block_cache_size of 512MB. I wanted most of the memory to be available for the file cache, where the compressed SST files will be cached. The block cache just needs to be large enough to keep the index and filter blocks in memory. In terms of compression, the relevant settings in the column family options are:

compression_per_level=kLZ4Compression;bottommost_compression=kZSTD;compression_opts=-14:1:0

MyRocks uses ZStd:1 compression for the bottom level and LZ4 for the upper levels. The bottom-level compression is really critical as it contains most of the data.

Being an LSM-type storage engine, RocksDB must frequently perform level compactions. Level compactions consume IOPs and in environments where IOPs are scarce, those impact performance. Fortunately, RocksDB has the variable rocksdb_rate_limiter_bytes_per_sec to limit the impacts of compaction. The IO bandwidth used by the background compaction threads is limited by this parameter. The following figure illustrates the impacts.

myrocks

As the filesystem cache and the block cache warms up, the TPC-C transaction rates rise from 50 to around 175/s. After roughly 500s, the need for compaction arises and the performance drops. With no rate limit (0), the background threads consume too much IOPs and the compaction adversely affects the workload. With lower values of rocksdb_rate_limiter_bytes_per_sec, the impacts are reduced and the compactions are spread over longer periods of time.

For this environment, a rate limit of 4 MB/s achieves the lowest performance drops. Once warmed, the performance level never felt under 100 Trx/s. If you set rocksdb_rate_limiter_bytes_per_sec too low, like at 1MB/s, compaction cannot keep up and processing has to stall for some time. You should allocate enough bandwidth for compaction to avoid these stalls.

Long term stability

Over time, as data accumulates in the RocksDB LSM tree, performance can degrade. Using the 2 MB/s rate limiter, I pushed the runtime to 10 hours and observed very little degradation as shown in the following figure.

MyRocks performance stability

There are of course many compaction events but the performance baseline remains stable.

 

MyRocks Vs InnoDB

Now, how does this workload perform on InnoDB? InnoDB is more IO bound than MyRocks, essentially the 20GB dataset is large for the 3GB buffer pool.

MyRocks Vs InnoDB

The compaction event diminishes MyRocks performance but even then, the transaction rate stays well above the InnoDB one. Over the course of one hour, InnoDB executed 125k transactions while MyRocks achieved in excess of 575k transactions. Even if InnoDB uses compression (CMP8k), the performance level is still much lower.

Conclusion

I hope this post has raised your interest in the MyRocks storage engine. If you are paying too much for cloud-based storage and IOPs, make sure you evaluate MyRocks as it has super compression capabilities and is IO efficient.

Note: all the raw results, scripts, and configuration files used for this post can be found on Github.

Nov
03
2022
--

Raspberry on the Rocks: Build Percona Server for MySQL With MyRocks on Your Raspberry Pi

Percona Server for MySQL With MyRocks on Your Raspberry Pi

Percona Server for MySQL With MyRocks on Your Raspberry PiRaspberry PI is a small single-board computer (SBCs) developed by the Raspberry Pi Foundation in association with Broadcom. This tiny computer is extremely popular and widely used in many areas. Thanks to its size, low cost, and low energy requirements, it can be used to collect data in remote locations or from sensor devices. We often need to be able to store large amounts of data efficiently on these devices.

MyRocks is a MySQL engine that uses RocksDB to store data. It is space efficient and able to handle writes quite efficiently.

First things first

Building and installing Percona Server for MySQL with MyRocks engine enabled is easy but requires some time. It is essential to make sure that you have all the ingredients and meet all the requirements:

  • Raspberry PI 3, 4, 400, or superior.
  • SD Card with Raspberry Pi OS 11 Bullseye 64 bits
  • Internet connectivity
  • 2 ounces gin
  • 1/2 ounce lemon juice, freshly squeezed
  • 1/2 ounce raspberry syrup
  • One egg white
  • Three raspberries

Preparation tasks

You need a Raspberry PI with a 64bits processor. We assume you already have installed the most recent version of Raspios. As of this writing, the version is 11 Bullseye.

Install screen package

Screen will allow you to reconnect if the connection is lost during the installation procedure. We install it, and we open a new session.

sudo -i
apt-get install screen
screen -LS install_mysql

Increase available memory

To build and install MySQL, you need a significant amount of memory. We recommend adding at least 4Gb of swap.

dd if=/dev/zero of=/swapfile4GB bs=1M count=4096
mkswap /swapfile4GB
chmod 0600 /swapfile4GB
swapon /swapfile4GB

Install additional packages

We need git to clone the Percona repository that contains MySQL.

apt-get install git

And some packages and libraries are needed to build and run MySQL.

apt-get install libjemalloc-dev libjemalloc2
apt-get install libzstd-dev zstd libevent-dev libicu-dev lz4 \
liblz4-dev cmake libprotobuf-dev protobuf-compiler \
libfido2-dev libssl-dev libldap-dev libsasl2-dev \
libsasl2-modules-gssapi-mit libkrb5-dev \
libncurses-dev libaio-dev libreadline-dev \
libprotoc-dev bison libcurl4-openssl-dev

Once we have all those packages installed, we can download the MySQL code.

Build Percona Server for MySQL

Configure the environment and download the source code

First, we create a working directory and move the working directory there.

mkdir ws
cd ws

Then we clone the source code. This process will take some time.

git clone https://github.com/percona/percona-server.git

The Clover Club Cocktail

The Clover Club was a men’s club established in 1882 in Philadelphia. The club met in the Bellevue-Stratford Hotel. It was pretty popular, having writers, lawyers, and businessmen among its members. The club gatherings were not used only to discuss issues of the day, but also included drinks. One of the most liked ones was a cocktail made of gin, lemon juice, raspberry syrup, and egg white.

The Clover Club Cocktail enjoyed some popularity for some time but almost disappeared for most of the 20th century. The cocktail recovered some of its popularity during the first decade of the 21st century thanks to modern cocktail books. More on this later…

Configure the code

The first step is changing to the branch you want to build. In our example, we are using the most recently released branch. But you could use a development branch if you want; just make sure that you understand the implications of using a version that hasn’t been released yet.

cd percona-server
git checkout -b current Percona-Server-8.0.29-21
git submodule init
git submodule update

This will download some additional code.

Fix configuration

If you are building a version previous to 8.0.30, then you need to edit the MyRocks cmake file to allow MyRocks to be built on a different architecture than i86_64. The file is located in percona-server/storage/rocksdb and the name is CMakeLists.txt

What needs to be done is adding arm64 to the supported CPU architectures and checking that specific intel i86_64-specific extensions are not used. If you prefer, you can just replace the file with the version from my GitHub repository.

https://raw.githubusercontent.com/peppla/blog_posts/main/RaspberryOnTheRocks/storage/rocksdb/CMakeLists.txt

Configure

We create a directory only for the build process:

cd ..
mkdir percona-build
cd percona-build

And we launch the configuration process. This process takes some minutes to complete.

cmake ../percona-server \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DBUILD_CONFIG=mysql_release \
-DMYSQL_MAINTAINER_MODE=OFF \
-DDOWNLOAD_BOOST=ON \
-DWITH_BOOST=../deps \
-DWITH_SYSTEM_LIBS=ON \
-DWITHOUT_TOKUDB=ON \
-DWITH_ROCKSDB=ON

The configuration should finish without errors. If you get any error regarding the version of zlib, you can add a flag to the configuration to use the bundled version:

cmake ../percona-server \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DBUILD_CONFIG=mysql_release \
-DMYSQL_MAINTAINER_MODE=OFF \
-DDOWNLOAD_BOOST=ON \
-DWITH_BOOST=../deps \
-DWITH_SYSTEM_LIBS=ON \
-DWITHOUT_TOKUDB=ON \
-DWITH_ROCKSDB=ON \
-DWITH_ZLIB=bundled

Build

Once we complete the configuration, the build process is quite straightforward. Just run the following command:

cmake --build . -- -j 4

The building process will take some time, depending on the speed of your processor and how fast your storage is. In any case, if executed successfully, it should take a few hours.

In the meantime…

To prepare the raspberry syrup:

  • Add one cup of sugar and 1/2 cup of water to a pot.
  • Stir over medium heat until the sugar has dissolved.
  • Reduce heat to low, add 1/2 cup fresh raspberries, and macerate into a pulp.
  • Fine-strain to remove the berry’s tiny seeds.
  • Optionally, add 1/2 oz vodka as a preservative.
  • Refrigerate in a jar until ready for use.

Once the raspberry syrup is ready, we can proceed with the Clover Club Cocktail: Add the gin, lemon juice, raspberry syrup, and egg white into a shaker with ice and shake vigorously until well-chilled. Strain into a chilled cocktail glass. Garnish with three speared raspberries.

Testing

Once the build process has finished, you can test the results using the MySQL test suite:

./mysql-test/mtr rocksdb.1st

You should get results similar to these:

==============================================================================
TEST NAME RESULT TIME (ms) COMMENT
------------------------------------------------------------------------------
worker[1] mysql-test-run: WARNING: running this script as _root_ will cause some tests to be skipped
[ 25%] rocksdb.1st 'write_prepared' [ pass ] 138
[ 50%] rocksdb.1st 'write_unprepared' [ pass ] 519
[ 75%] rocksdb.1st 'write_committed' [ pass ] 132
[100%] shutdown_report [ pass ]
------------------------------------------------------------------------------
The servers were restarted 2 times
The servers were reinitialized 0 times
Spent 0.789 of 116 seconds executing testcases

Completed: All 4 tests were successful.

Install and distribute

You can install the binaries you recently built by executing the following command:

make install

Or create a compressed tar file that you can use to install the server manually:

make package

Once the binaries are installed, follow the standard procedure to initialize and configure the server.

Warning

According to the Surgeon General, women should not drink alcoholic beverages during pregnancy because of the risk of birth defects. Consumption of alcoholic beverages impairs your ability to drive a car or operate machinery, including databases, and may cause health problems. Consuming raw and lightly cooked eggs poses a risk of foodborne illness.

Apr
05
2021
--

Which Version of MySQL Should I Use for MyRocks?

Version of MySQL Should I Use for MyRocks

Version of MySQL Should I Use for MyRocksAs database footprints continue to explode, many companies are looking for ways to deal with such rapid growth.  One approach is to refactor traditional relational databases to fit into a NoSQL engine, where horizontal scalability is easier.  However, in many cases, this is in no way a trivial undertaking.

Another approach that has been gaining interest is the use of MyRocks as an alternative storage engine to the traditional InnoDB.  While not for everyone, in certain use cases it could be a potential solution.  As with so many things open source, the next standard questions are: which version should I use?  Any differences with the engine if I use MyRocks with MySQL 5.7 vs 8.0?

In this post, I wanted to touch on this and give some high-level thoughts on MyRocks when it comes to the version of MySQL.

Engine Overview

At a high level, RocksDB is an embedded key-value database with data stored in a log-structured merge tree (LSM).  MyRocks is an abstraction layer that allows RocksDB to serve as a MySQL storage engine.  With RocksDB as the underlying storage layer, there are numerous advantages including faster replication, better compression, and faster data loading.

In contrast, InnoDB is B-Tree based structure.  MySQL was designed using a plugin architecture that separates the storage engine logic from the main server functionality.  This allows users to choose which storage engine they want based on their use case.  Historically, this was the MyISAM storage engine.  In recent years, InnoDB replaced MyISAM as the defacto standard and later the default engine.

Which Engine is Better?

Choosing which engine to use is very use case-specific.  Even at Facebook, where MyRocks was first developed, it was not meant to be a universal replacement for InnoDB.  Features like gap locks remain absent from MyRocks.  If a workload is dependent on gap locking, it will be a non-starter with the MyRocks engine.

Also, data access patterns should dictate the engine choice.  If the workload is write-intensive with limited range scans, it may be a good fit for MyRocks.  If you have a traditional OLTP workload with several reporting access patterns, InnoDB would remain a better option.  As mentioned in MyRocks Engine: Things to Know Before You Start:

“MyRocks is not an enhanced InnoDB, nor a one-size-fits-all replacement for InnoDB. It has its own pros/cons just like InnoDB. You need to decide which engine to use based on your application’s data access patterns.”

Which Version of MySQL Should I Use?

Assuming you have done your research and found MyRocks would be a good fit, the next choice is which version to use.  Facebook runs MyRocks on a heavily patched, internal version of MySQL 5.6Percona Server for MySQL includes MyRocks in both 5.7 and 8.0.  So what are the differences?

At a high level, the MyRocks/RocksDB code is essentially the same in all three versions.  Percona Server for MySQL uses the upstream version of RocksDB and only changes the linked version when Facebook MySQL updates the version.  This is due to the fact that it inherits the integration testing between RocksDB and MyRocks from Facebook.

As such, the biggest differences are based solely on the server version and not the server version and MyRocks combination.  If your infrastructure is already running with MySQL 5.7 in production and not ready to finalize a move to 8.0, there would be no discernible difference running the MyRocks engine with your current version.  MyRocks running on Percona Server for MySQL 5.7 is stable and shouldn’t be discarded as an option.  You should still plan to upgrade to 8.0, but there shouldn’t be any unique challenges in the process compared to using InnoDB.  It would just require the standard upgrade process and significant testing.

Moving forward (as 5.7 approaches EOL in 2023), you can expect to see fewer enhancements with MyRocks on Percona Server for MySQL 5.7.  Most new active development will be done against 8.0 while 5.7 will remain primarily in extended support, which includes critical bug fixes and security fixes only.

Conclusion

With all the version combinations and options, it can be overwhelming to pick the best option for your organization.  For a new deployment, I would recommend starting with MyRocks on MySQL 8.0.  With an existing MySQL 5.7 deployment (while you should be actively working towards an upgrade to 8.0), MyRocks is a viable and stable option as well.  The biggest driver for MyRocks should be space concerns combined with the workload.  As nice as it would be to say MyRocks is the silver bullet that works for everything, that just isn’t the case.  You can always reach out to the Professional Services team at Percona and let us help you determine if MyRocks would be a good fit for your team!

Apr
01
2021
--

Working to Validate MyRocks in the Enterprise with Dropbox

MyRocks in the Enterprise with Dropbox

MyRocks in the Enterprise with DropboxPercona Technical Account Managers get the privilege of working with some of our largest enterprise clients day in and day out.  As such, we get to really focus on how to best leverage our technology to generate measurable benefits for our users.  While it is fun to “nerd out” and always strive to use the latest and greatest, we need to stay focused on demonstrating business value and a genuine need.  Over the past few months, I’ve been working with one of my larger clients, Dropbox, along with our professional services team to validate the use of Percona Server for MySQL with the MyRocks storage engine over a large portion of their MySQL infrastructure.

Please note – this is not meant to be a deep dive into the technical details around MyRocks or the implementation.  Rather, it is meant to show how we determined the need, potential solution, and the planning that has started us down this path.  Look for a more detailed case study in the coming months as we hope to push this solution forward!

The Problem

In a single word, space.  When a client reaches a certain scale, space becomes a real concern.  With 10-20 servers, having an extra index or choosing the wrong data type can be noticeable.  However, with 10,000 servers, there is much less margin for error.  The challenge with this client is that the schema has already been sharded and optimized. Even with that, the aggregate dataset is still on the petabyte scale.  Think about what that means:

  • Large storage footprint per server (2+TB)
  • Replication multiplies that footprint for each “cluster”
  • Multiple backups per cluster result in a huge (PB+) backup footprint

There are some additional challenges created at this scale, too.  To mitigate the risk of filling a disk, there is a soft cap of 75 percent full on each server.  When a cluster hits this level, the shard is split to give each half some additional runway.  While this is great operationally, splitting shards results in doubling the cost for the cluster.  Again, on a scale of 1000s of servers, a “split” means an increase in cost in the hundreds of thousands of dollars.  This is not trivial and puts pressure on the team to delay splits as long as possible while maintaining high availability, durability, and performance.

MyRocks Use Case

After much discussion and understanding that space (and, in turn, cost) is a major business driver, we decided to investigate a change to MyRocks as the storage engine.  While standard InnoDB compression provided a small bandaid, the thinking was that MyRocks would give substantial savings.  At a very high level, MyRocks is an LSM tree-based storage engine built on top of RocksDB.  This type of storage has numerous advantages over traditional B+Tree storage engines (like InnoDB), including a smaller disk footprint and reduced write amplification.  This can translate to business value in several ways, including:

  • Decreased cloud/HDFS storage cost for backups
  • Reduced Write Amplification results in a longer disk lifetime, reducing fleet turnover
  • Decreased instance space requires fewer splits and increases runway

After several rounds of discussion around the benefits and any blockers, we decided to begin testing the feasibility of converting from InnoDB to MyRocks.

Initial Testing

While the use case seemed like a solid fit, extensive testing is always needed.  So far, the initial testing looks promising.  We’ve seen a large reduction in space combined with an increase in performance.  One example cluster saw nearly a 75 percent reduction in space, to the point that we were able to test running two replica instances on a single server in parallel.

What makes this interesting is that the test cluster in question has replicas that periodically show lag and generally are close to the replication limit.  In contrast, the two MyRocks replicas showed no lag during the same test period despite running two instances on one physical server.

While this isn’t something that would be done in production, it was impressive to see double the workload operating so efficiently on the same hardware.  I/O and CPU utilization were both noticeably lower than the single replica running InnoDB.  This shows the potential of increased server longevity and less frequent splits that we were hoping to see.  If these numbers and this performance were to hold into production, we could see savings on the order of millions of dollars across the entire fleet.

Note – this early testing has been limited to replicas only so we can’t yet validate this performance at the production level of concurrency of a primary server.  These results are only with four replica threads, so contention has been minor.

Looking Forward

So far, our limited testing has shown some real promise.  There are still some blockers that we need to overcome, but early results are encouraging. Currently, the biggest technical challenges in this project include:

  • MyRocks currently lacks pt-table-checksum support (for replica consistency)
  • Modifying and validating existing backup/restore/clone automation

Once we are able to fully validate this solution and solve the existing challenges (both fixes have been identified and are on the project roadmap) in the coming months, look for a more detailed case study.  While not every organization needs to make such a drastic change, this use case is one that is becoming more common at the enterprise level as data sets continue to explode.

Want to learn more? Check out the Percona MyRocks Introduction page on our website!

Feb
07
2019
--

Column Families in MyRocks

myrocks column families

myrocks column familiesIn my webinar How To Rock with MyRocks I briefly mentioned the column families feature in MyRocks, that allows a fine tuning for indexes and primary keys.

Let’s review it in more detail.

To recap, MyRocks is based on the RocksDB library, which stores all data in [key => value] pairs, so when it translates to MySQL, all Primary Keys (data) and secondary keys (indexes) are stored in [ key => value ] pairs, which by default are assigned to “default” Column Family.

Each column family has individual set of

  • SST files, and their parameters
  • Memtable and its parameters
  • Bloom filters, and their parameters
  • Compression settings

There is a N:1 relation between tables and indexes to column family, so schematically it looks like this:

column families myrocks

How do you assign tables and indexes to a column family?

It is defined in the COMMENT section for a key or primary key:

CREATE TABLE tab1(
a INT,
b INT,
PRIMARY KEY (a) COMMENT ‘cfname=cf1’,
KEY key_b (b) COMMENT ‘cfname=cf2’)
)

Now, if you want to define individual parameters for column families, you should use

rocksdb_override_cf_options 

For example:

rocksdb_override_cf_options=’cf1={compression=kNoCompression}; cf2={compression=kLZ4Compression,bottommost_compression==kZSTD}’

Be careful of defining too many column families: as I mentioned, each column family will use an individual memtable, which takes 64MB of memory by default.

There is also an individual set of SST tables. You can see how they perform with

SHOW ENGINE ROCKSDB STATUS

 :

  Type: CF_COMPACTION
  Name: cf1
Status: 
** Compaction Stats [cf1] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0     0.00 KB   0.0      0.0     0.0      0.0       1.5      1.5       0.0   1.0      0.0     99.1        15       37    0.410       0      0
  L5      3/0   197.05 MB   0.8      0.4     0.4      0.0       0.4      0.4       0.0   1.0     75.6     75.6         6        1    5.923   8862K      0
  L6      7/0   341.24 MB   0.0      1.7     1.3      0.5       0.8      0.3       0.0   0.6     42.8     19.5        42        7    5.933     61M      0
 Sum     10/0   538.29 MB   0.0      2.2     1.7      0.5       2.7      2.2       0.0   1.8     35.5     44.1        63        45    1.392     70M     0
 Int      0/0     0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0        0    0.000       0      0
  Type: CF_COMPACTION
  Name: cf2
Status: 
** Compaction Stats [cf2] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0     0.00 KB   0.0      0.0     0.0      0.0       0.3      0.3       0.0   1.0      0.0     13.5        22       22    1.023       0      0
  L6      4/0   178.61 MB   0.0      0.6     0.3      0.3       0.4      0.2       0.0   1.5      9.3      7.3        61        5   12.243     72M      0
 Sum      4/0   178.61 MB   0.0      0.6     0.3      0.3       0.7      0.5       0.0   2.5      6.8      9.0        84       27    3.100     72M      0
 Int      0/0     0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0        0    0.000       0      0

To check the current column families and their settings you can use:

SELECT * FROM INFORMATION_SCHEMA.ROCKSDB_CF_OPTIONS:
| cf1        | COMPARATOR                                                      | RocksDB_SE_v3.10      
|
| cf1        | MERGE_OPERATOR                                                  | NULL                                                                                                      |
| cf1        | COMPACTION_FILTER                                               | NULL                                                                                                                                                                     
| cf1        | COMPACTION_FILTER_FACTORY                                       | Rdb_compact_filter_factory                                                                                                                                               
| cf1        | WRITE_BUFFER_SIZE                                               | 67108864                                                                                                       
| cf1        | MAX_WRITE_BUFFER_NUMBER                                         | 2                                                                                                                                                                        
| cf1        | MIN_WRITE_BUFFER_NUMBER_TO_MERGE                                | 1                                                                                                                                                                        
| cf1        | NUM_LEVELS                                                      | 7                                                                                                                                                                        
| cf1        | LEVEL0_FILE_NUM_COMPACTION_TRIGGER                              | 4                                                                                                                                                                        
| cf1        | LEVEL0_SLOWDOWN_WRITES_TRIGGER                                  | 20                                                                                                                                                                       
| cf1        | LEVEL0_STOP_WRITES_TRIGGER                                      | 36                                                                                                                                                                       
| cf1        | MAX_MEM_COMPACTION_LEVEL                                        | 0                                                                                                                                                                        
| cf1        | TARGET_FILE_SIZE_BASE                                           | 67108864                                                                                                                                                                 
| cf1        | TARGET_FILE_SIZE_MULTIPLIER                                     | 1                                                                                                                                                                        
| cf1        | MAX_BYTES_FOR_LEVEL_BASE                                        | 268435456                                                                                                                                                                
| cf1        | LEVEL_COMPACTION_DYNAMIC_LEVEL_BYTES                            | ON                                                                                                                                                                       
| cf1        | MAX_BYTES_FOR_LEVEL_MULTIPLIER                                  | 10.000000                                                                                                                                                                
| cf1        | SOFT_RATE_LIMIT                                                 | 0.000000                                                                                                                                                                 
| cf1        | HARD_RATE_LIMIT                                                 | 0.000000                                                                                                                                                                 
| cf1        | RATE_LIMIT_DELAY_MAX_MILLISECONDS                               | 100                                                                                                                                                                      
| cf1        | ARENA_BLOCK_SIZE                                                | 0                                                                                                                                                                        
| cf1        | DISABLE_AUTO_COMPACTIONS                                        | OFF                                                                                                                                                                      
| cf1        | PURGE_REDUNDANT_KVS_WHILE_FLUSH                                 | ON                                                                                                                                                                       
| cf1        | MAX_SEQUENTIAL_SKIP_IN_ITERATIONS                               | 8                                                                                                                                                                        
| cf1        | MEMTABLE_FACTORY                                                | SkipListFactory                                                                                                                                                          
| cf1        | INPLACE_UPDATE_SUPPORT                                          | OFF                                                                                                                                                                      
| cf1        | INPLACE_UPDATE_NUM_LOCKS                                        | ON                                                                                                                                                                       
| cf1        | MEMTABLE_PREFIX_BLOOM_BITS_RATIO                                | 0.000000                                                                                                                                                                 
| cf1        | MEMTABLE_PREFIX_BLOOM_HUGE_PAGE_TLB_SIZE                        | 0                                                                                                                                                                        
| cf1        | BLOOM_LOCALITY                                                  | 0                                                                                                                                                                        
| cf1        | MAX_SUCCESSIVE_MERGES                                           | 0                                                                                                                                                                        
| cf1        | OPTIMIZE_FILTERS_FOR_HITS                                       | ON                                                                                                                                                                       
| cf1        | MAX_BYTES_FOR_LEVEL_MULTIPLIER_ADDITIONAL                       | 1:1:1:1:1:1:1                                                                                                                                                            
| cf1        | COMPRESSION_TYPE                                                | kNoCompression                                                                                                                                                           
| cf1        | COMPRESSION_PER_LEVEL                                           | NUL                                                                                                                                                                      
| cf1        | COMPRESSION_OPTS                                                | -14:-1:0                                                                                                                                                                 
| cf1        | BOTTOMMOST_COMPRESSION                                          | kLZ4Compression                                                                                                                                                          
| cf1        | PREFIX_EXTRACTOR                                                | NULL                                                                                                                                                                     
| cf1        | COMPACTION_STYLE                                                | kCompactionStyleLevel                                                                                                                                                    
| cf1        | COMPACTION_OPTIONS_UNIVERSAL                                    | {SIZE_RATIO=1; MIN_MERGE_WIDTH=2; MAX_MERGE_WIDTH=4294967295; MAX_SIZE_AMPLIFICATION_PERCENT=200; COMPRESSION_SIZE_PERCENT=-1; STOP_STYLE=kCompactionStopStyleTotalSize} |
| cf1        | COMPACTION_OPTION_FIFO::MAX_TABLE_FILES_SIZE                    | 1073741824                                                                                                                                                               
| cf1        | TABLE_FACTORY::FLUSH_BLOCK_POLICY_FACTORY                       | FlushBlockBySizePolicyFactory(0x4715df0)                                                                                                                                 
| cf1        | TABLE_FACTORY::CACHE_INDEX_AND_FILTER_BLOCKS                    | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::CACHE_INDEX_AND_FILTER_BLOCKS_WITH_HIGH_PRIORITY | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::PIN_L0_FILTER_AND_INDEX_BLOCKS_IN_CACHE          | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::INDEX_TYPE                                       | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::HASH_INDEX_ALLOW_COLLISION                       | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::CHECKSUM                                         | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::NO_BLOCK_CACHE                                   | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::BLOCK_CACHE                                      | 0x470c880                                                                                                                                                                
| cf1        | TABLE_FACTORY::BLOCK_CACHE_NAME                                 | LRUCache                                                                                                                                                                 
| cf1        | TABLE_FACTORY::BLOCK_CACHE_OPTIONS                              |                                                                                                                                                                          
| cf1        | TABLE_FACTORY::CAPACITY                                         | 536870912                                                                                                                                                                
| cf1        | TABLE_FACTORY::NUM_SHARD_BITS                                   | 6                                                                                                                                                                        
| cf1        | TABLE_FACTORY::STRICT_CAPACITY_LIMIT                            | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::HIGH_PRI_POOL_RATIO                              | 0.000                                                                                                                                                                    
| cf1        | TABLE_FACTORY::BLOCK_CACHE_COMPRESSED                           | (nil)                                                                                                                                                                    
| cf1        | TABLE_FACTORY::PERSISTENT_CACHE                                 | (nil)                                                                                                                                                                    
| cf1        | TABLE_FACTORY::BLOCK_SIZE                                       | 16384                                                                                                                                                                    
| cf1        | TABLE_FACTORY::BLOCK_SIZE_DEVIATION                             | 10                                                                                                                                                                       
| cf1        | TABLE_FACTORY::BLOCK_RESTART_INTERVAL                           | 16                                                                                                                                                                       
| cf1        | TABLE_FACTORY::INDEX_BLOCK_RESTART_INTERVAL                     | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::METADATA_BLOCK_SIZE                              | 4096                                                                                                                                                                     
| cf1        | TABLE_FACTORY::PARTITION_FILTERS                                | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::USE_DELTA_ENCODING                               | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::FILTER_POLICY                                    | rocksdb.BuiltinBloomFilter                                                                                                                                               
| cf1        | TABLE_FACTORY::WHOLE_KEY_FILTERING                              | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::VERIFY_COMPRESSION                               | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::READ_AMP_BYTES_PER_BIT                           | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::FORMAT_VERSION                                   | 2                                                                                                                                                                        
| cf1        | TABLE_FACTORY::ENABLE_INDEX_COMPRESSION                         | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::BLOCK_ALIGN                                      | 0                   
    

As a reminder MyRocks is available in Percona Server 5.7 and Percona Server 8.0, you can try it and share your experience!


Photo by Debby Hudson on Unsplash

Jan
14
2019
--

Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

How to Rock with MyRocks

How to Rock with MyRocksPlease join Percona’s Chief Technology Officer, Vadim Tkachenko, as he presents How to Rock with MyRocks on Thursday, January 17th at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

Register Now

MyRocks is a new storage engine from Facebook and is available in Percona Server for MySQL. In what cases will you want to use it? We will check different workloads and when MyRocks is most suitable for you. Also, as for any new engine, it’s important to set it up and tune it properly. So, we will review the most important settings to pay attention to.

Register for this webinar to learn How to Rock with MyRocks.

Sep
26
2018
--

Scaling IO-Bound Workloads for MySQL in the Cloud – part 2

Rplot07-innodb-iops

This post is a followup to my previous article https://www.percona.com/blog/2018/08/29/scaling-io-bound-workloads-mysql-cloud/

In this instance, I want to show the data in different dimensions, primarily to answer questions around how throughput scales with increasing IOPS.

A recap: for the test I use Amazon instances and Amazon gp2 and io1 volumes. In addition to the original post, I also tested two gpl2 volumes combined in software RAID0. I did this for the following reason: Amazon cap the single gp2 volume throughput to 160MB/sec, and as we will see from the charts, this limits InnoDB performance.

Also, a reminder from the previous post: we can increase gp2 IOPS by increasing volume size (to the top limit 10000 IOPS), and for io1 we can increase IOPS by paying per additional IOPS.

Scaling with InnoDB

So for the first result, let’s see how InnoDB scales with increasing IOPS.

There are a few interesting observations here: InnoDB scales linearly with additional IOPS, but it faces a throughput limit that Amazon applies to volumes.

So besides considering IOPS, we should take into account the maximal throughout of volumes.

In the second chart we compare InnoDB performance vs the cost of volumes:

It’s interesting to see here the slope for gp2 volumes is steeper than for io1 volumes. This means we can get a bigger increase in InnoDB performance per dollar using gp2 volumes, but only until we reach the IOPS and throughput limits that are applied to gp2 volumes.

Scaling with MyRocks

And here’s the similar chart but for MyRocks:

Here we can also see that MyRocks scales linearly, showing identical results on gp2 and io1 volumes. This means that running on gp2 will be cheaper. Also, there is no plateau in throughput, as we saw for InnoDB, which means that MyRocks uses less IO throughput.

And the chart for the cost of running MyRocks:

This charts also shows that it is cheaper to run on gp2 volume but only while it provides enough IOPS. I assume that using two gp2 volumes would allow me to double the throughput. (I did not run the test for MyRocks using two volumes)

Conclusions

  • Both MyRocks and InnoDB can scale (linearly) with additional IOPS on gp2 and io1 Amazon volumes.
  • Take into account that IOPS is not the only factor to consider. There is also throughput limit, which affects InnoDB results, so for further scaling you might need to use multiple volumes.

The post Scaling IO-Bound Workloads for MySQL in the Cloud – part 2 appeared first on Percona Database Performance Blog.

Sep
20
2018
--

MyRocks Disk Full Edge Case

problem in MyRocks

MyRocks disk full bugRocksDB engine—and it’s MySQL implementation MyRocks—is a very good alternative engine for MySQL. It has proven to be very efficient and stable for many workloads, including those of large scale. However, it is still a relative newborn in the MySQL ecosystem, and has only a small fraction of the adoption rate of InnoDB. That means it is not so well tested at all possible edge cases, and may have many unreported bugs. One known bug is discussed here: if you are a MyRocks user, it’s important that you are aware of the possibility of lost data in the specific circumstances described below.

In writing this article, I want to broadcast a wider warning, as the problem I found is pretty serious and could lead to a very unpleasant situation.

The problem is related to not a very edge case after all – disk full. The result could be extremely bad, though. After printing some errors, the RocksDB engine continues to message clients that consecutive writes are accepted, while they are, in fact, lost! In addition, depending on the workload and at which moment the disk ran out of space, in a worst case scenario we may lose access to tables that use the RocksDB engine completely.

Let me show how bad the situation could be, using a basic sandbox and sysbench test example. I have a sandbox with MyRocks installed:

mysql [localhost] {msandbox} ((none)) > show engines\G
*************************** 1. row ***************************
Engine: ROCKSDB
Support: YES
Comment: RocksDB storage engine
Transactions: YES
XA: YES
Savepoints: YES
mysql [localhost] {msandbox} ((none)) > select @@version,@@version_comment;
+-----------+----------------------------------------------------+
| @@version | @@version_comment |
+-----------+----------------------------------------------------+
| 5.7.22-22 | Percona Server (GPL), Release 22, Revision f62d93c |
+-----------+----------------------------------------------------+
1 row in set (0.00 sec)

Once my test system had not so much free disk space remaining, I ran this simple sysbench prepare command:

$ sysbench /usr/share/sysbench/oltp_insert.lua --mysql_storage_engine=rocksdb --table-size=1000000 --tables=4 --mysql-db=db2 --mysql-user=root --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox5723.sock --threads=4 --time=200 --report-interval=1 --events=0 --db-driver=mysql prepare
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)
Initializing worker threads...
Creating table 'sbtest2'...
Creating table 'sbtest3'...
Creating table 'sbtest1'...
Creating table 'sbtest4'...
Inserting 1000000 records into 'sbtest2'
Inserting 1000000 records into 'sbtest1'
Inserting 1000000 records into 'sbtest3'
Inserting 1000000 records into 'sbtest4'
Creating a secondary index on 'sbtest4'...
FATAL: mysql_drv_query() returned error 1105 ([./.rocksdb/db2.sbtest4_k_4_4_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device) for query 'CREATE INDEX k_4 ON sbtest4(k)'
FATAL: `sysbench.cmdline.call_command' function failed: /usr/share/sysbench/oltp_common.lua:238: SQL error, errno = 1105, state = 'HY000': [./.rocksdb/db2.sbtest4_k_4_4_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device
Creating a secondary index on 'sbtest2'...
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest3'...
FATAL: mysql_drv_query() returned error 1105 ([./.rocksdb/db2.sbtest2_k_2_5_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device) for query 'CREATE INDEX k_2 ON sbtest2(k)'
FATAL: `sysbench.cmdline.call_command' function failed: /usr/share/sysbench/oltp_common.lua:238: SQL error, errno = 1105, state = 'HY000': [./.rocksdb/db2.sbtest2_k_2_5_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device
FATAL: mysql_drv_query() returned error 1105 ([./.rocksdb/db2.sbtest3_k_3_6_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device) for query 'CREATE INDEX k_3 ON sbtest3(k)'
FATAL: `sysbench.cmdline.call_command' function failed: /usr/share/sysbench/oltp_common.lua:238: SQL error, errno = 1105, state = 'HY000': [./.rocksdb/db2.sbtest3_k_3_6_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device
FATAL: mysql_drv_query() returned error 1105 ([./.rocksdb/db2.sbtest1_k_1_7_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device) for query 'CREATE INDEX k_1 ON sbtest1(k)'
FATAL: `sysbench.cmdline.call_command' function failed: /usr/share/sysbench/oltp_common.lua:238: SQL error, errno = 1105, state = 'HY000': [./.rocksdb/db2.sbtest1_k_1_7_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device

This resulted in the following errors being printed to the error log:

2018-08-10T11:01:35.379611Z 5 [Note] RocksDB: Begin index creation (0,260)
2018-08-10T11:01:40.105596Z 3 [Note] RocksDB: Begin index creation (0,261)
2018-08-10T11:01:40.838902Z 4 [Note] RocksDB: Begin index creation (0,262)
2018-08-10T11:01:40.924093Z 6 [Note] RocksDB: Begin index creation (0,263)
2018-08-10T11:03:14.868958Z 0 [ERROR] RocksDB: Error detected in background, Status Code: 5, Status: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device
2018-08-10T11:03:14.868986Z 0 [ERROR] RocksDB: BackgroundErrorReason: 0
2018-08-10T11:03:14.869003Z 0 [ERROR] LibRocksDB:[/mnt/workspace/percona-server-5.7-binaries-release-rocks/label_exp/min-stretch-x64/percona-server-5.7.22-22/storage/rocksdb/rocksdb/db/db_impl_compaction_flush.cc:1293] Waiting after background flush error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on deviceAccumulated background error counts: 1
2018-08-10T11:03:15.430064Z 0 [ERROR] LibRocksDB:[/mnt/workspace/percona-server-5.7-binaries-release-rocks/label_exp/min-stretch-x64/percona-server-5.7.22-22/storage/rocksdb/rocksdb/db/db_impl_compaction_flush.cc:1373] Waiting after background compaction error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000102.sst: No space left on device, Accumulated background error counts: 2
2018-08-10T11:03:18.538866Z 9 [Note] RocksDB: Begin index creation (0,268)
2018-08-10T11:03:19.897797Z 9 [ERROR] Error finishing bulk load.
2018-08-10T11:03:23.306957Z 10 [Note] RocksDB: Begin index creation (0,269)
2018-08-10T11:03:23.509448Z 8 [Note] RocksDB: Begin index creation (0,270)
2018-08-10T11:03:23.539705Z 7 [Note] RocksDB: Begin index creation (0,271)
2018-08-10T11:03:24.536092Z 10 [ERROR] Error finishing bulk load.
2018-08-10T11:03:24.659233Z 7 [ERROR] Error finishing bulk load.
2018-08-10T11:03:24.736380Z 8 [ERROR] Error finishing bulk load.

OK, so I resized my data partition online (no MySQL service restart) and gave it more free disk space. And here is what happened next. I tried to see if MySQL/MyRocks works now:

mysql [localhost] {msandbox} (db1) > select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row in set (0.33 sec)
mysql [localhost] {msandbox} (db1) > insert into sbtest1 set k=1;
Query OK, 1 row affected (0.01 sec)
mysql [localhost] {msandbox} (db1) > select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row in set (0.32 sec)
mysql [localhost] {msandbox} (db1) > delete from db1.sbtest2 where id<100;
Query OK, 99 rows affected (0.01 sec)
mysql [localhost] {msandbox} (db1) > select count(*) from db1.sbtest2;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row in set (0.32 sec)

That’s the worst part – there’s no error returned to the client, and yet my new writes are basically gone!

The good thing is we can still read from the tables, so mysqldump or mydumper is the way to go now, BEFORE you attempt to restart the service:

mysql [localhost] {msandbox} (db1) > select id from db1.sbtest2 where id<10;
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
+----+
9 rows in set (0.00 sec)

When I later try to restart the service, it crashes on shutdown. In some cases, the database is able to recover and offer access to the old RocksDB data (though the new data, written above, are still lost).

In other cases, though, your RocksDB won’t load after restart:

mysql [localhost] {msandbox} (db1) > show create table sbtest1\G
ERROR 1286 (42000): Unknown storage engine 'ROCKSDB'

and using this option:

rocksdb_wal_recovery_mode = 3

…helps to recover the engine to usable state.

Ultimately, the worst case scenario is that you end up with the MyRocks engine dead and unable to recover, no matter what options you try. We can see this in the error log when I try to start the service:

2018-08-10T09:33:18.756868Z 0 [Note] RocksDB: 2 column families found
2018-08-10T09:33:18.756993Z 0 [Note] RocksDB: Column Families at start:
2018-08-10T09:33:18.757016Z 0 [Note] cf=default
2018-08-10T09:33:18.757022Z 0 [Note] write_buffer_size=67108864
2018-08-10T09:33:18.757027Z 0 [Note] target_file_size_base=67108864
2018-08-10T09:33:18.757043Z 0 [Note] cf=__system__
2018-08-10T09:33:18.757048Z 0 [Note] write_buffer_size=67108864
2018-08-10T09:33:18.757052Z 0 [Note] target_file_size_base=67108864
2018-08-10T09:33:18.811671Z 0 [ERROR] RocksDB: Could not get index information for Index Number (0,282), table db3.sbtest4
2018-08-10T09:33:18.811741Z 0 [ERROR] RocksDB: Failed to initialize DDL manager.
2018-08-10T09:33:18.811795Z 0 [ERROR] Plugin 'ROCKSDB' init function returned error.
2018-08-10T09:33:18.811812Z 0 [ERROR] Plugin 'ROCKSDB' registration as a STORAGE ENGINE failed.

Unfortunately, we were not able to find any workaround for this ultimate worst case situation, and so recovering data—including all other RocksDB tables not touched during the incident—may be very hard.

The problem, which is obviously a bug, is reported here: https://jira.percona.com/browse/PS-4706.

Fortunately, the engineers at Facebook—the original RocksDB founder—are aware of the issue and have already released a simple patch. This allows the engine simply to crash instead of allowing lost writes. While this is not a perfect solution, it at least protects you from losing data. Related commit details can be found here: https://github.com/facebook/mysql-5.6/commit/4a7d3fb8f96c96c2be10d61a7796ccac7610a5d6.

At Percona, we accepted that fix, and the latest Percona Server for MySQL 5.7.23 version, already has it incorporated: https://github.com/percona/percona-server/commit/d0d5635573ebc4f157e36a989d14f7ee6f87c410

So, with the upgraded version, we instead will see this kind of error log event:

2018-09-19T08:12:14.496412Z 0 [ERROR] RocksDB: Error detected in background, Status Code: 5, Status: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000104.sst: No space left on device
2018-09-19T08:12:14.496435Z 0 [ERROR] RocksDB: BackgroundErrorReason: 1
2018-09-19T08:12:14.496451Z 0 [ERROR] LibRocksDB:[/mnt/workspace/percona-server-5.7-binaries-release-rocks-new/label_exp/min-stretch-x64/test/percona-server-5.7.23-23/storage/rocksdb/rocksdb/db/db_impl_compactio
n_flush.cc:1472] Waiting after background compaction error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000104.sst: No space left on device, Accumulated background error counts: 1
2018-09-19T08:12:14.511910Z 12 [ERROR] RocksDB: failed to write to WAL, Status Code: 5, Status: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000104.sst: No space left on device
2018-09-19T08:12:14.511935Z 12 [ERROR] MyRocks: aborting on WAL write error.
08:12:14 UTC - mysqld got signal 6 ;

Therefore, all MyRocks users are advised to upgrade ASAP and if that’s not possible, you should at least double check the disk space monitoring and alerting.

The post MyRocks Disk Full Edge Case appeared first on Percona Database Performance Blog.

Aug
29
2018
--

Scaling IO-Bound Workloads for MySQL in the Cloud

InnoDB / MyRocks throughput on IO1

Is increasing GP2 volumes size or increasing IOPS for IO1 volumes a valid method for scaling IO-Bound workloads? In this post I’ll focus on one question: how much can we improve performance if we use faster cloud volumes? This post is a continuance of previous cloud research posts:

To recap, in Amazon EC2 we can use gp2 and io1 volumes. gp2 performance can be scaled with size, i.e for gp2 volume size of 500GB we get 1500 iops; size 1000GB – 3000 iops; and for 3334GB – 10000 iops (maximal possible value). For io1 volumes we can “buy” throughput up to 30000 iops.

So I wanted to check how both InnoDB and RocksDB storage engines perform on these volumes with different throughput.

Benchmark Scenario

I will use the same datasize that I used in Saving With MyRocks in The Cloud, that is sysbench-tpcc, 50 tables, 100W each, about 500GB datasize in InnoDB and 100GB in RocksDB (compressed with LZ4).

Volumes settings: gp2 volumes from 500GB (1000GB for InnoDB) to 3400GB with 100GB increments (so each increment increases throughput by 300 iops); io1 volumes: 1TB in size, iops from 1000 to 30000 with 1000 increments.

Let’s take look at the results. I will use a slightly different format than usual, but hopefully it represents the results better. You will see density throughout the plots—a higher and narrower chart represents less variance in the throughput. The plot represents the distribution of the throughput.

Results on GP2 volumes:

InnoDB/MyRocks throughput on gp2

It’s quite interesting to see how the result scales with better IO throughput. InnoDB does not improve its throughput after gp2 size 2600GB, while MyRocks continues to scale linearly. The problem with MyRocks is that there is a lot of variance in throughput (I will show a one second resolution chart).

Results on IO1 volumes

InnoDB / MyRocks throughput on IO1

Here MyRocks again shows an impressive growth as as we add more IO capacity, but also shows a lot of variance on high capacity volumes.

Let’s compare how engines perform with one second resolution. GP2 volume, 3400GB:

InnoDB/MyRocks throughput on gp2 3400GB

IO1 volume, 30000 iops:

InnoDB/MyRocks throughput on IO1 30000 IOPS

So for MyRocks there seems to be periodical background activity, which does not allow it to achieve a stable throughput.

Raw results, if you’d like to review them, can be found here: https://github.com/Percona-Lab-results/201808-rocksdb-cloudio

Conclusions

If you are looking to improve throughput in IO-bound workloads, either increasing GP2 volumes size or increasing IOPS for IO1 volumes is a valid method, especially for the MyRocks engine.

The post Scaling IO-Bound Workloads for MySQL in the Cloud appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com