Apr
05
2021
--

Which Version of MySQL Should I Use for MyRocks?

Version of MySQL Should I Use for MyRocks

Version of MySQL Should I Use for MyRocksAs database footprints continue to explode, many companies are looking for ways to deal with such rapid growth.  One approach is to refactor traditional relational databases to fit into a NoSQL engine, where horizontal scalability is easier.  However, in many cases, this is in no way a trivial undertaking.

Another approach that has been gaining interest is the use of MyRocks as an alternative storage engine to the traditional InnoDB.  While not for everyone, in certain use cases it could be a potential solution.  As with so many things open source, the next standard questions are: which version should I use?  Any differences with the engine if I use MyRocks with MySQL 5.7 vs 8.0?

In this post, I wanted to touch on this and give some high-level thoughts on MyRocks when it comes to the version of MySQL.

Engine Overview

At a high level, RocksDB is an embedded key-value database with data stored in a log-structured merge tree (LSM).  MyRocks is an abstraction layer that allows RocksDB to serve as a MySQL storage engine.  With RocksDB as the underlying storage layer, there are numerous advantages including faster replication, better compression, and faster data loading.

In contrast, InnoDB is B-Tree based structure.  MySQL was designed using a plugin architecture that separates the storage engine logic from the main server functionality.  This allows users to choose which storage engine they want based on their use case.  Historically, this was the MyISAM storage engine.  In recent years, InnoDB replaced MyISAM as the defacto standard and later the default engine.

Which Engine is Better?

Choosing which engine to use is very use case-specific.  Even at Facebook, where MyRocks was first developed, it was not meant to be a universal replacement for InnoDB.  Features like gap locks remain absent from MyRocks.  If a workload is dependent on gap locking, it will be a non-starter with the MyRocks engine.

Also, data access patterns should dictate the engine choice.  If the workload is write-intensive with limited range scans, it may be a good fit for MyRocks.  If you have a traditional OLTP workload with several reporting access patterns, InnoDB would remain a better option.  As mentioned in MyRocks Engine: Things to Know Before You Start:

“MyRocks is not an enhanced InnoDB, nor a one-size-fits-all replacement for InnoDB. It has its own pros/cons just like InnoDB. You need to decide which engine to use based on your application’s data access patterns.”

Which Version of MySQL Should I Use?

Assuming you have done your research and found MyRocks would be a good fit, the next choice is which version to use.  Facebook runs MyRocks on a heavily patched, internal version of MySQL 5.6Percona Server for MySQL includes MyRocks in both 5.7 and 8.0.  So what are the differences?

At a high level, the MyRocks/RocksDB code is essentially the same in all three versions.  Percona Server for MySQL uses the upstream version of RocksDB and only changes the linked version when Facebook MySQL updates the version.  This is due to the fact that it inherits the integration testing between RocksDB and MyRocks from Facebook.

As such, the biggest differences are based solely on the server version and not the server version and MyRocks combination.  If your infrastructure is already running with MySQL 5.7 in production and not ready to finalize a move to 8.0, there would be no discernible difference running the MyRocks engine with your current version.  MyRocks running on Percona Server for MySQL 5.7 is stable and shouldn’t be discarded as an option.  You should still plan to upgrade to 8.0, but there shouldn’t be any unique challenges in the process compared to using InnoDB.  It would just require the standard upgrade process and significant testing.

Moving forward (as 5.7 approaches EOL in 2023), you can expect to see fewer enhancements with MyRocks on Percona Server for MySQL 5.7.  Most new active development will be done against 8.0 while 5.7 will remain primarily in extended support, which includes critical bug fixes and security fixes only.

Conclusion

With all the version combinations and options, it can be overwhelming to pick the best option for your organization.  For a new deployment, I would recommend starting with MyRocks on MySQL 8.0.  With an existing MySQL 5.7 deployment (while you should be actively working towards an upgrade to 8.0), MyRocks is a viable and stable option as well.  The biggest driver for MyRocks should be space concerns combined with the workload.  As nice as it would be to say MyRocks is the silver bullet that works for everything, that just isn’t the case.  You can always reach out to the Professional Services team at Percona and let us help you determine if MyRocks would be a good fit for your team!

Apr
01
2021
--

Working to Validate MyRocks in the Enterprise with Dropbox

MyRocks in the Enterprise with Dropbox

MyRocks in the Enterprise with DropboxPercona Technical Account Managers get the privilege of working with some of our largest enterprise clients day in and day out.  As such, we get to really focus on how to best leverage our technology to generate measurable benefits for our users.  While it is fun to “nerd out” and always strive to use the latest and greatest, we need to stay focused on demonstrating business value and a genuine need.  Over the past few months, I’ve been working with one of my larger clients, Dropbox, along with our professional services team to validate the use of Percona Server for MySQL with the MyRocks storage engine over a large portion of their MySQL infrastructure.

Please note – this is not meant to be a deep dive into the technical details around MyRocks or the implementation.  Rather, it is meant to show how we determined the need, potential solution, and the planning that has started us down this path.  Look for a more detailed case study in the coming months as we hope to push this solution forward!

The Problem

In a single word, space.  When a client reaches a certain scale, space becomes a real concern.  With 10-20 servers, having an extra index or choosing the wrong data type can be noticeable.  However, with 10,000 servers, there is much less margin for error.  The challenge with this client is that the schema has already been sharded and optimized. Even with that, the aggregate dataset is still on the petabyte scale.  Think about what that means:

  • Large storage footprint per server (2+TB)
  • Replication multiplies that footprint for each “cluster”
  • Multiple backups per cluster result in a huge (PB+) backup footprint

There are some additional challenges created at this scale, too.  To mitigate the risk of filling a disk, there is a soft cap of 75 percent full on each server.  When a cluster hits this level, the shard is split to give each half some additional runway.  While this is great operationally, splitting shards results in doubling the cost for the cluster.  Again, on a scale of 1000s of servers, a “split” means an increase in cost in the hundreds of thousands of dollars.  This is not trivial and puts pressure on the team to delay splits as long as possible while maintaining high availability, durability, and performance.

MyRocks Use Case

After much discussion and understanding that space (and, in turn, cost) is a major business driver, we decided to investigate a change to MyRocks as the storage engine.  While standard InnoDB compression provided a small bandaid, the thinking was that MyRocks would give substantial savings.  At a very high level, MyRocks is an LSM tree-based storage engine built on top of RocksDB.  This type of storage has numerous advantages over traditional B+Tree storage engines (like InnoDB), including a smaller disk footprint and reduced write amplification.  This can translate to business value in several ways, including:

  • Decreased cloud/HDFS storage cost for backups
  • Reduced Write Amplification results in a longer disk lifetime, reducing fleet turnover
  • Decreased instance space requires fewer splits and increases runway

After several rounds of discussion around the benefits and any blockers, we decided to begin testing the feasibility of converting from InnoDB to MyRocks.

Initial Testing

While the use case seemed like a solid fit, extensive testing is always needed.  So far, the initial testing looks promising.  We’ve seen a large reduction in space combined with an increase in performance.  One example cluster saw nearly a 75 percent reduction in space, to the point that we were able to test running two replica instances on a single server in parallel.

What makes this interesting is that the test cluster in question has replicas that periodically show lag and generally are close to the replication limit.  In contrast, the two MyRocks replicas showed no lag during the same test period despite running two instances on one physical server.

While this isn’t something that would be done in production, it was impressive to see double the workload operating so efficiently on the same hardware.  I/O and CPU utilization were both noticeably lower than the single replica running InnoDB.  This shows the potential of increased server longevity and less frequent splits that we were hoping to see.  If these numbers and this performance were to hold into production, we could see savings on the order of millions of dollars across the entire fleet.

Note – this early testing has been limited to replicas only so we can’t yet validate this performance at the production level of concurrency of a primary server.  These results are only with four replica threads, so contention has been minor.

Looking Forward

So far, our limited testing has shown some real promise.  There are still some blockers that we need to overcome, but early results are encouraging. Currently, the biggest technical challenges in this project include:

  • MyRocks currently lacks pt-table-checksum support (for replica consistency)
  • Modifying and validating existing backup/restore/clone automation

Once we are able to fully validate this solution and solve the existing challenges (both fixes have been identified and are on the project roadmap) in the coming months, look for a more detailed case study.  While not every organization needs to make such a drastic change, this use case is one that is becoming more common at the enterprise level as data sets continue to explode.

Want to learn more? Check out the Percona MyRocks Introduction page on our website!

Feb
07
2019
--

Column Families in MyRocks

myrocks column families

myrocks column familiesIn my webinar How To Rock with MyRocks I briefly mentioned the column families feature in MyRocks, that allows a fine tuning for indexes and primary keys.

Let’s review it in more detail.

To recap, MyRocks is based on the RocksDB library, which stores all data in [key => value] pairs, so when it translates to MySQL, all Primary Keys (data) and secondary keys (indexes) are stored in [ key => value ] pairs, which by default are assigned to “default” Column Family.

Each column family has individual set of

  • SST files, and their parameters
  • Memtable and its parameters
  • Bloom filters, and their parameters
  • Compression settings

There is a N:1 relation between tables and indexes to column family, so schematically it looks like this:

column families myrocks

How do you assign tables and indexes to a column family?

It is defined in the COMMENT section for a key or primary key:

CREATE TABLE tab1(
a INT,
b INT,
PRIMARY KEY (a) COMMENT ‘cfname=cf1’,
KEY key_b (b) COMMENT ‘cfname=cf2’)
)

Now, if you want to define individual parameters for column families, you should use

rocksdb_override_cf_options 

For example:

rocksdb_override_cf_options=’cf1={compression=kNoCompression}; cf2={compression=kLZ4Compression,bottommost_compression==kZSTD}’

Be careful of defining too many column families: as I mentioned, each column family will use an individual memtable, which takes 64MB of memory by default.

There is also an individual set of SST tables. You can see how they perform with

SHOW ENGINE ROCKSDB STATUS

 :

  Type: CF_COMPACTION
  Name: cf1
Status: 
** Compaction Stats [cf1] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0     0.00 KB   0.0      0.0     0.0      0.0       1.5      1.5       0.0   1.0      0.0     99.1        15       37    0.410       0      0
  L5      3/0   197.05 MB   0.8      0.4     0.4      0.0       0.4      0.4       0.0   1.0     75.6     75.6         6        1    5.923   8862K      0
  L6      7/0   341.24 MB   0.0      1.7     1.3      0.5       0.8      0.3       0.0   0.6     42.8     19.5        42        7    5.933     61M      0
 Sum     10/0   538.29 MB   0.0      2.2     1.7      0.5       2.7      2.2       0.0   1.8     35.5     44.1        63        45    1.392     70M     0
 Int      0/0     0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0        0    0.000       0      0
  Type: CF_COMPACTION
  Name: cf2
Status: 
** Compaction Stats [cf2] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0     0.00 KB   0.0      0.0     0.0      0.0       0.3      0.3       0.0   1.0      0.0     13.5        22       22    1.023       0      0
  L6      4/0   178.61 MB   0.0      0.6     0.3      0.3       0.4      0.2       0.0   1.5      9.3      7.3        61        5   12.243     72M      0
 Sum      4/0   178.61 MB   0.0      0.6     0.3      0.3       0.7      0.5       0.0   2.5      6.8      9.0        84       27    3.100     72M      0
 Int      0/0     0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0        0    0.000       0      0

To check the current column families and their settings you can use:

SELECT * FROM INFORMATION_SCHEMA.ROCKSDB_CF_OPTIONS:
| cf1        | COMPARATOR                                                      | RocksDB_SE_v3.10      
|
| cf1        | MERGE_OPERATOR                                                  | NULL                                                                                                      |
| cf1        | COMPACTION_FILTER                                               | NULL                                                                                                                                                                     
| cf1        | COMPACTION_FILTER_FACTORY                                       | Rdb_compact_filter_factory                                                                                                                                               
| cf1        | WRITE_BUFFER_SIZE                                               | 67108864                                                                                                       
| cf1        | MAX_WRITE_BUFFER_NUMBER                                         | 2                                                                                                                                                                        
| cf1        | MIN_WRITE_BUFFER_NUMBER_TO_MERGE                                | 1                                                                                                                                                                        
| cf1        | NUM_LEVELS                                                      | 7                                                                                                                                                                        
| cf1        | LEVEL0_FILE_NUM_COMPACTION_TRIGGER                              | 4                                                                                                                                                                        
| cf1        | LEVEL0_SLOWDOWN_WRITES_TRIGGER                                  | 20                                                                                                                                                                       
| cf1        | LEVEL0_STOP_WRITES_TRIGGER                                      | 36                                                                                                                                                                       
| cf1        | MAX_MEM_COMPACTION_LEVEL                                        | 0                                                                                                                                                                        
| cf1        | TARGET_FILE_SIZE_BASE                                           | 67108864                                                                                                                                                                 
| cf1        | TARGET_FILE_SIZE_MULTIPLIER                                     | 1                                                                                                                                                                        
| cf1        | MAX_BYTES_FOR_LEVEL_BASE                                        | 268435456                                                                                                                                                                
| cf1        | LEVEL_COMPACTION_DYNAMIC_LEVEL_BYTES                            | ON                                                                                                                                                                       
| cf1        | MAX_BYTES_FOR_LEVEL_MULTIPLIER                                  | 10.000000                                                                                                                                                                
| cf1        | SOFT_RATE_LIMIT                                                 | 0.000000                                                                                                                                                                 
| cf1        | HARD_RATE_LIMIT                                                 | 0.000000                                                                                                                                                                 
| cf1        | RATE_LIMIT_DELAY_MAX_MILLISECONDS                               | 100                                                                                                                                                                      
| cf1        | ARENA_BLOCK_SIZE                                                | 0                                                                                                                                                                        
| cf1        | DISABLE_AUTO_COMPACTIONS                                        | OFF                                                                                                                                                                      
| cf1        | PURGE_REDUNDANT_KVS_WHILE_FLUSH                                 | ON                                                                                                                                                                       
| cf1        | MAX_SEQUENTIAL_SKIP_IN_ITERATIONS                               | 8                                                                                                                                                                        
| cf1        | MEMTABLE_FACTORY                                                | SkipListFactory                                                                                                                                                          
| cf1        | INPLACE_UPDATE_SUPPORT                                          | OFF                                                                                                                                                                      
| cf1        | INPLACE_UPDATE_NUM_LOCKS                                        | ON                                                                                                                                                                       
| cf1        | MEMTABLE_PREFIX_BLOOM_BITS_RATIO                                | 0.000000                                                                                                                                                                 
| cf1        | MEMTABLE_PREFIX_BLOOM_HUGE_PAGE_TLB_SIZE                        | 0                                                                                                                                                                        
| cf1        | BLOOM_LOCALITY                                                  | 0                                                                                                                                                                        
| cf1        | MAX_SUCCESSIVE_MERGES                                           | 0                                                                                                                                                                        
| cf1        | OPTIMIZE_FILTERS_FOR_HITS                                       | ON                                                                                                                                                                       
| cf1        | MAX_BYTES_FOR_LEVEL_MULTIPLIER_ADDITIONAL                       | 1:1:1:1:1:1:1                                                                                                                                                            
| cf1        | COMPRESSION_TYPE                                                | kNoCompression                                                                                                                                                           
| cf1        | COMPRESSION_PER_LEVEL                                           | NUL                                                                                                                                                                      
| cf1        | COMPRESSION_OPTS                                                | -14:-1:0                                                                                                                                                                 
| cf1        | BOTTOMMOST_COMPRESSION                                          | kLZ4Compression                                                                                                                                                          
| cf1        | PREFIX_EXTRACTOR                                                | NULL                                                                                                                                                                     
| cf1        | COMPACTION_STYLE                                                | kCompactionStyleLevel                                                                                                                                                    
| cf1        | COMPACTION_OPTIONS_UNIVERSAL                                    | {SIZE_RATIO=1; MIN_MERGE_WIDTH=2; MAX_MERGE_WIDTH=4294967295; MAX_SIZE_AMPLIFICATION_PERCENT=200; COMPRESSION_SIZE_PERCENT=-1; STOP_STYLE=kCompactionStopStyleTotalSize} |
| cf1        | COMPACTION_OPTION_FIFO::MAX_TABLE_FILES_SIZE                    | 1073741824                                                                                                                                                               
| cf1        | TABLE_FACTORY::FLUSH_BLOCK_POLICY_FACTORY                       | FlushBlockBySizePolicyFactory(0x4715df0)                                                                                                                                 
| cf1        | TABLE_FACTORY::CACHE_INDEX_AND_FILTER_BLOCKS                    | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::CACHE_INDEX_AND_FILTER_BLOCKS_WITH_HIGH_PRIORITY | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::PIN_L0_FILTER_AND_INDEX_BLOCKS_IN_CACHE          | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::INDEX_TYPE                                       | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::HASH_INDEX_ALLOW_COLLISION                       | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::CHECKSUM                                         | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::NO_BLOCK_CACHE                                   | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::BLOCK_CACHE                                      | 0x470c880                                                                                                                                                                
| cf1        | TABLE_FACTORY::BLOCK_CACHE_NAME                                 | LRUCache                                                                                                                                                                 
| cf1        | TABLE_FACTORY::BLOCK_CACHE_OPTIONS                              |                                                                                                                                                                          
| cf1        | TABLE_FACTORY::CAPACITY                                         | 536870912                                                                                                                                                                
| cf1        | TABLE_FACTORY::NUM_SHARD_BITS                                   | 6                                                                                                                                                                        
| cf1        | TABLE_FACTORY::STRICT_CAPACITY_LIMIT                            | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::HIGH_PRI_POOL_RATIO                              | 0.000                                                                                                                                                                    
| cf1        | TABLE_FACTORY::BLOCK_CACHE_COMPRESSED                           | (nil)                                                                                                                                                                    
| cf1        | TABLE_FACTORY::PERSISTENT_CACHE                                 | (nil)                                                                                                                                                                    
| cf1        | TABLE_FACTORY::BLOCK_SIZE                                       | 16384                                                                                                                                                                    
| cf1        | TABLE_FACTORY::BLOCK_SIZE_DEVIATION                             | 10                                                                                                                                                                       
| cf1        | TABLE_FACTORY::BLOCK_RESTART_INTERVAL                           | 16                                                                                                                                                                       
| cf1        | TABLE_FACTORY::INDEX_BLOCK_RESTART_INTERVAL                     | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::METADATA_BLOCK_SIZE                              | 4096                                                                                                                                                                     
| cf1        | TABLE_FACTORY::PARTITION_FILTERS                                | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::USE_DELTA_ENCODING                               | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::FILTER_POLICY                                    | rocksdb.BuiltinBloomFilter                                                                                                                                               
| cf1        | TABLE_FACTORY::WHOLE_KEY_FILTERING                              | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::VERIFY_COMPRESSION                               | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::READ_AMP_BYTES_PER_BIT                           | 0                                                                                                                                                                        
| cf1        | TABLE_FACTORY::FORMAT_VERSION                                   | 2                                                                                                                                                                        
| cf1        | TABLE_FACTORY::ENABLE_INDEX_COMPRESSION                         | 1                                                                                                                                                                        
| cf1        | TABLE_FACTORY::BLOCK_ALIGN                                      | 0                   
    

As a reminder MyRocks is available in Percona Server 5.7 and Percona Server 8.0, you can try it and share your experience!


Photo by Debby Hudson on Unsplash

Jan
14
2019
--

Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

How to Rock with MyRocks

How to Rock with MyRocksPlease join Percona’s Chief Technology Officer, Vadim Tkachenko, as he presents How to Rock with MyRocks on Thursday, January 17th at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

Register Now

MyRocks is a new storage engine from Facebook and is available in Percona Server for MySQL. In what cases will you want to use it? We will check different workloads and when MyRocks is most suitable for you. Also, as for any new engine, it’s important to set it up and tune it properly. So, we will review the most important settings to pay attention to.

Register for this webinar to learn How to Rock with MyRocks.

Sep
26
2018
--

Scaling IO-Bound Workloads for MySQL in the Cloud – part 2

Rplot07-innodb-iops

This post is a followup to my previous article https://www.percona.com/blog/2018/08/29/scaling-io-bound-workloads-mysql-cloud/

In this instance, I want to show the data in different dimensions, primarily to answer questions around how throughput scales with increasing IOPS.

A recap: for the test I use Amazon instances and Amazon gp2 and io1 volumes. In addition to the original post, I also tested two gpl2 volumes combined in software RAID0. I did this for the following reason: Amazon cap the single gp2 volume throughput to 160MB/sec, and as we will see from the charts, this limits InnoDB performance.

Also, a reminder from the previous post: we can increase gp2 IOPS by increasing volume size (to the top limit 10000 IOPS), and for io1 we can increase IOPS by paying per additional IOPS.

Scaling with InnoDB

So for the first result, let’s see how InnoDB scales with increasing IOPS.

There are a few interesting observations here: InnoDB scales linearly with additional IOPS, but it faces a throughput limit that Amazon applies to volumes.

So besides considering IOPS, we should take into account the maximal throughout of volumes.

In the second chart we compare InnoDB performance vs the cost of volumes:

It’s interesting to see here the slope for gp2 volumes is steeper than for io1 volumes. This means we can get a bigger increase in InnoDB performance per dollar using gp2 volumes, but only until we reach the IOPS and throughput limits that are applied to gp2 volumes.

Scaling with MyRocks

And here’s the similar chart but for MyRocks:

Here we can also see that MyRocks scales linearly, showing identical results on gp2 and io1 volumes. This means that running on gp2 will be cheaper. Also, there is no plateau in throughput, as we saw for InnoDB, which means that MyRocks uses less IO throughput.

And the chart for the cost of running MyRocks:

This charts also shows that it is cheaper to run on gp2 volume but only while it provides enough IOPS. I assume that using two gp2 volumes would allow me to double the throughput. (I did not run the test for MyRocks using two volumes)

Conclusions

  • Both MyRocks and InnoDB can scale (linearly) with additional IOPS on gp2 and io1 Amazon volumes.
  • Take into account that IOPS is not the only factor to consider. There is also throughput limit, which affects InnoDB results, so for further scaling you might need to use multiple volumes.

The post Scaling IO-Bound Workloads for MySQL in the Cloud – part 2 appeared first on Percona Database Performance Blog.

Sep
20
2018
--

MyRocks Disk Full Edge Case

problem in MyRocks

MyRocks disk full bugRocksDB engine—and it’s MySQL implementation MyRocks—is a very good alternative engine for MySQL. It has proven to be very efficient and stable for many workloads, including those of large scale. However, it is still a relative newborn in the MySQL ecosystem, and has only a small fraction of the adoption rate of InnoDB. That means it is not so well tested at all possible edge cases, and may have many unreported bugs. One known bug is discussed here: if you are a MyRocks user, it’s important that you are aware of the possibility of lost data in the specific circumstances described below.

In writing this article, I want to broadcast a wider warning, as the problem I found is pretty serious and could lead to a very unpleasant situation.

The problem is related to not a very edge case after all – disk full. The result could be extremely bad, though. After printing some errors, the RocksDB engine continues to message clients that consecutive writes are accepted, while they are, in fact, lost! In addition, depending on the workload and at which moment the disk ran out of space, in a worst case scenario we may lose access to tables that use the RocksDB engine completely.

Let me show how bad the situation could be, using a basic sandbox and sysbench test example. I have a sandbox with MyRocks installed:

mysql [localhost] {msandbox} ((none)) > show engines\G
*************************** 1. row ***************************
Engine: ROCKSDB
Support: YES
Comment: RocksDB storage engine
Transactions: YES
XA: YES
Savepoints: YES
mysql [localhost] {msandbox} ((none)) > select @@version,@@version_comment;
+-----------+----------------------------------------------------+
| @@version | @@version_comment |
+-----------+----------------------------------------------------+
| 5.7.22-22 | Percona Server (GPL), Release 22, Revision f62d93c |
+-----------+----------------------------------------------------+
1 row in set (0.00 sec)

Once my test system had not so much free disk space remaining, I ran this simple sysbench prepare command:

$ sysbench /usr/share/sysbench/oltp_insert.lua --mysql_storage_engine=rocksdb --table-size=1000000 --tables=4 --mysql-db=db2 --mysql-user=root --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox5723.sock --threads=4 --time=200 --report-interval=1 --events=0 --db-driver=mysql prepare
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)
Initializing worker threads...
Creating table 'sbtest2'...
Creating table 'sbtest3'...
Creating table 'sbtest1'...
Creating table 'sbtest4'...
Inserting 1000000 records into 'sbtest2'
Inserting 1000000 records into 'sbtest1'
Inserting 1000000 records into 'sbtest3'
Inserting 1000000 records into 'sbtest4'
Creating a secondary index on 'sbtest4'...
FATAL: mysql_drv_query() returned error 1105 ([./.rocksdb/db2.sbtest4_k_4_4_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device) for query 'CREATE INDEX k_4 ON sbtest4(k)'
FATAL: `sysbench.cmdline.call_command' function failed: /usr/share/sysbench/oltp_common.lua:238: SQL error, errno = 1105, state = 'HY000': [./.rocksdb/db2.sbtest4_k_4_4_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device
Creating a secondary index on 'sbtest2'...
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest3'...
FATAL: mysql_drv_query() returned error 1105 ([./.rocksdb/db2.sbtest2_k_2_5_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device) for query 'CREATE INDEX k_2 ON sbtest2(k)'
FATAL: `sysbench.cmdline.call_command' function failed: /usr/share/sysbench/oltp_common.lua:238: SQL error, errno = 1105, state = 'HY000': [./.rocksdb/db2.sbtest2_k_2_5_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device
FATAL: mysql_drv_query() returned error 1105 ([./.rocksdb/db2.sbtest3_k_3_6_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device) for query 'CREATE INDEX k_3 ON sbtest3(k)'
FATAL: `sysbench.cmdline.call_command' function failed: /usr/share/sysbench/oltp_common.lua:238: SQL error, errno = 1105, state = 'HY000': [./.rocksdb/db2.sbtest3_k_3_6_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device
FATAL: mysql_drv_query() returned error 1105 ([./.rocksdb/db2.sbtest1_k_1_7_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device) for query 'CREATE INDEX k_1 ON sbtest1(k)'
FATAL: `sysbench.cmdline.call_command' function failed: /usr/share/sysbench/oltp_common.lua:238: SQL error, errno = 1105, state = 'HY000': [./.rocksdb/db2.sbtest1_k_1_7_0.bulk_load.tmp] bulk load error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device

This resulted in the following errors being printed to the error log:

2018-08-10T11:01:35.379611Z 5 [Note] RocksDB: Begin index creation (0,260)
2018-08-10T11:01:40.105596Z 3 [Note] RocksDB: Begin index creation (0,261)
2018-08-10T11:01:40.838902Z 4 [Note] RocksDB: Begin index creation (0,262)
2018-08-10T11:01:40.924093Z 6 [Note] RocksDB: Begin index creation (0,263)
2018-08-10T11:03:14.868958Z 0 [ERROR] RocksDB: Error detected in background, Status Code: 5, Status: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on device
2018-08-10T11:03:14.868986Z 0 [ERROR] RocksDB: BackgroundErrorReason: 0
2018-08-10T11:03:14.869003Z 0 [ERROR] LibRocksDB:[/mnt/workspace/percona-server-5.7-binaries-release-rocks/label_exp/min-stretch-x64/percona-server-5.7.22-22/storage/rocksdb/rocksdb/db/db_impl_compaction_flush.cc:1293] Waiting after background flush error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000101.sst: No space left on deviceAccumulated background error counts: 1
2018-08-10T11:03:15.430064Z 0 [ERROR] LibRocksDB:[/mnt/workspace/percona-server-5.7-binaries-release-rocks/label_exp/min-stretch-x64/percona-server-5.7.22-22/storage/rocksdb/rocksdb/db/db_impl_compaction_flush.cc:1373] Waiting after background compaction error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000102.sst: No space left on device, Accumulated background error counts: 2
2018-08-10T11:03:18.538866Z 9 [Note] RocksDB: Begin index creation (0,268)
2018-08-10T11:03:19.897797Z 9 [ERROR] Error finishing bulk load.
2018-08-10T11:03:23.306957Z 10 [Note] RocksDB: Begin index creation (0,269)
2018-08-10T11:03:23.509448Z 8 [Note] RocksDB: Begin index creation (0,270)
2018-08-10T11:03:23.539705Z 7 [Note] RocksDB: Begin index creation (0,271)
2018-08-10T11:03:24.536092Z 10 [ERROR] Error finishing bulk load.
2018-08-10T11:03:24.659233Z 7 [ERROR] Error finishing bulk load.
2018-08-10T11:03:24.736380Z 8 [ERROR] Error finishing bulk load.

OK, so I resized my data partition online (no MySQL service restart) and gave it more free disk space. And here is what happened next. I tried to see if MySQL/MyRocks works now:

mysql [localhost] {msandbox} (db1) > select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row in set (0.33 sec)
mysql [localhost] {msandbox} (db1) > insert into sbtest1 set k=1;
Query OK, 1 row affected (0.01 sec)
mysql [localhost] {msandbox} (db1) > select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row in set (0.32 sec)
mysql [localhost] {msandbox} (db1) > delete from db1.sbtest2 where id<100;
Query OK, 99 rows affected (0.01 sec)
mysql [localhost] {msandbox} (db1) > select count(*) from db1.sbtest2;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row in set (0.32 sec)

That’s the worst part – there’s no error returned to the client, and yet my new writes are basically gone!

The good thing is we can still read from the tables, so mysqldump or mydumper is the way to go now, BEFORE you attempt to restart the service:

mysql [localhost] {msandbox} (db1) > select id from db1.sbtest2 where id<10;
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
+----+
9 rows in set (0.00 sec)

When I later try to restart the service, it crashes on shutdown. In some cases, the database is able to recover and offer access to the old RocksDB data (though the new data, written above, are still lost).

In other cases, though, your RocksDB won’t load after restart:

mysql [localhost] {msandbox} (db1) > show create table sbtest1\G
ERROR 1286 (42000): Unknown storage engine 'ROCKSDB'

and using this option:

rocksdb_wal_recovery_mode = 3

…helps to recover the engine to usable state.

Ultimately, the worst case scenario is that you end up with the MyRocks engine dead and unable to recover, no matter what options you try. We can see this in the error log when I try to start the service:

2018-08-10T09:33:18.756868Z 0 [Note] RocksDB: 2 column families found
2018-08-10T09:33:18.756993Z 0 [Note] RocksDB: Column Families at start:
2018-08-10T09:33:18.757016Z 0 [Note] cf=default
2018-08-10T09:33:18.757022Z 0 [Note] write_buffer_size=67108864
2018-08-10T09:33:18.757027Z 0 [Note] target_file_size_base=67108864
2018-08-10T09:33:18.757043Z 0 [Note] cf=__system__
2018-08-10T09:33:18.757048Z 0 [Note] write_buffer_size=67108864
2018-08-10T09:33:18.757052Z 0 [Note] target_file_size_base=67108864
2018-08-10T09:33:18.811671Z 0 [ERROR] RocksDB: Could not get index information for Index Number (0,282), table db3.sbtest4
2018-08-10T09:33:18.811741Z 0 [ERROR] RocksDB: Failed to initialize DDL manager.
2018-08-10T09:33:18.811795Z 0 [ERROR] Plugin 'ROCKSDB' init function returned error.
2018-08-10T09:33:18.811812Z 0 [ERROR] Plugin 'ROCKSDB' registration as a STORAGE ENGINE failed.

Unfortunately, we were not able to find any workaround for this ultimate worst case situation, and so recovering data—including all other RocksDB tables not touched during the incident—may be very hard.

The problem, which is obviously a bug, is reported here: https://jira.percona.com/browse/PS-4706.

Fortunately, the engineers at Facebook—the original RocksDB founder—are aware of the issue and have already released a simple patch. This allows the engine simply to crash instead of allowing lost writes. While this is not a perfect solution, it at least protects you from losing data. Related commit details can be found here: https://github.com/facebook/mysql-5.6/commit/4a7d3fb8f96c96c2be10d61a7796ccac7610a5d6.

At Percona, we accepted that fix, and the latest Percona Server for MySQL 5.7.23 version, already has it incorporated: https://github.com/percona/percona-server/commit/d0d5635573ebc4f157e36a989d14f7ee6f87c410

So, with the upgraded version, we instead will see this kind of error log event:

2018-09-19T08:12:14.496412Z 0 [ERROR] RocksDB: Error detected in background, Status Code: 5, Status: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000104.sst: No space left on device
2018-09-19T08:12:14.496435Z 0 [ERROR] RocksDB: BackgroundErrorReason: 1
2018-09-19T08:12:14.496451Z 0 [ERROR] LibRocksDB:[/mnt/workspace/percona-server-5.7-binaries-release-rocks-new/label_exp/min-stretch-x64/test/percona-server-5.7.23-23/storage/rocksdb/rocksdb/db/db_impl_compactio
n_flush.cc:1472] Waiting after background compaction error: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000104.sst: No space left on device, Accumulated background error counts: 1
2018-09-19T08:12:14.511910Z 12 [ERROR] RocksDB: failed to write to WAL, Status Code: 5, Status: IO error: No space left on deviceWhile appending to file: ./.rocksdb/000104.sst: No space left on device
2018-09-19T08:12:14.511935Z 12 [ERROR] MyRocks: aborting on WAL write error.
08:12:14 UTC - mysqld got signal 6 ;

Therefore, all MyRocks users are advised to upgrade ASAP and if that’s not possible, you should at least double check the disk space monitoring and alerting.

The post MyRocks Disk Full Edge Case appeared first on Percona Database Performance Blog.

Aug
29
2018
--

Scaling IO-Bound Workloads for MySQL in the Cloud

InnoDB / MyRocks throughput on IO1

Is increasing GP2 volumes size or increasing IOPS for IO1 volumes a valid method for scaling IO-Bound workloads? In this post I’ll focus on one question: how much can we improve performance if we use faster cloud volumes? This post is a continuance of previous cloud research posts:

To recap, in Amazon EC2 we can use gp2 and io1 volumes. gp2 performance can be scaled with size, i.e for gp2 volume size of 500GB we get 1500 iops; size 1000GB – 3000 iops; and for 3334GB – 10000 iops (maximal possible value). For io1 volumes we can “buy” throughput up to 30000 iops.

So I wanted to check how both InnoDB and RocksDB storage engines perform on these volumes with different throughput.

Benchmark Scenario

I will use the same datasize that I used in Saving With MyRocks in The Cloud, that is sysbench-tpcc, 50 tables, 100W each, about 500GB datasize in InnoDB and 100GB in RocksDB (compressed with LZ4).

Volumes settings: gp2 volumes from 500GB (1000GB for InnoDB) to 3400GB with 100GB increments (so each increment increases throughput by 300 iops); io1 volumes: 1TB in size, iops from 1000 to 30000 with 1000 increments.

Let’s take look at the results. I will use a slightly different format than usual, but hopefully it represents the results better. You will see density throughout the plots—a higher and narrower chart represents less variance in the throughput. The plot represents the distribution of the throughput.

Results on GP2 volumes:

InnoDB/MyRocks throughput on gp2

It’s quite interesting to see how the result scales with better IO throughput. InnoDB does not improve its throughput after gp2 size 2600GB, while MyRocks continues to scale linearly. The problem with MyRocks is that there is a lot of variance in throughput (I will show a one second resolution chart).

Results on IO1 volumes

InnoDB / MyRocks throughput on IO1

Here MyRocks again shows an impressive growth as as we add more IO capacity, but also shows a lot of variance on high capacity volumes.

Let’s compare how engines perform with one second resolution. GP2 volume, 3400GB:

InnoDB/MyRocks throughput on gp2 3400GB

IO1 volume, 30000 iops:

InnoDB/MyRocks throughput on IO1 30000 IOPS

So for MyRocks there seems to be periodical background activity, which does not allow it to achieve a stable throughput.

Raw results, if you’d like to review them, can be found here: https://github.com/Percona-Lab-results/201808-rocksdb-cloudio

Conclusions

If you are looking to improve throughput in IO-bound workloads, either increasing GP2 volumes size or increasing IOPS for IO1 volumes is a valid method, especially for the MyRocks engine.

The post Scaling IO-Bound Workloads for MySQL in the Cloud appeared first on Percona Database Performance Blog.

Aug
03
2018
--

This Week in Data with Colin Charles 47: MySQL 8.0.12 and It’s Time To Submit!

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Don’t wait, submit a talk for Percona Live Europe 2018 to be held in Frankfurt 5-7 November 2018. The call for proposals is ending soon, there is a committee being created, and it is a great conference to speak at, with a new city to boot!

Releases

  • A big release, MySQL 8.0.12, with INSTANT ADD COLUMN support, BLOB optimisations, changes around replication, the query rewrite plugin and lots more. Naturally this also means the connectors get bumped up to the 8.0.12, including a nice new MySQL Shell.
  • A maintenance release, with security fixes, MySQL 5.5.61 as well as MariaDB 5.5.61.
  • repmgr v4.1 helps monitor PostgreSQL replication, and can handle switch overs and failovers.

Link List

  • Saving With MyRocks in The Cloud – a great MyRocks use case, as in the cloud, resources are major considerations and you can save on I/O with MyRocks. As long as your workload is I/O bound, you’re bound to benefit.
  • Hasura GraphQL Engine allows you to get an instant GraphQL API on any PostgreSQL based application. This is in addition to Graphile. For MySQL users, there is Prisma.

Industry Updates

  • Jeremy Cole (Linkedin) ended his sabbatical to start work at Shopify. He was previously hacking on MySQL and MariaDB Server at Google, and had stints at Twitter, Yahoo!, his co-owned firm Proven Scaling, as well as MySQL AB.
  • Dremio raises $30 million from the likes of Cisco and more for their Series B. They are a “data-as-a-service” company, having raised a total of $45m in two rounds (Crunchbase).

Upcoming Appearances

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

 

The post This Week in Data with Colin Charles 47: MySQL 8.0.12 and It’s Time To Submit! appeared first on Percona Database Performance Blog.

Aug
01
2018
--

Saving With MyRocks in The Cloud

The main focus of a previous blog post was the performance of MyRocks when using fast SSD devices. However, I figured that MyRocks would be beneficial for use in cloud workloads, where storage is either slow or expensive.

In that earlier post, we demonstrated the benefits of MyRocks, especially for heavy IO workloads. Meanwhile, Mark wrote in his blog that the CPU overhead in MyRocks might be significant for CPU-bound workloads, but this should not be the issue for IO-bound workloads.

In the cloud the cost of resources is a major consideration. Let’s review the annual cost for the processing and storage resources.

 Resource cost/year, $   IO cost $/year   Total $/year 
c5.9xlarge  7881    7881
1TB io1 5000 IOPS  1500  3900    5400
1TB io1 10000 IOPS  1500  7800    9300
1TB io1 15000 IOPS  1500  11700  13200
1TB io1 20000 IOPS  1500  15600  17100
1TB io1 30000 IOPS  1500  23400  24900
3.4TB GP2 (10000 IOPS)  4800    4800

 

The scenario

The server version is Percona Server 5.7.22

For instances, I used c5.9xlarge instances. The reason for c5 was that it provides high performance Nitro virtualization: Brendan Gregg describes this in his blog post. The rationale for 9xlarge instances was to be able to utilize io1 volumes with a 30000 IOPS throughput – smaller instances will cap io1 throughput at a lower level.

I also used huge gp2 volumes: 3400GB, as this volume provides guaranteed 10000 IOPS even if we do not use io1 volumes. This is a cheaper alternative to io1 volumes to achieve 10000 IOPS.

For the workload I used sysbench-tpcc 5000W (50 tables * 100W), which for InnoDB gave about 471GB in storage used space.

For the cache I used 27GB and 54G buffer size, so the workload is IO-heavy.

I wanted to compare how InnoDB and RocksDB performed under this scenario.

If you are curious I prepared my terraform+ansible deployment files here: https://github.com/vadimtk/terraform-ansible-percona

Before jumping to the results, I should note that for MyRocks I used LZ4 compression for all levels, which in its final size is 91GB. That is five times less than InnoDB size. This alone provides operational benefits—for example to copy InnoDB files (471GB) from a backup volume takes longer than 1 hour, while it is much faster (five times) for MyRocks.

The benchmark results

So let’s review the results.

InnoDB versus MyRocks throughput in the cloud

Or presenting average throughput in a tabular form:

cachesize IOPS engine avg TPS
27 5000 innodb 132.66
27 5000 rocksdb 481.03
27 10000 innodb 285.93
27 10000 rocksdb 1224.14
27 10000gp2 innodb 227.19
27 10000gp2 rocksdb 1268.89
27 15000 innodb 436.04
27 15000 rocksdb 1839.66
27 20000 innodb 584.69
27 20000 rocksdb 2336.94
27 30000 innodb 753.86
27 30000 rocksdb 2508.97
54 5000 innodb 197.51
54 5000 rocksdb 667.63
54 10000 innodb 433.99
54 10000 rocksdb 1600.01
54 10000gp2 innodb 326.12
54 10000gp2 rocksdb 1559.98
54 15000 innodb 661.34
54 15000 rocksdb 2176.83
54 20000 innodb 888.74
54 20000 rocksdb 2506.22
54 30000 innodb 1097.31
54 30000 rocksdb 2690.91

 

We can see that MyRocks outperformed InnoDB in every single combination, but it is also important to note the following:

MyRocks on io1 5000 IOPS showed the performance that InnoDB showed in io1 15000 IOPS.

That means that InnoDB requires three times more in storage throughput. If we take a look at the storage cost, it corresponds to three times more expensive storage. Given that MyRocks requires less storage, it is possible to save even more on storage capacity.

On the most economical storage (3400GB gp2, which will provide 10000 IOPS) MyRocks showed 4.7 times better throughput.

For the 30000 IOPS storage, MyRocks was still better by 2.45 times.

However it is worth noting that MyRocks showed a greater variance in throughput during the runs. Let’s review the charts with 1 sec resolution for GP2 and io1 30000 IOPS storage:Throughput 1 sec resolution for GP2 and io1 30000 IOPS storage MyROCKS versus InnoDB

Such variance might be problematic for workloads that require stable throughput and where periodical slowdowns are unacceptable.

Conclusion

MyRocks is suitable and beneficial not only for fast SSD, but also for cloud deployments. By requiring less IOPS, MyRocks can provide better performance and save on the storage costs.

However, before evaluating MyRocks, make sure that your workload is IO-bound i.e. the working set is much bigger than available memory. For CPU-intensive workloads (where the working set fits into memory), MyRocks will be less beneficial or even perform worse than InnoDB (as described in the blog post A Look at MyRocks Performance)

 

 

 

The post Saving With MyRocks in The Cloud appeared first on Percona Database Performance Blog.

Jul
27
2018
--

This Week in Data with Colin Charles 46: OSCON Recap, Google Site Reliability Workbook

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

OSCON happened last week and was incredible. It is true there was less of a database focus, and a lot more topics covered. In fact, you’d have found it hard to find database content. There was plenty of interesting content around AI/ML, cloud, SRE, blockchain and more. As a speaker, the 40-minute sessions that included a Q and A session was quite compact (I felt it was a little too short, and many speakers sped up towards the end). I guess it will make for more blog content.

The conference’s open source ethos is still extremely strong, and the keynotes exemplified that. It was not just the 20th anniversary of OSCON, but also the 20th anniversary of the Open Source Initiative (OSI). Percona is a sponsor (and I am an individual member). From a sponsor standpoint, Home Depot had a huge booth, but so did the NSA – who, along with giving out stickers, were actively recruiting. Readers might recall mention of NSA’s involvement with LemonGraph and the other data open source data projects from column 43.

Google just released The Site Reliability Workbook, which looks like the full PDF (“launch day edition”) of their new book. It includes practical ways to implement SRE. You can pre-order the book, and the official release date is August 4, 2018. This should be the best companion to Site Reliability Engineering: How Google Runs Production Systems, which I highly recommend reading first before getting to the workbook. After a quick perusal of the new release, I can say I like it — the case studies from Evernote and Home Depot, are all very interesting from a database standpoint (MySQL, Cloud SQL). Plenty of information is relevant if you’re a Prometheus user as well. I say skim the PDF, and devour the book!

Releases

Link List

Industry Updates

  • Elastic is on a spree of grabbing folk – Massimo Brignoli (ex-MongoDB, SkySQL, Oracle, and MySQL) joins as a Principal Solutions Architect, and Gerardo Narvaja joins as Sr. Solutions Architect for Pacific Northwest. He departs MariaDB Corporation, and has previously been at Tokutek, Pythian and MySQL.
  • Morgan Tocker (LinkedIn) has joined PingCAP as Senior Product & Community Manager. Previously he was both product manager and community manager at Oracle for MySQL, had a stint at Percona and also was at the original MySQL AB.
  • Baron Schwartz is now Chief Technology Officer of VividCortex, and Amena Ali has become the new CEO.

Upcoming Appearances

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week in Data with Colin Charles 46: OSCON Recap, Google Site Reliability Workbook appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com