Apr
17
2017
--

Experimental Build of MyRocks with Percona Server for MySQL

MyRocks with Percona Server for MySQL

MyRocks with Percona Server for MySQLWe have been working on bringing out a build of MyRocks with Percona Server for MySQL.

MyRocks is a RocksDB-based storage engine. You can find more information about MyRocks here.

While there is still a lot of work to do, I want to share an experimental build of Percona Server for MySQL with MyRocks, which you can use to evaluate and test this engine

(WARNING: in NO WAY is this build supposed to be for production usage! Consider this ALPHA quality.)

The tar.gz binaries are available from our TESTING area. To start Percona Server for MySQL with the MyRocks engine, use following line in my.cnf:

plugin-load=rocksdb=ha_rocksdb.so;rocksdb_cfstats=ha_rocksdb.so;rocksdb_dbstats=ha_rocksdb.so;rocksdb_perf_context=ha_rocksdb.so;rocksdb_perf_context_global=ha_rocksdb.so;rocksdb_cf_options=ha_rocksdb.so;rocksdb_compaction_stats=ha_rocksdb.so;rocksdb_global_info=ha_rocksdb.so;rocksdb_ddl=ha_rocksdb.so;rocksdb_index_file_map=ha_rocksdb.so;rocksdb_locks=ha_rocksdb.so;rocksdb_trx=ha_rocksdb.so

Later we will provide experimental RPM and DEB packages from our testing repositories. Please let us know how MyRocks is working for you!

Apr
03
2017
--

New MariaDB Dashboard in Percona Monitoring and Management Metrics Monitor

MariaDB

In honor of the upcoming MariaDB M17 conference in New York City on April 11-12, we have enhanced Percona Monitoring and Management (PMM) Metrics Monitor with a new MariaDB Dashboard and multiple new graphs!

The Percona Monitoring and Management MariaDB Dashboard builds on the efforts of the MariaDB development team to instrument the Aria Storage Engine Status Variables related to Aria Pagecache and Aria Transaction Log activity, the tracking of Index Condition Pushdown (ICP), InnoDB Online DDL when using ALTER TABLE ... ALGORITHM=INPLACE, InnoDB Deadlocks Detected, and finally InnoDB Defragmentation. This new dashboard is available in Percona Monitoring and Management release 1.1.2. Download it now using our docker, VirtualBox or Amazon AMI installation options!

Percona Monitoring and Management (PMM) is a free and open-source platform for managing and monitoring MySQL®, MariaDB® and MongoDB® performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL, MariaDB® and MongoDB servers to ensure that your data works as efficiently as possible.

Aria Pagecache Reads/Writes

MariaDB 5.1 introduced the Aria Storage Engine, which is MariaDB’s MyISAM replacement. Originally known as the Maria storage engine, they renamed it in late 2010 in order to avoid confusion with the overall MariaDB project name. The Aria Pagecache Status Variables graph plots the count of disk block reads and writes, which occur when the data isn’t already in the Aria Pagecache. We also plot the reads and writes from the Aria Page Cache, which count the reads/writes that did not incur a disk lookup (as the data was previously fetched and available from the Aria pagecache):

MariaDB - Aria Pagecache Reads/Writes

Aria Pagecache Blocks

Aria reads and writes to the pagecache in order to cache data in RAM and avoid or delay activity related to disk. Overall, this translates into faster database query response times:

  • Aria_pagecache_blocks_not_flushed: The number of dirty blocks in the Aria pagecache.
  • Aria_pagecache_blocks_unused: Free blocks in the Aria pagecache.
  • Aria_pagecache_blocks_used: Blocks used in the Aria pagecache.

Aria Pagecache Total Blocks is calculated using Aria System Variables and the following formula:aria_pagecache_buffer_size / aria_block_size:MariaDB - Aria Pagecache Blocks

Aria Transaction Log Syncs

As Aria strives to be a fully ACID- and MVCC-compliant storage engine, an important factor is support for transactions. A transaction is the unit of work in a database that defines how to implement the four properties of Atomicity, Consistency, Isolation, and Durability (ACID). This graph tracks the rate at which Aria fsyncs the Aria Transaction Log to disk. You can think of this as the “write penalty” for running a transactional storage engine:

MariaDB - Aria Transaction Log Syncs

InnoDB Online DDL

MySQL 5.6 released the concept of an in-place DDL operation via ALTER TABLE ... ALGORITHM=INPLACE, which in some cases avoided performing a table copy and thus didn’t block INSERT/UPDATE/DELETE. MariaDB implemented three measures to track ongoing InnoDB Online DDL operations, which we plot via the following three status variables:

  • Innodb_onlineddl_pct_progress: Shows the progress of the in-place alter table. It might not be accurate, as in-place alter is highly dependent on the disk and buffer pool status
  • Innodb_onlineddl_rowlog_pct_used: Shows row log buffer usage in 5-digit integers (10000 means 100.00%)
  • Innodb_onlineddl_rowlog_rows: Number of rows stored in the row log buffer

MariaDB - InnoDB Online DLL

For more information, please see the MariaDB blog post Monitoring progress and temporal memory usage of Online DDL in InnoDB.

InnoDB Defragmentation

MariaDB merged the Facebook/Kakao defragmentation patch for defragmenting InnoDB tablespaces into their 10.1 release. Your MariaDB instance needs to have started with innodb_defragment=1 and your tables need to be in innodb_file_per_table=1 for this to work. We plot the following three status variables:

  • Innodb_defragment_compression_failures: Number of defragment re-compression failures
  • Innodb_defragment_failures: Number of defragment failures
  • Innodb_defragment_count: Number of defragment operations

MariaDB - InnoDB Defragmentation

Index Condition Pushdown

Oracle introduced this in MySQL 5.6. From the manual:

Index Condition Pushdown (ICP) is an optimization for the case where MySQL retrieves rows from a table using an index. Without ICP, the storage engine traverses the index to locate rows in the base table and returns them to the MySQL server which evaluates the WHERE condition for the rows. With ICP enabled, and if parts of the WHERE condition can be evaluated by using only columns from the index, the MySQL server pushes this part of the WHERE condition down to the storage engine. The storage engine then evaluates the pushed index condition by using the index entry and only if this is satisfied is the row read from the table. ICP can reduce the number of times the storage engine must access the base table and the number of times the MySQL server must access the storage engine.

Essentially, the closer that ICP Attempts are to ICP Matches, the better!

MariaDB - Index Condition Pushdown (ICP)

InnoDB Deadlocks Detected (MariaDB 10.1 Only)

Ever since MySQL implemented a transactional storage engine there have been deadlocks. Deadlocks are conditions where different transactions are unable to proceed because each holds a lock that the other needs. In MariaDB 10.1, there is a Status variable that counts the occurrences of deadlocks since the server startup. Previously, you had to instrument your application to get an accurate count of deadlocks, because otherwise you could miss occurrences if your polling interval wasn’t configured frequent enough (even using pt-deadlock-logger). Unfortunately, this Status variable doesn’t appear to be present in the MariaDB 10.2.4 build I tested:

MariaDB - InnoDB Deadlocks Detected

Again, please download Percona Monitoring and Management 1.1.2 to take advantage of the new MariaDB Dashboard and new graphs!  For installation instructions, see the Deployment Guide.

You can see the MariaDB Dashboard and new graphs in action at the PMM Demo site. If you feel the graphs need any tweaking or if I’ve missed anything, leave a note on the blog. You can also write me directly (I look forward to your comments): michael.coburn@percona.com.

To start: on the ICP graph, should we have a line that defines the percentage of successful ICP matches vs. attempts?

Mar
08
2017
--

Migrating MongoDB Away from MMAPv1

MMAPv1

MMAPv1This is another post in the series of blogs on the Percona Server for MongoDB 3.4 bundle release. In this blog post, we’ll discuss moving away from the MMAPv1 storage engine.

Introduction

WIth the MongoDB v3.0 release in February of 2015, the long-awaited ability to choose storage engines became a reality. As of version 3.0, you could choose two engines in MongoDB Community Server and, if you use Percona Server for MongoDB, you could choose from four. Here’s a table for ease of consumption:

Here’s a table for easy consumption:

Storage Engine Percona Server for MongoDB MongoDB Community Server MongoDB Enterprise Server (licensed)
MMAPv1

?

?

?

WiredTiger

?

?

?

MongoRocks

?

In-memory

?

?

Encrypted

?

 

Why change engines?

With increased possibilities comes an increase in the decision-making process difficult (a concept that gets reinforced every time I take my mother out a restaurant with a large menu – ordering is never quick). In all seriousness, the introduction of the storage engine API to MongoDB is possibly the single greatest feature MongoDB, Inc has released to-date.

One of the biggest gripes from the pre-v3.0 days was MongoDB’s lack of scale. This was mostly due to the MMAPv1 storage engine, which suffered from a very primitive locking scheme. If you would like a illustration of the problem, think of the world’s biggest supermarket with one checkout line – you might be able to fit in lots of shoppers, but they’re not going to accomplish their goal quickly. So, the ability to increase performance and concurrency with a simple switch is huge! Additionally, modern storage engines support compression. This should reduce your space utilization when switching by at least 50%.

All the way up to MongoDB v3.2, the default storage engine was MMAPv1. If you didn’t make a conscious decision about what storage engine to choose when you started using MongoDB, there is a good chance that MMAPv1 is what you’re on. If you’d like to find out for sure what engine you’re using, simply run the command below. The output will be the name of the storage engine. As you can see, I was running the MMAPv1 storage engine on this machine. Now that we understand where we’re at, let’s get into where we can be in the future.

db.serverStatus().storageEngine.name
mmapv1

Public Service Announcement

Before we get into what storage engine(s) to evaluate, we need to talk about testing. In my experience, a majority of the MySQL and MongoDB community members are rolling out changes to production without planning or testing. If you’re in the same boat, you’re in very good company (or at least in a great deal of company). However, you should stop this practice. It’s basic “sample size” in statistics – when engaged in risk-laden behavior, the optimal time to stop increasing the sample size is prior to the probability of failure reaching “1”. In other words, start your testing and planning process today!

At Percona, we recommend that you thoroughly test any database changes in a testing or development environment before you decide to roll them into production. Additionally, prior to rolling the changes into production (with a well thought out plan, of course), you’ll need to have a roll-back plan in case of unintended consequences. Luckily, with MongoDB’s built-in replication and election protocols, both are fairly easy. The key here is to plan. This is doubly true if you are undertaking a major version upgrade, or are jumping over major versions. With major version upgrades (or version jumps) comes the increased likelihood of a change in database behavior as it relates to your application’s response time (or even stability).

What should I think about?

In the table above, we listed the pre-packaged storage engine options that are available to us and other distributions. We also took a look at why you should consider moving off of MMAPv1 in the preceding section. To be clear, in my opinion a vast majority of MongoDB users that are on MMAPv1 can benefit from a switch. Which engine to switch to is the pressing question. Your first decision should be to evaluate whether or not your workload fits into the sweet spot for MMAPv1 by reading the section below. If that section doesn’t describe your application, then the additional sections should help you narrow down your choices.

Now, let’s take a look at what workloads match up with what storage engines.

MMAPv1

Believe it or not, there are some use cases where MMAPv1 is likely to give you as good (or better) performance as any other engine. If you’re not worried about the size of your database on disk, then you may not want to bother changing engines. Users that are likely to see no benefit from changing have read-heavy (or 100%) read applications. Also, certain update-heavy use cases, where you’re updating small amounts of data or performing $set operations, are likely to be faster on MMAPv1.

WiredTiger

WiredTiger is a the new default storage engine for MongoDB. It is a good option for general workloads that are currently running on MMAPv1. WiredTiger will give you good performance for most workloads and will reduce your storage utilization with compression that’s enabled by default. If you have a write-heavy workload, or are approaching high I/O utilization (>55%) with plans for it to rise, then you might benefit from a migration to WiredTiger.

MongoRocks (RocksDB from Facebook)

This is Facebook’s baby, which was forged in the fires of the former Parse business unit. MongoRocks, which uses LSM indexing, is advertised as “designed to work with fast storage.” Don’t let this claim fool you. For workloads that are heavy on writes, highly concurrent or approaching disk bound, MongoRocks could give you great benefits. In terms of compression, MongoRocks has the ability to efficiently handle deeper compression algorithms, which should further decrease your storage requirements.

In-Memory

The in-memory engine, whether we’re speaking about the MongoDB or Percona implementation, should be used for workloads where extreme low latency is the most important requirement. The types of applications that I’m talking about are usually low-latency, “real-time” apps — like decision making or user session tracking. The in-memory engine is not persistent, so it operates strictly out of the cache allocated to MongoDB. Consequently, the data may (and likely will) be lost if the server crashes.

Encrypted

This is for applications in highly secure environments where on-disk encryption is necessary for compliance. This engine will protect the MongoDB data in the case that a disk or server is stolen. On the flip side, this engine will not protect you from a hacker that has access to the server (MongoDB shell), or can intercept your application traffic. Another way to achieve the same level of encryption for compliance is using volume level encryption like LUKS. An additional benefit of volume level encryption, since it works outside the database, is re-use on all compliant servers (not just MongoDB).

Getting to your new engine

Switching to the new engine is actually pretty easy, especially if you’re running a replica set. One important caveat is that unlike MySQL, the storage engine can only be defined per mongod process (not per database or collection). This means that it’s an all or nothing operation on a single MongoDB process. You’ll need to reload your data on that server. This is necessary because the data files from one engine are not compatible with another engine. Thus reloading the data to transform from one engine format to another is necessary. Here are the high-level steps (assuming you’re running a replica set):

  1. Make sure you’re not in your production environment
  2. Backup your data (it can’t hurt)
  3. Remove a replica set member
  4. Rename (or delete) the old data directory. The member will re-sync with the replica set
    • Make sure you have enough disk space if you’re going to keep a copy of the old data directory
  5. Update the mongo.conf file to use a new storage engine. Here’s an example for RocksDB from our documentation:
    storage:
     engine: rocksdb
     rocksdb:
       cacheSizeGB: 4
       compression: snappy
  6. Start the MongoDB process again
  7. Join the member to the replica set (initial sync will happen)
  8. When the updated member is all caught up, pick another member and repeat the process.
  9. Continue until the primary is the only server left. At this point, you should step down the primary, but hold off switching storage engines until you are certain that the new storage engine meets your needs.

The Wrap Up

At this point I’ve explained how you can understand your options, where you can gain additional performance and what engines to evaluate. Please don’t forget to test your application with the new setup before launching into production. Please drop a comment below if you found this helpful or, on the other hand, if there’s something that would make it more useful to you. Chances are, if you’d find something helpful, the rest of the community will as well.

Aug
01
2016
--

Introduction into storage engine troubleshooting: Q & A

storage engine troubleshooting

 storage engine troubleshootingIn this blog, I will provide answers to the Q & A for the “Introduction into storage engine troubleshooting” webinar.

First, I want to thank everybody for attending the July 14 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: At which isolation level do 

pt-online-schema-change

 and 

pt-archive

  copy data from a table?

A: Both tools do not change the server’s default transaction isolation level. Use either

REPEATABLE READ

 or set it in my

.cnf

.

Q: Can I create an index to optimize a query which has group by A and order by B, both from different tables and A column is from the first table in the two table join?

A: Do you mean a query like

SELECT ... FROM a, b GROUP BY a.A ORDER BY b.B

 ? Yes, this is possible:

mysql> explain select A, B, count(*) from a join b on(a.A=b.id) WHERE b.B < 4 GROUP BY a.A, b.B ORDER BY b.B ASC;
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref       | rows | Extra                                                     |
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
|  1 | SIMPLE      | b     | range | PRIMARY,B     | B    | 5       | NULL      |   15 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | a     | ref   | A             | A    | 5       | test.b.id |    1 | Using index                                               |
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
2 rows in set (0.00 sec)

Q: Where can I find recommendations on what kind of engine to use for different application types or use cases?

A: Storage engines are always being actively developed, therefore I suggest that you don’t search for generic recommendations. These can be outdated just a few weeks after they are written. Study engines instead. For example, just a few years ago MyISAM was the only engine (among those officially supported) that could work with FULLTEXT indexes and SPATIAL columns. Now InnoDB supports both: FULLTEXT indexes since version 5.6 and GIS features in 5.7. Today I can recommend InnoDB as a general-purpose engine for all installations, and TokuDB for write-heavy workloads when you cannot use high-speed disks.

Alternative storage engines can help to realize specific business needs. For example, CONNECT brings data to your server from many sources, SphinxSE talks to the Sphinx daemon, etc.

Other alternative storage engines increase the speed of certain workloads. Memory, for example, can be a good fit for temporary tables.

Q: Can you please explain how we find the full text of the query when we query the view ‘statements_with_full_table_Scans’?

A: Do you mean view in sys schema? Sys schema views take information from

summary_*

 and digests it in Performance Schema, therefore it does not contain full queries (only digests). Full text of the query can be found in the

events_statements_*

  tables in the Performance Schema. Note that even the 

events_statements_history_long

  table can be rewritten very quickly, and you may want to save data from it periodically.

Q: Hi is TokuDB for the new document protocol?

A: As Alex Rubin showed in his detailed blog post, the new document protocol just converts NoSQL queries into SQL, and is thus not limited to any storage engine. To use documents and collections, a storage engine must support generated columns (which TokuDB currently does not). So support of X Protocol for TokuDB is limited to relational tables access.

Q: Please comment on “read committed” versus “repeatable read.”
Q: Repeatable read holds the cursor on the result set for the client versus read committed where the cursor is updated after a transaction.

A:

READ COMMITTED

 and

REPEATABLE READ

 are transaction isolation levels, whose details are explained here.
I would not correlate locks set on table rows in different transaction isolation modes with the result set. A transaction with isolation level

REPEATABLE READ

  instead creates a snapshot of rows that are accessed by the transaction. Let’s consider a table:

mysql> create table ti(id int not null primary key, f1 int) engine=innodb;
Query OK, 0 rows affected (0.56 sec)
mysql> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
Query OK, 9 rows affected (0.03 sec)
Records: 9  Duplicates: 0  Warnings: 0

Then start the transaction and select a few rows from this table:

mysql1> begin;
Query OK, 0 rows affected (0.00 sec)
mysql1> select * from ti where id < 5;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
+----+------+
4 rows in set (0.04 sec)

Now let’s update another set of rows in another transaction:

mysql2> update ti set f1 = id*2 where id > 5;
Query OK, 4 rows affected (0.06 sec)
Rows matched: 4  Changed: 4  Warnings: 0
mysql2> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
|  6 |   12 |
|  7 |   14 |
|  8 |   16 |
|  9 |   18 |
+----+------+
9 rows in set (0.00 sec)

You see that the first four rows – which we accessed in the first transaction – were not modified, and last four were modified. If InnoDB only saved the cursor (as someone answered above) we would expect to see the same result if we ran 

SELECT * ...

  query in our old transaction, but it actually shows whole table content before modification:

mysql1> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
|  6 |    6 |
|  7 |    7 |
|  8 |    8 |
|  9 |    9 |
+----+------+
9 rows in set (0.00 sec)

So “snapshot”  is a better word than “cursor” for the result set. In the case of

READ COMMITTED

, the first transaction would see modified rows:

mysql1> drop table ti;
Query OK, 0 rows affected (0.11 sec)
mysql1> create table ti(id int not null primary key, f1 int) engine=innodb;
Query OK, 0 rows affected (0.38 sec)
mysql1> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
Query OK, 9 rows affected (0.04 sec)
Records: 9  Duplicates: 0  Warnings: 0
mysql1> set transaction isolation level read committed;
Query OK, 0 rows affected (0.00 sec)
mysql1> begin;
Query OK, 0 rows affected (0.00 sec)
mysql1> select * from ti where id < 5;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
+----+------+
4 rows in set (0.00 sec)

Let’s update all rows in the table this time:

mysql2> update ti set f1 = id*2;
Query OK, 9 rows affected (0.04 sec)
Rows matched: 9  Changed: 9  Warnings: 0

Now the first transaction sees both the modified rows with id >= 5 (not in the initial result set), but also the modified rows with id < 5 (which existed in the initial result set):

mysql1> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    2 |
|  2 |    4 |
|  3 |    6 |
|  4 |    8 |
|  5 |   10 |
|  6 |   12 |
|  7 |   14 |
|  8 |   16 |
|  9 |   18 |
+----+------+
9 rows in set (0.00 sec)

Jan
27
2016
--

MongoDB revs you up: What storage engine is right for you? (Part 4)

MongoDB

MongoDBDifferentiating Between MongoDB Storage Engines: PerconaFT

In this series of posts, we discussed what a storage engine is, and how you can determine the characteristics of one versus the other:

“A database storage engine is the underlying software that a DBMS uses to create, read, update and delete data from a database. The storage engine should be thought of as a “bolt on” to the database (server daemon), which controls the database’s interaction with memory and storage subsystems.”

Generally speaking, it’s important to understand what type of work environment the database is going to interact with, and to select a storage engine that is tailored to that environment.

The first post looked at MMAPv1, the original default engine for MongoDB (through release 3.0). The second post examined WiredTiger, the new default MongoDB engine. The third post reviewed RocksDB, an engine developed for the Facebook environment.

This post will cover PerconaFT. PerconaFT was developed out of Percona’s acquisition of Tokutek, from their TokuDB product.

PerconaFT

Find it in: Percona Builds

PerconaFT is the newest version of the Fractal Tree storage engine that was designed and implemented by Tokutek, which was acquired by Percona in April of 2015. Designed at MIT, SUNY Stony Brook and Rutgers, the Fractal Tree is a data structure that aimed to remove disk bottlenecks from databases that were using the B-tree with datasets that were several times larger that cache.

PerconaFT is arguably the most “mature” storage engine for MongoDB, with support for document level concurrency and compression. The Fractal Tree was first commercially implemented in June of 2013 in TokuMX, a fork of MongoDB, with an advanced feature set.

As described previously, the Fractal Tree (which is available for MongoDB in the PerconaFT storage engine) is a write-optimized data structure utilizing many log-like “queues” called message buffers, but has an arrangement like that of a read-optimized data structure. With the combination of these properties, PerconaFT can provide high performance for applications with high insert rates, while providing very efficient lookups for update/query-based applications. This will theoretically provide very predictable and consistent performance as the database grows. Furthermore, PerconaFT typically provides, comparatively, the deepest compression rates of any of the engines we’ve discussed in this series.

An ideal fit for the PerconaFT storage engine is a system with varied workloads, where predictable vertical scaling is required in addition to the horizontal scaling provide MongoDB. Furthermore, the ability of PerconaFT to maintain performance while compressing – along with support for multiple compression algorithms (snappy, quicklz, zlib and lzma) – make it one of the best options for users looking to optimize their data footprint.

Conclusion

Most people don’t know that they have a choice when it comes to storage engines, and that the choice should be based on what the database workload will look like. Percona’s Vadim Tkachenko performed an excellent benchmark test comparing the performances of PerconaFT and WiredTiger to help specifically differentiate between these engines.

Part 1: Intro and the MMAPv1 storage engine.

Part 2: WiredTiger storage engine.

Part 3: RocksDB storage engine.

May
23
2014
--

How to improve InnoDB performance by 55% for write-bound loads

During April’s Percona Live MySQL Conference and Expo 2014, I attended a talk on MySQL 5.7 performance an scalability given by Dimitri Kravtchuk, the Oracle MySQL benchmark specialist. He mentioned at some point that the InnoDB double write buffer was a real performance killer. For the ones that don’t know what the innodb double write buffer is, it is a disk buffer were pages are written before being written to the actual data file. Upon restart, pages in the double write buffer are rewritten to their data files if complete. This is to avoid data file corruption with half written pages. I knew it has an impact on performance, on ZFS since it is transactional I always disable it, but I never realized how important the performance impact could be. Back from PLMCE, a friend had dropped home a Dell R320 server, asking me to setup the OS and test it. How best to test a new server than to run benchmarks on it, so here we go!

ZFS is not the only transactional filesystem, ext4, with the option “data=journal”, can also be transactional. So, the question is: is it better to have the InnoDB double write buffer enabled or to use the ext4 transaction log. Also, if this is better, how does it compare with xfs, the filesystem I use to propose but which do not support transactions.

Methodology

The goal is to stress the double write buffer so the load has to be write intensive. The server has a simple mirror of two 7.2k rpm drives. There is no controller write cache and the drives write caches are disabled. I decided to use the Percona tpcc-mysql benchmark tool and with 200 warehouses, the total dataset size was around 18G, fitting all within the Innodb buffer pool (server has 24GB). Here’re the relevant part of the my.cnf:

innodb_read_io_threads=4
innodb_write_io_threads=8  #To stress the double write buffer
innodb_buffer_pool_size=20G
innodb_buffer_pool_load_at_startup=ON
innodb_log_file_size = 32M #Small log files, more page flush
innodb_log_files_in_group=2
innodb_file_per_table=1
innodb_log_buffer_size=8M
innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=0
skip-innodb_doublewrite  #commented or not depending on test

So, I generated the dataset for 200 warehouses, added they keys but not the foreign key constraints, loaded all that in the buffer pool with a few queries and dumped the buffer pool. Then, with MySQL stopped, I did a file level backup to a different partition. I used the MySQL 5.6.16 version that comes with Ubuntu 14.04, at the time Percona server was not available for 14.04. Each benchmark followed this procedure:

  1. Stop mysql
  2. umount /var/lib/mysql
  3. comment or uncomment skip-innodb_doublewrite in my.cnf
  4. mount /var/lib/mysql with specific options
  5. copy the reference backup to /var/lib/mysql
  6. Start mysql and wait for the buffer pool load to complete
  7. start tpcc from another server

The tpcc_start I used it the following:

./tpcc_start -h10.2.2.247 -P3306 -dtpcc -utpcc -ptpcc -w200 -c32 -r300 -l3600 -i60

I used 32 connections, let the tool run for 300s of warm up, enough to reach a steady level of dirty pages, and then, I let the benchmark run for one hour, reporting results every minute.

Results

Test: Double write buffer File system options Average NOPTM over 1h
ext4_dw Yes rw 690
ext4_dionolock_dw Yes rw,dioread_nolock 668
ext4_nodw No rw 1107
ext4trx_nodw No rw,data=journal 1066
xfs_dw Yes xfs rw,noatime 754

 

So, from the above table, the first test I did was the common ext4 with the Innodb double write buffer enabled and it yielded 690 new order transactions per minute (NOTPM). Reading the ext4 doc, I also wanted to try the “dioread_nolock” setting that is supposed to reduce mutex contention and this time, I got slightly less 668 NOTPM. The difference is within the measurement error and isn’t significant. Removing the Innodb double write buffer, although unsafe, boosted the throughput to 1107 NOTPM, a 60% increase! Wow, indeed the double write buffer has a huge impact. But what is the impact of asking the file system to replace the innodb double write buffer? Surprisingly, the performance level is only slightly lower at 1066 NOTPM and vmstat did report twice the amount writes. I needed to redo the tests a few times to convince myself. Getting a 55% increase in performance with the same hardware is not common except when some trivial configuration errors are made. Finally, I used to propose xfs with the Innodb double write buffer enabled to customers, that’s about 10% higher than ext4 with the Innodb double write buffer, close to what I was expecting. The graphic below presents the numbers in a more visual form.

TPCC NOTPM for various configurations

TPCC NOTPM for various configurations

In term of performance stability, you’ll find below a graphic of the per minute NOTPM output for three of the tests, ext4 non-transactional with the double write buffer, ext4 transactional without the double write buffer and xfs with the double write buffer. The dispersion is qualitatively similar for all three. The values presented above are just the averages of those data sets.

TPCC NOTPM evolution over time

TPCC NOTPM evolution over time

Safety

Innodb data corruption is not fun and removing the innodb double write buffer is a bit scary. In order to be sure it is safe, I executed the following procedure ten times:

  1. Start mysql and wait for recovery and for the buffer pool load to complete
  2. Check the error log for no corruption
  3. start tpcc from another server
  4. After about 10 minutes, physically unplug the server
  5. Plug back and restart the server

I observed no corruption. I was still a bit preoccupied, what if the test is wrong? I removed the “data=journal” mount option and did a new run. I got corruption the first time. So given what the procedure I followed and the number of crash tests, I think it is reasonable to assume it is safe to replace the InnoDB double write buffer by the ext4 transactional journal.

I also looked at the kernel ext4 sources and changelog. Up to recently, before kernel 3.2, O_DIRECT wasn’t supported with data=journal and MySQL would have issued a warning in the error log. Now, with recent kernels, O_DIRECT is mapped to O_DSYNC and O_DIRECT is faked, always for data=journal, which is exactly what is needed. Indeed, I tried “innodb_flush_method = O_DSYNC” and found the same results. With older kernels I strongly advise to use the “innodb_flush_method = O_DSYNC” setting to make sure files are opened is a way that will cause them to be transactional for ext4. As always, test thoroughfully, I only tested on Ubuntu 14.04.

Impacts on MyISAM

Since we are no longer really using O_DIRECT, even if set in my.cnf, the OS file cache will be used for InnoDB data. If the database is only using InnoDB that’s not a big deal but if MyISAM is significantly used, that may cause performance issues since MyISAM relies on the OS file cache so be warned.

Fast SSDs

If you have a SSD setup that doesn’t offer a transactional file system like the FusionIO directFS, a very interesting setup would be to mix spinning drives and SSDs. For example, let’s suppose we have a mirror of spinning drives handled by a raid controller with a write cache (and a BBU) and an SSD storage on a PCIe card. To reduce the write load to the SSD, we could send the file system journal to the spinning drives using the “journal_path=path” or “journal_dev=devnum” options of ext4. The raid controller write cache would do an awesome job at merging the write operations for the file system journal and the amount of write operations going to the SSD would be cut by half. I don’t have access to such a setup but it seems very promising performance wise.

Conclusion

Like ZFS, ext4 can be transactional and replacing the InnoDB double write buffer with the file system transaction journal yield a 55% increase in performance for write intensive workload. Performance gains are also expected for SSD and mixed spinning/SSD configurations.

The post How to improve InnoDB performance by 55% for write-bound loads appeared first on MySQL Performance Blog.

Dec
16
2011
--

Setting up XFS on Hardware RAID — the simple edition

There are about a gazillion FAQs and HOWTOs out there that talk about XFS configuration, RAID IO alignment, and mount point options.  I wanted to try to put some of that information together in a condensed and simplified format that will work for the majority of use cases.  This is not meant to cover every single tuning option, but rather to cover the important bases in a simple and easy to understand way.

Let’s say you have a server with standard hardware RAID setup running conventional HDDs.

RAID setup

For the sake of simplicity you create one single RAID logical volume that covers all your available drives.  This is the easiest setup to configure and maintain and is the best choice for operability in the majority of normal configurations.  Are there ways to squeeze more performance out of a server by dividing the logical volumes: perhaps, but it requires a lot of fiddling and custom tuning to accomplish.

There are plenty of other posts out there that discuss RAID minutia.  Make sure you cover the following:

  • RAID type (usually 5 or 1+0)
  • RAID stripe size
  • BBU enabled with Write-back cache only
  • No read cache or read-ahead
  • No drive write cache enabled

Partitioning

You want to run only MySQL on this box, and you want to ensure your MySQL datadir is separated from the OS in case you ever want to upgrade the OS, but otherwise keep it simple.  My suggestion?  Plan on allocating partitions roughly as follows, based on your available drive space and keeping in mind future growth.

  • 8-16G for Swap –
  • 10-20G for the OS (/)
  • Possibly 10G+ for /tmp  (note you could also point mysql’s tmpdir elsewhere)
  • Everything else for MySQL (/mnt/data or similar):  (sym-link /var/lib/mysql into here when you setup mysql)

Are there alternatives?  Yes.  Can you have separate partitions for Innodb log volumes, etc.?  Sure.  Is it work doing much more than this most of the time?  I’d argue not until you’re sure you are I/O bound and need to squeeze every last ounce of performance from the box.  Fiddling with how to allocate drives and drive space from partition to partition is a lot of operational work which should be spent only when needed.

Aligning the Partitions

Once you have the partitions, it could look something like this:
#fdisk -ul

Disk /dev/sda: 438.5 GB, 438489317376 bytes
255 heads, 63 sectors/track, 53309 cylinders, total 856424448 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00051fe9

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048     7813119     3905536   82  Linux swap / Solaris
Partition 1 does not end on cylinder boundary.
/dev/sda2   *     7813120    27344895     9765888   83  Linux
/dev/sda3        27344896   856422399   414538752   83  Linux
 Several months ago my colleague Aurimas posted two excellent blogs on both the theory of Aligning IO on hardware RAID and some good benchmarks to emphasize the point, go read those if you need the theory here.  Is it common on modern Linux systems for this to be off?  Maybe not, but here’s how you check.
  We want to use mysql on /dev/sda3, but how can we ensure that it is aligned with the RAID stripes?  It takes a small amount of math:
  • Start with your RAID stripe size.  Let’s use 64k which is a common default.  In this case 64K = 2^16 = 65536 bytes.
  • Get your sector size from fdisk.  In this case 512 bytes.
  • Calculate how many sectors fit in a RAID stripe.   65536 / 512 = 128 sectors per stripe.
  • Get start boundary of our mysql partition from fdisk: 27344896.
  • See if the Start boundary for our mysql partition falls on a stripe boundary by dividing the start sector of the partition by the sectors per stripe:  27344896 / 128 = 213632.  This is a whole number, so we are good.  If it had a remainder, then our partition would not start on a RAID stripe boundary.

Create the Filesystem

XFS requires a little massaging (or a lot).  For a standard server, it’s fairly simple.  We need to know two things:

  • RAID stripe size
  • Number of unique, utilized disks in the RAID.  This turns out to be the same as the size formulas I gave above:
    • RAID 1+0:  is a set of mirrored drives, so the number here is num drives / 2.
    • RAID 5: is striped drives plus one full drive of parity, so the number here is num drives – 1.
In our case, it is RAID 1+0 64k stripe with 8 drives.  Since those drives each have a mirror, there are really 4 sets of unique drives that are striped over the top.  Using these numbers, we set the ‘su’ and ‘sw’ options in mkfs.xfs with those two values respectively.
# mkfs.xfs -d su=64k,sw=4 /dev/sda3
meta-data=/dev/sda3              isize=256    agcount=4, agsize=25908656 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=103634624, imaxpct=25
         =                       sunit=16     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=50608, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The XFS FAQ is a good place to check out for more details.

Mount the filesystem

Again, there are many options to use here, but let’s use some simple ones:

/var/lib/mysql           xfs     nobarrier,noatime,nodiratime

Setting the IO scheduler

This is a commonly missed step related to getting the IO setup properly.  The best choices here are between ‘deadline’ and ‘noop’.   Deadline is an active scheduler, and noop simply means IO will be handled without rescheduling.  Which is best is workload dependent, but in the simple case you would be well-served by either.  Two steps here:

echo noop > /sys/block/sda/queue/scheduler   # update the scheduler in realtime

And to make it permanent, add ‘elevator=<your choice>’ in your grub.conf at the end of the kernel line:

kernel /boot/vmlinuz-2.6.18-53.el5 ro root=LABEL=/ noapic acpi=off rhgb quiet notsc elevator=noop

 

This is a complicated topic, and I’ve tried to temper the complexity with what will provide the most benefit.  What has made most improvement for you that could be added without much complexity?

May
06
2011
--

Shard-Query turbo charges Infobright community edition (ICE)

Shard-Query is an open source tool kit which helps improve the performance of queries against a MySQL database by distributing the work over multiple machines and/or multiple cores. This is similar to the divide and conquer approach that Hive takes in combination with Hadoop. Shard-Query applies a clever approach to parallelism which allows it to significantly improve the performance of queries by spreading the work over all available compute resources. In this test, Shard-Query averages a nearly 6x (max over 10x) improvement over the baseline, as shown in the following graph:

One significant advantage of Shard-Query over Hive is that it works with existing MySQL data sets and queries. Another advantage is that it works with all MySQL storage engines.

This set of benchmarks evaluates how well Infobright community edition (ICE) performs in combination with Shard-Query.

Data set

It was important to choose a data set that was large enough to create queries that would run for a decent amount of time, but not so large that it was difficult to work with. The ontime flight performance statistics data, available online from the United States Bureau of Transportation Statistics (BTS) made a good candidate for testing, as it had been tested before:
Another MPB post
Lucid DB testing

The raw data is a completely denormalized schema (single table). In order to demonstrate the power of Shard-Query it is important to test complex queries involving joins and aggregation. A star schema is the most common OLAP/DW data model, since it typically represents a data mart. See also: “Data mart or data warehouse?”. As it is the most common data model, it is desirable to benchmark using a star schema, even though it involves work to transform the data.

Star schema

Transforming the data was straightforward. I should note that I did this preprocessing with the MyISAM storage engine, then I dumped the data to tab delimited flat files using mysqldump. I started by loading the raw data from the BTS into a single database table called ontime_stage.

Then, the airport information was extracted:

create table dim_airport(
airport_id int auto_increment primary key,
unique key(airport_code)
)
as
select
  Origin as `airport_code`,
  OriginCityName as `CityName`,
  OriginState as `State`,
  OriginStateFips as `StateFips`,
  OriginStateName as `StateName` ,
  OriginWac as `Wac`
FROM ontime_stage
UNION
select
  Dest as `airport_code`,
  DestCityName as `CityName`,
  DestState as `State`,
  DestStateFips as `StateFips`,
  DestStateName as `StateName` ,
  DestWac as `Wac`
FROM ontime_stage;

After extracting flight/airline and date information in a similar fashion, a final table `ontime_fact` is created by joining the newly constructed dimension table tables to the staging tables, omitting the dimension columns from the projection, instead replacing them with the dimension keys:

select dim_date.date_id,
       origin.airport_id as origin_airport_id,
       dest.airport_id as dest_airport_id,
       dim_flight.flight_id,
       ontime_stage.TailNum, ...
from ontime_stage
join dim_date using(FlightDate)
join dim_airport origin on ontime_stage.Origin = origin.airport_code
join dim_airport dest on ontime_stage.Dest = dest.airport_code
join dim_flight using (UniqueCarrier,AirlineID,Carrier,FlightNum);

The data set contains ontime flight information for 22 years, which can be confirmed by examining the contents of the date dimension:

mysql> select count(*),
min(FlightDate),
max(FlightDate)
from dim_date\G
*************************** 1. row ***************************
       count(*): 8401
min(FlightDate): 1988-01-01
max(FlightDate): 2010-12-31
1 row in set (0.00 sec)

The airport dimension is a puppet dimension. It is called a puppet because it serves as both origin and destination dimensions, being referenced by origin_airport_id and destination_airport_id in the fact table, respectively. There are nearly 400 major airports included in the data set.

mysql> select count(*) from dim_airport;
+----------+
| count(*) |
+----------+
|      396 |
+----------+
1 row in set (0.00 sec)

The final dimension is the flight dimension, which contains the flight numbers and air carrier hierarchies. Only the largest air carriers must register and report ontime information with the FAA, so there are only 29 air carriers in the table:

mysql> select count(*),
count(distinct UniqueCarrier)
from dim_flight\G
*************************** 1. row ***************************
                     count(*): 58625
count(distinct UniqueCarrier): 29
1 row in set (0.02 sec)

Each year has tens of millions of flights:

mysql> select count(*) from ontime_one.ontime_fact;
+-----------+
| count(*)  |
+-----------+
| 135125787 |
+-----------+
1 row in set (0.00 sec)

This should be made fully clear by the following schema diagram:

Star schema (ontime_fact, dim_date, dim_flight, dim_airport)

Diagram of the ontime dimensional schema

Test environment

For this benchmark, a test environment consisting of a single commodity database server with 6 cores (+6ht) and 24GB of memory was selected. The selected operating system was Fedora 14. Oracle VirtualBox OSE was used to create six virtual machines, each running Fedora 14. Each of the virtual machines was granted 4GB of memory. A SATA 7200rpm RAID10 battery backed RAID array was used as the underlying storage for the virtual machines.

Baseline:
The MySQL command line client was used to execute the a script file containing the 11 queries. This same SQL script was used in the Shard-Query tests. For the baseline, the results and response times were captured with the \T command. The queries were executed on a single database schema containing all of the data. Before loading, there was approximately 23GB of data. After loading, ICE compressed this data to about 2GB. The test virtual machine was assigned 12 cores in this test.

Scale-up:
Shard-Query was given the following configuration file, which lists only one server. A single schema (ontime_one) contained all of the data. The test virtual machine was assigned 12 cores for this test. The same VM was used as the previous baseline test. This VM was rebooted between tests. A SQL script was fed to the run_query.php script and the output was captured with the ‘tee’ command.

$ cat one.ini
[default]
user=mysql
db=ontime_one
password=
port=5029
column=date_id
mapper=hash
gearman=127.0.0.1:7000
inlist=*
between=*

[shard1]
host=192.168.56.102

Scale-out
In addition to adding parallelism via scale-up, Shard-Query can improve performance of almost all queries by spreading them over more than one physical server. This is called “scaling out” and it allows Shard-Query to vastly improve the performance of queries which have to examine a large amount of data. Shard-Query includes a loader (loader.php) which can be used to either split a data into multiple files (for each shard, for later loading) or it can load files directly, in parallel, to multiple hosts.

Shard-Query will execute queries in parallel over all of these machines. With enough machines, massive parallelism is possible and very large data sets may be processed. As each machine examines only a small subset of the data, performance can be improved significantly:

$ cat shards.ini
[default]
user=mysql
db=ontime
password=
port=5029
column=date_id
mapper=hash
gearman=127.0.0.1:7000
inlist=*
between=*

[shard1]
host=192.168.56.101
db=ontime

[shard2]
host=192.168.56.102
db=ontime

[shard3]
host=192.168.56.103
db=ontime

[shard4]
host=192.168.56.105
db=ontime

[shard5]
host=192.168.56.106
db=ontime

[shard6]
host=192.168.56.104
db=ontime

In this configuration, each shard has about 335MB-350MB of data (23GB raw data, compressed to about 2GB total data. then spread over six servers). Due to ICE limitations, the data was split before loading. The splitting/loading process will be described in another post.

Complex queries

Shard-Query was tested with the simple single table version of this data set in a previous blog post. Previous testing was limited to a subset of Vadim’s test queries (see that post). As this new test schema is normalized, Vadim’s test queries must be modified to reflect the more complex schema structure. For this benchmark each of the original queries has been rewritten to conform to the new schema, and additionally two new test queries have been added. To model real world complexity, each of the test queries feature at least one join, and many of the filter conditions are placed on attributes in the dimension tables. It will be very interesting to test these queries on various engines and databases.

Following is a list of the queries, followed by a response time table recording the actual response times for each query. The queries should be self-explanatory.

Performance, at a glance

You will notice that Shard-Query is faster in nearly every case. The performance of the queries is improved significantly by scaling out, even more so than scaling up, because additional parallelism is added beyond what can be exploited by scale up. Scale up can improve query performance when there is enough resources to support the added parallelism, and when one of the the following are in used in the query: BETWEEN or IN clauses, subqueries in the FROM clause, UNION or UNION ALL clauses. If none of those features are used, then parallelism can’t be added. Q9 is an example of such a query.
.

Query details:

-- Q1
 SELECT DayOfWeek, count(*) AS c
   from ontime_fact
   JOIN dim_date using (date_id)
  WHERE Year
BETWEEN 2000 AND 2008
  GROUP BY DayOfWeek
  ORDER BY c DESC;

-- Q2
SELECT DayOfWeek, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
 WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008
 GROUP BY DayOfWeek
 ORDER BY c DESC;

-- Q3
SELECT CityName as Origin, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
  JOIN dim_airport origin
    ON origin_airport_id = origin.airport_id
 WHERE DepDelay>10
   AND Year BETWEEN 2000 AND 2008
 GROUP BY 1
 ORDER BY c
 LIMIT 10;


The next queries show how performance is improved when Shard-Query adds parallelism when “subqueries in the from clause” are used. There are benefits with both “scale-up” and “scale-out”, but once again, the “scale-out” results are the most striking.

-- Q4
SELECT Carrier, count(*) as c
  from ontime_fact
  JOIN dim_date using (date_id)
  join dim_flight using(flight_id)
 WHERE DepDelay>10
   AND Year=2007
 GROUP BY Carrier
 ORDER BY c DESC;

-- Q5
SELECT t.Carrier, c, c2, c*1000/c2 as c3
FROM
     (SELECT Carrier, count(*) AS c
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE DepDelay>10
         AND Year=2007
       GROUP BY Carrier) t
JOIN (SELECT Carrier, count(*) AS c2
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE Year=2007
       GROUP BY Carrier) t2
  ON (t.Carrier=t2.Carrier)
ORDER BY c3 DESC;

-- Q6
SELECT t.Year, c1 / c2 as ratio
FROM
     (select Year, count(*)*1000 as c1
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t
JOIN (select Year, count(*) as c2
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t2
  ON (t.Year=t2.Year);

-- Q7
SELECT t.Year, c1 / c2 as ratio
  FROM (select Year, count(Year)*1000 as c1
          from ontime_fact
          join dim_date using (date_id)
         WHERE DepDelay>10
         GROUP BY Year) t
  JOIN (select Year, count(*) as c2
          from ontime_fact
          join dim_date using (date_id)
         GROUP BY Year) t2
    ON (t.Year=t2.Year);


The performance of the following queries depends on the size of the date range:

-- Q8.0
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2001
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.1
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2005
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.2
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.3
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1990 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.4
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1980 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;


Finally, Shard-Query performance continues to improve when grouping and filtering is used. Again, notice Q9. It doesn’t use any features which Shard-Query can use to add parallelism. Thus, in the scale up configuration it does not perform any better than the baseline, and actually performed just a little worse. Since scale out splits the data between servers, it performs about 6x better as the degree of parallelism is controlled by the number of shards.

-- Q9
select Year ,count(Year) as c1
  from ontime_fact
  JOIN dim_date using (date_id)
 group by Year;

-- Q10
SELECT Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2009 and 2011
 GROUP BY Carrier,dest.CityName
 ORDER BY 3 DESC;

-- Q11
SELECT Year, Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2000 and 2003
   AND Carrier = 'AA'
 GROUP BY Year, Carrier,dest.CityName
 ORDER BY 4 DESC;


Conclusion

The divide and conquer approach is very useful when working with large quantities of data. Shard-Query can be used with existing data sets easily, improving the performance of queries significantly if they use common query features like BETWEEN or IN. It is also possible to spread your data over multiple machines, scaling out to improve query response times significantly.

These queries are a great test of Shard-Query features. It is currently approaching RC status. If you decide to test it and encounter issues, please file a bug on the bug tracker. You can get Shard-Query (currently in development release form as a checkout from SVN) here: Shard-Query Google code project

Full disclosure

Justin Swanhart, the author of this article is also the creator and maintainer of Shard-Query. The author has previously worked in cooperation with Infobright, including on benchmarking. These particular tests were performed independently of Infobright, without their knowledge or approval. Infobright was, however, given the chance to review this document before publication, as a courtesy. All findings are represented truthfully, transparently, and without any intended bias.

Apr
07
2011
--

Innodb row size limitation

I recently worked on a customer case where at seemingly random times, inserts would fail with Innodb error 139. This is a rather simple problem, but due to it’s nature, it may only affect you after you already have a system running in production for a while.

Suppose you have the following table structure:

CREATE TABLE example (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
fname TEXT NOT NULL,
fcomment TEXT,
ftitle TEXT NOT NULL,
fsubtitle TEXT NOT NULL,
fcontent TEXT NOT NULL,
fheader TEXT,
ffooter TEXT,
fdisclaimer TEXT,
fcopyright TEXT,
fstylesheet TEXT,
fterms TEXT,
PRIMARY KEY (id)
) Engine=InnoDB;

Now you insert some test data into it:
mysql> INSERT INTO example
-> VALUES (
->   NULL,
->   'First example',
->   'First comment',
->   'First title',
->   'First subtitle',
->   'First content',
->   'First header',
->   'First footer',
->   'First disclaimer',
->   'First copyright',
->   'First stylesheet',
->   'First terms'
-> );
Query OK, 1 row affected (0.03 sec)

All goes ok.

Now you begin using this structure in production, and ocassionally get this error:
ERROR 1030 (HY000) at line 1: Got error 139 from storage engine

What’s going on?

Innodb gives you this error when it can’t store all of the variable-length columns for a given row on a single database page.

Innodb has a limit for the maximum row size, which is slightly less than half a database page (the actual math is 16k-(page header + page trailer)/2. For the default page size of 16kb. this sets an ~8000 bytes limit on row size (8126 as reported by my test installation). This limit is a consequence of InnoDB storing two rows on a page. If you’re using compression, however, you can store a single row on a page.
If your row has variable length columns and the complete row size exceeds this limit, InnoDB chooses variable length columns for off-page storage.

In these cases, the first 768 bytes of each variable length column is stored locally, and the rest is stored outside of the page (this behavior is version specific, see http://www.mysqlperformanceblog.com/2010/02/09/blob-storage-in-innodb/ for further reference).

It’s worth mentioning that this limit applies to the byte-size of values, not the character-size. This means if you insert a 500 character string with all multi-byte characters into a VARCHAR(500) column, that value will also be chosen for off page storage. This means it is possible that you hit this error after converting from a native language character set to utf8, as it can increase the size dramatically.

If you have more than 10 variable length columns, and each exceeds 768 bytes, then you’ll have at least 8448 bytes for local storage, not counting other fixed length columns. This exceeds the limit and therefore you get the error.

You don’t have to get it always, as this is not evaluated at table definition, but at row insertion. You may just have a table with 50 variable length columns, but if their max length is, say 50 bytes, you still have plenty of room for local storage.

The first thing to highlight here is the need for proper testing. If you just insert dummy data in your database, you may be missing to catch important data-dependent bugs like this one.

Now, how can you work around this problem?

Here are a few approaches you can take to solve this:

Upgrade to Barracuda format

This is probably the simplest approach, and as Barracuda only uses a 20 byte pointer for variable length types, this problem just goes away.

Limit the size of variable length columns

This is a quite obvious approach, and if your data allows it, it’s also one of the simplest to implement.
Just because you need a variable length field, it doesn’t mean you need to be able to store data up to that fields’ maximum allowed value. So, if your problem domain allows you to limit the size of these fields without affecting your use cases, this is the way to go.

Use the COMPRESS/UNCOMPRESS functions

This is also relatively simple to implement, though that depends on where in your code you use this fields. If you have a modular data access API, you shouldn’t have to change code in a lot of places. How much you can win with this approach depends a lot on the type of data you’re inserting (hint: the more random your data is, the worse compression will work). Something important to note is that the result of the COMPRESS function is always binary, so for example, you should use a BLOB instead of a TEXT column for storage. You can find more information about this functions here.

Split the table in a way that you don’t have more than 10 variable length columns per table

This approach has more impact on your existing code, as it requires you to rewrite as joins queries that access all columns in the original table. It also requires you to change the way you insert/modify data, as more tables are involved, and for example you will have to run any multi table update in a transaction. It will also impact both read and modification speed, precisely for the added join complexity. However, if you have too much variable length columns and you can’t limit their size, this may be the only possible solution.

Combine all your variable length fields into a single BLOB and do the splitting at the application level.

This will always work (unless you have so many fixed length columns that you still exceed 8000 bytes, but in that case I think you have a bigger problem than Innodb’s internal limitations), though you still need to change the way your application uses this data. You can limit the changes to your application if you choose XML as format and you use MySQL’s built in functions for handling this data. This approach, however, does have other negative impact, as it prevents you from using SQL operators/functions directly on the individual fields you want to store. It also forces you to read/write more data at once, while one of the interesting things about overflow blob storage is that if the columns aren’t requested as part of a row-read, they don’t have to be touched at all.

Of course, these approaches can be combined. For example, using the single BLOB field with COMPRESS/UNCOMPRESS can yield great results.

Finally, let me mention that handling of overflow pages involves other things one should consider. No overflow pages are shared, so a 769 byte blob will get a 16k overflow page for itself. Also, overflow pages are allocated 1 page at a time until you reach 32 pages. Then they are allocated an extent at a time (64 pages).

Nov
20
2009
--

Paul McCullagh answers your questions about PBXT

Following on from our earlier announcement, Paul McCullagh has responded with the answers to your questions – as well as a few I gathered from other Percona folks, and attendees of OpenSQL Camp. Thank you Paul!

What’s the “ideal” use case for the PBXT engine, and how does it compare in performance?  When would I use PBXT instead of a storage engine like MyISAM, InnoDB or XtraDB?

Unfortunately it is not possible to point to a specific category of applications and say, “PBXT will be better here, so try it”.  PBXT is a general purpose transactional storage engine, designed to perform well on a broad range of tasks, much like InnoDB.  However, PBXT’s log-based architecture makes performance characteristics different to both MyISAM and InnoDB/XtraDB. Tests show that PBXT’s performance is similar to InnoDB but, depending on your database designed and the application, it can be faster.

PBXT is a community project and, of course, we depend on users trying it out. In the long run, this will determine to what extent we are able to continue to develop and improve the engine.  So, despite this rather vague answer, we are hoping that more people try it out, and work with us to improve the engine as necessary.  My thanks to all who are already doing this! :)

I think I remember reports that PBXT (at an early stage) out performed InnoDB with INSERTS and UPDATES (but not SELECTS). That would make PBXT very interesting for non-SELECT-intensive applications (finance, production management etc.) in my opinion.  Is this the case, and do you have any recent benchmarks available?

This is no longer necessarily the case. For example a test (http://mysqlha.blogspot.com/2009/03/pbxt-is-fast-no-kidding.html) by Mark Callaghan shows that PBXT can actually out perform InnoDB with SELECTs under circumstances.

The implementation of full-durability has changed the performance characteristics of PBXT from “MyISAM-like” to more InnoDB-like. Originally PBXT was conceived as an engine that would be somewhere between MyISAM and InnoDB in both performance and features. The early version of PBXT was not fully durable (equivalent to innodb_flush_log_at_trx_commit=2).

A major change was completed at the beginning of last year with the implementation of full-durability. In doing this it was important to keep the log-based architecture which was the reason for the high write performance of earlier versions.

Traditional transactional implementations suffer from the problem that a backlog of asynchronous writes accumulate until it swamps the engine. There has been a lot of work on both InnoDB and XtraDB to solve this problem. The key words here are fuzzy and adaptive checkpointing (the former, originally implementation by Heiki for InnoDB, and the latter, an excellent addition to XtraDB).

Both methods improve the management of the asynchronous writes. The idea behind the log-based solution, on the other hand, is to avoid the accumulating a backlog of asynchronous writes, but writing synchronously.

Although write performance is comparable with InnoDB, I am not entirely convinced that PBXT’s implementation of the log-based I/O is optimal at this stage. This is ongoing work for PBXT 1.5.

Morgan notes: As well as Adaptive Checkpointing, Oracle has also been working on Adaptive Flushing for the InnoDB Plugin. The engine being ‘swamped’ problem that Paul is referring to is best described visually – see this post for more info.

What were the hard decisions or trade-offs that you had to make when designing PBXT?

If you read the white paper from 2006 (http://primebase.org/download/pbxt_white_paper.pdf) you will notice that the original design was uncompromisingly MVCC-based.  Some of this has been changed to make PBXT more InnoDB-like, but other principles have remained.

Pure-MVCC does not do any locking. Read locks are not required because each transaction effectively gets its own snapshot of the database. And write locks are not acquired when updating. Instead, the application can hit a “optimistic lock error” if a record is updated by another user.

Now PBXT does acquire locks for 2 reasons: to support SELECT FOR UPDATE, and to avoid optimistic locking errors. This makes PBXT’s behavior identical to InnoDB in REPEATABLE READ mode.

On the other hand, there are currently no plans to implement InnoDB style “gap locking”. Gap locking effectively involves locking rows that do not exist. This, in turn, means that PBXT transactions are not SERIALIZABLE.  A result of this is that statement-based replication is not supported by the engine.

Another hard decision was not to implement clustered indexes which I mentioned in more details later.

How does online backup work in PBXT, and is incremental backup possible?

A recent version of PBXT (1.0.09) supports the MySQL Backup API which was originally implemented in MySQL 6.0. This feature is now scheduled for an upcoming version of MySQL 5.4.

The Backup API makes it possible to pull a consistent snapshot of an entire database even when tables use different engine types. The API does not yet support incremental backup, but this is planned.

Internally this feature is implemented by PBXT using an MVCC-based consistent snapshot.

Does PBXT have a maintenance thread like InnoDB’s main thread?

PBXT has several system threads, that are responsible for various maintenance tasks. The most important of these are the “Writer”, the “Sweeper” and the “Checkpointer”;

  • The Writer transfers data from the transaction log to the database table files. This task runs constantly and is the main source of asynchronous I/O in the engine.
  • The Sweeper is a thread unique to the PBXT architecture. It’s job is to clean up after a transaction. Basically it recognizes which record versions are no longer needed and deletes them from the database.
  • The Checkpointer flushes the data written by the Writer to disk. It is also responsible for writing consistent snapshots of the index files to disk. The frequency of checkpoints, as with other engines like InnoDB, determines the recovery time.

Does PBXT support clustered indexes?

No, currently it does not. This is one of the original design decisions (as raised by a previous question).  Two things contributed to this decision:

  • Supporting clustered indexes would have made the implementation of PBXT more complicated than I would have liked. My goal was to make PBXT as simple as possible, so that it is easy to maintain the code and add features.
  • I believe clustered indexes are becoming less relevant with the rise of Solid State technology. As random read access times decrease clustering of data will become less and less important.

What is the page size in PBXT, and can it be tuned?

PBXT uses 16K pages for the index data and (approximately) 32K pages for the table data. Both sizes can be set using compile time switches. However, if the index page size is changed, then the indices need to be rebuilt, which can be done by REPAIR TABLE.  The table data page size does not require a rebuild because a page of records in the table is just a group of records (not an actual fixed length page).

Are there any differences in the PBXT implementation of MVCC that might surprise experienced InnoDB DBAs?  Also – In MVCC does it keep the versions in indexes, and can PBXT use MVCC for index scans?

If you are using InnoDB in REPEATABLE READ mode, then there is essentially no difference in the isolation paradigm between the two engines.

REPEATABLE READ is often preferred over SERIALIZABLE mode because it allows a greater degree of concurrency while still providing the necessary transaction isolation. So I do not consider the lack of serializability as a serious deficit in the engine. And, fortunately MySQL 5.1. supports row-based replication which makes it possible to do replication while using REPEATABLE READ.

PBXT does use MVCC to do index scans. Basically this means that all types of SELECTs can be done without locking.

Morgan notes: Indexes not using MVCC is one of the main differences in the Falcon storage engine.

PBXT supports row-level locking and foreign keys. Does this create any additional locking overhead that we should be aware of?

Firstly, PBXT does not acquire read locks. A normal SELECT does not lock at all. In addition, an UPDATE or DELETE only acquires a temporary row-lock. This lock is released when the row is updated or deleted because the MVCC system can detect that a row has been changed (and is therefore write locked).

This means that PBXT does not normally need to maintain long lists of row-level locks. This is also the case when a foreign key leads to cascading operations which can affect thousands of rows.

The only case you need to be aware of is SELECT FOR UPDATE. This operation acquires and holds a row-level lock for each row returned by the SELECT. These locks are all stored in RAM. The format is quite compact (especially when row IDs are consecutive) but this can become an issue if millions of rows are selected in this manner.

I should also mention that the consequence of this is that SELECT … LOCK IN SHARE MODE is currently not supported.

When I evaluate a storage engine my key acceptance criteria are things like backup, concurrency, ACID compliance and crash recovery. As a storage engine developer, what other criteria do you think I should be adding?

Yes, I think you have mentioned the most important criteria. What I can add to this list are 3 things that make developing a storage engine extremely demanding: performance, stability and data integrity.

Of course, as a DBA or database user these aspects are so basic that they are taken for granted.

But engine developers need to keep performance, stability and data integrity in mind constantly. The problem is, they compete with each other: increasing performance often causes instabilities that then have to be fixed. How to optimize the program without compromising data integrity is a constant question.

Relative to maintaining performance, stability and data integrity, adding features to an engine is easy. So I would say that these are the criteria that concern a developer the most.

MySQL supports a “pluggable storage engine API”, but it seems that not all the storage engine vendors are able to keep all their code at that layer (Infobright had to make major changes to MySQL itself).  What war stories can you report on in plugging into MySQL?

Unfortunately the “war” continues. I have already received several e-mails that PBXT does not compile with the recently released MySQL 5.1.41!

Any dot release can lead to this problem, and I think PBXT is fairly moderate with its integration into MySQL.

My main advantage: I have been able to avoid modifying any part of MySQL to make the engine work. This means that PBXT runs with the standard MySQL/MariaDB distribution.

But this has required quite a bit of creative work, in other words, hacks.

One of the main problems has been running into global locks when calling back into MySQL to do things like open a table, create a session structure (THD) or create a .frm file.

One extreme example of this is PBXT recovery. When MySQL calls the engine “init” method on startup it is holding the global LOCK_plugin lock. In init, the engine needs to do recovery. In PBXT’s case this means opening tables (reading a .frm file), which requires creating a THD. The code to create a THD in turn tries to acquire LOCK_plugin!

Unfortunately a thread hangs if it tries to acquire the same mutex twice, so this just does not work!

We went through quite a few iterations (MySQL code was also changing during the development of 5.1) before we came up with the current solution: create a background thread to do recovery asynchronously. So the thread can wait for the LOCK_plugin to be unlocked before it continues.

The affect is that the init function returns quickly, but the first queries that access PBXT tables may hang waiting for recovery to complete.

PBXT seems to have very few configuration parameters.  Was this an intentional design decision, and do you see it creating opportunities for you in organizations with less internal IT-expertise?

No, this is not by design.

While I try to only add tuning parameters that are absolutely necessary, PBXT is not specifically designed to be self-tuning, because I believe that is a very hard problem to solve in general.

Tuning parameters are often added to an engine in response to performance problems in particular configurations. This is not necessarily a bad thing because it provides DBA’s with the tools they need.

My goal for PBXT in this regard is twofold:

  1. I hope that most installations can get away with setting a minimum of tuning parameters, in particular, just the cache values.
  2. On the other hand, I aim to provide expert tuning parameters for installations that need to extract maximum performance from the hardware. This work is ongoing.

Morgan notes: There are more in InnoDB/XtraDB now than there were three years ago. This is probably something that emerges over time as we get to understand more about an engine.


Entry posted by Morgan Tocker |
2 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com