Nov
02
2017
--

MySQL vs. MariaDB: Reality Check

MySQL vs. MariaDB

MySQL vs. MariaDBIn this blog, we’ll provide a comparison between MySQL vs. MariaDB (including Percona Server for MySQL).

Introduction

The goal of this blog post is to evaluate, at a higher level, MySQL, MariaDB and Percona Server for MySQL side-by-side to better inform the decision making process. It is largely an unofficial response to published comments from the MariaDB Corporation.

It is worth noting that Percona Server for MySQL is a drop-in compatible branch of MySQL, where Percona contributes as much as possible upstream. MariaDB Server, on the other hand, is a fork of MySQL 5.5. They cherry-picked MySQL features, and don’t guarantee drop-in compatibility any longer.

MySQL Percona Server for MySQL* MariaDB Server
Protocols MySQL protocol over port 3306, X Protocol over port 33060 MySQL protocol over port 3306, X Protocol over port 33060 MySQL protocol, MariaDB Server extensions
Community –
Source Code
Open Source Open Source Open Source
Community – Development Open Source, contributions via signing the Oracle Contributor Agreement (OCA) Open Source Open Source, contributions via the new BSD license or signing the MariaDB Contributor Agreement (MCA)
Community – Collaboration Mailing list, forums, bugs system Mailing list, forums, bugs system (Jira, Launchpad) Mailing list, bugs system (Jira), IRC channel
Core –
Replication
MySQL replication with GTID MySQL replication with GTID MariaDB Server replication, with own GTID, compatible only if MariaDB Server is a slave to MySQL, not vice versa
Core –
Routing
MySQL Router (GPLv2) ProxySQL (GPLv3) MariaDB MaxScale (Business Source License)
Core –
Partitioning
Standard Standard Standard, with extra engines like SPIDER/CONNECT that offer varying levels of support
Tool –
Editing
MySQL Workbench for Microsoft Windows, macOS, and Linux MySQL Workbench for Microsoft Windows, macOS, and Linux Webyog’s SQLYog for Microsoft Windows (MySQL Workbench notes an incompatible server)
Tool –
Monitoring
MySQL Enterprise Monitor Percona Monitoring & Management (PMM) (100% open source) Webyog’s Monyog
Scalability –
Client Connections
MySQL Enterprise Threadpool Open Source Threadpool with support for priority tickets Open Source Threadpool
Scalability –
Clustering
MySQL Group Replication MySQL Group Replication, Percona XtraDB Cluster (based on a further engineered Galera Cluster) MariaDB Enterprise Cluster (based on Galera Cluster)
Security –
Encryption
Tablespace data-at-rest encryption. Amazon KMS, Oracle Vault Enterprise Edition Tablespace data-at-rest encryption with Keyring Vault plugin Tablespace and table data-at-rest encryption. Amazon KMS, binlog/redo/tmp file with Aria tablespace encryption
Security –
Data Masking
ProxySQL data masking ProxySQL data masking MariaDB MaxScale data masking
Security –
Firewall
MySQL Enterprise Firewall ProxySQL Firewall MariaDB MaxScale Firewall
Security –
Auditing
MySQL Enterprise Audit Plugin Percona Audit Plugin (OSS) MariaDB Audit Plugin (OSS)
Analytics No ClickHouse MariaDB ColumnStore
SQL –
Common Table Expressions
In-development for MySQL 8.0 (now a release candidate) In-development for MySQL 8.0 (now a release candidate) Present in MariaDB Server 10.2
SQL –
Window Functions
In-development for MySQL 8.0 (now a release candidate) In-development for MySQL 8.0 (now a release candidate) Present in MariaDB Server 10.2
Temporal –
Log-based rollback
No No In development for MariaDB Server 10.3
Temporal – system versioned tables No No In development for MariaDB Server 10.3
JSON JSON Data type, 21 functions JSON Data type, 21 functions No JSON Data Type, 26 functions
Official
client connectors
C (libmysqlclient), Java, ODBC, .NET, Node.js, Python, C++, mysqlnd for PHP C (libmysqlclient), Java, ODBC, .NET, Node.js, Python, C++, mysqlnd for PHP C (libmariadbclient), Java, ODBC
Usability – CJK Language support Gb18030, ngram & MeCab for InnoDB full-text search Gb18030, ngram & MeCab for InnoDB full-text search No
Monitoring – PERFORMANCE
_SCHEMA
Thorough instrumentation in 5.7, sys schema included Thorough instrumentation in 5.7, sys schema included Instrumentation from MySQL 5.6, sys schema not included
Security – Password authentication sha256_password (with caching_sha2_password in 8.0) sha256_password (with caching_sha2_password in 8.0) ed25519 (incompatible with sha256_password)
Security –
Secure out of the box
validate_password on by default, to choose a strong password at the start validate_password on by default, to choose a strong password at the start No
Usability – Syntax differences EXPLAIN FOR CONNECTION <thread_id> EXPLAIN FOR CONNECTION <thread_id> SHOW EXPLAIN FOR <thread_id>
Optimiser –
Optimiser Tracing
Yes Yes No
Optimiser –
Optimiser Hints
Yes Yes No
DBA –
Super readonly mode
Yes Yes No
Security – Password expiry Yes Yes No
Security – Password last changed? Password lifetime? Yes Yes No
Security – VALIDATE_PASSWORD
_STRENGTH()
Yes Yes No
Security – ACCOUNT LOCK/UNLOCK Yes Yes No
Usability – Query Rewriting Yes Yes No
GIS – GeoJSON &
GeoHash functionality
Yes Yes Incomplete
Security – mysql_ssl_rsa_setup Yes Yes No (setup SSL connections manually)
MySQL Utilities Yes Yes No
Backup locks No (in development for 8.0) Yes No
Usability – InnoDB memcached interface Yes Yes No

*Note. Third-party software (such as ProxySQL and ClickHouse) used in conjunction with Percona Server for MySQL is not necessarily covered by Percona Support services.

To get a higher level view of what Percona Server for MySQL offers compared to MySQL, please visit: Percona Server Feature Comparison. Read this for a higher level view of compatibility between MariaDB Server and MySQL written by MariaDB Corporation.

Open Community

MariaDB Server undoubtedly has an open community, with governance mixed between MariaDB Foundation and MariaDB Corporation. There are open developer meetings on average about twice per year, two mailing lists (one for developers and users), an IRC channel and an open JIRA ticket system that logs bugs and feature requests.

Percona Server for MySQL also has an open community. Developer meetings are not open to general contributors, but there is a mailing list, an IRC channel and two systems – Launchpad and JIRA – for logging bugs and feature requests.

MySQL also has an open community where developer meetings are also not open to general contributors. There are many mailing lists, there are a few IRC channels and there is the MySQL bugs system. The worklogs are where the design for future releases happens, and these are opened up when their features are fully developed and  source-code-pushed.

From a source code standpoint, MySQL makes pushes to Github when a release is made; whereas open source development happens for Percona Server for MySQL and MariaDB Server on Github.

Feature development on MySQL continues in leaps and bounds, and Oracle has been an excellent steward of MySQL. Please refer to The Complete List of Features in 5.7, as well as The Unofficial MySQL 8 Optimiser Guide.

Linux distributions have chosen MariaDB Server 5.5, and some have chosen MariaDB Server 10.0/10.1 when there was more backward compatibility to MySQL 5.5/5.6. It is the “default” MySQL in many Linux distributions (such as Red Hat Enterprise Linux, SUSE and Debian). However, Ubuntu still believes that when you ask for MySQL you should get it (and that is what Ubuntu ships).

One of the main reasons Debian switched was due to the way Oracle publishes updates for security issues. They are released as a whole quarterly as Critical Patch Updates, without much detail about individual fixes. This is a policy that is unlikely to change, but has had no adverse effects on distribution.

All projects actively embrace contributions from the open community. MariaDB Server does include contributions like the MyRocks engine developed at Facebook, but so does Percona Server for MySQL. Oracle accepts contributions from a long list of contributors, including Percona. Please see Licensing information for MySQL 5.7 as an example.

A Shared Core Engine

MariaDB Server has differed from MySQL since MySQL 5.5. This is one reason why you don’t get version numbers that follow the MySQL scheme. It is also worth noting that features are cherry-picked at merge time, because the source code has diverged so much since then.

As the table below shows, it took Percona Server for MySQL over four months to get a stable 5.5 release based on MySQL 5.5, while it took MariaDB Server one year and four months to get a stable 5.5 release based on MySQL 5.5. Percona Server for MySQL 5.6 and 5.7 are based on their respective MySQL versions.

MySQL Percona Server for MySQL MariaDB Server
3 December 2010 5.5.8 GA
28 April 2011 5.5.11-20.2 GA
11 April 2012 5.5.23 GA
5 February 2013 5.6.10 GA
7 October 2013 5.6.13-61.0 GA
31 March 2014 10.0.10 GA
17 October 2015 10.1.8 GA
21 October 2015 5.7.9 GA
23 February 2016 5.7.10-3 GA
23 May 2017 10.2.6 GA

 

MySQL is currently at 8.0.3 Release Candidate, while MariaDB Server is at 10.3.2 Alpha as of this writing.

MariaDB Server is by no means a drop-in replacement for MySQL. The risk of moving to MariaDB Server if you aren’t using newer MySQL features may be minimal, but the risk of moving out of MariaDB Server to MySQL is very prevalent. Linux distributions like Debian already warn you of this.

MySQL vs. MariaDB

The differences are beyond just default configuration options. Some features, like time-delayed replication that were present in MySQL since 2013, only make an appearance in MariaDB Server in 2017! (Refer to the MariaDB Server 10.2 Overview for more.) However, it is also worth noting some features such as multi-source replication appeared in MariaDB Server 10.0 first, and only then came to MySQL 5.7.

Extensibility

MySQL and MariaDB Server have a storage engine interface, and this is how you access all engines, including the favored InnoDB/Percona XtraDB. It is worth noting that Percona XtraDB was the default InnoDB replacement in MariaDB Server 5.1, 5.2, 5.3, 5.5, 10.0 and 10.1. But in MariaDB Server 10.2, the InnoDB of choice is upstream MySQL.

Stock MySQL has provided several storage engines beyond just InnoDB (the default) and MyISAM. You can find out more information about 5.7 Supported Engines.

Percona Server for MySQL includes a modified MEMORY storage engine, ships Percona XtraDB as the default InnoDB and also ships TokuDB and MyRocks (currently experimental). MyRocks is based on the RocksDB engine, and both are developed extensively at Facebook.

MariaDB Server includes many storage engines, beyond the default InnoDB. MyISAM is modified with segmented key caches, the default temporary table storage engine is Aria (which is a crash-safe MyISAM), the FederatedX engine is a modified FEDERATED engine, and there are more: CONNECT, Mroonga, OQGRAPH, Sequence, SphinxSE, SPIDER, TokuDB and of course MyRocks.

Storage engines have specific use cases, and have different levels of feature completeness. You should thoroughly evaluate a storage engine before choosing it. We believe that over 90% of installations are fine with just InnoDB or Percona XtraDB. Percona TokuDB is another engine that users who need compression could use. We naturally expect more usage in the MyRocks sphere going forward.

Analytics

MariaDB ColumnStore is the MariaDB solution to analytics and using a column-based store. It is a separate download and product, and not a traditional storage engine (yet). It is based on the now defunct InfiniDB product.

At Percona, we are quite excited by ClickHouse. We also have plenty of content around it. There is no MySQL story around this.

High Availability

High Availability is an exciting topic in the MySQL world, considering the server itself has been around for over 22 years. There are so many solutions out there, and some have had evolution as well.

MySQL provides MySQL Cluster (NDBCLUSTER) (there is no equivalent in the MariaDB world). MySQL also provides group replication (similar to Galera Cluster). Combined with the proxy MySQL Router, and the mysqlsh for administration (part of the X Protocol/X Dev API), you can also get MySQL InnoDB Cluster.

We benefit from the above at Percona, but also put lots of engineering work to make Percona XtraDB Cluster.

MariaDB Server only provides Galera Cluster.

Security

While we don’t want to compare the proprietary MySQL Enterprise Firewall, MariaDB’s recommendation is the proprietary, non-open source MariaDB MaxScale (it uses a Business Source License). We highly recommend the alternative, ProxySQL.

When it comes to encryption, MariaDB Server implements Google patches to provide complete data at rest encryption. This supports InnoDB, XtraDB and Aria temporary tables. The log files can also be encrypted (not present in MySQL, which only allows tablespace encryption and not log file encryption).

When it comes to attack prevention, ProxySQL should offer everything you need.

MySQL Enterprise provides auditing, while MariaDB Server provides an audit plugin as well as an extension to the audit interface for user filtering. Percona Server for MySQL has an audit plugin that sticks to the MySQL API, yet provides user filtering and controls the ability to audit (since auditing is expensive). Streaming to syslog is supported by the audit plugins from Percona and MariaDB.

Supporting Ecosystem and Tools

Upgrading from MySQL to MariaDB Server should be a relatively simple process (as stated above). If you want to upgrade away from MariaDB Server to MySQL, you may face hassles. For tools, see the following table:

Purpose MySQL Percona Server for MySQL MariaDB Server
Monitoring MySQL Enterprise Monitor Percona Monitoring & Management (PMM) (100% open source) Webyog Monyog
Backup MySQL Enterprise Backup Percona XtraBackup MariaDB Backup (fork of Percona XtraBackup)
SQL Management MySQL Workbench MySQL Workbench Webyog SQLyog
Load Balancing & Routing MySQL Router ProxySQL MariaDB MaxScale
Database Firewall MySQL Enterprise Firewall ProxySQL MariaDB MaxScale

 

Enterprise Database Compatibility

MariaDB Server today has window functions and common table expressions (CTEs). These appeared in MariaDB Server 10.2. MySQL 8 is presently in release candidate status and also has similar functionality.

Looking ahead, MariaDB Server 10.3 also includes an Oracle SQL_MODE and a partial PL/SQL parser. This is to aid migration from Oracle to MariaDB Server.

MariaDB Server 10.2 also has “flashback”, developed at Alibaba, to help with log-based rollback using the binary log.

Conclusion

Percona sees healthy competition in the MySQL ecosystem. We support all databases in the ecosystem: MySQL, MariaDB Server and Percona Server for MySQL. Our focus is to provide alternatives to proprietary parts of open source software. Percona has a strong operations focus on compatibility, application scalability, high availability security and observability. We also support many additional tools within the ecosystem, and love integrating and contributing to open source code.

For example, Percona Monitoring and Management (PMM) includes many open source tools like Prometheus, Consul, Grafana, Orchestrator and more. We have made the de facto open source hot backup solution for MySQL, MariaDB Server and Percona Server for MySQL (called Percona XtraBackup). We continue to maintain and extend useful tools for database engineers and administrators in Percona Toolkit. We make Percona XtraDB Cluster safe for deployment out of the box. We have invested in a write-optimized storage engine, TokuDB, and now continue to work with making MyRocks better.

We look forward to supporting your deployments of MySQL or MariaDB Server, whichever option is right for you! If you need assistance on migrations between servers, or further information, don’t hesitate to contact your friendly Percona sales associate.

Oct
19
2017
--

Percona Blog Poll: How Do You Currently Host Applications and Databases?

Host applications and databases

Host applications and databasesPercona latest blog poll asks how you currently host applications and databases. Select an option below, or leave a comment to clarify your deployment!

With the increased need for environments that respond more quickly to changing business demands, many enterprises are moving to the cloud and hosted deployments for applications and software in order to offload development and maintenance overhead to a third party. The database is no exception. Businesses are turning to using database as a service (DBaaS) to handle their data needs.

DBaaS provides some obvious benefits:

  • Offload physical infrastructure to another vendor. It is the responsibility of whoever is providing the DBaaS service to maintain the physical environment – including hardware, software and best practices.
  • Scalability. You can add or subtract capacity as needed by just contacting your vendor. Have a big event on the horizon? Order more servers!
  • Expense. Since you no longer have shell out for operational costs or infrastructure upgrades (all handled by the vendor now), you can reduce capital and operation expenses – or at least reasonably plan on what they are going to be.

There are some potential disadvantages to a DBaaS as well:

  • Network performance issues. If your database is located off-premises, then it can be subject to network issues (or outages) that are beyond your control. These can translate into performance problems that impact the customer experience.
  • Loss of visibility. It’s harder (though not impossible) to always know what is happening with your data. Decisions around provisioning, storage and architecture are now in the hands of a third party.
  • Security and compliance. You are no longer totally in control of how secure or compliant your data is when using a DBaaS. This can be crucial if your business requires certain standards to operate in your market (healthcare, for example).

How are you hosting your database? On-premises? In the cloud? Which cloud? Is it co-located? Please answer using the poll below. Choose up to three answers. If you don’t see your solutions, use the comments to explain.

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

Thanks in advance for your responses – they will help the open source community determine how databases are being hosted.

Oct
18
2017
--

How to Choose the MySQL innodb_log_file_size

innodb_log_file_size

In this blog post, I’ll provide some guidance on how to choose the MySQL innodb_log_file_size.

Like many database management systems, MySQL uses logs to achieve data durability (when using the default InnoDB storage engine). This ensures that when a transaction is committed, data is not lost in the event of crash or power loss.

MySQL’s InnoDB storage engine uses a fixed size (circular) Redo log space. The size is controlled by innodb_log_file_size and innodb_log_files_in_group (default 2). You multiply those values and get the Redo log space that available to use. While technically it shouldn’t matter whether you change either the innodb_log_file_size or innodb_log_files_in_group variable to control the Redo space size, most people just work with the innodb_log_file_size and leave innodb_log_files_in_group alone.

Configuring InnoDB’s Redo space size is one of the most important configuration options for write-intensive workloads. However, it comes with trade-offs. The more Redo space you have configured, the better InnoDB can optimize write IO. However, increasing the Redo space also means longer recovery times when the system loses power or crashes for other reasons.  

It is not easy or straightforward to predict how much time a system crash recovery takes for a specific innodb_log_file_size value – it depends on the hardware, MySQL version and workload. It can vary widely (10 times difference or more, depending on the circumstances). However, around five minutes per 1GB of innodb_log_file_size is a decent ballpark number. If this is really important for your environment, I would recommend testing it by a simulating system crash under full load (after the database has completely warmed up).   

While recovery time can be a guideline for the limit of the InnoDB Log File size, there are a couple of other ways you can look at this number – especially if you have Percona Monitoring and Management installed.

Check Percona Monitoring and Management’s “MySQL InnoDB Metrics” Dashboard. If you see a graph like this:

innodb_log_file_size

where Uncheckpointed Bytes is pushing very close to the Max Checkpoint Age, you can almost be sure your current innodb_log_file_size is limiting your system’s performance. Increasing it can provide substantial performance improvements.

If you see something like this instead:

innodb_log_file_size 2

where the number of Uncheckpointed Bytes is well below the Max Checkpoint Age, then increasing the log file size won’t give you a significant improvement.

Note: many MySQL settings are interconnected. While a specific log file size might be good enough for smaller innodb_buffer_pool_size, larger InnoDB Buffer Pool values might warrant larger log files for optimal performance.

Another thing to keep in mind: the recovery time we spoke about early really depends on the Uncheckpointed Bytes rather than total log file size. If you do not see recovery time increasing with a larger innodb_log_file_size, check out InnoDB Checkpoint Age graph – it might be you just can’t fully utilize large log files with your workload and configuration.

Another way to look at the log file size is in context of log space usage:

innodb_log_file_size 3

This graph shows the amount of Data Written to the InnoDB log files per hour, as well as the total size of the InnoDB log files. In the graph above, we have 2GB of log space and some 12GB written to the Log files per hour. This means we cycle through logs every ten minutes.

InnoDB has to flush every dirty page in the buffer pool at least once per log file cycle time.

InnoDB gets better performance when it does that less frequently, and there is less wear and tear on SSD devices. I like to see this number at no less than 15 minutes. One hour is even better.  

Summary

Getting the innodb_log_file_file size is important to achieve the balance between reasonably fast crash recovery time and good system performance. Remember, your recovery time objective it is not as trivial as you might imagine. I hope the techniques described in this post help you to find the optimal value for your situation!

Oct
18
2017
--

Webinar Thursday, October 19, 2017: What You Need to Get the Most Out of Indexes – Part 2

Indexes

IndexesJoin Percona’s Senior Architect, Matthew Boehm, as he presents What You Need to Get the Most Out of Indexes – Part 2 webinar on Thursday, October 19, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

Proper indexing is key to database performance. Finely tune your query writing and database performance with tips from the experts. MySQL offers a few different types of indexes and uses them in a variety of ways.

In this session you’ll learn:

  • How to use composite indexes
  • Other index usages besides lookup
  • How to find unoptimized queries
  • What is there beyond EXPLAIN?

Register for the webinar.

IndexesMatthew Boehm, Architect

Matthew joined Percona in the fall of 2012 as a MySQL consultant. His areas of knowledge include the traditional Linux/Apache/MySQL/PHP stack, memcached, MySQL Cluster, massive sharding topologies, PHP development and a bit of MySQL-C-API development. Previously, Matthew DBAed for the fifth largest MySQL installation at eBay/PayPal. He also hails from managed hosting environments. During his off-hours, Matthew is a nationally ranked, competitive West Coast Swing dancer and travels to competitions around the US. He enjoys working out, camping, biking and playing MMOs with his son.

Oct
17
2017
--

Webinar Wednesday, October 18, 2017: How to Scale with MongoDB

Scale with MongoDB

Scale with MongoDBJoin Percona’s Senior Technical Services Engineer Adamo Tonete as he presents How To Scale with MongoDB on Wednesday, October 18, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

In this webinar, we will talk about how to scale with MongoDB, up to thousands of writes and reads per second. What are the common issues when you scale with MongoDB? Is it better to shard or to add further secondaries?

We will walk through many common scaling situations, and through the steps needed to deploy a sharded cluster: from a single instance to a sharded environment. We will also talk about common mistakes/pitfalls a company can make when scaling its database – and how to avoid such situations.

Register for the webinar.

Adamo Tonete, Senior Technical Services Engineer

Adamo joined Percona in 2015, after working as a MongoDB/MySQL database administrator for three years. As the main database member of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24×7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three has moved to NoSQL technologies without giving up relational databases. He likes to play video games and study everything that is related to engines. Adamo lives with his wife in São Paulo, Brazil.

Oct
12
2017
--

A Mystery with open_files_limit

open_files_limit

open_files_limitIn this blog, we’ll look at a mystery around setting the

open_files_limit

 variable in MySQL and Percona Server for MySQL.

MySQL Server needs file descriptors to run. It uses them to open new connections, store tables in the cache, create temporary tables to resolve complicated queries and access persistent ones. If

mysqld

 is not able to open new files when needed, it can stop functioning correctly. A common symptom of this issue is error 24:

"Too many open files"

.

The number of file descriptors

mysqld

 can open simultaneously is defined by the configuration

open_files_limit

 option. You would expect it to work like any other MySQL Server option: set in the configuration file, restart

mysqld

and use more or fewer descriptors. All other configuration variables work this way. But

open_files_limit

also depends on the operating system (OS) limits. This makes setting the variable more complicated.

mysqld

As a user, when you start any application it cannot have limits set to be greater than the limits defined by the operating system for the user in question. Therefore, you would intuitively expect

mysqld

to set

open_files_limit

  to any value that is less than the OS limit. This is not the case, however. No matter what value you set for the

open_files_limit

 variable, the OS limit is used unless it is set to “infinity”.

sveta@Thinkie:~$ ulimit -n
32000
sveta@Thinkie:$ cat /etc/my.cnf
[mysqld]
open-files-limit=16000
...
sveta@Thinkie:$ ./bin/mysqld &
sveta@Thinkie:$ mysql -uroot
Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 3
Server version: 5.7.19-17-debug-log Source distribution
Copyright (c) 2009-2017 Percona LLC and/or its affiliates
Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql> select @@open_files_limit;
+--------------------+
| @@open_files_limit |
+--------------------+
|              32000 |
+--------------------+
1 row in set (0.00 sec)

The reason for this can be found in the code contained in the 

mysys/my_file.c

 file:

static uint set_max_open_files(uint max_file_limit)
{
  struct rlimit rlimit;
  uint old_cur;
  DBUG_ENTER("set_max_open_files");
  DBUG_PRINT("enter",("files: %u", max_file_limit));
  if (!getrlimit(RLIMIT_NOFILE,&rlimit))
  {
    old_cur= (uint) rlimit.rlim_cur;
    DBUG_PRINT("info", ("rlim_cur: %u  rlim_max: %u",
            (uint) rlimit.rlim_cur,
            (uint) rlimit.rlim_max));
    if (rlimit.rlim_cur == RLIM_INFINITY)
      rlimit.rlim_cur = max_file_limit;
    if (rlimit.rlim_cur >= max_file_limit)
      DBUG_RETURN(rlimit.rlim_cur);     /* purecov: inspected */
    rlimit.rlim_cur= rlimit.rlim_max= max_file_limit;
    if (setrlimit(RLIMIT_NOFILE, &rlimit))
      max_file_limit= old_cur;          /* Use original value */
    else
    { 
      rlimit.rlim_cur= 0;           /* Safety if next call fails */
      (void) getrlimit(RLIMIT_NOFILE,&rlimit);
      DBUG_PRINT("info", ("rlim_cur: %u", (uint) rlimit.rlim_cur));
      if (rlimit.rlim_cur)          /* If call didn't fail */
    max_file_limit= (uint) rlimit.rlim_cur;
    } 
  }
  DBUG_PRINT("exit",("max_file_limit: %u", max_file_limit));
  DBUG_RETURN(max_file_limit);
}

Particularly these lines:

if (rlimit.rlim_cur >= max_file_limit)
  DBUG_RETURN(rlimit.rlim_cur);    /* purecov: inspected */

This code tells

mysqld

to take the maximum value of what is specified in either the variable

open_files_limit

, or the soft system user limit.

I reported this behavior as documentation bug #87681.

mysqld_safe

mysqld_safe

has its own

open_files_limit

 option. This option allows you to overwrite the system soft limit any way you want. However, on:

  • Red Hat Enterprise Linux 7
  • Oracle Linux 7
  • CentOS 7
  • SUSE Linux Enterprise Server 12
  • Fedora 25 and 26
  • Debian 8 or higher
  • Ubuntu 16.04 LTS or higher

This option as specified under the 

[mysqld_safe]

header in the configuration file is not used when you start

mysqld

as a service. To explain the reason for this behavior, we need to step back into history.

init.d

For a long time, many Linux Operating Systems used init.d to start certain commands together with the OS. The Init daemon executes scripts (usually located in the directory

/etc/init.d

) at system startup, depending on the runlevel.

The different implementations of

init.d

vary, but they have known drawbacks. For example,

init.d

starts everything sequentially. This means a new process has to wait if another has already started. This makes the startup process on multi-core machine slow. Another drawback related to our topic is that daemons started by

init.d

 inherit OS limits from the root user. If a program needs to be run by another user, the switch needs to happen in the startup script itself. But the order of option files that such users read can be different, depending if they are logged in via the

ssh

su

or

sudo

commands.

MySQL Server

MySQL Server’s startup sequence for the service is as follow:

  1. <Perform another job>
  2. Start
    mysqld_safe

    as

    mysql

    user:

    su - mysql -s /bin/bash -c "mysqld_safe > /dev/null &"

This behavior has existed at least since version 5.5.

Percona Server for MySQL

Before version 5.7, Percona Server for MySQL had a different startup sequence:

  1. <Perform another job>
  2. Start
    mysqld_safe

    as root and pass option

    --user=mysql

    to it:

    "${PERCONA_PREFIX}"/bin/mysqld_safe > /dev/null 2>&1 &

With this sequence, you only need to set a hard limit for a mysql user in the file 

/etc/security/limits.conf

, and

mysqld_safe

 will do the rest.

In version 5.7, Percona Server for MySQL backported the startup sequence from MySQL Server. Since then, setting a hard limit on the number of open files for mysql users in 

/etc/security/limits.conf

 is not enough. You also need to have a row

session required pam_limits.so

in the file

/etc/pam.d/common-session

. This is needed because the startup sequence for

mysql

users changed due to the design of  

init.d

.

SystemD

Linux developers performed several trials to find a better startup solution than

init.d

. Speaking for MySQL and Percona Server for MySQL startup, the most important innovation is SystemD. SystemD is becoming more and more popular. Therefore MySQL and Percona Server for MySQL do not use

init.d

 on Red Hat Enterprise Linux 7, Oracle Linux 7, CentOS 7, SUSE Linux Enterprise Server 12, Fedora 25 and 26, Debian 8 or higher and Ubuntu 16.04 LTS or higher. Instead, they use SystemD.

What does this mean for MySQL users?

Scripts started by SystemD start as required by the system user from the start. Therefore they do not inherit limits from the root user and use their own limits specified in 

/etc/security/limits.conf

. If you need to have your

mysqld

process limits differ from the defaults for user

mysql

, you need to set the option

LimitNOFILE

under the 

[Service]

section in the service configuration file. Again, you cannot then lower this limit using

open_files_limit

option, unless you set it to

Infinity

.

Both packages

To make things more complex, Percona Server for MySQL packages for Ubuntu contain both the 

mysql.server

script (used by

init.d

) and the service description for SystemD. In fact, SystemD is used after install — but you might be confused when looking at only the package files.

Conclusion

You should set the 

open_files_limit

variable together with the operating system limits. You should study how

init.d

 or SystemD works if you see values that you don’t expect.

How to change

open_files_limit

variable?

Operating System Startup daemon Where to put configuration
Red Hat Enterprise Linux 7, Oracle Linux 7, CentOS 7
SUSE Linux Enterprise Server 12
Fedora 25 and 26
Debian 8+
Ubuntu 16.04 LTS+
SystemD
/etc/security/limits.conf

and

/etc/pam.d/common-session

Service configuration:

sudo systemctl edit mysql

[mysqld]

section of the configuration file

Others init.d
/etc/security/limits.conf

and

/etc/pam.d/common-session

[mysqld_safe]

section of the configuration file

[mysqld]

section of the configuration file

 

Which values of

open_files_limit

variable make sense?

Soft User Limit
open_files_limit

range

Infinity Any
Positive Greater/equal than soft user limit and smaller than hard user limit

 

Oct
11
2017
--

Webinar Thursday, October 12, 2017: MongoDB Readiness from an SRE and Ops Viewpoint

MongoDB Readiness

MongoDB ReadinessJoin Percona’s MongoDB Practice Manager David Murphy on Thursday, October 12, 2017, at 10:00 am PDT / 1:00 pm EDT (UTC-7) as he discusses MongoDB Readiness from an SRE and Ops Viewpoint.

Operations teams (SRE, PE, DevOps, etc.) are being asked to take a more active role in database provisioning and scaling. Much of the MongoDB material available online is from one, two, three or even five years ago (or more). Finding useful content online that is helpful in breaking through the current state of MongoDB maturity and stability can be challenging with all this outdated material exists – especially when MongoDB is massively different than it was even in the 2.X series.

This webinar will cut through the noise and provide the 2017 state of MongoDB. You can expect to leave knowing more about how it behaves, when to use it and how it handles things like high availability and backup and recovery.

We will also review both the good and bad history of MongoDB, and talk about why you need to know how something works today (not how it worked in 2010) in this fast-paced environment. You will leave knowing MongoDB’s current maturity, a high-level view of how it works today and what your risk/benefit charts should look like when considering using it.

Key ops areas covered:

  • MongoDB architecture
  • High availability
  • Ansible and MongoDB
  • Cloud provisioning
  • Effective monitoring solutions
  • How to make sure you have consistency
  • Top five ops challenges and their solutions
  • How to think about multiple regions with MongoDB

Register for the webinar here.

MongoDB BackupsDavid Murphy, MongoDB Practice Manager

David is the Practice Manager for MongoDB @ Percona. He joined Percona in Oct 2015, before that he has been deep in both the MySQL and MongoDB database communities for some time. Other passions include DevOps, tool building and security.

Oct
09
2017
--

MySQL and MariaDB Default Configuration Differences

MySQL and MariaDB Default Configuration

MySQL and MariaDB Default ConfigurationIn this blog post, I’ll discuss some of the MySQL and MariaDB default configuration differences, focusing on MySQL 5.7 and MariaDB 10.2.

MariaDB Server is a general purpose open source database, created by the founders of MySQL. MariaDB Server (referred to as MariaDB for brevity) has similar roots as Percona Server for MySQL, but is quickly diverging from MySQL compatibility and growing on its own. MariaDB has become the default installation for several operating systems (such as Red Hat Enterprise Linux/CentOS/Fedora). Changes in the default variables can make a large difference in the out-of-box performance of the database, so knowing what is different is important.

As MariaDB grows on its own and doesn’t remain 100% compatible with MySQL, the defaults configuration settings might not mean everything or behave the way they used to. It might use different variable names, or implement the same variables in new ways. You also need to take into account that MariaDB uses it’s own Aria storage engine that has many configuration options that do not exist in MySQL.

Note: In this blog, I am looking at variables common to both MySQL or MariaDB, but have different defaults, not variables that are specific to either MySQL or MariaDB (except for the different switches inside the optimizer_switch).

Binary Logs

Variable MariaDB Default MySQL Default
sync_binlog 0 1
binlog_format Mixed Row

 

MySQL has taken a more conservative stance when it comes to the binary log. In the newest versions of MySQL 5.7, they have updated two variables to help ensure all committed data remains intact and identical. Binlog_format was updated to row in MySQL in order to prevent non-deterministic statements from having different results on the slave. Row-based replication also helps when performing a lot of smaller updates. MariaDB defaults to the Mixed format. Mixed uses statement-based format unless certain criteria are met. It hat case, it uses the row format. You can see the detailed criteria for when the row format is used here: https://mariadb.com/kb/en/the-mariadb-library/binary-log-formats/.

The other difference that can cause a significant impact on performance is related to sync_binlog. Sync_binlog controls the number of commit groups to collect before synchronizing the binary log to disk. MySQL has changed this to 1, which means that every transaction is flushed to disk before it is committed. This guarantees that there can never be a committed transaction that is not recorded (even during a system failure). This can create a big impact to performance, as shown by a Roel Van de Paar in his blog: https://www.percona.com/blog/2016/06/03/binary-logs-make-mysql-5-7-slower-than-5-6/

MariaDB utilizes a value of 0 for sync_binlog, which allows the operating system to determine when the binlog needs to be flushed. This provides better performance, but adds the risk that if MariaDB crashes (or power is lost) that some data may be lost.

MyISAM

Variable MariaDB Default MySQL Default
myisam_recover_options BACKUP,QUICK OFF
key_buffer_size 134217728 8388608

 

InnoDB replaced MyISAM as the default storage engine for some time now, but it is still used for many system tables. MySQL has tuned down the MyISAM settings, since it is not heavily used.

When mysqld opens a table, it checks whether the table is marked as crashed, or was not closed properly, and runs a check on it based on the myisam_recover_options settings. MySQL disables this by default, preventing recovery. MariaDB has enabled the BACKUP and QUICK recovery options. BACKUP causes a table_name-datetime.bak file to be created whenever a data file is changed during recovery. QUICK causes mysqld to not check the rows in a table if there are no delete blocks, ensuring recovery can occur faster.

MariaDB 10.2 increased the key_buffer_size. This allows for more index blocks to be stored in memory. All threads use this buffer, so a small buffer can cause information to get moved in and out of it more quickly. MariaDB 10.2 uses a buffer 16 times the size of MySQL 5.7: 134217728 in MariaDB 10.2 vsx 8388608 in MySQL 5.7.

Innodb

Variable MariaDB Default MySQL Default
innodb_max_undo_log_size 10485760(10 MiB) 1073741824(1024 MiB)

 

InnoDB variables have remained primarily unchanged between MariaDB 10.2 and MySQL 5.7. MariaDB has reduced the innodb_max_undo_log_size starting in 10.2.6. This was reduced from MySQL’s default of 1073741824(1024 MiB) to 10485760(10 MiB). These sizes reflect the maximum size an undo tablespace can become before it is marked for truncation. The tablespace doesn’t get truncated unless innodb_undo_log_truncate is enabled, and it is disabled in MySQL 5.7 and MariaDB 10.2 by default.

Logging

Variable MariaDB Default MySQL Default
log_error /var/log/mysqld.log
log_slow_admin_statements ON OFF
log_slow_slave_statements ON OFF
lc_messages_dir /usr/share/mysql

 

Logs are extremely important for troubleshooting any issues so the different choices in logging for MySQL 5.7 and MariaDB 10.2 are very interesting.

The log_error variable allows you to control where errors get logged. MariaDB 10.2 leaves this variable blank, writing all errors to stderr. MySQL 5.7 uses an explicitly created file at: /var/log/mysqld.log.

MariaDB 10.2 has also enabled additional slow statement logging. Log_slow_admin_statements create a record for any administrative statements that are not typically written to the binlog. Log_slow_slave_statements log the replicated statements sent from the master, if they are slow to complete. MySQL 5.7 does not enable logging of these statements by default.

Lc_messages_dir is the directory that contains the error message files for various languages. The variable defaults might be a little misleading in MariaDB 10.2. Lc_messages_dir is left empty by default, although it still uses the same path as MySQL 5.7. The files are located in /usr/share/mysql by default for both databases.

Performance Schema

Variable MariaDB Default MySQL Default
performance_schema OFF ON
performance_schema_setup_actors_size 100 -1 (auto adjusted)
performance_schema_setup_objects_size 100 -1 (auto adjusted)

 

The performance schema is an instrumentation tool that is designed to help troubleshoot various performance concerns. MySQL 5.7 enables the performance schema, and many of its instruments, by default. MySQL even goes so far as to detect the appropriate value for many Performance Schema variables instead of setting a static default. The Performance Schema does come with some overhead, and there are many blogs regarding how much this can impact performance. I think Sveta Smirnova said it best in her blog  Performance Schema Benchmarks OLTP RW: “…test on your system! No generic benchmark can exactly repeat a workload on your site.

MariaDB has disabled the Performance Schema by default, as well as adjusted a couple of the dynamic variables. Note that if you wish to disable or enable the Performance Schema, it requires a restart of the server since these variables are not dynamic. Performance_schema_setup_actors_size and performance_schema_setup_objects_size have both been set to a static 100, instead of the dynamic -1 used in MySQL 5.7. These both limit the number of rows that can be stored in relative tables. This creates a hard limit to the size these tables can grow to, helping to reduce their data footprint.

SSL/TLS

Variable MariaDB Default MySQL Default
ssl_ca ca.pem
ssl_cert server-cert.pem
ssl_key server-key.pem

 

Secure Sockets Layer (SSL) and Transport Layer Security (TLS) are cryptographic protocols that allow for secure communication. SSL is actually the predecessor of TLS, although both are often referred to as SSL. MySQL 5.7 and MariaDB 10.2 support both yaSSL and OpenSSL. The default configurations for SSL/TLS differ only slightly between MySQL 5.7 and MariaDB 10.2. MySQL 5.7 sets a specific file name for ssl_ca, ssl_cert, and ssl_key. These files are created in the base directory, identified by the variable basedir. Each of these variables is left blank in MariaDB 10.2, so you need to set them before using secure connections. These variables are not dynamic, so be sure to set the values before starting your database.

Query Optimizer

MariaDB 10.2 MySQL 5.7 Optimization Meaning Switch
N/A OFF Batched Key Access Controls use of BKA join algorithm batched_key_access
N/A ON Block Nested-Loop Controls use of BNL join algorithm block_nested_loop
N/A ON Condition Filtering Controls use of condition filtering condition_fanout_filter
Deprecated ON Engine Condition Pushdown Controls engine condition pushdown engine_condition_pushdown
ON N/A Engine Condition Pushdown Controls ability to push conditions down into non-mergeable views and derived tables condition_pushdown_for_derived
ON N/A Exists Subquery Allows conversion of in statements to exists statements exists_to_in
ON N/A Exists Subquery Allows conversion of exists statements to in statements in_to_exists
N/A ON Index Extensions Controls use of index extensions use_index_extensions
OFF N/A Index Merge Allows index_merge for non-equality conditions index_merge_sort_intersection
ON N/A Join Algorithms Perform index lookups for a batch of records from the join buffer join_cache_bka
ON N/A Join Algorithms Controls use of BNLH and BKAH algorithms join_cache_hashed
ON N/A Join Algorithms Controls use of incremental algorithms join_cache_incremental
ON N/A Join Algorithms Controls use of block-based algorithms for outer joins outer_join_with_cache
ON N/A Join Algorithms Controls block-based algorithms for use with semi-join operations semijoin_with_cache
OFF N/A Join Buffer Creates the join buffer with an estimated size based on the estimated number of rows in the result optimize_join_buffer_size
ON N/A Materialized Temporary Tables Allows index creation on derived temporary tables derived_keys
ON N/A Materialized Temporary Tables Controls use of the rowid-merge strategy partial_match_rowid_merge
ON N/A Materialized Temporary Tables Controls use of the partial_match_table-scan strategy partial_match_table_scan
OFF ON Multi-Range Read Controls use of the multi-range read strategy mrr
OFF ON Multi-Range Read Controls use of cost-based MRR, if mrr=on mrr_cost_based
OFF N/A Multi-Range Read Enables key ordered scans if mrr=on mrr_sort_keys
ON N/A Order By Considers multiple equalities when ordering results ordery_uses_equalities
ON N/A Query Plan Allows the optimizer to use hidden components of InnoDB keys extended_keys
ON N/A Query Plan Controls the removal of irrelevant tables from the execution plan table_elimination
ON N/A Subquery Stores subquery results and correlation parameters for reuse subquery_cache
N/A ON Subquery Materialization Controls us of cost-based materialization ubquery_materialization_cost_based
N/A ON Subquery Materialization &

Semi-join

Controls the semi-join duplicate weedout strategy duplicateweedout

 

The query optimizer has several variances that not only affect query performance but also how you write SQL statements. The query optimizer is substantially different between MariaDB and MySQL, so even with identical configurations you are likely to see varying performance.

The sql_mode puts restrictions on how you can write queries. MySQL 5.7 has several additional restrictions compared to MariaDB 10.2. Only_full_group_by requires that all fields in any select…group by statement are either aggregated or inside the group by clause. The optimizer doesn’t assume anything regarding the grouping, so you must specify it explicitly.

No_zero_date, and no_zero_in_date both affect how the server interprets 0’s in dates. When no_zero_date is enabled, values of ‘0000-00-00’ are permitted but produce a warning. With strict mode enabled, then the value is not permitted and produces an error. No_zero_in_date is similar, except it applies to any section of the date(month, day, or year). With this disabled, dates with 0 parts, such as ‘2017-00-16’ are allowed as is. When enabled, the date is changed to ‘0000-00-00’ without warning. Strict mode prevents the date being inserted, unless ignore is provided as well. “INSERT IGNORE” and “UPDATE IGNORE” inserts the dates as ‘0000-00-00’. 5.7.4 changed this. No_zero_in_date was consolidated with strict mode, and the explicit option is deprecated.

The query_prealloc_size determines the size of the persistent buffer used for statement parsing and execution. If you regularly use complex queries, it can be useful to increase the size of this buffer, as it does not need to allocate additional memory during the query parsing. MySQL 5.7 has set this buffer to 8192, with a block size of 1024. MariaDB increased this value in 10.1.2 up to 24576.

Query_alloc_block_size dictates the size in bytes of any extra blocks allocated during query parsing. If memory fragmentation is a common problem, you might want to look at increasing this value. MySQL 5.7 uses 8192, while MariaDB 10.2 uses 16384 (twice that). Be careful when adjusting the block sizes: going too high consumes more than the needed amount of memory, and too low causes significant fragmentation.

The optimizer_switch variable contains many different switches that impact how the query optimizer plans and performs different queries. MariaDB 10.2 and MySQL 5.7 have many differences in their enabled options, and even the available options. You can see a brief breakdown of each of the options below. Any options with N/A is not supported in that server.

Miscellaneous

Variable MariaDB Default MySQL Default
default_tmp_storage_engine NULL InnoDB
group_concat_max_len 1048576(1M) 1024(1K)
Lock_wait_timeout 86400 (1 DAY) 31536000 (1 YEAR)
Max_allowed_packet (16777216) 16MB 4194304 (4MB)
Max_write_lock_count 4294967295 18446744073709551615
Old_passwords OFF 0
Open_files_limit 0 dependent on OS
pid_file /var/lib/mysql/ /var/run/mysqld/
secure_file_priv Varies by installation
sort_buffer_size 2097152 262144
table_definition_cache 400 autosized
table_open_cache_instances 8 16
thread_cache_size autosized autosized
thread_stack 292KB 192KB/256KB

 

There are many variables that do not fit well into a group. I will go over those here.

When creating temporary tables, if you do not specify a storage engine then a default is used. In MySQL 5.7 this is set to InnoDB, the same as the default_storage_engine. MariaDB 10.2 also uses InnoDB, but it is not explicitly set. MariaDB sets the default_tmp_storage_engine to NULL, which causes it to use the default_storage_engine. This is important to remember if you change your default storage engine, as it would also change the default for temporary tables. An Important note, in MariaDB this is only relevant to tables created with “CREATE TEMPORARY TABLE”. Internal in-memory temporary tables use the memory storage engine, and internal, on-disk temporary tables use the aria engine by default.

The Group_concat function can cause some very large results if left unchecked. You can restrict the maximum size of results from this function with group_concat_max_len. MySQL 5.7 limits this to 1024(1K). MariaDB increased the value in 10.2.4 up to 1048576(1M).

Lock_wait_timeout controls how long a thread waits as it attempts to acquire a metadata lock. Several statements require a metadata lock, including DDL and DML operations, Lock Tables, Flush Tables with Read Lock and Handler statements. MySQL 5.7 defaults to the maximum possible value (one year), while MariaDB 10.2 has toned this down to one day.

Max_allowed_packet sets a limit to the maximum size of a packet, or a generated/intermediate string. This value is intentionally kept small (4MB) on MySQL 5.7 in order to detect the larger, intentionally incorrect packets. MariaDB has increased this value to 16MB. If using any large BLOB fields, you need to adjust this value to the size of the largest BLOB, in multiples of 1024, or you risk running into errors transferring the results.

Max_write_lock_count controls the number of write locks that can be given before some read lock requests being processed. In extremely heavy write loads your reads can pile up while waiting for the writes to complete. Modifying the max_write_lock_count allows you to tune how many writes can occur before some reads are allowed against the table. MySQL 5.7 keeps this value at the maximum (18446744073709551615), while MariaDB 10.2 lowered this to 4294967295. One thing to note is that this is still the maximum value on MariaDB 10.2.

Old_passwords controls the hashing method used by the password function, create user and grant statements. This variable has undergone several changes in MySQL 5.7. As of 5.7.4 the valid options were MySQL 4.1 native hashing, Pre-4.1 (old) hashing, and SHA-256 hashing. Version 5.7.5 removed the “old” Pre-4.1 method, and in 5.7.6 the variable has been deprecated with the intent of removing it entirely. MariaDB 10.2 uses a simple boolean value for this variable instead of the enumerated one in MySQL 5.7, though the intent is the same. Both default the old_passwords to OFF, or 0, and allow you to enable the older method if necessary.

Open_files_limit restricts the number of file descriptors mysqld can reserve. If set to 0 (the default in MariaDB 10.2) then mysqld reserves max_connections * 5 or max_connections + table_open_cache * 2, whichever is larger. It should be noted that mysqld cannot use an amount larger than the hard limit imposed by the operating system. MySQL 5.7 is also restricted by the operating systems hard limit, but is set at runtime to the real value permitted by the system (not a calculated value).

The pid_file allows you to control where you store the process id file. This isn’t a file you typically need, but it is good to know where it is located in case some unusual errors occur. On MariaDB you can find this inside /var/lib/mysql/, while on MySQL 5.7 you will find it inside /var/run/mysqld/. You will also notice a difference in the actual name of the file. MariaDB 10.2 uses the hostname as the name of the pid, while MySQL 5.7 simply uses the process name (mysqld.pid).

Secure_file_priv is a security feature that allows you to restrict the location of files used in data import and export operations. When this variable is empty, which was the default in MySQL before 5.7.6, there is no restriction. If the value is set to NULL, import and export operations are not permitted. The only other valid value is the directory path where files can be imported from or exported to. MariaDB 10.2 defaults to empty. As of MySQL 5.7.6, the default will depend on the install_layout CMAKE option.

INSTALL_LAYOUT DEFAULT VALUE
STANDALONE,WIN NULL(>=MySQL 5.7.16_,empty(<MySQL 5.7.16)
DEB,RPM,SLES,SVR4 /var/lib/mysql-files
Other Mysql-files under the CMAKE_INSTALL_PREFIX value

 

Mysqld uses a sort buffer regardless of storage engine. Every session that must perform a sort allocates a buffer equal to the value of sort_buffer_size. This buffer should at minimum be large enough to contain 15 tuples. In MySQL 5.7, this defaults to 262144, while MariaDB 10.2 uses the larger value 2097152.

The table_definition_cache restricts the number of table definitions that can be cached. If you have a large number of tables, mysqld may have to read the .frm file to get this information. MySQL 5.7 auto detects the appropriate size to use, while MariaDB 10.2 defaults this value to 400. On my small test VM, MySQL 5.7 chose a value of 1400.

The table_open_cache_instances vary in implementation between MySQL and MariaDB. MySQL 5.7 creates multiple instances of the table_open_cache, each holding a portion of the tables. This helps reduce contention, as a session needs to lock only one instance of the cache for DML statements. In MySQL 5.7.7 the default was a single instance, but this was changed in MySQL 5.7.8 (increased to 16). MariaDB has a more dynamic approach to the table_open_cache. Initially there is only a single instance of the cache, and the table_open_cache_instances variable is the maximum number of instances that can be created. If contention is detected on the single cache, another instance is created and an error logged. MariaDB 10.2 suspects that the maximum eight instances it sets by default should support up to 100 CPU cores.

The thread_cache_size controls when a new thread is created. When a client disconnects the thread is stored in the cache, as long as the maximum number of threads do not exist. Although this is not typically noticeable, if your server sees hundreds of connections per second you should increase this value to so that new connections can use the cache. Thread_cache_size is an automatically detected variable in both MySQL 5.7 and MariaDB 10.2, but their methods to calculate the default vary significantly. MySQL uses a formula, with a maximum of 100: 8+ (max_connections / 100). MariaDB 10.2 uses the smaller value out of 256 or the max_connections size.

The thread_stack is the stack size for each thread. If the stack size is too small, it limits the complexity of SQL statements, the recursion depth of stored procedures and other memory-consuming actions. MySQL 5.7 defaults the stack size to 192KB on 32-bit platforms and 256KB on 64-bit systems. MariaDB 10.2 adjusted this value several times. MariaDB 10.2.0 used 290KB, 10.2.1 used 291KB and 10.2.5 used 292KB.

Conclusion

Hopefully, this helps you with the configurations options between MySQL and MariaDB. Use the comments for any questions.

Oct
01
2017
--

One Million Tables in MySQL 8.0

MySQL 8.0

In my previous blog post, I talked about new general tablespaces in MySQL 8.0. Recently MySQL 8.0.3-rc was released, which includes a new data dictionary. My goal is to create one million tables in MySQL and test the performance.

Background questions

Q: Why million tables in MySQL? Is it even realistic? How does this happen?

Usually, millions of tables in MySQL is a result of “a schema per customer” Software as a Service (SaaS) approach. For the purposes of customer data isolation (security) and logical data partitioning (performance), each “customer” has a dedicated schema. You can think of a WordPress hosting service (or any CMS based hosting) where each customer has their own dedicated schema. With 10K customers per MySQL server, we could end up with millions of tables.

Q: Should you design an application with >1 million tables?

Having separate tables is one of the easiest designs for a multi-tenant or SaaS application, and makes it easy to shard and re-distribute your workload between servers. In fact, the table-per-customer or schema-per-customer design has the quickest time-to-market, which is why we see it a lot in consulting. In this post, we are not aiming to cover the merits of should you do this (if your application has high churn or millions of free users, for example, it might not be a good idea). Instead, we will focus on if the new data dictionary provides relief to a historical pain point.

Q: Why is one million tables a problem?

The main issue results from the fact that MySQL needs to open (and eventually close) the table structure file (FRM file). With one million tables, we are talking about at least one million files. Originally MySQL fixed it with table_open_cache and table_definition_cache. However, the maximum value for table_open_cache is 524288. In addition, it is split into 16 partitions by default (to reduce the contention). So it is not ideal. MySQL 8.0 has removed FRM files for InnoDB, and will now allow you to create general tablespaces. I’ve demonstrated how we can create tablespace per customer in MySQL 8.0, which is ideal for “schema-per-customer” approach (we can move/migrate one customer data to a new server by importing/exporting the tablespace).

One million tables in MySQL 5.7

Recently, I’ve created the test with one million tables. The test creates 10K databases, and each database contains 100 tables. To use a standard benchmark I’ve employed sysbench table structure.

mysql> select count(*) from information_schema.schemata where schema_name like 'test_sbtest%';
+----------+
| count(*) |
+----------+
| 10000    |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from information_schema.tables where table_schema like 'test_sbtest%';
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (4.61 sec)

This also creates a huge overhead: with one million tables we have ~two million files. Each .frm file and .ibd file size sums up to 175G:

# du -sh /ssd/mysql_57
175G    /ssd/mysql_57

Now I’ve used sysbench Lua script to insert one row randomly into one table

pathtest = "/usr/share/sysbench/tests/include/oltp_legacy/"
if pathtest then
   dofile(pathtest .. "common.lua")
else
   require("common")
end
function event()
   local table_name
   local i
   local c_val
   local k_val
   local pad_val
   local oltp_tables_count = 100
   local oltp_db_count = 10000
   table_name = "test_sbtest_" .. sb_rand_uniform(1, oltp_db_count) .. ".sbtest".. sb_rand_uniform(1, oltp_tables_count)
   k_val = sb_rand(1, oltp_table_size)
   c_val = sb_rand_str([[
###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]])
   pad_val = sb_rand_str([[
###########-###########-###########-###########-###########]])
   rs = db_query("INSERT INTO " .. table_name ..
                       " (id, k, c, pad) VALUES " ..
                       string.format("(%d, %d, '%s', '%s')", i, k_val, c_val,
                                     pad_val))
   end
end

With:

local oltp_tables_count = 100
   local oltp_db_count = 10000

Sysbench will choose one table randomly out of one million. With oltp_tables_count = 1 and oltp_db_count = 100, it will only choose the first table (sbtest1) out of the first 100 databases (randomly).

As expected, MySQL 5.7 has a huge performance degradation when going across one million tables. When running a script that only inserts data into 100 random tables, we can see ~150K transactions per second. When the data is inserted in one million tables (chosen randomly) performance drops to 2K (!) transactions per second:

Insert into 100 random tables:

SQL statistics:
    queries performed:
        read:                            0
        write:                           16879188
        other:                           0
        total:                           16879188
    transactions:                        16879188 (140611.72 per sec.)
    queries:                             16879188 (140611.72 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Insert into one million random tables:

SQL statistics:
    queries performed:
        read:                            0
        write:                           243533
        other:                           0
        total:                           243533
    transactions:                        243533 (2029.21 per sec.)
    queries:                             243533 (2029.21 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

This is expected. Here I’m testing the worse case scenario, where we can’t keep all table open handlers and table definitions in cache (memory) since the table_open_cache and table_definition_cache both have a limit of 524288.

Also, normally we can expect a huge skew between access to the tables. There can be only 20% active customers (80-20 rule), meaning that we can only expect an active access to 2K databases. In addition, there will be old or unused tables so we can expect around 100K or less of active tables.

Hardware and config files

The above results are from this server:

Processors   | 64xGenuine Intel(R) CPU @ 2.00GHz
Memory Total | 251.8G
Disk         | Samsung 950 Pro PCIE SSD (nvme)

Sysbench script:

sysbench $conn --report-interval=1 --num-threads=32 --max-requests=0 --max-time=600 --test=/root/drupal_demo/insert_custom.lua run

My.cnf:

innodb_buffer_pool_size = 100G
innodb_io_capacity=20000
innodb_flush_log_at_trx_commit = 0
innodb_log_file_size = 2G
innodb_flush_method=O_DIRECT_NO_FSYNC
skip-log-bin
open_files_limit=1000000
table_open_cache=524288
table_definition_cache=524288

One million tables in MySQL 8.0 + general tablespaces

In MySQL 8.0 is it easy and logical to create one general tablespace per each schema (it will host all tables in this schema). In MySQL 5.7, general tablespaces are available – but there are still .frm files.

I’ve used the following script to create 100 tables in one schema all in one tablespace:

mysql test -e "CREATE TABLESPACE t ADD DATAFILE 't.ibd' engine=InnoDB;"
for i in {1..10000}
do
           mysql test -e "create table ab$i(i int) tablespace t"
done

The new MySQL 8.0.3-rc also uses the new data dictionary, so all MyISAM tables in the mysql schema are removed and all metadata is stored in additional mysql.ibd file.

Creating one million tables

Creating InnoDB tables fast enough can be a task by itself. Stewart Smith published a blog post a while ago where he focused on optimizing time to create 30K tables in MySQL.

The problem is that after creating an .ibd file, MySQL needs to “fsync” it. However, when creating a table inside the tablespace, there is no fsync. I’ve created a simple script to create tables in parallel, one thread per database:

#/bin/bash
function do_db {
        mysql -vvv -e "create database $db";
        mysql -vvv $db -e "CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB;"
        for i in {1..100}
        do
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                mysql $db -e "$table"
        done
}
c=0
for m in {1..100}
do
        for i in {1..100}
        do
                let c=$c+1
                echo $c
                db="test_sbtest_$c"
                do_db &
        done
        wait
done

That script works perfectly in MySQL 8.0.1-dmr and creates one million tables in 25 minutes and 28 seconds (1528 seconds). That is ~654 tables per second. That is significantly faster than ~30 tables per second in the original Stewart’s test and 2x faster than a test where all fsyncs were artificially disabled using libeat-my-data library.

Unfortunately, in MySQL 8.0.3-rc some regression was introduced. In MySQL 8.0.3-rc I can see heavy mutex contention, and the table creation speed dropped from 25 minutes to ~280 minutes. I’ve filed a bug report: performance regression: “create table” speed and scalability in 8.0.3.

Size on disk

With general tablespaces and no .frm files, the size on disk decreased:

# du -h -d1 /ssd/
147G    /ssd/mysql_801
119G    /ssd/mysql_803
175G    /ssd/mysql_57

Please note though that in MySQL 8.0.3-rc, with new native data dictionary, the size on disk increased as it needs to write additional information (Serialized Dictionary Information, SDI) to the tablespace files:

InnoDB: Serialized Dictionary Information (SDI) is now present in all InnoDB tablespace files
except for temporary tablespace and undo tablespace files.
SDI is serialized metadata for schema, table, and tablespace objects.
The presence of SDI data provides metadata redundancy.
...
The inclusion of SDI data in tablespace files increases tablespace file size.
An SDI record requires a single index page, which is 16k in size by default.
However, SDI data is compressed when it is stored to reduce the storage footprint.

The general mysql data dictionary in MySQL 8.0.3 is 6.6Gb:

6.6G /ssd/mysql/mysql.ibd

Benchmarking the insert speed in MySQL 8.0 

I’ve repeated the same test I’ve done for MySQL 5.7 in MySQL 8.0.3-rc (and in 8.0.1-dmr), but using general tablespace. I created 10K databases (=10K tablespace files), each database has100 tables and each database resides in its own tablespace.

There are two new tablespace level caches we can use in MySQL 8.0: tablespace_definition_cache and schema_definition_cache:

tablespace_definition_cache = 15000
schema_definition_cache = 524288

Unfortunately, with one million random table accesses in MySQL 8.0 (both 8.0.1 and 8.0.3), we can still see that it stalls on opening tables (even with no .frm files and general tablespaces):

mysql> select conn_id, current_statement, state, statement_latency, lock_latency from sys.processlist where current_statement is not null and conn_id <> CONNECTION_ID();
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
| conn_id | current_statement                                                 | state          | statement_latency | lock_latency |
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
|     199 | INSERT INTO test_sbtest_9749.s ... 8079-53209333270-93105555128') | Opening tables | 4.45 ms           | 0 ps         |
|     198 | INSERT INTO test_sbtest_1863.s ... 9574-29782886623-39251573705') | Opening tables | 9.95 ms           | 5.67 ms      |
|     189 | INSERT INTO test_sbtest_3948.s ... 9365-63502117132-66650064067') | Opening tables | 16.29 ms          | 15.38 ms     |
|     190 | INSERT INTO test_sbtest_6885.s ... 8436-41291265610-60894472357') | Opening tables | 13.78 ms          | 9.52 ms      |
|     191 | INSERT INTO test_sbtest_247.sb ... 7467-89459234028-92064334941') | Opening tables | 8.36 ms           | 3.18 ms      |
|     192 | INSERT INTO test_sbtest_9689.s ... 8058-74586985382-00185651578') | Opening tables | 6.89 ms           | 0 ps         |
|     193 | INSERT INTO test_sbtest_8777.s ... 1900-02582963670-01868315060') | Opening tables | 7.09 ms           | 5.70 ms      |
|     194 | INSERT INTO test_sbtest_9972.s ... 9057-89011320723-95018545652') | Opening tables | 9.44 ms           | 9.35 ms      |
|     195 | INSERT INTO test_sbtest_6977.s ... 7902-29158428721-66447528241') | Opening tables | 7.82 ms           | 789.00 us    |
|     196 | INSERT INTO test_sbtest_129.sb ... 2091-86346366083-87657045906') | Opening tables | 13.01 ms          | 7.30 ms      |
|     197 | INSERT INTO test_sbtest_1418.s ... 6581-90894769279-68213053531') | Opening tables | 16.35 ms          | 10.07 ms     |
|     208 | INSERT INTO test_sbtest_4757.s ... 4592-86183240946-83973365617') | Opening tables | 8.66 ms           | 2.84 ms      |
|     207 | INSERT INTO test_sbtest_2152.s ... 5459-55779113235-07063155183') | Opening tables | 11.08 ms          | 3.89 ms      |
|     212 | INSERT INTO test_sbtest_7623.s ... 0354-58204256630-57234862746') | Opening tables | 8.67 ms           | 2.80 ms      |
|     215 | INSERT INTO test_sbtest_5216.s ... 9161-37142478639-26288001648') | Opening tables | 9.72 ms           | 3.92 ms      |
|     210 | INSERT INTO test_sbtest_8007.s ... 2999-90116450579-85010442132') | Opening tables | 1.33 ms           | 0 ps         |
|     203 | INSERT INTO test_sbtest_7173.s ... 2718-12894934801-25331023143') | Opening tables | 358.09 us         | 0 ps         |
|     209 | INSERT INTO test_sbtest_1118.s ... 8361-98642762543-17027080501') | Opening tables | 3.32 ms           | 0 ps         |
|     219 | INSERT INTO test_sbtest_5039.s ... 1740-21004115002-49204432949') | Opening tables | 8.56 ms           | 8.44 ms      |
|     202 | INSERT INTO test_sbtest_8322.s ... 8686-46403563348-31237202393') | Opening tables | 1.19 ms           | 0 ps         |
|     205 | INSERT INTO test_sbtest_1563.s ... 6753-76124087654-01753008993') | Opening tables | 9.62 ms           | 2.76 ms      |
|     213 | INSERT INTO test_sbtest_5817.s ... 2771-82142650177-00423653942') | Opening tables | 17.21 ms          | 16.47 ms     |
|     216 | INSERT INTO test_sbtest_238.sb ... 5343-25703812276-82353892989') | Opening tables | 7.24 ms           | 7.20 ms      |
|     200 | INSERT INTO test_sbtest_2637.s ... 8022-62207583903-44136028229') | Opening tables | 7.52 ms           | 7.39 ms      |
|     204 | INSERT INTO test_sbtest_9289.s ... 2786-22417080232-11687891881') | Opening tables | 10.75 ms          | 9.01 ms      |
|     201 | INSERT INTO test_sbtest_6573.s ... 0106-91679428362-14852851066') | Opening tables | 8.43 ms           | 7.03 ms      |
|     217 | INSERT INTO test_sbtest_1071.s ... 9465-09453525844-02377557541') | Opening tables | 8.42 ms           | 7.49 ms      |
|     206 | INSERT INTO test_sbtest_9588.s ... 8804-20770286377-79085399594') | Opening tables | 8.02 ms           | 7.50 ms      |
|     211 | INSERT INTO test_sbtest_4657.s ... 4758-53442917995-98424096745') | Opening tables | 16.62 ms          | 9.76 ms      |
|     218 | INSERT INTO test_sbtest_9672.s ... 1537-13189199316-54071282928') | Opening tables | 10.01 ms          | 7.41 ms      |
|     214 | INSERT INTO test_sbtest_1391.s ... 9241-84702335152-38653248940') | Opening tables | 21.34 ms          | 15.54 ms     |
|     220 | INSERT INTO test_sbtest_6542.s ... 7778-65788940102-87075246009') | Opening tables | 2.96 ms           | 0 ps         |
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
32 rows in set (0.11 sec)

And the transactions per second drops to ~2K.

Here I’ve expected different behavior. With the .frm files gone and with tablespace_definition_cache set to more than 10K (we have only 10K tablespace files), I’ve expected that MySQL does not have to open and close files. It looks like this is not the case.

I can also see the table opening (since the server started):

mysql> show global status like '%open%';
+----------------------------+-----------+
| Variable_name              | Value     |
+----------------------------+-----------+
| Com_ha_open                | 0         |
| Com_show_open_tables       | 0         |
| Innodb_num_open_files      | 10040     |
| Open_files                 | 0         |
| Open_streams               | 0         |
| Open_table_definitions     | 524288    |
| Open_tables                | 499794    |
| Opened_files               | 22        |
| Opened_table_definitions   | 1220904   |
| Opened_tables              | 2254648   |
| Slave_open_temp_tables     | 0         |
| Table_open_cache_hits      | 256866421 |
| Table_open_cache_misses    | 2254643   |
| Table_open_cache_overflows | 1254766   |
+----------------------------+-----------+

This is easier to see on the graphs from PMM. Insert per second for the two runs (both running 16 threads):

  1. The first run is 10K random databases/tablespaces and one table (sysbench is choosing table#1 from a randomly chosen list of 10K databases). This way there is also no contention on the tablespace file.
  2. The second run is a randomly chosen table from a list of one million tables.

As we can see, the first run is dong 50K -100K inserts/second. Second run is only limited to ~2.5 inserts per second:

MySQL 8.0

“Table open cache misses” grows significantly after the start of the second benchmark run:
MySQL 8.0

As we can see, MySQL performs ~1.1K table definition openings per second and has ~2K table cache misses due to the overflow:

MySQL 8.0

When inserting against only 1K random tables (one specific table in a random database, that way we almost guarantee that one thread will always write to a different tablespace file), the table_open_cache got warmed up quickly. After a couple of seconds, the sysbench test starts showing > 100K tps. The processlist looks much better (compare the statement latency and lock latency to the above as well):

mysql> select conn_id, current_statement, state, statement_latency, lock_latency from sys.processlist where current_statement is not null and conn_id <> CONNECTION_ID();
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
| conn_id | current_statement                                                 | state          | statement_latency | lock_latency |
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
|     253 | INSERT INTO test_sbtest_3293.s ... 2282-95400708146-84684851551') | starting       | 22.72 us          | 0 ps         |
|     254 | INSERT INTO test_sbtest_3802.s ... 4030-35983148190-23616685226') | update         | 62.88 us          | 45.00 us     |
|     255 | INSERT INTO test_sbtest_5290.s ... 2361-58374942527-86207214617') | Opening tables | 36.07 us          | 0 ps         |
|     256 | INSERT INTO test_sbtest_5684.s ... 4717-34992549120-04746631452') | Opening tables | 37.61 us          | 37.00 us     |
|     257 | INSERT INTO test_sbtest_5088.s ... 5637-75275906887-76112520982') | starting       | 22.97 us          | 0 ps         |
|     258 | INSERT INTO test_sbtest_1375.s ... 8592-24036624620-65536442287') | query end      | 98.66 us          | 35.00 us     |
|     259 | INSERT INTO test_sbtest_8764.s ... 8566-02569157908-49891861265') | Opening tables | 47.13 us          | 37.00 us     |
|     260 | INSERT INTO test_sbtest_560.sb ... 2605-08226572929-25889530906') | query end      | 155.64 us         | 38.00 us     |
|     261 | INSERT INTO test_sbtest_7776.s ... 0243-86335905542-37976752368') | System lock    | 46.68 us          | 32.00 us     |
|     262 | INSERT INTO test_sbtest_6551.s ... 5496-19983185638-75401382079') | update         | 74.07 us          | 40.00 us     |
|     263 | INSERT INTO test_sbtest_7765.s ... 5428-29707353898-77023627427') | update         | 71.35 us          | 45.00 us     |
|     265 | INSERT INTO test_sbtest_5771.s ... 7065-03531013976-67381721569') | query end      | 138.42 us         | 39.00 us     |
|     266 | INSERT INTO test_sbtest_8603.s ... 7158-66470411444-47085285977') | update         | 64.00 us          | 36.00 us     |
|     267 | INSERT INTO test_sbtest_3983.s ... 5039-55965227945-22430910215') | update         | 21.04 ms          | 39.00 us     |
|     268 | INSERT INTO test_sbtest_8186.s ... 5418-65389322831-81706268892') | query end      | 113.58 us         | 37.00 us     |
|     269 | INSERT INTO test_sbtest_1373.s ... 1399-08304962595-55155170406') | update         | 131.97 us         | 59.00 us     |
|     270 | INSERT INTO test_sbtest_7624.s ... 0589-64243675321-62971916496') | query end      | 120.47 us         | 38.00 us     |
|     271 | INSERT INTO test_sbtest_8201.s ... 6888-31692084119-80855845726') | query end      | 109.97 us         | 37.00 us     |
|     272 | INSERT INTO test_sbtest_7054.s ... 3674-32329064814-59707699237') | update         | 67.99 us          | 35.00 us     |
|     273 | INSERT INTO test_sbtest_3019.s ... 1740-35410584680-96109859552') | update         | 5.21 ms           | 33.00 us     |
|     275 | INSERT INTO test_sbtest_7657.s ... 4985-72017519764-59842283878') | update         | 88.91 us          | 48.00 us     |
|     274 | INSERT INTO test_sbtest_8606.s ... 0580-38496560423-65038119567') | freeing items  | NULL              | 37.00 us     |
|     276 | INSERT INTO test_sbtest_9349.s ... 0295-94997123247-88008705118') | starting       | 25.74 us          | 0 ps         |
|     277 | INSERT INTO test_sbtest_3552.s ... 2080-59650597118-53885660147') | starting       | 32.23 us          | 0 ps         |
|     278 | INSERT INTO test_sbtest_3832.s ... 1580-27778606266-19414961452') | freeing items  | 194.14 us         | 51.00 us     |
|     279 | INSERT INTO test_sbtest_7685.s ... 0234-22016898044-97277319766') | update         | 62.66 us          | 40.00 us     |
|     280 | INSERT INTO test_sbtest_6026.s ... 2629-36599580811-97852201188') | Opening tables | 49.41 us          | 37.00 us     |
|     281 | INSERT INTO test_sbtest_8273.s ... 7957-39977507737-37560332932') | update         | 92.56 us          | 36.00 us     |
|     283 | INSERT INTO test_sbtest_8584.s ... 7604-24831943860-69537745471') | starting       | 31.20 us          | 0 ps         |
|     284 | INSERT INTO test_sbtest_3787.s ... 1644-40368085836-11529677841') | update         | 100.41 us         | 40.00 us     |
+---------+-------------------------------------------------------------------+----------------+-------------------+--------------+
30 rows in set (0.10 sec)

What about the 100K random tables? That should fit into the table_open_cache. At the same time, the default 16 table_open_cache_instances split 500K table_open_cache, so each bucket is only ~30K. To fix that, I’ve set table_open_cache_instances = 4 and was able to get ~50K tps average. However, the contention inside the table_open_cache seems to stall the queries:

MySQL 8.0

There are only a very limited amount of table openings:

MySQL 8.0

 

Conclusion

MySQL 8.0 general tablespaces looks very promising. It is finally possible to create one million tables in MySQL without the need to create two million files. Actually, MySQL 8 can handle many tables very well as long as table cache misses are kept to a minimum.

At the same time, the problem with “Opening tables” (worst case scenario test) still persists in MySQL 8.0.3-rc and limits the throughput. I expected to see that MySQL does not have to open/close the table structure file. I also hope the create table regression bug is fixed in the next MySQL 8.0 version.

I’ve not tested other new features in the new data dictionary in 8.0.3-rc: i.e., atomic DDL (InnoDB now supports atomic DDL, which ensures that DDL operations are either committed in their entirety or rolled back in case of an unplanned server stoppage). That is the topic of the next blog post.

Sep
25
2017
--

Avoid Shared Locks from Subqueries When Possible

Shared Locks

Shared LocksIn this blog post, we’ll look at how to avoid shared locks from subqueries.

I’m pretty sure most of you have seen an UPDATE statement matching rows returned from a SELECT query:

update ibreg set k=1 where id in (select id from ibcmp where id > 90000);

This query, when executed with

autocommit=1

, is normally harmless. However, this can have bad effects when combined with other statements in the same transaction that result in holding the shared locks from the SELECT query. But first, let me explain why the SELECT query would hold locks in the first place.

Due to InnoDB’s ACID properties, to make sure that the outer UPDATE statement has a consistent view of the matching rows from the SELECT query the server has to acquire a shared lock on those rows. No other thread should modify those matching rows to maintain consistency within the transaction. To demonstrate, let’s take two transactions executed in specific order below:

mysql1> begin;
mysql1> update ibreg set k=1 where id in (select id from ibcmp where id > 90000);
mysql2> begin;
mysql2> delete from ibcmp where id > 90000;

By the time the second session executes, it will be in a LOCK WAIT state (as confirmed from INFORMATION_SCHEMA):

mysql1> select * from information_schema.innodb_trx G
*************************** 1. row ***************************
                    trx_id: 3932449
                 trx_state: LOCK WAIT
               trx_started: 2017-09-06 00:20:05
     trx_requested_lock_id: 3932449:13:1354:31
          trx_wait_started: 2017-09-06 00:20:05
                trx_weight: 2
       trx_mysql_thread_id: 9
                 trx_query: delete from test.ibcmp where id > 90000
       trx_operation_state: starting index read
...
mysql1> select * from information_schema.innodb_locks G
*************************** 1. row ***************************
    lock_id: 3932449:13:1354:31
lock_trx_id: 3932449
  lock_mode: X
  lock_type: RECORD
 lock_table: `test`.`ibcmp`
 lock_index: PRIMARY
 lock_space: 13
  lock_page: 1354
   lock_rec: 31
  lock_data: 90001
*************************** 2. row ***************************
    lock_id: 3932174:13:1354:31
lock_trx_id: 3932174
  lock_mode: S
  lock_type: RECORD
 lock_table: `test`.`ibcmp`
 lock_index: PRIMARY
 lock_space: 13
  lock_page: 1354
   lock_rec: 31
  lock_data: 90001

Information_Schema.INNODB_LOCKS confirms that our first transaction has held a shared lock on the rows that matched the SELECT queries from the first transaction. This can be bad for a number of reasons:

  1. As the number of rows that matches the SELECT grows, DEADLOCK and lock wait timeouts can become more frequent
  2. As a consequence of this, ROLLBACKs would also increase (and are expensive operations)
  3. Your users can become unhappy, especially if it is not handled gracefully from the application

If you really need the consistency of the view between the table being read from and the table getting updated, the lock is necessary and unavoidable. Avoiding the deadlocks and lock wait timeouts can be minimized, but not totally avoided.

On the other hand, if you’re not worried about view consistency, there are two ways you can avoid such problems: by using variables or making sure the SELECT becomes a transient read inside the transaction (i.e., by dumping the results into an OUTFILE).

mysql1> begin;
mysql1> select group_concat(id) into @ids from ibcmp where id > 90000;
mysql1> update ibreg set k=1 where id in (@ids);
mysql2> begin;
mysql2> delete from ibcmp where iid > 90000;

The first method is bound by the

group_concat_max_len

 variable. If you think you will only have a few resulting IDs that fit into

group_concat_max_len

, this is a good solution.

mysql1> begin;
mysql1> select id into outfile '/tmp/id.csv' from ibcmp where id > 90000;
mysql1> create temporary table t (id int unsigned not null) engine=innodb;
mysql1> load data infile '/tmp/id.csv' into table t;
mysql1> update ibreg inner join t on ibreg.id = t.id;
mysql2> begin;
mysql2> delete from ibcmp where id > 90000;

The second approach is only meant to overcome the limitation of the GROUP_CONCAT method.

Again, these two approaches only work if you do not care if the result of the SELECT queries changes on the other table between the BEGIN statement and UPDATE within the transaction.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com