Webinar Thurs 6/14: MongoDB Backup and Recovery Field Guide

mongodb backup and recovery field guide

mongodb backup and recovery field guidePlease join Percona’s Sr. Technical Operations Architect, Tim Vaillancourt as he presents MongoDB Backup and Recovery Field Guide on Thursday, June 14, 2018, at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

This talk will cover backup and recovery solutions for MongoDB replica sets and clusters, focusing on online and low-impact solutions for production systems.

Register for the webinar

Tim Vaillancourt

Senior Technical Operations Architect

With experience operating infrastructures in industries such as government, online marketing/publishing, SaaS and gaming combined with experience tuning systems from the hard disk all the way up to the end-user, Tim has spent time in nearly every area of the modern IT stack with many lessons learned.

Tim is based in Amsterdam, NL and enjoys traveling, coding and music. Prior to Percona Tim was the Lead MySQL DBA of Electronic Arts’ DICE studios, helping some of the largest games in the world (“Battlefield” series, “Mirrors Edge” series, “Star Wars: Battlefront”) launch and operate smoothly while also leading the automation of MongoDB deployments for EA systems. Before the role of DBA at EA’s DICE studio, Tim served as a subject matter expert in NoSQL databases, queues and search on the Online Operations team at EA SPORTS.

Prior to moving to the gaming industry, Tim served as a Database/Systems Admin operating a large MySQL-based SaaS infrastructure at AbeBooks/Amazon Inc.

The post Webinar Thurs 6/14: MongoDB Backup and Recovery Field Guide appeared first on Percona Database Performance Blog.


Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

Slave Performance Too Slow

Slave Performance Too SlowRecently I worked on one customer issue that I would describe as “slave performance too slow”. During a quick analysis, I’ve found that the replication slave SQL thread cannot keep up while processing row-based events from the master’s binary log.

For example:

*************************** 1. row ***************************
              Master_Log_File: binlog.0000185
          Read_Master_Log_Pos: 86698585
        Relay_Master_Log_File: binlog.0000185
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
          Exec_Master_Log_Pos: 380
              Relay_Log_Space: 85699128
                  Master_UUID: 98974e7f-2fbc-18e9-72cd-07003817585c
           Retrieved_Gtid_Set: 98974e7f-2fbc-18e9-72cd-07003817585c:1055-1057
            Executed_Gtid_Set: 7f42e2c5-3fbc-16e7-7fb8-05003715789a:1-2,

The processlist state for the SQL thread can be one of the following: Reading event from the relay log, or System lock, or potentially some other state. In my case:

| Id | User            | Host            | db   | Command | Time | State                            | Info             |
|  4 | system user     |                 | NULL | Connect |  268 | Reading event from the relay log | NULL             |

What causes that?

Let’s take a look what could potentially cause such behavior and what we need to pay attention to. When the SQL thread applies the change from a row-based event, it has to locate the exact row that was updated. With a primary key, this is trivial as only one row can possibly have the same value for the primary key.

However, if there is no primary key on the table on the replication slave side, the SQL thread must search the entire table to locate the row to update or delete. It repeats the search for each updated row. This search is both very resource usage intensive (CPU usage can be up to 100%) and slow causing the slave to fall behind.

For InnoDB tables, the “hidden” key used for the clustered index for tables without a primary key cannot be used to avoid searching the entire table for the rows to update or delete. We need to keep in mind that the “hidden” key is unique only to each MySQL instance, so the replication master and replication slave generally don’t have the same values for the “hidden” key for the same row.

What can we do to solve that?

The best solution is to ensure that all tables have a primary key. This not only ensures the SQL thread can easily locate rows to update or delete, but it is also considered as a best practice since it ensures all rows are unique.

If there is no way to logically add a natural primary key for the table, a potential solution is to add an auto-increment unsigned integer column as the primary key.

The query below helps you to locate tables without a primary key:

SELECT tables.table_schema, tables.table_name, tables.table_rows
      FROM information_schema.tables
      LEFT JOIN (
        SELECT table_schema, table_name
        FROM information_schema.statistics
        GROUP BY table_schema, table_name, index_name
            CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END
          ) = COUNT(*)
      ) puks
      ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name
      WHERE puks.table_name IS NULL
        AND tables.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
        AND tables.table_type = 'BASE TABLE' AND engine='InnoDB';

Please note that for InnoDB, there must always be a unique NOT NULL key for all tables. It is required for the clustered index. So adding an explicit “dummy” column as suggested above will not add to the overall storage requirements as it will merely replace the hidden key.

It’s not always possible to add a primary key to the table immediately if, for example, there are many relations on the application side/legacy system, lack of resources, unknown application behavior after the change which required testing, etc.

In this case, a short-term solution is to change the search algorithm used by the replication slave to locate the rows changed by row-based events.

The search algorithm is set using the slave_rows_search_algorithms option which is available in MySQL 5.6 and later. The default value is to use an index scan if possible, otherwise a table scan.


However, for tables without a primary key using a hash scan, which causes the SQL thread to temporarily cache hashes to reduce the overhead of searching the whole table. The value of slave_rows_search_algorithms can be changed dynamically using:

mysql> SET GLOBAL slave_rows_search_algorithms = 'INDEX_SCAN,HASH_SCAN';

Just to note INDEX_SCAN,HASH_SCAN is the default value in MySQL 8.0.

One thing to be aware of when using hash scans that the hashes are only reused within one row-based event. (Each row-based event may have changes to several rows in the same table originating from the same SQL statement).

The binlog_row_event_max_size option on the replication master controls the maximum size of a row-based event. The default max event size is 8kB. This means that switching to hash scans only improves the performance of the SQL thread when:

  1. Several rows fit into one row based event. It may help to increase the value of binlog_row_event_max_size on the replication master, if you perform updates or deletes on large rows (e.g., with blob or text data). You can only set the binlog_row_event_max_size in the MySQL configuration file, and resetting this value requires a restart.
  2. One statement changes several rows.



Even if enabling hash scans improves the performance enough for the replication slave to keep up, the permanent solution is to add an explicit primary key to each table. This should be the general rule of thumb in the schema design in order avoid and/or minimize many issues like slave performance too slow (as described in this post).

Next, I am going to investigate how we can find out the exact thread state using Performance Schema in order to make issue identification less of a guessing game.

The post Causes and Workarounds for Slave Performance Too Slow with Row-Based Events appeared first on Percona Database Performance Blog.


MongoDB Rollback in replicaset

MongoDB Rollback

MongoDB RollbackIn this blog post, we’ll look at how MongoDB rollback works during replicaset failovers.

In recent versions, MongoDB has provided lots of features related to replicaset and automatic failover. When it comes to failover, the next question that arises is “How does MongoDB ROLLBACK work during replicaset failover?”

If a PRIMARY member (say node A) stepped down with some data writes that were executed but not replicated to the SECONDARY members yet, then a ROLLBACK occurs on the former PRIMARY A when it rejoins the replicaset. I’ll explain below how the ROLLBACK works!

ROLLBACK Scenario:

ROLLBACK is rare in a replicaset as MongoDB tries to avoid it by replicating the operations from PRIMARY to SECONDARY without delay, under normal conditions. Most of the time ROLLBACK occurs in the event of network partitioning, or if SECONDARY members can’t keep up with the throughput of operations on the former PRIMARY.


We will see the process with a test. I have used Docker for this test with the MongoDB 3.2 Jessie version to setup a replicaset with members mongo1 – A, mongo2 – B, mongo3 – C and set Priority 10 to A. Now A is PRIMARY as expected in the replicaset. We need to write some data into A and create a network partition scenario with B and C at the same time. For that, I inserted 25000 documents into A and made it out of network at the same time.

Terminal 1 (A’s mongo prompt):

my-mongo-set:PRIMARY> for (var i = 1; i <= 25000; i++) {
...    db.testData.insert( { x : i } )
... }
WriteResult({ "nInserted" : 1 })
my-mongo-set:PRIMARY> db.testD2018-03-30T17:34:51.455+0530 I NETWORK  [thread1] trying reconnect to ( failed
2018-03-30T17:34:51.464+0530 I NETWORK  [thread1] reconnect ( ok
my-mongo-set:SECONDARY> rs.slaveOk()
my-mongo-set:SECONDARY> db.testData.count()


Vinodhs-MBP:~ vinodhkrish$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS                           NAMES
b27d82ac2439        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days >27017/tcp        mongo3
2b39f9e41973        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days >27017/tcp        mongo2
105b6df757d7        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days >27017/tcp        mongo1
Vinodhs-MBP:~ vinodhkrish$ docker network disconnect my-mongo-cluster mongo1

The member A has now become as SECONDARY, because it couldn’t reach other members in the replicaset. On the other side, B and C members see that A is not reachable and then B is elected as PRIMARY. We could see that some inserts from former A replicated to B before the network split happens.

(B node)

my-mongo-set:PRIMARY> db.testData.count()

Now do some write operations in current PRIMARY – B and then let node A join the network back by joining the container back to the bridge network. You can observe below that the node A’s member states are changing in the mongo prompt. (I just connected to A and pressed ENTER/RETURN button many times to see the member states, or you can see them in the log file):

(A node)

Vinodhs-MacBook-Pro:mongodb-osx-x86_64-3.2.19 vinodhkrish$ ./bin/mongo
MongoDB shell version: 3.2.19
connecting to:


From MongoDB point of view, we will see the replicaset process to understand what happened above. Normally the SECONDARY member syncs the oplog entries from its syncSource (the member from where the data is replicated) by using oplogFetcher. The OplogFetcher first sends a find() command to the syncSource’s oplog, and then follows with a series of getMores on the cursor. When node A rejoins the replicaset, node A’s oplogFetcher first sends find() command to syncSource node B and check it has a greater than or equal predicate on the timestamp of the last oplog entry it has fetched. Usually the find() command should at least return one doc due to the greater than or equal predicate. If not, it means that the syncSource is behind and so it will not replicate from it and look for other syncSource.

In this case, A’s oplogFetcher sees that the first document returned from node B does not match the last entry in its oplog. That means node A’s oplog has diverged from node B’s and it should go into ROLLBACK.

Node A first finds the common point between its oplog and its syncSource B’s oplog. It then goes through all of the operations in its oplog back to the common point and figures out how to undo them. Here, 9997 inserts are missed from B and C nodes, and so these documents will be recovered from A’s oplog.

2018-03-30T12:08:37.160+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 4, timestamp: Mar 30 12:03:52:139). source's GTE: (term: 5, timestamp: Mar 30 12:05:37:1) hashes: (3789163619674410187/3226093795606474294)
2018-03-30T12:08:37.160+0000 I REPL     [rsBackgroundSync] rollback 0
2018-03-30T12:08:37.160+0000 I REPL     [ReplicationExecutor] transition to ROLLBACK
2018-03-30T12:08:37.163+0000 I REPL     [rsBackgroundSync] beginning rollback
2018-03-30T12:08:37.163+0000 I REPL     [rsBackgroundSync] rollback 1
2018-03-30T12:08:37.164+0000 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback our last optime:   Mar 30 12:03:52:139
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback their last optime: Mar 30 12:08:17:1c5
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback diff in end of log times: -265 seconds
2018-03-30T12:08:37.269+0000 I REPL     [rsBackgroundSync] rollback common point is (term: 4, timestamp: Mar 30 12:03:46:d2)
2018-03-30T12:08:37.269+0000 I REPL     [rsBackgroundSync] rollback 3 fixup
2018-03-30T12:08:38.240+0000 I REPL     [rsBackgroundSync] rollback 3.5
2018-03-30T12:08:38.240+0000 I REPL     [rsBackgroundSync] Setting minvalid to (term: 5, timestamp: Mar 30 12:08:17:1c5)
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4 n:1
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4.6
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4.7
2018-03-30T12:08:38.391+0000 I REPL     [rsBackgroundSync] rollback 5 d:9997 u:0
2018-03-30T12:08:38.391+0000 I REPL     [rsBackgroundSync] rollback 6
2018-03-30T12:08:38.394+0000 I REPL     [rsBackgroundSync] rollback done
2018-03-30T12:08:38.396+0000 I REPL     [rsBackgroundSync] rollback finished


Where would these 9997 recovered documents go? MongoDB writes these ROLLBACK documents under the rollback directory in the dbpath. These recovered collections are named with namespace as the prefix and the date time as the suffix in their names. These are in BSON format, and we need to convert into JSON to analyze them so the plan for the next course of action can be done. In our case, the testData collection’s rollback data are as follows:

root@105b6df757d7:/# cd /data/db
root@105b6df757d7:/data/db# ls -l rollback/
total 324K
-rw-r--r-- 1 mongodb mongodb 323K Mar 30 12:08 admin.testData.2018-03-30T12-08-38.0.bson

root@105b6df757d7:/data/db/rollback# bsondump admin.testData.2018-03-30T12-08-38.0.bson > rollback.json
2018-03-30T12:13:00.033+0000 9997 objects found
root@105b6df757d7:/data/db/rollback# head rollback.json

That’s it? Now check the counts of the testData collection in node A:

my-mongo-set:PRIMARY> db.testData.count()

So the records 9997 which were rollbacked into the rollback directory would also be dropped from the collection. This ensures the data consistency throughout the replicaset.

How to avoid ROLLBACK – writeConcern

The default writeConcern in the replicaSet is w:1., i.e., When a client writes into a replicaSet, then it receives an acknowledgment from the PRIMARY alone and won’t wait for SECONDARY members’ acknowledgment. If you want to avoid the ROLLBACK scenario in your environment, then you have to use the {w:majority} or {w:n}, where 1 > n <=  (no. of members in your replica set). This ensures that the writes are propagated to so many members of the replica set before sending the acknowledgment to the client. This solves the problem of ROLLBACK.

But please be careful that you are not giving higher value to writeConcern, because it also affects the write performance. The acknowledgment needs to be received from the number of members mentioned in the value. The value {w:majority} provides the acknowledgement that write operations have propagated to the majority of voting nodes, including the primary and is suitable for most of the environments.

ROLLBACK – Limitation

The main thing to note here is that mongod will not rollback more than 300MB data. In such cases, we need to manually check the instance to recover the data. You can see the below message in mongod.log in such cases:

[replica set sync] replSet syncThread: 13410 replSet too much data to roll back

Understanding this simple ROLLBACK background helps us to decide what needs to be done with the rollbacked data. It also helps us avoid such scenarios, because data is data and is very important!

The post MongoDB Rollback in replicaset appeared first on Percona Database Performance Blog.


Webinar Thursday May 3, 2018: Running MongoDB in Production (Part 3)

Running MongoDB

Running MongoDBPlease join Percona’s Senior Technical Operations Architect, Tim Vaillancourt as he presents Running MongoDB in Production (Part 3) on Thursday, May 3, 2018, at 10:00 am PDT (UTC-7) / 1:00 pm EDT (UTC-4).

Are you a seasoned MySQL DBA that needs to add MongoDB to your skills? Are you used to managing a small environment that runs well, but want to know what you might not know yet?

MongoDB works well, but when it has issues the number one question is “where should I go to solve a problem?”

This webinar on running MongoDB covers:

  • Troubleshooting
    • Log File
    • Slow Query
    • Operations
  • Schema Design
    • Data Types
    • Indexes
    • Workflows
  • Data Integrity
    • Replica Sets
    • Write Concerns
    • Data Recovery
  • Scaling (Read/Writes)

Register for the webinar now.

Missed Part 1 and Part 2 of our Running MongoDB in Production series? You can watch and download the slides of Part 1 here and watch or download the slides of Part 2 here.

Timothy Vaillancourt, Senior Technical Operations Architect

Tim joined Percona in 2016 as Sr. Technical Operations Architect for MongoDB with the goal of making MongoDB operations as smooth as possible. With experience operating infrastructures in industries such as government, online marketing/publishing, SaaS and gaming, combined with experience tuning systems from the hard disk all the way up to the end-user, Tim has spent time in nearly every area of the modern IT stack with many lessons learned.

Tim is based in Amsterdam, NL and enjoys traveling, coding and music. Prior to Percona Tim was the Lead MySQL DBA of Electronic Arts’ DICE studios, helping some of the largest games in the world (“Battlefield” series, “Mirrors Edge” series, “Star Wars: Battlefront”) launch and operate smoothly while also leading the automation of MongoDB deployments for EA systems. Before the role of DBA at EA’s DICE studio, Tim served as a subject matter expert in NoSQL databases, queues and search on the Online Operations team at EA SPORTS. Prior to moving to the gaming industry, Tim served as a Database/Systems Admin operating a large MySQL-based SaaS infrastructure at AbeBooks/Amazon Inc.

The post Webinar Thursday May 3, 2018: Running MongoDB in Production (Part 3) appeared first on Percona Database Performance Blog.


Keep Sensitive Data Secure in a Replication Setup

Keep sensitive data secure

Keep sensitive data secureThis blog post describes how to keep sensitive data secure on slave servers in a MySQL async replication setup.

Almost every web application has a sensitive data: passwords, SNN, credit cards, emails, etc. Splitting the database to secure and “public” parts allows for restricting user and application parts access to sensitive data.

Field encryption

This is based on MySQL encryption functions or on client-side encryption when the authorized user knows a secret, but encrypted data is distributed to all slaves.

  • If possible, use hashes with a big enough salt, and do not store real sensitive data in the database. A good example is passwords. An end-user sends the login and password, application/SQL code calculates the hash with a salt value unique for each end-user and compares the hash with the value stored in the database. Even if the attacker gets the hashes, it’s still hard or even impossible to extract real passwords for all users. Make sure that you are using a good random number generator for the salt, application-side secret, and a good hash function (not MD5).
  • Encryption is not suitable if you are going to provide public access to your database (via slave dumps in sql/csv/xml/json format).
  • Encryption is a complex topic. Check here for a good blog post explaining hashing usage, and try to find a security consultant if you are inventing some “new” method of storing and encrypting data.

Field encryption example

I’m using a single server setup, because the most important part of data separation should be done on the application side. The secure part of the application has a secret passphrase. For example, you can place the code working with authentication, full profile and payments on a separate server and use a dedicated MySQL account.

create database encrypted;
use encrypted;
create table t(c1 int, c2 varchar(255), rnd_pad varbinary(16), primary key(c1));
SET block_encryption_mode = 'aes-256-cbc';
SET @key_str = SHA2('My secret passphrase',512);
SET @init_vector = RANDOM_BYTES(16);
insert into t (c1,c2, rnd_pad) values (1, AES_ENCRYPT('Secret', @key_str, @init_vector), @init_vector);
-- decrypt data
select c1, AES_DECRYPT(c2,@key_str, rnd_pad) from t;


  • GOOD: Master and slave servers have exactly the same data and no problems with replication.
  • GOOD: Even if two different end-users have exactly the same password, the stored values are different due to random bytes in the init vector for AES encryption.
  • GOOD: Both the encryption and random number generation uses an external library (openssl).
  • CONF: It’s important to have binlog_format=ROW to avoid sending the secret to slave servers.
  • CONF: Do not allow end-users to change data without changing the init_vector, especially for small strings without random padding. Each update should cause init_vector re-generation.
  • BAD: Encrypted data is still sent to slave servers. If the encryption algorithm or protocol is broken, it is possible to get access to data from an insecure part of the application.
  • BAD: The described protocol still could be insecure.

Replication filters

There are two types of replication filters: a master-side with binlog-*db and a slave-side with replicate-*.

Both could cause replication breakage. Replication filters were created for STATEMENT-based replication and are problematic with modern binlog_format=ROW + gtid_mode=on setup. You can find several cases related to database-level slave-side filters in this blog post. If you still need slave-side filtering, use per-table replicate-wild-*-table options.


Even if binary logging is disabled for a specific database, the statement still could be stored in the binary log if it’s a DDL statement, or if the binlog_format is STATEMENT or MIXED and default database is not used by the statement. For details, see the reference manual for the binlog-do-db option. In order to avoid replication issues, you should use ROW-based replication and run SET SESSION sql_log_bin=0; before each DDL statement is executed against the ignored database. It’s not a good idea to use binlog-do-db, because you are losing control of what should be replicated.

Why is binary log filtering useful? Changing the sql_log_bin variable is prohibited inside transactions. The sql_log_bin is DANGEROUS, please do not use it instead of binlog-ignore-db in production on the application side. If you need it for database administration, make sure that you are always typing the “session” word before sql_log_bin. This makes problematic consistent updates of multiple entities inside database.

We still should have the ability to hide just one column from the table. But if we are ignoring the database, we should provide a method of reading non-secure data on slaves / by restricted MySQL accounts. This is possible with triggers and views:

create database test;
set session sql_log_bin=0;
create table test.t(c1 int, c2 int, primary key(c1));
alter table test.t add primary key(c1);
set session sql_log_bin=1;
create database test_insecure;
create table test_insecure.t(c1 int, c2 int default NULL, primary key(c1));
use test
delimiter //
create trigger t_aft_ins
after insert
 on test.t FOR EACH ROW
  INSERT test_insecure.t (c1) values (NEW.c1);
END //
create trigger t_aft_upd
after update
 on test.t FOR EACH ROW
  UPDATE test_insecure.t SET c1 = NEW.c1 WHERE c1 = OLD.c1;
END //
create trigger t_aft_del
after delete
 on test.t FOR EACH ROW
  DELETE FROM test_insecure.t WHERE c1 = OLD.c1;
END //
delimiter ;
-- just on slave:
create database test;
create view test.t as select * from test_insecure.t;
-- typical usage
INSERT INTO test.t values(1,1234);
SELECT * from test.t; -- works on both master and slave, c2 field will have NULL value on slave.


  • BAD: The data is not the same on the master and slaves. It potentially breaks replication. It’s not possible to use a slave’s backup to restore the master or promote the slave as a new master.
  • BAD: Triggers could reduce DML statement performance.
  • GOOD: The sensitive data is not sent to slaves at all (and not written to binary log).
  • GOOD: It works with GTID
  • GOOD: It requires no application changes (or almost no application changes).
  • GOOD: binlog-ignore-db allows us to not use the dangerous sql_log_bin variable after initial table creation.

The post Keep Sensitive Data Secure in a Replication Setup appeared first on Percona Database Performance Blog.


Multi-Source Replication Performance with GTID

Multi-Source Replication with GTID

In this blog post, we’ll look at the performance of multi-source replication with GTID.

Multi-Source Replication is a topology I’ve seen discussed recently, so I decided to look into how it performs with the different replication concepts. Multi-source replication use replication channels, which allow a slave to replicate from multiple masters. This is a great way to consolidate data that has been sharded for production or simplify the analytics process by using the same server. Since multiple masters are taking writes, care is needed to not overlook the slave. The traditional replication concept uses the binary log file name, and the position inside that file.

This was the standard until the release of global transaction identifiers (GTID). I have set up a test environment to validate which concept would perform better, and be a better choice for use in this topology.


My test suite is rather simple, consisting of only three virtual machines, two masters and one slave. The slaves’ replication channels are set up using the same concept for each run, and no run had any replication filters. To prevent any replication errors, each master took writes against a different schema and user grants are identical on all three servers. The setup below ran with both replication channels using binary log file and position. Then the tables were dropped and the servers changed to use GTID for the next run.

Prepare the sysbench tables:

sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db1 --range_size=100 --table_size=1000000 --tables=5 --threads=5 --events=0 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua prepare
sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db3 --range_size=100 --table_size=1000000 --tables=5 --threads=5 --events=0 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua prepare

I used a read-only sysbench to warm up the InnoDB buffer pool. Both commands ran on the slave to ensure both schemas were loaded into the buffer pool:

sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db1 --range_size=100 --table_size=1000000 --tables=5 --threads=5 --events=0 --time=3600 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua run
sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db3 --range_size=100 --table_size=1000000 --tables=5 --threads=5 --events=0 --time=3600 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua run

After warming up the buffer pool, the slave should be fully caught up with both masters. To remove IO contention as a possible influencer, I stopped the SQL thread while I generated load on the master. Leaving the IO thread running allowed the slave to write the relay logs during this process, and help ensure that the test only measures the difference in the slave SQL thread.

stop slave sql thread for channel 'db1'; stop slave sql thread for channel 'db3';

Each master had a sysbench run against it for the schema that was designated to it in order to generate the writes:

sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db1 --range_size=100 --table_size=1000000 --tables=5 --threads=1 --events=0 --time=3600 --rand-type=uniform /usr/share/sysbench/oltp_write_only.lua run
sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db3 --range_size=100 --table_size=1000000 --tables=5 --threads=1 --events=0 --time=3600 --rand-type=uniform /usr/share/sysbench/oltp_write_only.lua run

Once the writes completed, I monitored the IO activity on the slave to ensure it was 100% idle and that all relay logs were fully captured. Once everything was fully written, I enabled a capture of the replication lag once per minute for each replication channel, and started the slaves SQL threads:

usr/bin/pt-heartbeat -D db1 -h localhost --master-server-id=101 --check
usr/bin/pt-heartbeat -D db3 -h localhost --master-server-id=103 --check
start slave sql thread for channel 'db1'; start slave sql thread for channel 'db3';

The above chart depicts the cumulative lag seen on the slave by pt-heartbeat since starting the sql_thread. The first item to noticed is that the replication delay was higher overall with the binary log. This could be because the SQL thread stopped for a different amount of time. This may appear to give GTID an advantage in this test, but remember with this test the amount of delay is less important than the processed rate. Focusing on when replication began to display a significant change towards catching up you can see that there are two distinct drops in delay. This is caused by the fact that the slave has two replication threads that individually monitor their delay. One of the replication threads caught up fully and the other was delayed for a bit longer.

In every test run. GTID took slightly longer to fully catch up than the traditional method. There are a couple of reasons to expect GTID’s to be slightly slower. One possibility is the that there are additional writes on the slave, in order to keep track of all the GTID’s that the slave ran. I removed the initial write to the relay log, but we must retain the committed GTID, and this causes additional writes. I used the default settings for MySQL, and as such log_slave_updates was disabled. This causes the replicated GTID to be stored in a table, which is periodically compressed. You can find more details on how log_slave_updates impacts GTID replication here.

So the question still exists, why should we use GTID, especially with multisource replication? I’ve found that the answer lies in the composition of a GTID. From MySQL’s GTID Concepts, a GTID is composed of two parts, the source_id, and the transaction_id. The source_id is a unique identifier targeting the server which originally wrote the transaction. This allows you to identify in the binary log which master took the initial write, and so you can pinpoint problems much easier.

The below excerpt from DB1’s (a master from this test) binary log shows that, before the transaction being written, the “SET @@SESSION.GTID_NEXT” ran. This is the GTID that you can follow through the rest of the topology to identify the same transaction.

“d1ab72e9-0220-11e8-aee7-00155dab6104” is the server_uuid for DB1, and 270035 is the transaction id.

SET @@SESSION.GTID_NEXT= 'd1ab72e9-0220-11e8-aee7-00155dab6104:270035'/*!*/;
# at 212345
#180221 15:37:56 server id 101 end_log_pos 212416 CRC32 0x758a2d77 Query thread_id=15 exec_time=0 error_code=0
SET TIMESTAMP=1519245476/*!*/;
# at 212416
#180221 15:37:56 server id 101 end_log_pos 212472 CRC32 0x4363b430 Table_map: `db1`.`sbtest1` mapped to number 109
# at 212472
#180221 15:37:56 server id 101 end_log_pos 212886 CRC32 0xebc7dd07 Update_rows: table id 109 flags: STMT_END_F
### UPDATE `db1`.`sbtest1`
### @1=654656 /* INT meta=0 nullable=0 is_null=0 */
### @2=575055 /* INT meta=0 nullable=0 is_null=0 */
### @3='20363719684-91714942007-16275727909-59392501704-12548243890-89454336635-33888955251-58527675655-80724884750-84323571901' /* STRING(120) meta=65144 nullable=0 is_null=0 */
### @4='97609582672-87128964037-28290786562-40461379888-28354441688' /* STRING(60) meta=65084 nullable=0 is_null=0 */
### SET
### @1=654656 /* INT meta=0 nullable=0 is_null=0 */
### @2=575055 /* INT meta=0 nullable=0 is_null=0 */
### @3='17385221703-35116499567-51878229032-71273693554-15554057523-51236572310-30075972872-00319230964-15844913650-16027840700' /* STRING(120) meta=65144 nullable=0 is_null=0 */
### @4='97609582672-87128964037-28290786562-40461379888-28354441688' /* STRING(60) meta=65084 nullable=0 is_null=0 */
# at 212886
#180221 15:37:56 server id 101 end_log_pos 212942 CRC32 0xa6261395 Table_map: `db1`.`sbtest3` mapped to number 111
# at 212942
#180221 15:37:56 server id 101 end_log_pos 213166 CRC32 0x2782f0ba Write_rows: table id 111 flags: STMT_END_F
### INSERT INTO `db1`.`sbtest3`
### SET
### @1=817058 /* INT meta=0 nullable=0 is_null=0 */
### @2=390619 /* INT meta=0 nullable=0 is_null=0 */
### @3='01297933619-49903746173-24451604496-63437351643-68022151381-53341425828-64598253099-03878171884-20272994102-36742295812' /* STRING(120) meta=65144 nullable=0 is_null=0 */
### @4='29893726257-50434258879-09435473253-27022021485-07601619471' /* STRING(60) meta=65084 nullable=0 is_null=0 */
# at 213166
#180221 15:37:56 server id 101 end_log_pos 213197 CRC32 0x5814a60c Xid = 2313
# at 213197


Based on the sysbench tests I ran, GTID replication has a slightly lower throughput. It took about two to three minutes longer to process an hour worth of writes on two masters, compared to binary log replication. GTID’s strengths lie more in how it eases the management and troubleshooting of complex replication topologies.

The GTID concept allows a slave to know exactly which server initially wrote the transaction, even in a tiered environment. This means that if you need to promote a slave from the bottom tier, to the middle tier, simply changing the master is all that is needed. The slave can pick up from the last transaction it ran on that server and continue replicating without a problem. Stephane Combaudon explains this in detail in a pair of blogs. You can find part 1 here and part 2 here. Facebook also has a great post about their experience deploying GTID-based replication and the troubles they faced.

The post Multi-Source Replication Performance with GTID appeared first on Percona Database Performance Blog.


Percona XtraDB Cluster, MySQL Asynchronous Replication and log-slave-updates

Asynchronous Replication

Recently, I’ve been looking into issues with the interactions between MySQL asynchronous replication and Galera replication. In this blog post, I’d like to share what I’ve learned.

MySQL asynchronous replication and Galera replication interactions are complicated due to the number of factors involved (Galera replication vs. asynchronous replication, replication filters, and row-based vs. statement-based replication). So as a start, I’ll look at an issue that came up with setting up an asynchronous replication channel between two Percona XtraDB Cluster (PXC) clusters.

Here’s a view of the desired topology:

MySQL asynchronous replication

The Problem

We want to set up an asynchronous replication channel between two PXC clusters. We also set log-slave-updates on the async slave (PXC node 2a in the topology diagram).

This is an interesting configuration and results in unexpected behavior as the replication depends on the node where the operation was performed. Let’s use CREATE TABLE as an example.

  • Run CREATE TABLE on Node 1a.  The table replicates to Node 1b, but not to the nodes in cluster 2.
  • Run CREATE TABLE on Node 1b. The table replicates to all nodes (both cluster 1 and cluster 2).

Some background information

Understanding the problem requires some knowledge of MySQL threads. However, as a simplification, I’ll group the threads into three groups:

  • Main MySQL Client Thread: This thread handles the requests for the client connection (here the client is an external entity).
  • Async Replication Threads: There are multiple threads in use here, some handle I/O, and some apply the updates, but we will view them as a single entity for simplicity.
  • Galera Threads: There are also multiple threads here, similar to the Async Replication Threads. (The name Galera here refers to the underlying replication technology used by PXC.)

For more information on MySQL threads, see

Why is the data not replicating?

In the first case (CREATE TABLE executed on Node1a)

  • The table is replicated from Node1a -> Node 1b via Galera replication.
  • The table is not replicated because the async replication threads are not picking up the changes.

In the second case (CREATE TABLE executed on Node 1b)

  • The table is replicated from Node1b -> Node 1a via Galera replication.
  • The table is replicated from Node1b -> Node 2a via async replication. This differs from the first case because the statement is executed on the Main MySQL client thread.  The async replication threads pick up the changes and send them to Node 2a.
  • The table is replicated from Node 2a -> Node 2b via Galera replication because log-slave-updates has been enabled on Node2a.

That last part is the important bit. We can view the Galera replication threads as another set of asynchronous replication threads. So if data is coming in via async replication, they have to be made visible to Galera by log-slave-updates.  This is true in the other direction also: log-slave-updates must be enabled for Galera to supply data to async replication.

This is very similar to chained replication

The Solution

In this scenario, the answer is to set log-slave-updates on Node 1b (the async master) and on Node 2a (the async slave).

We set log-slave-updates on node 1b to allow the async threads to pickup the changes from the Galera threads.

We set log-slave-updates on node 2a to allow the Galera threads to pickup the changes from the async threads. Starting with PXC 5.7.17, calling START SLAVE on a PXC node will return an error unless log-slave-updates is enabled.

You must enable log-slave-updates on the node for data to be transferred between Galera and asynchronous replication.

Recommendations/Best Practices

If you plan to use MySQL asynchronous replication with Percona XtraDB Cluster (either as async master or slave), we recommend that you enable log-slave-updates on all nodes within the cluster. This to (1) to ensure that any async replication connections to/from the cluster work correctly and (2) to ensure that all the nodes within a cluster share the same configuration and behavior.

Recommended configuration diagram for the clusters:

MySQL asynchronous replication

The post Percona XtraDB Cluster, MySQL Asynchronous Replication and log-slave-updates appeared first on Percona Database Performance Blog.


The Multi-Source GTID Replication Maze

Multi-Source GTID Replication

In this blog post, we’ll look at how to navigate some of the complexities of multi-source GTID replication.

GTID replication is often a real challenge for DBAs, especially if this has to do with multi-source GTID replication. A while back, I came across a really interesting customer environment with shards where multi-master, multi-source, multi-threaded MySQL 5.6 MIXED replication was active. This is a highly complex environment that has both pros and cons, introducing risks as a trade-off for specific customer requirements.

This is the set up of part of this environment:

I started looking into this setup when a statement broke replication between db1 and db10. Replication broke due to a statement executed on a schema that was not present on db10. This also resulted in changes originating from db1 to not being pushed down to db100 as db10, as we stopped the replication thread (for db1 channel).

On the other hand, replication was not stopped on db2 because the schema in question was present on db2. Replication between db2 and db20 was broken as well because the schema was not present in db20.

In order to fix db1->db10 replication, four GTID sets were injected in db10.

Here are some interesting blog posts regarding how to handle/fix GTID replication issues:

After injecting the GTID sets, we started replication again and everything ran fine.


After that, we had to check the db2->db20 replication, which, as I’ve already said, was broken as well. In this case, injecting only the first GTID trx into db20 instead of all of those causing issues on db10 was enough!

You may wonder how this is possible. Right? The answer is that the rest of them were replicated from db10 to db20, although the channel was not the same.

Another strange thing is the fact that although the replication thread for the db2->db20 channel was stopped (broken), checking the slave status on db20 showed that Executed_Gtid_Set was moving for all channels even though Retrieved_Gtid_Set for the broken one was stopped! So what was happening there?

This raised my curiosity, so I decided to do some further investigation and created scenarios regarding other strange things that could happen. An interesting one was about the replication filters. In our case, I thought “What would happen in the following scenario … ?”

Let’s say we write a row from db1 to db123.table789. This row is replicated to db10 (let’s say using channel 1) and to db2 (let’s say using channel2). On channel 1, we filter out the db123.% tables, on channel2 we don’t. db1 writes the row and the entry to the binary log. db2 writes the row after reading the entry from the binary log and subsequently writes the entry to its own binary log and replicates this change to db20. This change is also replicated to db10. So now, on db10 (depending on which channel finds the GTID first) it either gets filtered on channel1 and written to its own bin log at just startcommit with any actual DDL/DML removed, or if it is read first on channel2 (db1->db2 and then db20->db10) then it is NOT filtered out and executed instead. Is this correct? It definitely ISN’T!

Points of interest

You can find answers to the above questions in the points of interest listed below. Although it’s not really clear through the official documentation, this is what happens with GTID replication and multi-source GTID replication:

  • As we know GTID sets are unique across all nodes in a given cluster. In multi-source replication, Executed_Gtid_Set is common for all channels. This means that regardless the originating channel, when a GTID transaction is executed it is recorded in all channels’ Executed_Gtid_Set. Although it’s logical (each database is unique, so if a trx is going to affect a database it shouldn’t be tightened to a single channel regardless of the channel it uses), the documentation doesn’t provide much info around this.
  • When we have multi-source, multi-level replication, there are cases where the GTID sets originating from one master can end up on one slave via different replication paths. It’s not clear if it applies any special algorithm (although it doesn’t seem that there could be one), but the preferred method seems to be FIFO. The fastest wins! This means that GTID sets can travel to the slave via different channels, and it’s related to how fast the upper-level slaves can commit changes. In fact, the path doesn’t really matter as it only executes each GTID trx once.
  • Replication filters are global regardless the channel. This means they apply each filter to all channels. This is normal as we can’t define a replication filter per channel. In order to be able to debug such cases, adding a small replication delay per channel seems a good idea.

Binlog Encryption with Percona Server for MySQL

binlog encryption

In this blog post, we’ll look at how to turn on binlog encryption in Percona Server for MySQL.

Why do I need this?

As you probably know, Percona Server for MySQL’s binlog contains sensitive information. Replication uses the binlog to copy events between servers. They contain all the information from one server so that it can be applied on another. In other words, if somebody has access to a binlog, it means they have access to all the data in the server. Moreover, said person (or, “Hacker”) could create a clone copy of our server by just making a replica of it. In the end, they have access to our binlog. This shows how important protecting a binlog really is – leakage of binlogs not only make a particular table/tablespace or a group of tables accessible to a hacker, but literally the whole server is at risk. The same situation is true with relay log – a relay log is really a copy of binlog on the slave server.

But have no fear – a new feature to the rescue – binary log encryption. Since Percona Server for MySQL version 5.7.20-19 (beta version) it is possible to enable binlog encryption for all the binlogs and relay logs produced by the server.

How do you turn it on?

To start binlog encryption, you need to start the server with –encrypt-binlog=1. This, in turn, requires –master_verify_checksum and –binlog_checksum both to be ON. Also, you need to install one of the keyring plugins.

From now on all the binlogs and relay logs produced by the server get encrypted. However, for the replication to be safe as a whole the connection between servers also has to be encrypted. See https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-encrypted-connections.html for details on how to do this.

Please note that this does not mean that all binlogs in our replication schema get encrypted. Remember you need to turn on encrypt-binlog on slave servers too, even if they do not produce binlog files. Slave servers still produce relay logs when replicating from a master server. Turn on encrypt-binlog on slave servers so that their relay logs also get encrypted.

How does this work in the big picture?

The master encrypts the event before writing it into the binlog. The slave connects to master and ask for events. The master decrypts the events from the binary log and sends them over to slave.

Note that events send between the master and slave servers are not encrypted! This is why the connection between the master and slave needs to use a secure channel, i.e., TLS.

The slave receives events from the master, encrypts them and writes them down into the relay log.

That is why we need to enable encrypt-binlog on a slave. The relay log has to get encrypted too.

Next, the slave decrypts events from relay log and applies them. After applying the event the slave encrypts it and writes it down into its binlog file (given binlog is enabled on the slave).

In summary to make our replication secure, we need:

  • Turn on encrypt-binlog on the master
  • Turn on encrypt-binlog on the slave
  • The connection between master and slave needs to use TLS.

It’s worth noting that servers in replication have no idea if other servers are encrypted or not.

Why do master_verify_checksum and binlog_checksum need to be turned ON?

This is needed for “authenticate encryption”. Simply put, this is how we make sure that what we decrypt has not been changed by a third party. Also, it checks if the key that was used to decrypt the event was the correct one.

Digging deeper with mysqlbinlog

Mysqlbinlog is a standalone application that lets you read binlog files. As I write this blog post, it is not capable of decrypting binary logs – at least not by itself. However, it still can read encrypted binlog files when using a running Percona Server for MySQL. Use option –read-from-remote-server to read binary log produced by a given server.

Let’s see what happens when we try to read an encrypted binlog with mysqlbinlog without read-from-remote-server enabled. You will get something like this:

As you can see it is only possible to read binary log till event type 9f gets read. This event is the Start_encryption_event. After this event the rest of the binlog is encrypted. One thing to note is that Start_encryption_event is never propagated in replication. For instance, the master server is run with –encryt_binlog. This means that the server writes Start_encryption_event to its binary logs. However it is never sent to the slave server (the slave has no idea whether the master is encrypted).

Another option you can use with mysqlbinlog is –force option. It forces mysqlbinlog to read all the events from the binlog, even if they are encrypted. You will see something like this in the output:

As you can see, it is only possible to read two first events – until the Start_encryption_event. However, this time we can see that there are other events that follow, which are encrypted.

Running mysqlbinlog (without –read-from-remote) on encrypted binary logs may only make sense if we want to see if a given binary log is encrypted. For point-in-time recovery, and for other purposes that might require reading encrypted binlog, we would use mysqlbinlog with –read-from-remote option.

For instance, if we want to read binlog master-bin.000001, and Percona Server for MySQL is running on, port 3033, with user:robert, password:hard_password, we would use mysqlbinlog like this:

mysqlbinlog –read-from-remote-server –protocol=tcp –host= –port=3033 –user=robert –password=hard_password master-bing.000001.

When you look at the output of this command, you see something like this:

You can now see the decrypted binlog. One interesting thing to note here is that we do not see our Start_encryption_event (type 9f). This proves my point – Start_encryption_event never leaves the server (we are reading from the server now as we use –read-from-remote-server).

For more information how to use mysqlbinlog for point-in-time recovery see https://dev.mysql.com/doc/refman/5.7/en/point-in-time-recovery.html.

However, for more modern approaches for point-in-time recovery that do not use mysqlbinlog and make use of parallel appliers, see here:

Have fun with binlog encryption!


Archiving MySQL Tables in ClickHouse

Archiving MySQL Tables in ClickHouse

Archiving MySQL Tables in ClickHouseIn this blog post, I will talk about archiving MySQL tables in ClickHouse for storage and analytics.

Why Archive?

Hard drives are cheap nowadays, but storing lots of data in MySQL is not practical and can cause all sorts of performance bottlenecks. To name just a few issues:

  1. The larger the table and index, the slower the performance of all operations (both writes and reads)
  2. Backup and restore for terabytes of data is more challenging, and if we need to have redundancy (replication slave, clustering, etc.) we will have to store all the data N times

The answer is archiving old data. Archiving does not necessarily mean that the data will be permanently removed. Instead, the archived data can be placed into long-term storage (i.e., AWS S3) or loaded into a special purpose database that is optimized for storage (with compression) and reporting. The data is then available.

Actually, there are multiple use cases:

  • Sometimes the data just needs to be stored (i.e., for regulatory purposes) but does not have to be readily available (it’s not “customer facing” data)
  • The data might be useful for debugging or investigation (i.e., application or access logs)
  • In some cases, the data needs to be available for the customer (i.e., historical reports or bank transactions for the last six years)

In all of those cases, we can move the older data away from MySQL and load it into a “big data” solution. Even if the data needs to be available, we can still move it from the main MySQL server to another system. In this blog post, I will look at archiving MySQL tables in ClickHouse for long-term storage and real-time queries.

How To Archive?

Let’s say we have a 650G table that stores the history of all transactions, and we want to start archiving it. How can we approach this?

First, we will need to split this table into “old” and “new”. I assume that the table is not partitioned (partitioned tables are much easier to deal with). For example, if we have data from 2008 (ten years worth) but only need to store data from the last two months in the main MySQL environment, then deleting the old data would be challenging. So instead of deleting 99% of the data from a huge table, we can create a new table and load the newer data into that. Then rename (swap) the tables. The process might look like this:

  1. CREATE TABLE transactions_new LIKE transactions
  2. INSERT INTO transactions_new SELECT * FROM transactions WHERE trx_date > now() – interval 2 month
  3. RENAME TABLE transactions TO transactions_old, transactions_new TO transactions

Second, we need to move the transactions_old into ClickHouse. This is straightforward — we can pipe data from MySQL to ClickHouse directly. To demonstrate I will use the Wikipedia:Statistics project (a real log of all requests to Wikipedia pages).

Create a table in ClickHouse:

    id bigint,
    dt DateTime,
    project String,
    subproject String,
    path String,
    hits UInt64,
    size UInt64
ENGINE = MergeTree
0 rows in set. Elapsed: 0.010 sec.

Please note that I’m using the new ClickHouse custom partitioning. It does not require that you create a separate date column to map the table in MySQL to the same table structure in ClickHouse

Now I can “pipe” data directly from MySQL to ClickHouse:

mysql --quick -h localhost wikistats -NBe
"SELECT concat(id,',"',dt,'","',project,'","',subproject,'","', path,'",',hits,',',size) FROM wikistats" |
clickhouse-client -d wikistats --query="INSERT INTO wikistats FORMAT CSV"

Thirdwe need to set up a constant archiving process so that the data is removed from MySQL and transferred to ClickHouse. To do that we can use the “pt-archiver” tool (part of Percona Toolkit). In this case, we can first archive to a file and then load that file to ClickHouse. Here is the example:

Remove data from MySQL and load to a file (tsv):

pt-archiver --source h=localhost,D=wikistats,t=wikistats,i=dt --where "dt <= '2018-01-01 0:00:00'"  --file load_to_clickhouse.txt --bulk-delete --limit 100000 --progress=100000
TIME                ELAPSED   COUNT
2018-01-25T18:19:59       0       0
2018-01-25T18:20:08       8  100000
2018-01-25T18:20:17      18  200000
2018-01-25T18:20:26      27  300000
2018-01-25T18:20:36      36  400000
2018-01-25T18:20:45      45  500000
2018-01-25T18:20:54      54  600000
2018-01-25T18:21:03      64  700000
2018-01-25T18:21:13      73  800000
2018-01-25T18:21:23      83  900000
2018-01-25T18:21:32      93 1000000
2018-01-25T18:21:42     102 1100000

Load the file to ClickHouse:

cat load_to_clickhouse.txt | clickhouse-client -d wikistats --query="INSERT INTO wikistats FORMAT TSV"

The newer version of pt-archiver can use a CSV format as well:

pt-archiver --source h=localhost,D=wikitest,t=wikistats,i=dt --where "dt <= '2018-01-01 0:00:00'"  --file load_to_clickhouse.csv --output-format csv --bulk-delete --limit 10000 --progress=10000

How Much Faster Is It?

Actually, it is much faster in ClickHouse. Even the queries that are based on index scans can be much slower in MySQL compared to ClickHouse.

For example, in MySQL just counting the number of rows for one year can take 34 seconds (index scan):

mysql> select count(*) from wikistats where dt between '2017-01-01 00:00:00' and '2017-12-31 00:00:00';
| count(*)  |
| 103161991 |
1 row in set (34.82 sec)
mysql> explain select count(*) from wikistats where dt between '2017-01-01 00:00:00' and '2017-12-31 00:00:00'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: wikistats
   partitions: NULL
         type: range
possible_keys: dt
          key: dt
      key_len: 6
          ref: NULL
         rows: 227206802
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

In ClickHouse, it only takes 0.062 sec:

:) select count(*) from wikistats where dt between  toDateTime('2017-01-01 00:00:00') and  toDateTime('2017-12-31 00:00:00');
SELECT count(*)
FROM wikistats
WHERE (dt >= toDateTime('2017-01-01 00:00:00')) AND (dt <= toDateTime('2017-12-31 00:00:00'))
? 103161991 ?
1 rows in set. Elapsed: 0.062 sec. Processed 103.16 million rows, 412.65 MB (1.67 billion rows/s., 6.68 GB/s.)

Size on Disk

In my previous blog on comparing ClickHouse to Apache Spark to MariaDB, I also compared disk size. Usually, we can expect a 10x to 5x decrease in disk size in ClickHouse due to compression. Wikipedia:Statistics, for example, contains actual URIs, which can be quite large due to the article name/search phrase. This can be compressed very well. If we use only integers or use MD5 / SHA1 hashes instead of storing actual URIs, we can expect much smaller compression (i.e., 3x). Even with a 3x compression ratio, it is still pretty good as long-term storage.


As the data in MySQL keeps growing, the performance for all the queries will keep decreasing. Typically, queries that originally took milliseconds can now take seconds (or more). That requires a lot of changes (code, MySQL, etc.) to make faster.

The main goal of archiving the data is to increase performance (“make MySQL fast again”), decrease costs and improve ease of maintenance (backup/restore, cloning the replication slave, etc.). Archiving to ClickHouse allows you to preserve old data and make it available for reports.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com