May
03
2018
--

Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

Slave Performance Too Slow

Slave Performance Too SlowRecently I worked on one customer issue that I would describe as “slave performance too slow”. During a quick analysis, I’ve found that the replication slave SQL thread cannot keep up while processing row-based events from the master’s binary log.

For example:

mysql> SHOW SLAVE STATUSG
*************************** 1. row ***************************
                          ...
              Master_Log_File: binlog.0000185
          Read_Master_Log_Pos: 86698585
                          ...
        Relay_Master_Log_File: binlog.0000185
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                          ...
          Exec_Master_Log_Pos: 380
              Relay_Log_Space: 85699128
                          ...
                  Master_UUID: 98974e7f-2fbc-18e9-72cd-07003817585c
                          ...
           Retrieved_Gtid_Set: 98974e7f-2fbc-18e9-72cd-07003817585c:1055-1057
            Executed_Gtid_Set: 7f42e2c5-3fbc-16e7-7fb8-05003715789a:1-2,
98974e7f-2fbc-18e9-72cd-07003817585c:1-1056
                          ...

The processlist state for the SQL thread can be one of the following: Reading event from the relay log, or System lock, or potentially some other state. In my case:

mysql> SHOW PROCESSLIST;
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+
| Id | User            | Host            | db   | Command | Time | State                            | Info             |
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+
...
|  4 | system user     |                 | NULL | Connect |  268 | Reading event from the relay log | NULL             |
...
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+

What causes that?

Let’s take a look what could potentially cause such behavior and what we need to pay attention to. When the SQL thread applies the change from a row-based event, it has to locate the exact row that was updated. With a primary key, this is trivial as only one row can possibly have the same value for the primary key.

However, if there is no primary key on the table on the replication slave side, the SQL thread must search the entire table to locate the row to update or delete. It repeats the search for each updated row. This search is both very resource usage intensive (CPU usage can be up to 100%) and slow causing the slave to fall behind.

For InnoDB tables, the “hidden” key used for the clustered index for tables without a primary key cannot be used to avoid searching the entire table for the rows to update or delete. We need to keep in mind that the “hidden” key is unique only to each MySQL instance, so the replication master and replication slave generally don’t have the same values for the “hidden” key for the same row.

What can we do to solve that?

The best solution is to ensure that all tables have a primary key. This not only ensures the SQL thread can easily locate rows to update or delete, but it is also considered as a best practice since it ensures all rows are unique.

If there is no way to logically add a natural primary key for the table, a potential solution is to add an auto-increment unsigned integer column as the primary key.

The query below helps you to locate tables without a primary key:

SELECT tables.table_schema, tables.table_name, tables.table_rows
      FROM information_schema.tables
      LEFT JOIN (
        SELECT table_schema, table_name
        FROM information_schema.statistics
        GROUP BY table_schema, table_name, index_name
        HAVING
          SUM(
            CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END
          ) = COUNT(*)
      ) puks
      ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name
      WHERE puks.table_name IS NULL
        AND tables.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
        AND tables.table_type = 'BASE TABLE' AND engine='InnoDB';

Please note that for InnoDB, there must always be a unique NOT NULL key for all tables. It is required for the clustered index. So adding an explicit “dummy” column as suggested above will not add to the overall storage requirements as it will merely replace the hidden key.

It’s not always possible to add a primary key to the table immediately if, for example, there are many relations on the application side/legacy system, lack of resources, unknown application behavior after the change which required testing, etc.

In this case, a short-term solution is to change the search algorithm used by the replication slave to locate the rows changed by row-based events.

The search algorithm is set using the slave_rows_search_algorithms option which is available in MySQL 5.6 and later. The default value is to use an index scan if possible, otherwise a table scan.

https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#option_mysqld_slave-rows-search-algorithms

However, for tables without a primary key using a hash scan, which causes the SQL thread to temporarily cache hashes to reduce the overhead of searching the whole table. The value of slave_rows_search_algorithms can be changed dynamically using:

mysql> SET GLOBAL slave_rows_search_algorithms = 'INDEX_SCAN,HASH_SCAN';

Just to note INDEX_SCAN,HASH_SCAN is the default value in MySQL 8.0.

One thing to be aware of when using hash scans that the hashes are only reused within one row-based event. (Each row-based event may have changes to several rows in the same table originating from the same SQL statement).

The binlog_row_event_max_size option on the replication master controls the maximum size of a row-based event. The default max event size is 8kB. This means that switching to hash scans only improves the performance of the SQL thread when:

  1. Several rows fit into one row based event. It may help to increase the value of binlog_row_event_max_size on the replication master, if you perform updates or deletes on large rows (e.g., with blob or text data). You can only set the binlog_row_event_max_size in the MySQL configuration file, and resetting this value requires a restart.
  2. One statement changes several rows.

https://dev.mysql.com/doc/refman/5.7/en/replication-options-binary-log.html#option_mysqld_binlog-row-event-max-size

Conclusion

Even if enabling hash scans improves the performance enough for the replication slave to keep up, the permanent solution is to add an explicit primary key to each table. This should be the general rule of thumb in the schema design in order avoid and/or minimize many issues like slave performance too slow (as described in this post).

Next, I am going to investigate how we can find out the exact thread state using Performance Schema in order to make issue identification less of a guessing game.

The post Causes and Workarounds for Slave Performance Too Slow with Row-Based Events appeared first on Percona Database Performance Blog.

May
01
2018
--

MongoDB Rollback in replicaset

MongoDB Rollback

MongoDB RollbackIn this blog post, we’ll look at how MongoDB rollback works during replicaset failovers.

In recent versions, MongoDB has provided lots of features related to replicaset and automatic failover. When it comes to failover, the next question that arises is “How does MongoDB ROLLBACK work during replicaset failover?”

If a PRIMARY member (say node A) stepped down with some data writes that were executed but not replicated to the SECONDARY members yet, then a ROLLBACK occurs on the former PRIMARY A when it rejoins the replicaset. I’ll explain below how the ROLLBACK works!

ROLLBACK Scenario:

ROLLBACK is rare in a replicaset as MongoDB tries to avoid it by replicating the operations from PRIMARY to SECONDARY without delay, under normal conditions. Most of the time ROLLBACK occurs in the event of network partitioning, or if SECONDARY members can’t keep up with the throughput of operations on the former PRIMARY.

ROLLBACK Process:

We will see the process with a test. I have used Docker for this test with the MongoDB 3.2 Jessie version to setup a replicaset with members mongo1 – A, mongo2 – B, mongo3 – C and set Priority 10 to A. Now A is PRIMARY as expected in the replicaset. We need to write some data into A and create a network partition scenario with B and C at the same time. For that, I inserted 25000 documents into A and made it out of network at the same time.

Terminal 1 (A’s mongo prompt):

my-mongo-set:PRIMARY> for (var i = 1; i <= 25000; i++) {
...    db.testData.insert( { x : i } )
... }
WriteResult({ "nInserted" : 1 })
my-mongo-set:PRIMARY> db.testD2018-03-30T17:34:51.455+0530 I NETWORK  [thread1] trying reconnect to 127.0.0.1:30001 (127.0.0.1) failed
2018-03-30T17:34:51.464+0530 I NETWORK  [thread1] reconnect 127.0.0.1:30001 (127.0.0.1) ok
                      db.testD
admin.testD
my-mongo-set:SECONDARY> rs.slaveOk()
my-mongo-set:SECONDARY> db.testData.count()
25000

Terminal2:

Vinodhs-MBP:~ vinodhkrish$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS                           NAMES
b27d82ac2439        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days           0.0.0.0:30003->27017/tcp        mongo3
2b39f9e41973        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days           0.0.0.0:30002->27017/tcp        mongo2
105b6df757d7        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days           0.0.0.0:30001->27017/tcp        mongo1
Vinodhs-MBP:~ vinodhkrish$ docker network disconnect my-mongo-cluster mongo1

The member A has now become as SECONDARY, because it couldn’t reach other members in the replicaset. On the other side, B and C members see that A is not reachable and then B is elected as PRIMARY. We could see that some inserts from former A replicated to B before the network split happens.

(B node)

my-mongo-set:PRIMARY> db.testData.count()
15003

Now do some write operations in current PRIMARY – B and then let node A join the network back by joining the container back to the bridge network. You can observe below that the node A’s member states are changing in the mongo prompt. (I just connected to A and pressed ENTER/RETURN button many times to see the member states, or you can see them in the log file):

(A node)

Vinodhs-MacBook-Pro:mongodb-osx-x86_64-3.2.19 vinodhkrish$ ./bin/mongo 127.0.0.1:30001/admin
MongoDB shell version: 3.2.19
connecting to: 127.0.0.1:30001/admin
my-mongo-set:ROLLBACK> 
my-mongo-set:RECOVERING> 
my-mongo-set:SECONDARY> 
my-mongo-set:SECONDARY> 
my-mongo-set:PRIMARY>

ROLLBACK Internal

From MongoDB point of view, we will see the replicaset process to understand what happened above. Normally the SECONDARY member syncs the oplog entries from its syncSource (the member from where the data is replicated) by using oplogFetcher. The OplogFetcher first sends a find() command to the syncSource’s oplog, and then follows with a series of getMores on the cursor. When node A rejoins the replicaset, node A’s oplogFetcher first sends find() command to syncSource node B and check it has a greater than or equal predicate on the timestamp of the last oplog entry it has fetched. Usually the find() command should at least return one doc due to the greater than or equal predicate. If not, it means that the syncSource is behind and so it will not replicate from it and look for other syncSource.

In this case, A’s oplogFetcher sees that the first document returned from node B does not match the last entry in its oplog. That means node A’s oplog has diverged from node B’s and it should go into ROLLBACK.

Node A first finds the common point between its oplog and its syncSource B’s oplog. It then goes through all of the operations in its oplog back to the common point and figures out how to undo them. Here, 9997 inserts are missed from B and C nodes, and so these documents will be recovered from A’s oplog.

2018-03-30T12:08:37.160+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 4, timestamp: Mar 30 12:03:52:139). source's GTE: (term: 5, timestamp: Mar 30 12:05:37:1) hashes: (3789163619674410187/3226093795606474294)
2018-03-30T12:08:37.160+0000 I REPL     [rsBackgroundSync] rollback 0
2018-03-30T12:08:37.160+0000 I REPL     [ReplicationExecutor] transition to ROLLBACK
2018-03-30T12:08:37.163+0000 I REPL     [rsBackgroundSync] beginning rollback
2018-03-30T12:08:37.163+0000 I REPL     [rsBackgroundSync] rollback 1
2018-03-30T12:08:37.164+0000 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback our last optime:   Mar 30 12:03:52:139
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback their last optime: Mar 30 12:08:17:1c5
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback diff in end of log times: -265 seconds
2018-03-30T12:08:37.269+0000 I REPL     [rsBackgroundSync] rollback common point is (term: 4, timestamp: Mar 30 12:03:46:d2)
2018-03-30T12:08:37.269+0000 I REPL     [rsBackgroundSync] rollback 3 fixup
2018-03-30T12:08:38.240+0000 I REPL     [rsBackgroundSync] rollback 3.5
2018-03-30T12:08:38.240+0000 I REPL     [rsBackgroundSync] Setting minvalid to (term: 5, timestamp: Mar 30 12:08:17:1c5)
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4 n:1
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4.6
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4.7
2018-03-30T12:08:38.391+0000 I REPL     [rsBackgroundSync] rollback 5 d:9997 u:0
2018-03-30T12:08:38.391+0000 I REPL     [rsBackgroundSync] rollback 6
2018-03-30T12:08:38.394+0000 I REPL     [rsBackgroundSync] rollback done
2018-03-30T12:08:38.396+0000 I REPL     [rsBackgroundSync] rollback finished

ROLLBACK data

Where would these 9997 recovered documents go? MongoDB writes these ROLLBACK documents under the rollback directory in the dbpath. These recovered collections are named with namespace as the prefix and the date time as the suffix in their names. These are in BSON format, and we need to convert into JSON to analyze them so the plan for the next course of action can be done. In our case, the testData collection’s rollback data are as follows:

root@105b6df757d7:/# cd /data/db
root@105b6df757d7:/data/db# ls -l rollback/
total 324K
-rw-r--r-- 1 mongodb mongodb 323K Mar 30 12:08 admin.testData.2018-03-30T12-08-38.0.bson

root@105b6df757d7:/data/db/rollback# bsondump admin.testData.2018-03-30T12-08-38.0.bson > rollback.json
2018-03-30T12:13:00.033+0000 9997 objects found
root@105b6df757d7:/data/db/rollback# head rollback.json
{"_id":{"$oid":"5abe279f97044083811b5975"},"x":15004.0}
{"_id":{"$oid":"5abe279f97044083811b5976"},"x":15005.0}
{"_id":{"$oid":"5abe279f97044083811b5977"},"x":15006.0}
{"_id":{"$oid":"5abe279f97044083811b5978"},"x":15007.0}
{"_id":{"$oid":"5abe279f97044083811b5979"},"x":15008.0}
{"_id":{"$oid":"5abe279f97044083811b5980"},"x":15009.0}
{"_id":{"$oid":"5abe279f97044083811b5981"},"x":15010.0}

That’s it? Now check the counts of the testData collection in node A:

my-mongo-set:PRIMARY> db.testData.count()
15003

So the records 9997 which were rollbacked into the rollback directory would also be dropped from the collection. This ensures the data consistency throughout the replicaset.

How to avoid ROLLBACK – writeConcern

The default writeConcern in the replicaSet is w:1., i.e., When a client writes into a replicaSet, then it receives an acknowledgment from the PRIMARY alone and won’t wait for SECONDARY members’ acknowledgment. If you want to avoid the ROLLBACK scenario in your environment, then you have to use the {w:majority} or {w:n}, where 1 > n <=  (no. of members in your replica set). This ensures that the writes are propagated to so many members of the replica set before sending the acknowledgment to the client. This solves the problem of ROLLBACK.

But please be careful that you are not giving higher value to writeConcern, because it also affects the write performance. The acknowledgment needs to be received from the number of members mentioned in the value. The value {w:majority} provides the acknowledgement that write operations have propagated to the majority of voting nodes, including the primary and is suitable for most of the environments.

ROLLBACK – Limitation

The main thing to note here is that mongod will not rollback more than 300MB data. In such cases, we need to manually check the instance to recover the data. You can see the below message in mongod.log in such cases:

[replica set sync] replSet syncThread: 13410 replSet too much data to roll back

Understanding this simple ROLLBACK background helps us to decide what needs to be done with the rollbacked data. It also helps us avoid such scenarios, because data is data and is very important!

The post MongoDB Rollback in replicaset appeared first on Percona Database Performance Blog.

May
01
2018
--

Webinar Thursday May 3, 2018: Running MongoDB in Production (Part 3)

Running MongoDB

Running MongoDBPlease join Percona’s Senior Technical Operations Architect, Tim Vaillancourt as he presents Running MongoDB in Production (Part 3) on Thursday, May 3, 2018, at 10:00 am PDT (UTC-7) / 1:00 pm EDT (UTC-4).

Are you a seasoned MySQL DBA that needs to add MongoDB to your skills? Are you used to managing a small environment that runs well, but want to know what you might not know yet?

MongoDB works well, but when it has issues the number one question is “where should I go to solve a problem?”

This webinar on running MongoDB covers:

  • Troubleshooting
    • Log File
    • Slow Query
    • Operations
  • Schema Design
    • Data Types
    • Indexes
    • Workflows
  • Data Integrity
    • Replica Sets
    • Write Concerns
    • Data Recovery
  • Scaling (Read/Writes)

Register for the webinar now.

Missed Part 1 and Part 2 of our Running MongoDB in Production series? You can watch and download the slides of Part 1 here and watch or download the slides of Part 2 here.

Timothy Vaillancourt, Senior Technical Operations Architect

Tim joined Percona in 2016 as Sr. Technical Operations Architect for MongoDB with the goal of making MongoDB operations as smooth as possible. With experience operating infrastructures in industries such as government, online marketing/publishing, SaaS and gaming, combined with experience tuning systems from the hard disk all the way up to the end-user, Tim has spent time in nearly every area of the modern IT stack with many lessons learned.

Tim is based in Amsterdam, NL and enjoys traveling, coding and music. Prior to Percona Tim was the Lead MySQL DBA of Electronic Arts’ DICE studios, helping some of the largest games in the world (“Battlefield” series, “Mirrors Edge” series, “Star Wars: Battlefront”) launch and operate smoothly while also leading the automation of MongoDB deployments for EA systems. Before the role of DBA at EA’s DICE studio, Tim served as a subject matter expert in NoSQL databases, queues and search on the Online Operations team at EA SPORTS. Prior to moving to the gaming industry, Tim served as a Database/Systems Admin operating a large MySQL-based SaaS infrastructure at AbeBooks/Amazon Inc.

The post Webinar Thursday May 3, 2018: Running MongoDB in Production (Part 3) appeared first on Percona Database Performance Blog.

Apr
30
2018
--

Keep Sensitive Data Secure in a Replication Setup

Keep sensitive data secure

Keep sensitive data secureThis blog post describes how to keep sensitive data secure on slave servers in a MySQL async replication setup.

Almost every web application has a sensitive data: passwords, SNN, credit cards, emails, etc. Splitting the database to secure and “public” parts allows for restricting user and application parts access to sensitive data.

Field encryption

This is based on MySQL encryption functions or on client-side encryption when the authorized user knows a secret, but encrypted data is distributed to all slaves.

  • If possible, use hashes with a big enough salt, and do not store real sensitive data in the database. A good example is passwords. An end-user sends the login and password, application/SQL code calculates the hash with a salt value unique for each end-user and compares the hash with the value stored in the database. Even if the attacker gets the hashes, it’s still hard or even impossible to extract real passwords for all users. Make sure that you are using a good random number generator for the salt, application-side secret, and a good hash function (not MD5).
  • Encryption is not suitable if you are going to provide public access to your database (via slave dumps in sql/csv/xml/json format).
  • Encryption is a complex topic. Check here for a good blog post explaining hashing usage, and try to find a security consultant if you are inventing some “new” method of storing and encrypting data.

Field encryption example

I’m using a single server setup, because the most important part of data separation should be done on the application side. The secure part of the application has a secret passphrase. For example, you can place the code working with authentication, full profile and payments on a separate server and use a dedicated MySQL account.

create database encrypted;
use encrypted;
create table t(c1 int, c2 varchar(255), rnd_pad varbinary(16), primary key(c1));
SET block_encryption_mode = 'aes-256-cbc';
SET @key_str = SHA2('My secret passphrase',512);
SET @init_vector = RANDOM_BYTES(16);
insert into t (c1,c2, rnd_pad) values (1, AES_ENCRYPT('Secret', @key_str, @init_vector), @init_vector);
-- decrypt data
select c1, AES_DECRYPT(c2,@key_str, rnd_pad) from t;

Summary

  • GOOD: Master and slave servers have exactly the same data and no problems with replication.
  • GOOD: Even if two different end-users have exactly the same password, the stored values are different due to random bytes in the init vector for AES encryption.
  • GOOD: Both the encryption and random number generation uses an external library (openssl).
  • CONF: It’s important to have binlog_format=ROW to avoid sending the secret to slave servers.
  • CONF: Do not allow end-users to change data without changing the init_vector, especially for small strings without random padding. Each update should cause init_vector re-generation.
  • BAD: Encrypted data is still sent to slave servers. If the encryption algorithm or protocol is broken, it is possible to get access to data from an insecure part of the application.
  • BAD: The described protocol still could be insecure.

Replication filters

There are two types of replication filters: a master-side with binlog-*db and a slave-side with replicate-*.

Both could cause replication breakage. Replication filters were created for STATEMENT-based replication and are problematic with modern binlog_format=ROW + gtid_mode=on setup. You can find several cases related to database-level slave-side filters in this blog post. If you still need slave-side filtering, use per-table replicate-wild-*-table options.

Master-side

Even if binary logging is disabled for a specific database, the statement still could be stored in the binary log if it’s a DDL statement, or if the binlog_format is STATEMENT or MIXED and default database is not used by the statement. For details, see the reference manual for the binlog-do-db option. In order to avoid replication issues, you should use ROW-based replication and run SET SESSION sql_log_bin=0; before each DDL statement is executed against the ignored database. It’s not a good idea to use binlog-do-db, because you are losing control of what should be replicated.

Why is binary log filtering useful? Changing the sql_log_bin variable is prohibited inside transactions. The sql_log_bin is DANGEROUS, please do not use it instead of binlog-ignore-db in production on the application side. If you need it for database administration, make sure that you are always typing the “session” word before sql_log_bin. This makes problematic consistent updates of multiple entities inside database.

We still should have the ability to hide just one column from the table. But if we are ignoring the database, we should provide a method of reading non-secure data on slaves / by restricted MySQL accounts. This is possible with triggers and views:

create database test;
set session sql_log_bin=0;
create table test.t(c1 int, c2 int, primary key(c1));
alter table test.t add primary key(c1);
set session sql_log_bin=1;
create database test_insecure;
create table test_insecure.t(c1 int, c2 int default NULL, primary key(c1));
use test
delimiter //
create trigger t_aft_ins
after insert
 on test.t FOR EACH ROW
BEGIN
  INSERT test_insecure.t (c1) values (NEW.c1);
END //
create trigger t_aft_upd
after update
 on test.t FOR EACH ROW
BEGIN
  UPDATE test_insecure.t SET c1 = NEW.c1 WHERE c1 = OLD.c1;
END //
create trigger t_aft_del
after delete
 on test.t FOR EACH ROW
BEGIN
  DELETE FROM test_insecure.t WHERE c1 = OLD.c1;
END //
delimiter ;
-- just on slave:
create database test;
create view test.t as select * from test_insecure.t;
-- typical usage
INSERT INTO test.t values(1,1234);
SELECT * from test.t; -- works on both master and slave, c2 field will have NULL value on slave.

Summary

  • BAD: The data is not the same on the master and slaves. It potentially breaks replication. It’s not possible to use a slave’s backup to restore the master or promote the slave as a new master.
  • BAD: Triggers could reduce DML statement performance.
  • GOOD: The sensitive data is not sent to slaves at all (and not written to binary log).
  • GOOD: It works with GTID
  • GOOD: It requires no application changes (or almost no application changes).
  • GOOD: binlog-ignore-db allows us to not use the dangerous sql_log_bin variable after initial table creation.

The post Keep Sensitive Data Secure in a Replication Setup appeared first on Percona Database Performance Blog.

Mar
30
2018
--

Multi-Source Replication Performance with GTID

Multi-Source Replication with GTID

In this blog post, we’ll look at the performance of multi-source replication with GTID.

Multi-Source Replication is a topology I’ve seen discussed recently, so I decided to look into how it performs with the different replication concepts. Multi-source replication use replication channels, which allow a slave to replicate from multiple masters. This is a great way to consolidate data that has been sharded for production or simplify the analytics process by using the same server. Since multiple masters are taking writes, care is needed to not overlook the slave. The traditional replication concept uses the binary log file name, and the position inside that file.

This was the standard until the release of global transaction identifiers (GTID). I have set up a test environment to validate which concept would perform better, and be a better choice for use in this topology.

SETUP

My test suite is rather simple, consisting of only three virtual machines, two masters and one slave. The slaves’ replication channels are set up using the same concept for each run, and no run had any replication filters. To prevent any replication errors, each master took writes against a different schema and user grants are identical on all three servers. The setup below ran with both replication channels using binary log file and position. Then the tables were dropped and the servers changed to use GTID for the next run.

Prepare the sysbench tables:

sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db1 --range_size=100 --table_size=1000000 --tables=5 --threads=5 --events=0 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua prepare
sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db3 --range_size=100 --table_size=1000000 --tables=5 --threads=5 --events=0 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua prepare

I used a read-only sysbench to warm up the InnoDB buffer pool. Both commands ran on the slave to ensure both schemas were loaded into the buffer pool:

sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db1 --range_size=100 --table_size=1000000 --tables=5 --threads=5 --events=0 --time=3600 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua run
sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db3 --range_size=100 --table_size=1000000 --tables=5 --threads=5 --events=0 --time=3600 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua run

After warming up the buffer pool, the slave should be fully caught up with both masters. To remove IO contention as a possible influencer, I stopped the SQL thread while I generated load on the master. Leaving the IO thread running allowed the slave to write the relay logs during this process, and help ensure that the test only measures the difference in the slave SQL thread.

stop slave sql thread for channel 'db1'; stop slave sql thread for channel 'db3';

Each master had a sysbench run against it for the schema that was designated to it in order to generate the writes:

sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db1 --range_size=100 --table_size=1000000 --tables=5 --threads=1 --events=0 --time=3600 --rand-type=uniform /usr/share/sysbench/oltp_write_only.lua run
sysbench --db-driver=mysql --mysql-user= --mysql-password='' --mysql-db=db3 --range_size=100 --table_size=1000000 --tables=5 --threads=1 --events=0 --time=3600 --rand-type=uniform /usr/share/sysbench/oltp_write_only.lua run

Once the writes completed, I monitored the IO activity on the slave to ensure it was 100% idle and that all relay logs were fully captured. Once everything was fully written, I enabled a capture of the replication lag once per minute for each replication channel, and started the slaves SQL threads:

usr/bin/pt-heartbeat -D db1 -h localhost --master-server-id=101 --check
usr/bin/pt-heartbeat -D db3 -h localhost --master-server-id=103 --check
start slave sql thread for channel 'db1'; start slave sql thread for channel 'db3';

The above chart depicts the cumulative lag seen on the slave by pt-heartbeat since starting the sql_thread. The first item to noticed is that the replication delay was higher overall with the binary log. This could be because the SQL thread stopped for a different amount of time. This may appear to give GTID an advantage in this test, but remember with this test the amount of delay is less important than the processed rate. Focusing on when replication began to display a significant change towards catching up you can see that there are two distinct drops in delay. This is caused by the fact that the slave has two replication threads that individually monitor their delay. One of the replication threads caught up fully and the other was delayed for a bit longer.

In every test run. GTID took slightly longer to fully catch up than the traditional method. There are a couple of reasons to expect GTID’s to be slightly slower. One possibility is the that there are additional writes on the slave, in order to keep track of all the GTID’s that the slave ran. I removed the initial write to the relay log, but we must retain the committed GTID, and this causes additional writes. I used the default settings for MySQL, and as such log_slave_updates was disabled. This causes the replicated GTID to be stored in a table, which is periodically compressed. You can find more details on how log_slave_updates impacts GTID replication here.

So the question still exists, why should we use GTID, especially with multisource replication? I’ve found that the answer lies in the composition of a GTID. From MySQL’s GTID Concepts, a GTID is composed of two parts, the source_id, and the transaction_id. The source_id is a unique identifier targeting the server which originally wrote the transaction. This allows you to identify in the binary log which master took the initial write, and so you can pinpoint problems much easier.

The below excerpt from DB1’s (a master from this test) binary log shows that, before the transaction being written, the “SET @@SESSION.GTID_NEXT” ran. This is the GTID that you can follow through the rest of the topology to identify the same transaction.

“d1ab72e9-0220-11e8-aee7-00155dab6104” is the server_uuid for DB1, and 270035 is the transaction id.

SET @@SESSION.GTID_NEXT= 'd1ab72e9-0220-11e8-aee7-00155dab6104:270035'/*!*/;
# at 212345
#180221 15:37:56 server id 101 end_log_pos 212416 CRC32 0x758a2d77 Query thread_id=15 exec_time=0 error_code=0
SET TIMESTAMP=1519245476/*!*/;
BEGIN
/*!*/;
# at 212416
#180221 15:37:56 server id 101 end_log_pos 212472 CRC32 0x4363b430 Table_map: `db1`.`sbtest1` mapped to number 109
# at 212472
#180221 15:37:56 server id 101 end_log_pos 212886 CRC32 0xebc7dd07 Update_rows: table id 109 flags: STMT_END_F
### UPDATE `db1`.`sbtest1`
### WHERE
### @1=654656 /* INT meta=0 nullable=0 is_null=0 */
### @2=575055 /* INT meta=0 nullable=0 is_null=0 */
### @3='20363719684-91714942007-16275727909-59392501704-12548243890-89454336635-33888955251-58527675655-80724884750-84323571901' /* STRING(120) meta=65144 nullable=0 is_null=0 */
### @4='97609582672-87128964037-28290786562-40461379888-28354441688' /* STRING(60) meta=65084 nullable=0 is_null=0 */
### SET
### @1=654656 /* INT meta=0 nullable=0 is_null=0 */
### @2=575055 /* INT meta=0 nullable=0 is_null=0 */
### @3='17385221703-35116499567-51878229032-71273693554-15554057523-51236572310-30075972872-00319230964-15844913650-16027840700' /* STRING(120) meta=65144 nullable=0 is_null=0 */
### @4='97609582672-87128964037-28290786562-40461379888-28354441688' /* STRING(60) meta=65084 nullable=0 is_null=0 */
# at 212886
#180221 15:37:56 server id 101 end_log_pos 212942 CRC32 0xa6261395 Table_map: `db1`.`sbtest3` mapped to number 111
# at 212942
#180221 15:37:56 server id 101 end_log_pos 213166 CRC32 0x2782f0ba Write_rows: table id 111 flags: STMT_END_F
### INSERT INTO `db1`.`sbtest3`
### SET
### @1=817058 /* INT meta=0 nullable=0 is_null=0 */
### @2=390619 /* INT meta=0 nullable=0 is_null=0 */
### @3='01297933619-49903746173-24451604496-63437351643-68022151381-53341425828-64598253099-03878171884-20272994102-36742295812' /* STRING(120) meta=65144 nullable=0 is_null=0 */
### @4='29893726257-50434258879-09435473253-27022021485-07601619471' /* STRING(60) meta=65084 nullable=0 is_null=0 */
# at 213166
#180221 15:37:56 server id 101 end_log_pos 213197 CRC32 0x5814a60c Xid = 2313
COMMIT/*!*/;
# at 213197

Conclusion

Based on the sysbench tests I ran, GTID replication has a slightly lower throughput. It took about two to three minutes longer to process an hour worth of writes on two masters, compared to binary log replication. GTID’s strengths lie more in how it eases the management and troubleshooting of complex replication topologies.

The GTID concept allows a slave to know exactly which server initially wrote the transaction, even in a tiered environment. This means that if you need to promote a slave from the bottom tier, to the middle tier, simply changing the master is all that is needed. The slave can pick up from the last transaction it ran on that server and continue replicating without a problem. Stephane Combaudon explains this in detail in a pair of blogs. You can find part 1 here and part 2 here. Facebook also has a great post about their experience deploying GTID-based replication and the troubles they faced.

The post Multi-Source Replication Performance with GTID appeared first on Percona Database Performance Blog.

Mar
19
2018
--

Percona XtraDB Cluster, MySQL Asynchronous Replication and log-slave-updates

Asynchronous Replication

Recently, I’ve been looking into issues with the interactions between MySQL asynchronous replication and Galera replication. In this blog post, I’d like to share what I’ve learned.

MySQL asynchronous replication and Galera replication interactions are complicated due to the number of factors involved (Galera replication vs. asynchronous replication, replication filters, and row-based vs. statement-based replication). So as a start, I’ll look at an issue that came up with setting up an asynchronous replication channel between two Percona XtraDB Cluster (PXC) clusters.

Here’s a view of the desired topology:

MySQL asynchronous replication

The Problem

We want to set up an asynchronous replication channel between two PXC clusters. We also set log-slave-updates on the async slave (PXC node 2a in the topology diagram).

This is an interesting configuration and results in unexpected behavior as the replication depends on the node where the operation was performed. Let’s use CREATE TABLE as an example.

  • Run CREATE TABLE on Node 1a.  The table replicates to Node 1b, but not to the nodes in cluster 2.
  • Run CREATE TABLE on Node 1b. The table replicates to all nodes (both cluster 1 and cluster 2).

Some background information

Understanding the problem requires some knowledge of MySQL threads. However, as a simplification, I’ll group the threads into three groups:

  • Main MySQL Client Thread: This thread handles the requests for the client connection (here the client is an external entity).
  • Async Replication Threads: There are multiple threads in use here, some handle I/O, and some apply the updates, but we will view them as a single entity for simplicity.
  • Galera Threads: There are also multiple threads here, similar to the Async Replication Threads. (The name Galera here refers to the underlying replication technology used by PXC.)

For more information on MySQL threads, see
https://dev.mysql.com/doc/refman/5.7/en/mysql-threads.html

Why is the data not replicating?

In the first case (CREATE TABLE executed on Node1a)

  • The table is replicated from Node1a -> Node 1b via Galera replication.
  • The table is not replicated because the async replication threads are not picking up the changes.

In the second case (CREATE TABLE executed on Node 1b)

  • The table is replicated from Node1b -> Node 1a via Galera replication.
  • The table is replicated from Node1b -> Node 2a via async replication. This differs from the first case because the statement is executed on the Main MySQL client thread.  The async replication threads pick up the changes and send them to Node 2a.
  • The table is replicated from Node 2a -> Node 2b via Galera replication because log-slave-updates has been enabled on Node2a.

That last part is the important bit. We can view the Galera replication threads as another set of asynchronous replication threads. So if data is coming in via async replication, they have to be made visible to Galera by log-slave-updates.  This is true in the other direction also: log-slave-updates must be enabled for Galera to supply data to async replication.

This is very similar to chained replication
https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html

The Solution

In this scenario, the answer is to set log-slave-updates on Node 1b (the async master) and on Node 2a (the async slave).

We set log-slave-updates on node 1b to allow the async threads to pickup the changes from the Galera threads.

We set log-slave-updates on node 2a to allow the Galera threads to pickup the changes from the async threads. Starting with PXC 5.7.17, calling START SLAVE on a PXC node will return an error unless log-slave-updates is enabled.

You must enable log-slave-updates on the node for data to be transferred between Galera and asynchronous replication.

Recommendations/Best Practices

If you plan to use MySQL asynchronous replication with Percona XtraDB Cluster (either as async master or slave), we recommend that you enable log-slave-updates on all nodes within the cluster. This to (1) to ensure that any async replication connections to/from the cluster work correctly and (2) to ensure that all the nodes within a cluster share the same configuration and behavior.

Recommended configuration diagram for the clusters:

MySQL asynchronous replication

The post Percona XtraDB Cluster, MySQL Asynchronous Replication and log-slave-updates appeared first on Percona Database Performance Blog.

Mar
13
2018
--

The Multi-Source GTID Replication Maze

Multi-Source GTID Replication

In this blog post, we’ll look at how to navigate some of the complexities of multi-source GTID replication.

GTID replication is often a real challenge for DBAs, especially if this has to do with multi-source GTID replication. A while back, I came across a really interesting customer environment with shards where multi-master, multi-source, multi-threaded MySQL 5.6 MIXED replication was active. This is a highly complex environment that has both pros and cons, introducing risks as a trade-off for specific customer requirements.

This is the set up of part of this environment:

I started looking into this setup when a statement broke replication between db1 and db10. Replication broke due to a statement executed on a schema that was not present on db10. This also resulted in changes originating from db1 to not being pushed down to db100 as db10, as we stopped the replication thread (for db1 channel).

On the other hand, replication was not stopped on db2 because the schema in question was present on db2. Replication between db2 and db20 was broken as well because the schema was not present in db20.

In order to fix db1->db10 replication, four GTID sets were injected in db10.

Here are some interesting blog posts regarding how to handle/fix GTID replication issues:

After injecting the GTID sets, we started replication again and everything ran fine.

 

After that, we had to check the db2->db20 replication, which, as I’ve already said, was broken as well. In this case, injecting only the first GTID trx into db20 instead of all of those causing issues on db10 was enough!

You may wonder how this is possible. Right? The answer is that the rest of them were replicated from db10 to db20, although the channel was not the same.

Another strange thing is the fact that although the replication thread for the db2->db20 channel was stopped (broken), checking the slave status on db20 showed that Executed_Gtid_Set was moving for all channels even though Retrieved_Gtid_Set for the broken one was stopped! So what was happening there?

This raised my curiosity, so I decided to do some further investigation and created scenarios regarding other strange things that could happen. An interesting one was about the replication filters. In our case, I thought “What would happen in the following scenario … ?”

Let’s say we write a row from db1 to db123.table789. This row is replicated to db10 (let’s say using channel 1) and to db2 (let’s say using channel2). On channel 1, we filter out the db123.% tables, on channel2 we don’t. db1 writes the row and the entry to the binary log. db2 writes the row after reading the entry from the binary log and subsequently writes the entry to its own binary log and replicates this change to db20. This change is also replicated to db10. So now, on db10 (depending on which channel finds the GTID first) it either gets filtered on channel1 and written to its own bin log at just startcommit with any actual DDL/DML removed, or if it is read first on channel2 (db1->db2 and then db20->db10) then it is NOT filtered out and executed instead. Is this correct? It definitely ISN’T!

Points of interest

You can find answers to the above questions in the points of interest listed below. Although it’s not really clear through the official documentation, this is what happens with GTID replication and multi-source GTID replication:

  • As we know GTID sets are unique across all nodes in a given cluster. In multi-source replication, Executed_Gtid_Set is common for all channels. This means that regardless the originating channel, when a GTID transaction is executed it is recorded in all channels’ Executed_Gtid_Set. Although it’s logical (each database is unique, so if a trx is going to affect a database it shouldn’t be tightened to a single channel regardless of the channel it uses), the documentation doesn’t provide much info around this.
  • When we have multi-source, multi-level replication, there are cases where the GTID sets originating from one master can end up on one slave via different replication paths. It’s not clear if it applies any special algorithm (although it doesn’t seem that there could be one), but the preferred method seems to be FIFO. The fastest wins! This means that GTID sets can travel to the slave via different channels, and it’s related to how fast the upper-level slaves can commit changes. In fact, the path doesn’t really matter as it only executes each GTID trx once.
  • Replication filters are global regardless the channel. This means they apply each filter to all channels. This is normal as we can’t define a replication filter per channel. In order to be able to debug such cases, adding a small replication delay per channel seems a good idea.
Mar
08
2018
--

Binlog Encryption with Percona Server for MySQL

binlog encryption

In this blog post, we’ll look at how to turn on binlog encryption in Percona Server for MySQL.

Why do I need this?

As you probably know, Percona Server for MySQL’s binlog contains sensitive information. Replication uses the binlog to copy events between servers. They contain all the information from one server so that it can be applied on another. In other words, if somebody has access to a binlog, it means they have access to all the data in the server. Moreover, said person (or, “Hacker”) could create a clone copy of our server by just making a replica of it. In the end, they have access to our binlog. This shows how important protecting a binlog really is – leakage of binlogs not only make a particular table/tablespace or a group of tables accessible to a hacker, but literally the whole server is at risk. The same situation is true with relay log – a relay log is really a copy of binlog on the slave server.

But have no fear – a new feature to the rescue – binary log encryption. Since Percona Server for MySQL version 5.7.20-19 (beta version) it is possible to enable binlog encryption for all the binlogs and relay logs produced by the server.

How do you turn it on?

To start binlog encryption, you need to start the server with –encrypt-binlog=1. This, in turn, requires –master_verify_checksum and –binlog_checksum both to be ON. Also, you need to install one of the keyring plugins.

From now on all the binlogs and relay logs produced by the server get encrypted. However, for the replication to be safe as a whole the connection between servers also has to be encrypted. See https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-encrypted-connections.html for details on how to do this.

Please note that this does not mean that all binlogs in our replication schema get encrypted. Remember you need to turn on encrypt-binlog on slave servers too, even if they do not produce binlog files. Slave servers still produce relay logs when replicating from a master server. Turn on encrypt-binlog on slave servers so that their relay logs also get encrypted.

How does this work in the big picture?

The master encrypts the event before writing it into the binlog. The slave connects to master and ask for events. The master decrypts the events from the binary log and sends them over to slave.

Note that events send between the master and slave servers are not encrypted! This is why the connection between the master and slave needs to use a secure channel, i.e., TLS.

The slave receives events from the master, encrypts them and writes them down into the relay log.

That is why we need to enable encrypt-binlog on a slave. The relay log has to get encrypted too.

Next, the slave decrypts events from relay log and applies them. After applying the event the slave encrypts it and writes it down into its binlog file (given binlog is enabled on the slave).

In summary to make our replication secure, we need:

  • Turn on encrypt-binlog on the master
  • Turn on encrypt-binlog on the slave
  • The connection between master and slave needs to use TLS.

It’s worth noting that servers in replication have no idea if other servers are encrypted or not.

Why do master_verify_checksum and binlog_checksum need to be turned ON?

This is needed for “authenticate encryption”. Simply put, this is how we make sure that what we decrypt has not been changed by a third party. Also, it checks if the key that was used to decrypt the event was the correct one.

Digging deeper with mysqlbinlog

Mysqlbinlog is a standalone application that lets you read binlog files. As I write this blog post, it is not capable of decrypting binary logs – at least not by itself. However, it still can read encrypted binlog files when using a running Percona Server for MySQL. Use option –read-from-remote-server to read binary log produced by a given server.

Let’s see what happens when we try to read an encrypted binlog with mysqlbinlog without read-from-remote-server enabled. You will get something like this:

As you can see it is only possible to read binary log till event type 9f gets read. This event is the Start_encryption_event. After this event the rest of the binlog is encrypted. One thing to note is that Start_encryption_event is never propagated in replication. For instance, the master server is run with –encryt_binlog. This means that the server writes Start_encryption_event to its binary logs. However it is never sent to the slave server (the slave has no idea whether the master is encrypted).

Another option you can use with mysqlbinlog is –force option. It forces mysqlbinlog to read all the events from the binlog, even if they are encrypted. You will see something like this in the output:

As you can see, it is only possible to read two first events – until the Start_encryption_event. However, this time we can see that there are other events that follow, which are encrypted.

Running mysqlbinlog (without –read-from-remote) on encrypted binary logs may only make sense if we want to see if a given binary log is encrypted. For point-in-time recovery, and for other purposes that might require reading encrypted binlog, we would use mysqlbinlog with –read-from-remote option.

For instance, if we want to read binlog master-bin.000001, and Percona Server for MySQL is running on 127.0.0.1, port 3033, with user:robert, password:hard_password, we would use mysqlbinlog like this:

mysqlbinlog –read-from-remote-server –protocol=tcp –host=127.0.0.1 –port=3033 –user=robert –password=hard_password master-bing.000001.

When you look at the output of this command, you see something like this:

You can now see the decrypted binlog. One interesting thing to note here is that we do not see our Start_encryption_event (type 9f). This proves my point – Start_encryption_event never leaves the server (we are reading from the server now as we use –read-from-remote-server).

For more information how to use mysqlbinlog for point-in-time recovery see https://dev.mysql.com/doc/refman/5.7/en/point-in-time-recovery.html.

However, for more modern approaches for point-in-time recovery that do not use mysqlbinlog and make use of parallel appliers, see here:

Have fun with binlog encryption!

Feb
19
2018
--

Archiving MySQL Tables in ClickHouse

Archiving MySQL Tables in ClickHouse

Archiving MySQL Tables in ClickHouseIn this blog post, I will talk about archiving MySQL tables in ClickHouse for storage and analytics.

Why Archive?

Hard drives are cheap nowadays, but storing lots of data in MySQL is not practical and can cause all sorts of performance bottlenecks. To name just a few issues:

  1. The larger the table and index, the slower the performance of all operations (both writes and reads)
  2. Backup and restore for terabytes of data is more challenging, and if we need to have redundancy (replication slave, clustering, etc.) we will have to store all the data N times

The answer is archiving old data. Archiving does not necessarily mean that the data will be permanently removed. Instead, the archived data can be placed into long-term storage (i.e., AWS S3) or loaded into a special purpose database that is optimized for storage (with compression) and reporting. The data is then available.

Actually, there are multiple use cases:

  • Sometimes the data just needs to be stored (i.e., for regulatory purposes) but does not have to be readily available (it’s not “customer facing” data)
  • The data might be useful for debugging or investigation (i.e., application or access logs)
  • In some cases, the data needs to be available for the customer (i.e., historical reports or bank transactions for the last six years)

In all of those cases, we can move the older data away from MySQL and load it into a “big data” solution. Even if the data needs to be available, we can still move it from the main MySQL server to another system. In this blog post, I will look at archiving MySQL tables in ClickHouse for long-term storage and real-time queries.

How To Archive?

Let’s say we have a 650G table that stores the history of all transactions, and we want to start archiving it. How can we approach this?

First, we will need to split this table into “old” and “new”. I assume that the table is not partitioned (partitioned tables are much easier to deal with). For example, if we have data from 2008 (ten years worth) but only need to store data from the last two months in the main MySQL environment, then deleting the old data would be challenging. So instead of deleting 99% of the data from a huge table, we can create a new table and load the newer data into that. Then rename (swap) the tables. The process might look like this:

  1. CREATE TABLE transactions_new LIKE transactions
  2. INSERT INTO transactions_new SELECT * FROM transactions WHERE trx_date > now() – interval 2 month
  3. RENAME TABLE transactions TO transactions_old, transactions_new TO transactions

Second, we need to move the transactions_old into ClickHouse. This is straightforward — we can pipe data from MySQL to ClickHouse directly. To demonstrate I will use the Wikipedia:Statistics project (a real log of all requests to Wikipedia pages).

Create a table in ClickHouse:

CREATE TABLE wikistat
(
    id bigint,
    dt DateTime,
    project String,
    subproject String,
    path String,
    hits UInt64,
    size UInt64
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(dt)
ORDER BY dt
Ok.
0 rows in set. Elapsed: 0.010 sec.

Please note that I’m using the new ClickHouse custom partitioning. It does not require that you create a separate date column to map the table in MySQL to the same table structure in ClickHouse

Now I can “pipe” data directly from MySQL to ClickHouse:

mysql --quick -h localhost wikistats -NBe
"SELECT concat(id,',"',dt,'","',project,'","',subproject,'","', path,'",',hits,',',size) FROM wikistats" |
clickhouse-client -d wikistats --query="INSERT INTO wikistats FORMAT CSV"

Thirdwe need to set up a constant archiving process so that the data is removed from MySQL and transferred to ClickHouse. To do that we can use the “pt-archiver” tool (part of Percona Toolkit). In this case, we can first archive to a file and then load that file to ClickHouse. Here is the example:

Remove data from MySQL and load to a file (tsv):

pt-archiver --source h=localhost,D=wikistats,t=wikistats,i=dt --where "dt <= '2018-01-01 0:00:00'"  --file load_to_clickhouse.txt --bulk-delete --limit 100000 --progress=100000
TIME                ELAPSED   COUNT
2018-01-25T18:19:59       0       0
2018-01-25T18:20:08       8  100000
2018-01-25T18:20:17      18  200000
2018-01-25T18:20:26      27  300000
2018-01-25T18:20:36      36  400000
2018-01-25T18:20:45      45  500000
2018-01-25T18:20:54      54  600000
2018-01-25T18:21:03      64  700000
2018-01-25T18:21:13      73  800000
2018-01-25T18:21:23      83  900000
2018-01-25T18:21:32      93 1000000
2018-01-25T18:21:42     102 1100000
...

Load the file to ClickHouse:

cat load_to_clickhouse.txt | clickhouse-client -d wikistats --query="INSERT INTO wikistats FORMAT TSV"

The newer version of pt-archiver can use a CSV format as well:

pt-archiver --source h=localhost,D=wikitest,t=wikistats,i=dt --where "dt <= '2018-01-01 0:00:00'"  --file load_to_clickhouse.csv --output-format csv --bulk-delete --limit 10000 --progress=10000

How Much Faster Is It?

Actually, it is much faster in ClickHouse. Even the queries that are based on index scans can be much slower in MySQL compared to ClickHouse.

For example, in MySQL just counting the number of rows for one year can take 34 seconds (index scan):

mysql> select count(*) from wikistats where dt between '2017-01-01 00:00:00' and '2017-12-31 00:00:00';
+-----------+
| count(*)  |
+-----------+
| 103161991 |
+-----------+
1 row in set (34.82 sec)
mysql> explain select count(*) from wikistats where dt between '2017-01-01 00:00:00' and '2017-12-31 00:00:00'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: wikistats
   partitions: NULL
         type: range
possible_keys: dt
          key: dt
      key_len: 6
          ref: NULL
         rows: 227206802
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

In ClickHouse, it only takes 0.062 sec:

:) select count(*) from wikistats where dt between  toDateTime('2017-01-01 00:00:00') and  toDateTime('2017-12-31 00:00:00');
SELECT count(*)
FROM wikistats
WHERE (dt >= toDateTime('2017-01-01 00:00:00')) AND (dt <= toDateTime('2017-12-31 00:00:00'))
????count()??
? 103161991 ?
?????????????
1 rows in set. Elapsed: 0.062 sec. Processed 103.16 million rows, 412.65 MB (1.67 billion rows/s., 6.68 GB/s.)

Size on Disk

In my previous blog on comparing ClickHouse to Apache Spark to MariaDB, I also compared disk size. Usually, we can expect a 10x to 5x decrease in disk size in ClickHouse due to compression. Wikipedia:Statistics, for example, contains actual URIs, which can be quite large due to the article name/search phrase. This can be compressed very well. If we use only integers or use MD5 / SHA1 hashes instead of storing actual URIs, we can expect much smaller compression (i.e., 3x). Even with a 3x compression ratio, it is still pretty good as long-term storage.

Conclusion

As the data in MySQL keeps growing, the performance for all the queries will keep decreasing. Typically, queries that originally took milliseconds can now take seconds (or more). That requires a lot of changes (code, MySQL, etc.) to make faster.

The main goal of archiving the data is to increase performance (“make MySQL fast again”), decrease costs and improve ease of maintenance (backup/restore, cloning the replication slave, etc.). Archiving to ClickHouse allows you to preserve old data and make it available for reports.

Feb
09
2018
--

This Week in Data with Colin Charles 27: Percona Live Tutorials Released and a Comprehensive Review of the FOSDEM MySQL DevRoom

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Percona Live Santa Clara 2018 update: tutorials have been announced. The committee rated over 300+ talks, and easily 70% of the schedule should go live next week as well. In practice, then, you should see about 50 talks announced next week. There’s been great competition: we only have 70 slots in total, so about 1 in 5 talks get picked — talk about a competitive ratio.

FOSDEM

FOSDEM was truly awesome last week. From a Percona standpoint, we had a lot of excellent booth traffic (being outside of the PostgreSQL room on Saturday, and not too far out from the MySQL room on Sunday). We gave away bottle openers — useful in Brussels with all the beer; we tried a new design with a magnet to attach it to your fridge — stickers, some brochures, but most of all we had plenty of great conversations. There was quite a crowd from Percona, and it was excellent to see the MySQL & Friends DevRoom almost constantly full! A few of us brave souls managed to stay there the whole day, barely with any breaks, so as to enjoy all the talks.

I find the quality of talks to be extremely high. And when it comes to a community run event, with all content picked by an independent program committee, FOSDEM really sets the bar high. There is plenty of competition to get a good talk in, and I enjoyed everything we picked (yes, I was on the committee too). We’ve had plenty of events in the ecosystem that sort of had “MySQL” or related days, but FOSDEM might be the only one that has really survived. I understand we will have a day of some sort at SCALE16x, but even that has been scaled down. So if you care about the MySQL ecosystem, you will really want to ensure that you are at FOSDEM next year.

This year, we started with the usual MySQL Day on Friday. I could not be present, as I was at the CentOS Dojo, giving a presentation. So, the highlight of Friday for me? The community dinner. Over 80 people showed up, I know there was a waiting list, and lots of people were trying to get tickets at the last minute. Many missed out too; sorry, better luck next year; and also, hopefully, we will get a larger venue going forward. I really thank the organizers for this — we affectionately refer to them as the Belconians (i.e. a handful of Perconians based in Belgium). The conversation, the food, the drink —  they were all excellent. It’s good to see representation from all parts of the community: MySQL, Percona, MariaDB, Pythian, and others. So thank you again, Liz, Dimitri, Tom, and Kenny in absentia. I think Tjerk also deserves special mention for always helping (this year with the drinks)

As for FOSDEM itself, beyond the booth, I think the most interesting stuff was the talks. There are video recordings and slides of pretty much all talks, but I will also give you the “Cliff’s Notes” of them here.

MySQL DevRoom talk quick summaries

Beyond WHERE and GROUP BY – Sergei Golubchik

  • EXCEPT is in MariaDB Server 10.3
  • recursive CTEs are good for hierarchical data, graphs, data generation, Turing complete (you can use it to solve Sudoku even)
  • non-recursive CTEs can be an alternative syntax for subqueries in the FROM clause
  • Window functions:
    • Normal: one result per row, depend on that row only
    • Aggregate: one result per group, depending on the whole group
    • Window: one result per row, depending on the whole group
  • System versioned tables with AS OF
  • Aggregate stored functions

MySQL 8.0 Performance: InnoDB Re-Design – Dimitri Kravtchuk

  • Contention-Aware Transactions Scheduling (CATS), since 8.0.3. Not all transactions are equal, FIFO could not be optimal, unblock the most blocking transactions first
  • CATS (VATS) had a few issues, and there were bugs (they thought everything worked since MariaDB Server had implemented it). They spent about 9 months before fixing everything.
  • Where does CATS help? Workloads hitting row lock contentions. You can monitor via SHOW ENGINE INNODB MUTEX.
  • the main problem is because of repeatable read versus read committed transaction isolation on the same workload. You really need to understand your workload when it comes to VATS.

MySQL 8.0 Roles – Giuseppe Maxia

  • Created like a user, granted like privileges. You need to activate them to use them.
  • Before roles, you created a user, then grant, grant, and more grant’s… Add another user? Same deal. Lots of repetitive work and a lot of chances to make mistakes.
  • Faster user administration – define a role, assign it many times. Centralized grant handling – grant and revoke privileges to roles, add/edit all user profiles.
  • You need to remember to set the default role.
  • A user can have many roles; default role can be a list of roles.
  • Roles are users without a login – roles are saved in user tables. This is useful from an account lock/unlock perspective.
  • You can grant a user to a user
  • SET ROLE is for session management; SET DEFAULT ROLE is a permanent assignment of a role for a user. SET ROLE DEFAULT means assign the default role for this user for this session
  • The role_edges table reports which roles are assigned to which users. default_roles keeps track of the current default roles assigned to users. A default role may not exist.

Histogram support in MySQL 8.0 – Øystein Grøvlen

  • You can now do ANALYZE TABLE table UPDATE HISTOGRAM on column WITH n BUCKETS;
  • New storage engine API for sampling (default implementation is full table scan even when sampling)
  • Histogram is stored in a JSON column in the data dictionary. Grab this from the INFORMATION_SCHEMA.
  • Histograms are useful for columns that are not the first column of any index, and used in WHERE conditions of JOIN queries, queries with IN-subqueries, ORDER BY … LIMIT queries. Best fit: low cardinality columns (e.g. gender, orderStatus, dayOfWeek, enums), columns with uneven distribution (skew), stable distribution (do not change much over time)
  • How many buckets? equi-height, 100 buckets should be enough.
  • Histograms are stored in the data dictionary, so will persist over restarts of course.

Let’s talk database optimizers – Vicen?iu Ciorbaru

TLS for MySQL at Large Scale – Jaime Crespo

  • Literally took 3 lines in the my.cnf to turn on TLS
  • https://dbtree.wikimedia.org
  • They wanted to do a data centre failover and wanted to ensure replication would be encrypted.
  • They didn’t have proper orchestration in place (MySQL could have this too). Every time OpenSSL or MySQL had to be upgraded, the daemon needed restarting. If there was an incompatible change, you had to sync master/replicas too.
  • The automation and orchestration that Wikipedia uses: https://fosdem.org/2018/schedule/event/cumin_automation/ (it is called Cumin: https://wikitech.wikimedia.org/wiki/Cumin)
  • Server support was poor – OpenSSL – so they had to deploy wmf-mysql and wmf-mariadb of their own
  • Currently using MariaDB 10.0, and looking to migrate to MariaDB 10.1
  • Client library pain they’ve had
  • TLSv1.2 from the beginning (2015).
  • 20-50x slower for actual connecting; the impact is less than 5% for the actual query performance. Just fix client libraries, make them use persistent connections. They are now very interested in ProxySQL for this purpose.
  • https://grafana.wikimedia.org/?orgId=1
  • Monty asks, would a double certificate help? Jaime says sure. But he may not actually use double certificates; might not solve CA issues, and the goal is not to restart the server.
  • Monty wonders why not to upgrade to 10.2? “Let’s talk outside because it’s a much larger question.”

MySQL InnoDB Cluster – Miguel Araújo

  • group replication: update everywhere (multi-master), virtually synchronous replication, automatic server failover, distributed recovery, group reconfiguration, GCS (implementation of Paxos – group communication system). HA is a critical factor.
  • mysqlsh: interactive and batch operations. Document store (CRUD and relational access)
  • admin API in mysqlsh: create & manage clusters, hide complexity of configuration/provisioning/orchestration of the InnoDB clusters. Works with JavaScript and Python
  • Usability. HA out of the box.
  • It’s easy to join a new node; new node goes into recovery mode (and as long as you have all the binary logs, this is easy; otherwise start from a backup)
  • SET PERSIST – run a command remotely, and the configuration is persisted in the server
  • Network flapping? Group replication will just reject the node from the cluster if its flapping too often

Why we’re excited about MySQL 8 – Peter Zaitsev

  • Native data dictionary – atomic, crash safe, DDLs, no more MyISAM system table requirements
  • Fast INFORMATION_SCHEMA
  • utf8mb4 as default character set
  • Security: roles, breakdown of SUPER privileges, password history, faster cached-SHA2 authentication (default), builds using OpenSSL (like Percona Server), skip grants blocks remote connections, logs now encrypted when tablespace encryption enabled
  • Persistent AUTO_INCREMENT
  • auto-managed undo tablespaces – do not use system table space for undo space. Automatically reclaim space on disks.
  • Self-tuning, limited to InnoDB (innodb_dedicated_server to auto-tune)
  • partial in-place update for JSON – update filed in JSON object without full rewrite. Good for counters/statuses/timestamps. Update/removal of element is supported
  • Invisible indexes – test impact of dropping indexes before actually dropping them. Maintained but unused by the optimizer. If not needed or used, then drop away.
  • TmpTable Storage Engine – more efficient storage engine for internal temporary tables. Efficient storage for VARCHAR and VARBINARY columns. Good for GROUP BY queries. Doesn’t support BLOB/TEXT columns yet (this reverts to InnoDB temp table now)
  • Backup locks – prevent operations which may result in inconsistent backups. CHECK INSTANCE FOR BACKUP (something Percona Server has had before)
  • Optimizer histograms – detailed statistics on columns, not just indexes
  • improved cost model for the optimizer – www.unofficialmysqlguide.com
  • Performance schematic – faster (via “fake” indexes), error instrumentation, response time histograms (global & per query), digest summaries
  • select * from sys.session – fast potential replacement for show processlist
  • RESTART (command)
  • SET PERSIST – e.g. change the buffer pool size, and this helps during a restart
  • assumes default storage is SSD now
  • binary log on by default, log_slave_updates enabled by default, and log expires after 30 days by default
  • query cache removed. Look at ProxySQL or some other caching solution
  • native partitioning only – remove partitions from MyISAM or convert to InnoDB
  • resource groups – isolation and better performance (map queries to specific CPU cores; can jail your costly queries, like analytical queries)
  • Feature Requests: better single thread performance, no parallel query support

MySQL Test Framework for Support and Bugs Work – Sveta Smirnova

  • MTR allows you to add multiple connections
  • has commands for flow control

ProxySQL – GTID Consistent Reads – René Cannaò, Nick Vyzas

  • threshold is configurable in increments of 1 second. Replication lag can be monitored with ProxySQL. Want to ensure you don’t have stale reads.
  • Why is GTID important? To guarantee consistently. Auto positioning for restructuring topologies.
  • –session-track-gtids is an important feature which allows sending the GTID for a transaction on the OK packet for a transaction. Not available in MariaDB.
  • There is a ProxySQL Binlog Reader now – GTID information about a MySQL server to all connected ProxySQL instances. Lightweight process to run on your MySQL server.
  • ProxySQL can be configured to enforce GTID consistency for reads on any hostgroup/replication hostgroup.
  • Live demo by René

Turbocharging MySQL with Vitess – Sugu Sougoumarane

  • trend for the cloud: container instances, short-lived containers, tolerate neighbors, discoverability. No good tools yet for Kubernetes.
  • non-ideal options: application sharing, NoSQL, paid solutions, NewSQL (CockroachDB, TiDB, Yugabyte)
  • Vitess: leverage MySQL at massive scale, opensource, 8+ years of work, and multiple production examples
  • Square uses Vitess for Square Cash application.
  • Can MySQL run on Docker? Absolutely, many of the companies do huge QPS on Docker.
  • YouTube does a major re-shard every 2-3 months once. No one notices nowadays when that happens.
  • app server connects to vtgate, and only underneath it’s a bunch of smaller databases with vttablet + mysqld. The lockserver is what makes it run well in the cloud.
  • pluggable architecture with no compromise on performance: monitoring, health check, ACLs, tracing, more.
  • at most, it adds about 2ms overhead to connections
  • Go coding standards are enforced, unit tests with strict coverage requirements, end-to-end tests, Travis, CodeClimate and Netlify. Readability is king.
  • On February 5 2018, it will be a CNCF project. One year of due diligence. They said there was nothing to compare it with. Looked at maturity and contributors. It’s becoming a truly community-owned project! (CNCF to Host Vitess is already live as of now)
  • roadmap: full cross-shard queries, migration tools, simplify configurability, documentation.
  • full MySQL protocol, but a limited query set – they want to get it to a point where it accepts a full MySQL query.

Orchestrator on Raft – Shlomi Noach

  • Raft: guaranteed to be in-order replication log, an increasing index. This is how nodes choose a leader based on who has the higher index. Get periodic snapshots (node runs a full backup).
  • HashiCorp raft, a Golang raft implementation, used by Consul
  • orchestrator manages topology for HA topologies; also want orchestrator to be highly available. Now with orchestrator/raft, remove the MySQL backend dependency, and you can have data center fencing too. Now you get: better cross-DC deploys, DC-local KV control, and also Kubernetes friendly.
  • n-orchestrator nodes, each node still runs its own backend (either MySQL or SQLite). Orchestrator provides the communication for SQLite between the nodes. Only one (the Raft leader) will handle failovers
  • implementation & deployment @ Github – one node per DC (deployed at 3 different DCs). 1-second raft polling interval. 2 major DCs, one in the cloud. Step-down, raft-yield, SQLite-backed log store, and still a MySQL backend (SQLite backend use case is in the works)
  • They patched the HashiCorp raft library. The library doesn’t care about the identity of nodes, with Github they do want to control the identity of the leader. There is an “active” data center, and locality is important. This is what they mean by raft-yield (picking a candidate leader).
  • The ability for a leader to step down is also something they had to patch.
  • HashiCorp Raft only supports LMDB and another database, so the replication log is now kept in a relational SQLite backed log store. Another patch.
  • once orchestrator can’t run its own self-health check, it recognizes this. The application can tell raft now that it’s stepping down. Takes 5 seconds to step down, and raft then promotes another orchestrator node to be the leader. This is their patch.
  • can also grab leadership
  • DC fencing handles network partitioning.
  • orchestrator is Consul-aware. Upon failover, orchestrator updates Consul KV with the identity of the promoted master.
  • considerations to watch out for: what happens if, upon replay of the Raft log, you hit two failovers for the same cluster? NOW() and otherwise time-based assumptions. Reapplying snapshot/log upon startup
  • roadmap: use Kubernetes (cluster IP based configuration in progress, already container friendly via auto-re-provisioning of nodes via Raft)

MyRocks Roadmaps – Yoshinori Matsunobu

  • Facebook has a large User Database (UDB). Social graph, massively sharded, low latency, automated operations, pure flash storage (constrained by space, not CPU/IOPS)
  • They have a record cache in-front of MySQL – Tao for reads. If cache misses, then it hits the database. And all write requests go thru MySQL. UDB has to be fast to ensure a good user experience.
  • they also at Facebook run 2 instances of MySQL on the same machine, because CPU wasn’t huge, but the space savings were awesome.
  • design decisions: clustered index (same as InnoDB), slower for reads, faster for writes (bloom filters, column family), support for transactions including consistency between binlog and MyRocks. Faster data loading/deletes/replication, dynamic options (instead of having to restart mysqld), TTL (comparable to HBase TTL feature, specify the TTL, any data older than time, can be removed), online logical (for recovery purposes) & binary backup (for creating replicas)
  • Pros: smaller space, better cache hit rate, writes are faster so you get faster replication, much smaller bytes written
  • Cons: no statement based replication, GAP locks, foreign keys, full-text index, spatial index support. Need to use case sensitive collations for performance. Reads are slower, especially if the data fits in memory. Dependent on file system and OS; lack of solid direct I/O (uses buffered I/O). You need a newer than 4.6 kernel. Too many tuning options beyond buffer pool such as bloom filter, compactions, etc.
  • https://twitter.com/deniszh/status/960163082642382849
  • Completed InnoDB to MyRocks migration. Saved 50% space in UDB compared to compressed InnoDB.
  • Roadmaps: getting in MariaDB and Percona Server for MySQL. Read Mark’s blog for matching read performance vs InnoDB. Supporting mixed engines. Better replication and bigger instance sizes.
  • mixed engines: InnoDB and MyRocks on the same instance, though single transaction does not overlap engines. Plan to extend star backup to integrate `myrocks_hotbackup. Backport gtid_pos_auto_engines from MariaDB?
  • Removing engine log. Could be caused by binlog and engine log, which requires 2pc and ordered commits. Use one log? Either binlog or binlog like service or RocksDB WAL? Rely on binlog now (semi-sync, binlog consumers), need to determine how much performance is gained by stopping writing to WAL.
  • Parallel replication apply is important in MySQL 8
  • support bigger instance sizes: shared nothing database is not a general purpose database. Today you can get 256GB+ RAM and 10TB+ flash on commodity servers. Why not run one big instance and put everything there? Bigger instances may help general purpose small-mid applications. Then you don’t have to worry about sharing. Atomic transactions, joins and secondary keys will just work. Amazon Aurora today supports a 60TB instance!
  • today: you can start deploying slaves with consistency check. Many status counters for instance monitoring.

ProxySQL internals – René Cannaò

  • reduce latency, scales, maximize throughput. Single instance to travel hundreds of thousands of connections and to handle thousands of backend servers.
  • threading models: one thread per connection (blocking I/O), thread pooling (non-blocking I/O, scalable).
  • ProxySQL thread pool implementation: known as “MySQL threads”, fixed number of worker threads (configurable), all threads listen on the same port(s), client connections are not shared between threads, all threads perform their own network I/O, and it uses poll() (does that scale? True, but there is a reason why poll over epoll)
  • threads never share client connections – no need for synchronization, thread contention is reduced, each thread calls poll(). Possibly imbalanced load as a con (one thread that has way more connections that another). Is it really a problem? Most of the time, no, connections will automatically balance.
  • poll() is O(N), epoll() is O(1). Poll() is faster than epoll() for fewer connections (around 1000). Performance degrees when there are a lot of connections. So by default, it uses poll() instead of epoll(), around 50,000 connections performance degrades badly – so ProxySQL has auxiliary threads.
  • MySQL_Session() is implemented as a state machine. Stores metadata associated with the client session (running timers, default hostgroup, etc.)

MySQL Point-in-time recovery like a rockstar – Frederic Descamps

Releases

  • Percona Monitoring and Management 1.7.0 (PMM) – This release features improved support for external services, which enables a PMM Server to store and display metrics for any available Prometheus exporter. For example, you could deploy the postgres_exporter and use PMM’s external services feature to store PostgreSQL metrics in PMM. Immediately, you’ll see these new metrics in the Advanced Data Exploration dashboard. Then you could leverage many of the pre-developed PostgreSQL dashboards available on Grafana.com, and with a minimal amount of edits have a working PostgreSQL dashboard in PMM!
  • MariaDB Server 10.1.31 – usual updates to storage engines, and a handful of bug fixes.

Link List

Upcoming appearances

  • SCALE16x – Pasadena, California, USA – March 8-11 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com