Jul
19
2017
--

Multi-Threaded Slave Statistics

Multi-Threaded Slave Statistics

Multi-Threaded Slave StatisticsIn this blog post, I’ll talk about multi-threaded slave statistics printed in MySQL error log file.

MySQL version 5.6 and later allows you to execute replicated events using parallel threads. This feature is called Multi-Threaded Slave (MTS), and to enable it you need to modify the

slave_parallel_workers

 variable to a value greater than 1.

Recently, a few customers asked about the meaning of some new statistics printed in their error log files when they enable MTS. These error messages look similar to the example stated below:

[Note] Multi-threaded slave statistics for channel '': seconds elapsed = 123; events assigned = 57345; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 0 waited (count) when Workers occupied = 0 waited when Workers occupied = 0

The MySQL reference manual doesn’t show information about these statistics. I’ve filled a bug report asking Oracle to add information about these statistics in the MySQL documentation. I reported this bug as #85747.

Before they update the documentation, we can use the MySQL code to get insight as to the statistics meaning. We can also determine how often these statistics are printed in the error log file. Looking into the rpl_slave.cc file, we find that when you enable MTS – and log-warnings variable is greater than 1 (log-error-verbosity greater than 2 for MySQL 5.7) – the time to print these statistics in MySQL error log is 120 seconds. It is determined by a hard-coded constant number. The code below shows this:

/*
  Statistics go to the error log every # of seconds when --log-warnings > 1
*/
const long mts_online_stat_period= 60 * 2;

Does this mean that every 120 seconds MTS prints statistics to your MySQL error log (if enabled)? The answer is no. MTS prints statistics in the mentioned period depending on the level of activity of your slave. The following line in MySQL code verifies the level of the slave’s activity to print the statistics:

if (rli->is_parallel_exec() && rli->mts_events_assigned % 1024 == 1)

From the above code, you need MTS enabled and the modulo operation between the 

mts_events_assigned

 variable and 1024 equal to 1 in order to print the statistics. The 

mts_events_assigned

 variable stores the number of events assigned to the parallel queue. If you’re replicating a low level of events, or not replicating at all, MySQL won’t print the statistics in the error log. On the other hand, if you’re replicating a high number of events all the time, and the

mts_events_assigned

 variable increased its value until the remainder from the division between this variable and 1024 is 1, MySQL prints MTS statistics in the error log almost every 120 seconds.

You can find the explanation these statistics below (collected from information in the source code):

  1. Worker queues filled over overrun level: MTS tends to load balance events between all parallel workers, and the  
    slave_parallel_workers

     variable determines the number of workers. This statistic shows the level of saturation that workers are suffering. If a parallel worker queue is close to full, this counter is incremented and the worker replication event is delayed in order to avoid reaching worker queue limits.

  2. Waited due to a Worker queue full: This statistic is incremented when the coordinator thread must wait because of the worker queue gets overfull.
  3. Waited due to the total size: This statistic shows the number of times that the coordinator thread slept due to reaching the limit of the memory available in the worker queue to hold unapplied events. If this statistic keeps increasing when printed in your MySQL error log, you should resize the
    slave_pending_jobs_size_max

     variable to a higher value to avoid the coordinator thread waiting time.

  4. Waited at clock conflicts: In the case of possible dependencies between transactions, this statistic shows the wait time corresponding to logical timestamp conflict detection and resolution.
  5. Waited (count) when used occupied: A counter of how many times the coordinator saw Workers filled up with “enough” with assignments. The enough definition depends on the scheduler type (per-database or clock-based).
  6. Waited when workers occupied: These are statistics to compute coordinator thread waiting time for any worker available, and applies solely to the Commit-clock scheduler.
Conclusion

Multi-threaded slave is an exciting feature that allows you to replicate events faster, and keep in sync with master instances. By changing the log-warnings variable to a value greater than 1, you can get information from the slave error log file about how multi-threaded performance.

Dec
18
2015
--

Performance Schema: Measure Multi-Threaded Slave Activity

performance schemaPerformance Schema

In many types of database workloads, using a multi-threaded slave from 5.6+ helps improve replication performance. I’ve had a number of users enable this feature, but have not seen anyone ask how each thread is performing. Here’s a quick way with Performance_Schema to measure the amount of multi-threaded slave activity on each thread (after you have already configured MTS on your slave of course ).

First, we need to enable the 

statements

 instruments:

slave1> UPDATE setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE 'events_statements_%';
Query OK, 2 rows affected (0.00 sec)
Rows matched: 3  Changed: 2  Warnings: 0

Next, let’s find the

THREAD_ID

 for our slave workers:

slave1> SELECT THREAD_ID, NAME FROM threads WHERE NAME LIKE '%worker';
+-----------+-------------------------+
| THREAD_ID | NAME                    |
+-----------+-------------------------+
|        28 | thread/sql/slave_worker |
|        29 | thread/sql/slave_worker |
|        30 | thread/sql/slave_worker |
|        31 | thread/sql/slave_worker |
+-----------+-------------------------+
4 rows in set (0.00 sec)

And then we capture measurements:

slave1> SELECT THREAD_ID, EVENT_NAME, COUNT_STAR FROM events_statements_summary_by_thread_by_event_name WHERE THREAD_ID IN (28,29,30,31) AND COUNT_STAR > 0;
+-----------+----------------------+------------+
| THREAD_ID | EVENT_NAME           | COUNT_STAR |
+-----------+----------------------+------------+
|        28 | statement/sql/update |         48 |
|        28 | statement/sql/insert |         24 |
|        28 | statement/sql/delete |         24 |
|        28 | statement/sql/begin  |         24 |
|        29 | statement/sql/update |         68 |
|        29 | statement/sql/insert |         34 |
|        29 | statement/sql/delete |         34 |
|        29 | statement/sql/begin  |         34 |
|        30 | statement/sql/update |       1864 |
|        30 | statement/sql/insert |        932 |
|        30 | statement/sql/delete |        932 |
|        30 | statement/sql/begin  |        932 |
|        31 | statement/sql/update |      40320 |
|        31 | statement/sql/insert |      20160 |
|        31 | statement/sql/delete |      20160 |
|        31 | statement/sql/begin  |      20160 |
+-----------+----------------------+------------+
16 rows in set (0.01 sec)

As you can see above, one of the multi-threadedthreads is doing more work compared to the other three. This is a sysbench on the master in the database doing

--trx-rate=10

, compared to the other three which are only doing 

--trx-rate=2|5

.

Below is another example, this time including the event times. With a slightly modified sysbench test, I’ve designed the tests to send large updates but one with a low transaction rate, another with a high transaction rate, and the rest in between. You can see specifically in threads 32 and 34, that the former is performing a lower amount of

UPDATES

, but spending more time per event.

slave1> SELECT THREAD_ID, EVENT_NAME, COUNT_STAR,
    ->   SUM_TIMER_WAIT/1000000 AS SUM_TIMER_WAIT_MS,
    ->   (SUM_TIMER_WAIT/COUNT_STAR)/1000000 AS AVG_TIME_MS
    -> FROM events_statements_summary_by_thread_by_event_name
    -> WHERE THREAD_ID IN (31,32,33,34) AND COUNT_STAR > 0;
+-----------+----------------------+------------+-------------------+--------------+
| THREAD_ID | EVENT_NAME           | COUNT_STAR | SUM_TIMER_WAIT_MS | AVG_TIME_MS  |
+-----------+----------------------+------------+-------------------+--------------+
|        31 | statement/sql/update |          3 |          403.9690 | 134.65633333 |
|        31 | statement/sql/insert |          1 |           91.5340 |  91.53400000 |
|        31 | statement/sql/delete |          1 |           75.9540 |  75.95400000 |
|        31 | statement/sql/begin  |          1 |           11.6800 |  11.68000000 |
|        32 | statement/sql/update |         75 |        25298.5090 | 337.31345333 |
|        32 | statement/sql/insert |         25 |         2221.1410 |  88.84564000 |
|        32 | statement/sql/delete |         25 |         2187.8970 |  87.51588000 |
|        32 | statement/sql/begin  |         25 |          321.0220 |  12.84088000 |
|        33 | statement/sql/update |       4728 |      1008736.6000 | 213.35376481 |
|        33 | statement/sql/insert |       1576 |       138815.0570 |  88.08061992 |
|        33 | statement/sql/delete |       1576 |       136161.9060 |  86.39714848 |
|        33 | statement/sql/begin  |       1576 |        20498.2300 |  13.00649112 |
|        34 | statement/sql/update |      70668 |     12304784.2380 | 174.12101995 |
|        34 | statement/sql/insert |      23556 |      2083454.5350 |  88.44687277 |
|        34 | statement/sql/delete |      23556 |      2041202.7690 |  86.65319957 |
|        34 | statement/sql/begin  |      23556 |       303710.4860 |  12.89312642 |
+-----------+----------------------+------------+-------------------+--------------+
16 rows in set (0.00 sec)

Exploring a bit further, one thing I noticed is that when using

binlog_format=ROW

  for replication, you will only see the 

BEGIN

  events instrumented from

performance_schema

. I think this is a bug so I reported it here.

slave1> SELECT THREAD_ID, EVENT_NAME, COUNT_STAR,
    ->   SUM_TIMER_WAIT/1000000 AS SUM_TIMER_WAIT_MS,
    ->   (SUM_TIMER_WAIT/COUNT_STAR)/1000000 AS AVG_TIME_MS
    -> FROM events_statements_summary_by_thread_by_event_name
    -> WHERE THREAD_ID IN (40,41,42,43) AND COUNT_STAR > 0;
+-----------+---------------------+------------+-------------------+-------------+
| THREAD_ID | EVENT_NAME          | COUNT_STAR | SUM_TIMER_WAIT_MS | AVG_TIME_MS |
+-----------+---------------------+------------+-------------------+-------------+
|        40 | statement/sql/begin |         16 |          258.6500 | 16.16562500 |
|        41 | statement/sql/begin |         91 |         1526.4320 | 16.77397802 |
|        42 | statement/sql/begin |        589 |        10131.4500 | 17.20110357 |
|        43 | statement/sql/begin |       5022 |        85584.0250 | 17.04182099 |
+-----------+---------------------+------------+-------------------+-------------+
4 rows in set (0.01 sec)

With MySQL 5.7, MTS has an additional improvement with how parallel replication is applied. By default, transactions are applied in parallel per database. With the new

slave_parallel_type

 feature, another option called

LOGICAL_CLOCK

  is introduced that allows parallelization intra-database as well. I did some small tests below, with same workload as above. First with the default

DATABASE

 type:

+-----------+----------------------+------------+-------------------+--------------+
| THREAD_ID | EVENT_NAME           | COUNT_STAR | SUM_TIMER_WAIT_MS | AVG_TIME_MS  |
+-----------+----------------------+------------+-------------------+--------------+
|        65 | statement/sql/update |         51 |         9974.1310 | 195.57119608 |
|        65 | statement/sql/insert |         17 |         2037.7080 | 119.86517647 |
|        65 | statement/sql/delete |         17 |         2144.4800 | 126.14588235 |
|        65 | statement/sql/begin  |         17 |          293.0650 |  17.23911765 |
|        66 | statement/sql/update |        255 |        67131.4980 | 263.26077647 |
|        66 | statement/sql/insert |         85 |         9629.5520 | 113.28884706 |
|        66 | statement/sql/delete |         85 |         9854.2920 | 115.93284706 |
|        66 | statement/sql/begin  |         85 |         1405.0200 |  16.52964706 |
|        67 | statement/sql/update |       1215 |       344712.6330 | 283.71410123 |
|        67 | statement/sql/insert |        405 |        48000.0110 | 118.51854568 |
|        67 | statement/sql/delete |        405 |        53222.3010 | 131.41308889 |
|        67 | statement/sql/begin  |        405 |         6836.9070 |  16.88125185 |
|        68 | statement/sql/update |       5943 |      1820669.3290 | 306.35526317 |
|        68 | statement/sql/insert |       1981 |       241513.1400 | 121.91476022 |
|        68 | statement/sql/delete |       1981 |       245022.2450 | 123.68614084 |
|        68 | statement/sql/begin  |       1981 |        32978.3390 |  16.64731903 |
+-----------+----------------------+------------+-------------------+--------------+

And then with

LOGICAL_CLOCK

:

+-----------+----------------------+------------+-------------------+--------------+
| THREAD_ID | EVENT_NAME           | COUNT_STAR | SUM_TIMER_WAIT_MS | AVG_TIME_MS  |
+-----------+----------------------+------------+-------------------+--------------+
|        59 | statement/sql/update |       9486 |      2667681.7170 | 281.22303574 |
|        59 | statement/sql/insert |       3162 |       376350.1650 | 119.02282258 |
|        59 | statement/sql/delete |       3162 |       388606.5460 | 122.89897090 |
|        59 | statement/sql/begin  |       3162 |        53988.2600 |  17.07408602 |
|        60 | statement/sql/update |        903 |       362853.2080 | 401.83079513 |
|        60 | statement/sql/insert |        301 |        36507.3090 | 121.28674086 |
|        60 | statement/sql/delete |        301 |        37437.2550 | 124.37626246 |
|        60 | statement/sql/begin  |        301 |         5210.4110 |  17.31033555 |
|        61 | statement/sql/update |         42 |        23394.0330 | 557.00078571 |
|        61 | statement/sql/insert |         14 |         1671.5900 | 119.39928571 |
|        61 | statement/sql/delete |         14 |         1720.1230 | 122.86592857 |
|        61 | statement/sql/begin  |         14 |          246.1490 |  17.58207143 |
+-----------+----------------------+------------+-------------------+--------------+

With

LOGICAL_LOCK

  and

slave_preserve_commit_order

  enabled:

+-----------+----------------------+------------+-------------------+--------------+
| THREAD_ID | EVENT_NAME           | COUNT_STAR | SUM_TIMER_WAIT_MS | AVG_TIME_MS  |
+-----------+----------------------+------------+-------------------+--------------+
|        71 | statement/sql/update |       8097 |      2216461.7170 | 273.73863369 |
|        71 | statement/sql/insert |       2699 |       322933.1620 | 119.64918933 |
|        71 | statement/sql/delete |       2699 |       326944.2380 | 121.13532345 |
|        71 | statement/sql/begin  |       2699 |        45860.5720 |  16.99169026 |
|        72 | statement/sql/update |        807 |       256668.2730 | 318.05238290 |
|        72 | statement/sql/insert |        269 |        32952.5570 | 122.50021190 |
|        72 | statement/sql/delete |        269 |        33346.3060 | 123.96396283 |
|        72 | statement/sql/begin  |        269 |         4650.1010 |  17.28662082 |
|        73 | statement/sql/update |         33 |         6823.1170 | 206.76112121 |
|        73 | statement/sql/insert |         11 |         1512.5810 | 137.50736364 |
|        73 | statement/sql/delete |         11 |         1593.5790 | 144.87081818 |
|        73 | statement/sql/begin  |         11 |          188.3910 |  17.12645455 |
+-----------+----------------------+------------+-------------------+--------------+

Combining the 

INSERT

,

UPDATE

 and

DELETE

 counts from these 3 tests,

LOGICAL_CLOCK

  tops with around 17k events within 120 seconds of the same workload. The 

DATABASE

  type was the slowest, recording only about 12k events committed on the slave.

Lastly, with MySQL 5.7 a new instrument has been added and enabled for metrics to be collected. You can also measure the per-transaction rate instead of per event class.

slave1> UPDATE setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE 'vents_transactions%';
Query OK, 0 rows affected (0.00 sec)
Rows matched: 0  Changed: 0  Warnings: 0
slave1> UPDATE setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME = 'transaction';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

And a quick measurement:

slave1> SELECT THREAD_ID, EVENT_NAME, COUNT_STAR,
    ->   SUM_TIMER_WAIT/1000000 AS SUM_TIMER_WAIT_MS,
    ->   (SUM_TIMER_WAIT/COUNT_STAR)/1000000 AS AVG_TIME_MS
    -> FROM events_transactions_summary_by_thread_by_event_name
    -> WHERE THREAD_ID IN (
    ->     SELECT THREAD_ID FROM replication_applier_status_by_worker
    -> ) AND COUNT_STAR > 0;
+-----------+-------------+------------+-------------------+----------------+
| THREAD_ID | EVENT_NAME  | COUNT_STAR | SUM_TIMER_WAIT_MS | AVG_TIME_MS    |
+-----------+-------------+------------+-------------------+----------------+
|        71 | transaction |       4849 |     50323942.6710 | 10378.21049103 |
|        72 | transaction |        487 |      6421399.4580 | 13185.62517043 |
|        73 | transaction |         17 |       319024.9280 | 18766.17223529 |
+-----------+-------------+------------+-------------------+----------------+
3 rows in set (0.00 sec)

Of course, not only statement events and transactions can be measured – you can also measure wait times and stages as well. Happy instrumenting!

The post Performance Schema: Measure Multi-Threaded Slave Activity appeared first on MySQL Performance Blog.

Jan
29
2015
--

Multi-threaded replication with MySQL 5.6: Use GTIDs!

MySQL 5.6 allows you to execute replicated events in parallel as long as data is split across several databases. This feature is named “Multi-Threaded Slave” (MTS) and it is easy to enable by setting slave_parallel_workers to a > 1 value. However if you decide to use MTS without GTIDs, you may run into annoying issues. Let’s look at two of them.

Skipping replication errors

When replication stops with an error, a frequent approach is to “ignore now and fix later.” This means you will run SET GLOBAL sql_slave_skip_counter=1 to be able to restart replication as quickly as possible and later use pt-table-checksum/pt-table-sync to resync data on the slave.

Then the day when I hit:

mysql> show slave status;
[...]
Last_SQL_Error: Worker 0 failed executing transaction '' at master log mysql-bin.000017, end_log_pos 1216451; Error 'Duplicate entry '1001' for key 'PRIMARY'' on query. Default database: 'db1'. Query: 'INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 5320, '49123511666-22272014664-85739796464-62261637750-57593947547-00947134109-73607171516-11063345053-55659776318-82888369235', '11400300639-05875856680-20973514928-29618434959-69429576205')'
Exec_Master_Log_Pos: 1005432

I tried to use the trick:

mysql> set global sql_slave_skip_counter=1;
mysql> start slave;

But:

mysql> show slave status;
[...]
Last_SQL_Error: Worker 0 failed executing transaction '' at master log mysql-bin.000017, end_log_pos 1216451; Error 'Duplicate entry '1001' for key 'PRIMARY'' on query. Default database: 'db1'. Query: 'INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 5320, '49123511666-22272014664-85739796464-62261637750-57593947547-00947134109-73607171516-11063345053-55659776318-82888369235', '11400300639-05875856680-20973514928-29618434959-69429576205')'
Exec_Master_Log_Pos: 1005882

Note that the position reported with Exec_Master_Log_Pos has moved forward, but I still have my duplicate key error. What’s wrong?

The issue is that the positions reported by SHOW SLAVE STATUS are misleading when using MTS. Quoting the documentation about Exec_Master_Log_Pos:

When using a multi-threaded slave (by setting slave_parallel_workers to a nonzero value in MySQL 5.6.3 and later), the value in this column actually represents a “low-water” mark, before which no uncommitted transactions remain. Because the current implementation allows execution of transactions on different databases in a different order on the slave than on the master, this is not necessarily the position of the most recently executed transaction.

So the solution to my problem is first to make sure that there is no execution gap, and only then to skip the offending event. There is a specific statement for the first part:

mysql> start slave until sql_after_mts_gaps;

And now I can finally skip the error and restart replication:

mysql> set global sql_slave_skip_counter=1;
mysql> start slave;
mysql> show slave statusG
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

The last thing to do is of course to resync the slave.

Backups

If you cannot trust the output of SHOW SLAVE STATUS to get the current binlog position, it means that taking a backup from a slave with parallel replication is tricky.

For instance, if you run mysqldump --dump-slave=2 to get the binlog position of the master, mysqldump will first run STOP SLAVE and then SHOW SLAVE STATUS. Is stopping the slave sufficient to avoid execution gaps? Actually, no.

The only option then seems to be: run STOP SLAVE followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS, followed by mysqldump while replication is stopped. Not very handy!

GTIDs to the rescue!

The solution for both issues is to use GTIDs.

They help when you want to skip an event because when using GTIDs, you must explicitly specify the transaction you will be skipping. It doesn’t matter whether there are execution holes.

They also help for backups because mysqldump takes the position from gtid_executed which is updated at each transaction commit (XtraBackup does that too).

Conclusion

If your application uses several databases and if you’re fighting with replication lag, MTS can be a great feature for you. But although GTIDs are not technically necessary, you’ll be exposed to tricky situations if you don’t use them.

Is everything rosy when using both GTIDs and MTS? Not exactly… But that will be the topic for a separate post!

By the way, if you are in the Brussels area this weekend, come see me and other great speakers at the MySQL and friends devroom at FOSDEM!

The post Multi-threaded replication with MySQL 5.6: Use GTIDs! appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com