Apr
16
2015
--

Profiling MySQL queries from Performance Schema

When optimizing queries and investigating performance issues, MySQL comes with built in support for profiling queries aka

SET profiling = 1;

 . This is already awesome and simple to use, but why the PERFORMANCE_SCHEMA alternative?

Because profiling will be removed soon (already deprecated on MySQL 5.6 ad 5.7); the built-in profiling capability can only be enabled per session. This means that you cannot capture profiling information for queries running from other connections. If you are using Percona Server, the profiling option for log_slow_verbosity is a nice alternative, unfortunately, not everyone is using Percona Server.

Now, for a quick demo: I execute a simple query and profile it below. Note that all of these commands are executed from a single session to my test instance.

mysql> SHOW PROFILES;
+----------+------------+----------------------------------------+
| Query_ID | Duration   | Query                                  |
+----------+------------+----------------------------------------+
|        1 | 0.00011150 | SELECT * FROM sysbench.sbtest1 LIMIT 1 |
+----------+------------+----------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> SHOW PROFILE SOURCE FOR QUERY 1;
+----------------------+----------+-----------------------+------------------+-------------+
| Status               | Duration | Source_function       | Source_file      | Source_line |
+----------------------+----------+-----------------------+------------------+-------------+
| starting             | 0.000017 | NULL                  | NULL             |        NULL |
| checking permissions | 0.000003 | check_access          | sql_parse.cc     |        5797 |
| Opening tables       | 0.000021 | open_tables           | sql_base.cc      |        5156 |
| init                 | 0.000009 | mysql_prepare_select  | sql_select.cc    |        1050 |
| System lock          | 0.000005 | mysql_lock_tables     | lock.cc          |         306 |
| optimizing           | 0.000002 | optimize              | sql_optimizer.cc |         138 |
| statistics           | 0.000006 | optimize              | sql_optimizer.cc |         381 |
| preparing            | 0.000005 | optimize              | sql_optimizer.cc |         504 |
| executing            | 0.000001 | exec                  | sql_executor.cc  |         110 |
| Sending data         | 0.000025 | exec                  | sql_executor.cc  |         190 |
| end                  | 0.000002 | mysql_execute_select  | sql_select.cc    |        1105 |
| query end            | 0.000003 | mysql_execute_command | sql_parse.cc     |        5465 |
| closing tables       | 0.000004 | mysql_execute_command | sql_parse.cc     |        5544 |
| freeing items        | 0.000005 | mysql_parse           | sql_parse.cc     |        6969 |
| cleaning up          | 0.000006 | dispatch_command      | sql_parse.cc     |        1874 |
+----------------------+----------+-----------------------+------------------+-------------+
15 rows in set, 1 warning (0.00 sec)

To demonstrate how we can achieve the same with Performance Schema, we first identify our current connection id. In the real world, you might want to get the connection/processlist id of the thread you want to watch i.e. from

SHOW PROCESSLIST

 .

mysql> SELECT THREAD_ID INTO @my_thread_id
    -> FROM threads WHERE PROCESSLIST_ID = CONNECTION_ID();
Query OK, 1 row affected (0.00 sec)

Next, we identify the bounding EVENT_IDs for the statement stages. We will look for the statement we wanted to profile using the query below from the

events_statements_history_long

table. Your LIMIT clause may vary depending on how much queries the server might be getting.

mysql> SELECT THREAD_ID, EVENT_ID, END_EVENT_ID, SQL_TEXT, NESTING_EVENT_ID
    -> FROM events_statements_history_long
    -> WHERE THREAD_ID = @my_thread_id
    ->   AND EVENT_NAME = 'statement/sql/select'
    -> ORDER BY EVENT_ID DESC LIMIT 3 G
*************************** 1. row ***************************
       THREAD_ID: 13848
        EVENT_ID: 419
    END_EVENT_ID: 434
        SQL_TEXT: SELECT THREAD_ID INTO @my_thread_id
FROM threads WHERE PROCESSLIST_ID = CONNECTION_ID()
NESTING_EVENT_ID: NULL
*************************** 2. row ***************************
       THREAD_ID: 13848
        EVENT_ID: 374
    END_EVENT_ID: 392
        SQL_TEXT: SELECT * FROM sysbench.sbtest1 LIMIT 1
NESTING_EVENT_ID: NULL
*************************** 3. row ***************************
       THREAD_ID: 13848
        EVENT_ID: 353
    END_EVENT_ID: 364
        SQL_TEXT: select @@version_comment limit 1
NESTING_EVENT_ID: NULL
3 rows in set (0.02 sec)

From the results above, we are mostly interested with the EVENT_ID and END_EVENT_ID values from the second row, this will give us the stage events of this particular query from the

events_stages_history_long

table.

mysql> SELECT EVENT_NAME, SOURCE, (TIMER_END-TIMER_START)/1000000000 as 'DURATION (ms)'
    -> FROM events_stages_history_long
    -> WHERE THREAD_ID = @my_thread_id AND EVENT_ID BETWEEN 374 AND 392;
+--------------------------------+----------------------+---------------+
| EVENT_NAME                     | SOURCE               | DURATION (ms) |
+--------------------------------+----------------------+---------------+
| stage/sql/init                 | mysqld.cc:998        |        0.0214 |
| stage/sql/checking permissions | sql_parse.cc:5797    |        0.0023 |
| stage/sql/Opening tables       | sql_base.cc:5156     |        0.0205 |
| stage/sql/init                 | sql_select.cc:1050   |        0.0089 |
| stage/sql/System lock          | lock.cc:306          |        0.0047 |
| stage/sql/optimizing           | sql_optimizer.cc:138 |        0.0016 |
| stage/sql/statistics           | sql_optimizer.cc:381 |        0.0058 |
| stage/sql/preparing            | sql_optimizer.cc:504 |        0.0044 |
| stage/sql/executing            | sql_executor.cc:110  |        0.0008 |
| stage/sql/Sending data         | sql_executor.cc:190  |        0.0251 |
| stage/sql/end                  | sql_select.cc:1105   |        0.0017 |
| stage/sql/query end            | sql_parse.cc:5465    |        0.0031 |
| stage/sql/closing tables       | sql_parse.cc:5544    |        0.0037 |
| stage/sql/freeing items        | sql_parse.cc:6969    |        0.0056 |
| stage/sql/cleaning up          | sql_parse.cc:1874    |        0.0006 |
+--------------------------------+----------------------+---------------+
15 rows in set (0.01 sec)

As you can see the results are pretty close, not exactly the same but close. SHOW PROFILE shows Duration in seconds, while the results above is in milliseconds.

Some limitations to this method though:

  • As we’ve seen it takes a few hoops to dish out the information we need. Because we have to identify the statement we have to profile manually, this procedure may not be easy to port into tools like the sys schema or pstop.
  • Only possible if Performance Schema is enabled (by default its enabled since MySQL 5.6.6, yay!)
  • Does not cover all metrics compared to the native profiling i.e. CONTEXT SWITCHES, BLOCK IO, SWAPS
  • Depending on how busy the server you are running the tests, the sizes of the history tables may be too small, as such you either have to increase or loose the history to early i.e.
    performance_schema_events_stages_history_long_size

    variable. Using ps_history might help in this case though with a little modification to the queries.

  • The resulting Duration per event may vary, I would think this may be due to the additional as described on performance_timers table. In any case we hope to get this cleared up as result when this bug is fixed.

The post Profiling MySQL queries from Performance Schema appeared first on MySQL Performance Blog.

Apr
16
2015
--

Profiling MySQL queries from Performance Schema

When optimizing queries and investigating performance issues, MySQL comes with built in support for profiling queries aka

SET profiling = 1;

 . This is already awesome and simple to use, but why the PERFORMANCE_SCHEMA alternative?

Because profiling will be removed soon (already deprecated on MySQL 5.6 ad 5.7); the built-in profiling capability can only be enabled per session. This means that you cannot capture profiling information for queries running from other connections. If you are using Percona Server, the profiling option for log_slow_verbosity is a nice alternative, unfortunately, not everyone is using Percona Server.

Now, for a quick demo: I execute a simple query and profile it below. Note that all of these commands are executed from a single session to my test instance.

mysql> SHOW PROFILES;
+----------+------------+----------------------------------------+
| Query_ID | Duration   | Query                                  |
+----------+------------+----------------------------------------+
|        1 | 0.00011150 | SELECT * FROM sysbench.sbtest1 LIMIT 1 |
+----------+------------+----------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> SHOW PROFILE SOURCE FOR QUERY 1;
+----------------------+----------+-----------------------+------------------+-------------+
| Status               | Duration | Source_function       | Source_file      | Source_line |
+----------------------+----------+-----------------------+------------------+-------------+
| starting             | 0.000017 | NULL                  | NULL             |        NULL |
| checking permissions | 0.000003 | check_access          | sql_parse.cc     |        5797 |
| Opening tables       | 0.000021 | open_tables           | sql_base.cc      |        5156 |
| init                 | 0.000009 | mysql_prepare_select  | sql_select.cc    |        1050 |
| System lock          | 0.000005 | mysql_lock_tables     | lock.cc          |         306 |
| optimizing           | 0.000002 | optimize              | sql_optimizer.cc |         138 |
| statistics           | 0.000006 | optimize              | sql_optimizer.cc |         381 |
| preparing            | 0.000005 | optimize              | sql_optimizer.cc |         504 |
| executing            | 0.000001 | exec                  | sql_executor.cc  |         110 |
| Sending data         | 0.000025 | exec                  | sql_executor.cc  |         190 |
| end                  | 0.000002 | mysql_execute_select  | sql_select.cc    |        1105 |
| query end            | 0.000003 | mysql_execute_command | sql_parse.cc     |        5465 |
| closing tables       | 0.000004 | mysql_execute_command | sql_parse.cc     |        5544 |
| freeing items        | 0.000005 | mysql_parse           | sql_parse.cc     |        6969 |
| cleaning up          | 0.000006 | dispatch_command      | sql_parse.cc     |        1874 |
+----------------------+----------+-----------------------+------------------+-------------+
15 rows in set, 1 warning (0.00 sec)

To demonstrate how we can achieve the same with Performance Schema, we first identify our current connection id. In the real world, you might want to get the connection/processlist id of the thread you want to watch i.e. from

SHOW PROCESSLIST

 .

mysql> SELECT THREAD_ID INTO @my_thread_id
    -> FROM threads WHERE PROCESSLIST_ID = CONNECTION_ID();
Query OK, 1 row affected (0.00 sec)

Next, we identify the bounding EVENT_IDs for the statement stages. We will look for the statement we wanted to profile using the query below from the

events_statements_history_long

table. Your LIMIT clause may vary depending on how much queries the server might be getting.

mysql> SELECT THREAD_ID, EVENT_ID, END_EVENT_ID, SQL_TEXT, NESTING_EVENT_ID
    -> FROM events_statements_history_long
    -> WHERE THREAD_ID = @my_thread_id
    ->   AND EVENT_NAME = 'statement/sql/select'
    -> ORDER BY EVENT_ID DESC LIMIT 3 G
*************************** 1. row ***************************
       THREAD_ID: 13848
        EVENT_ID: 419
    END_EVENT_ID: 434
        SQL_TEXT: SELECT THREAD_ID INTO @my_thread_id
FROM threads WHERE PROCESSLIST_ID = CONNECTION_ID()
NESTING_EVENT_ID: NULL
*************************** 2. row ***************************
       THREAD_ID: 13848
        EVENT_ID: 374
    END_EVENT_ID: 392
        SQL_TEXT: SELECT * FROM sysbench.sbtest1 LIMIT 1
NESTING_EVENT_ID: NULL
*************************** 3. row ***************************
       THREAD_ID: 13848
        EVENT_ID: 353
    END_EVENT_ID: 364
        SQL_TEXT: select @@version_comment limit 1
NESTING_EVENT_ID: NULL
3 rows in set (0.02 sec)

From the results above, we are mostly interested with the EVENT_ID and END_EVENT_ID values from the second row, this will give us the stage events of this particular query from the

events_stages_history_long

table.

mysql> SELECT EVENT_NAME, SOURCE, (TIMER_END-TIMER_START)/1000000000 as 'DURATION (ms)'
    -> FROM events_stages_history_long
    -> WHERE THREAD_ID = @my_thread_id AND EVENT_ID BETWEEN 374 AND 392;
+--------------------------------+----------------------+---------------+
| EVENT_NAME                     | SOURCE               | DURATION (ms) |
+--------------------------------+----------------------+---------------+
| stage/sql/init                 | mysqld.cc:998        |        0.0214 |
| stage/sql/checking permissions | sql_parse.cc:5797    |        0.0023 |
| stage/sql/Opening tables       | sql_base.cc:5156     |        0.0205 |
| stage/sql/init                 | sql_select.cc:1050   |        0.0089 |
| stage/sql/System lock          | lock.cc:306          |        0.0047 |
| stage/sql/optimizing           | sql_optimizer.cc:138 |        0.0016 |
| stage/sql/statistics           | sql_optimizer.cc:381 |        0.0058 |
| stage/sql/preparing            | sql_optimizer.cc:504 |        0.0044 |
| stage/sql/executing            | sql_executor.cc:110  |        0.0008 |
| stage/sql/Sending data         | sql_executor.cc:190  |        0.0251 |
| stage/sql/end                  | sql_select.cc:1105   |        0.0017 |
| stage/sql/query end            | sql_parse.cc:5465    |        0.0031 |
| stage/sql/closing tables       | sql_parse.cc:5544    |        0.0037 |
| stage/sql/freeing items        | sql_parse.cc:6969    |        0.0056 |
| stage/sql/cleaning up          | sql_parse.cc:1874    |        0.0006 |
+--------------------------------+----------------------+---------------+
15 rows in set (0.01 sec)

As you can see the results are pretty close, not exactly the same but close. SHOW PROFILE shows Duration in seconds, while the results above is in milliseconds.

Some limitations to this method though:

  • As we’ve seen it takes a few hoops to dish out the information we need. Because we have to identify the statement we have to profile manually, this procedure may not be easy to port into tools like the sys schema or pstop.
  • Only possible if Performance Schema is enabled (by default its enabled since MySQL 5.6.6, yay!)
  • Does not cover all metrics compared to the native profiling i.e. CONTEXT SWITCHES, BLOCK IO, SWAPS
  • Depending on how busy the server you are running the tests, the sizes of the history tables may be too small, as such you either have to increase or loose the history to early i.e.
    performance_schema_events_stages_history_long_size

    variable. Using ps_history might help in this case though with a little modification to the queries.

  • The resulting Duration per event may vary, I would think this may be due to the additional as described on performance_timers table. In any case we hope to get this cleared up as result when this bug is fixed.

The post Profiling MySQL queries from Performance Schema appeared first on MySQL Performance Blog.

Apr
10
2015
--

Measuring the impact of tcpdump on Very Busy Hosts

A few years back Deva wrote about how to use tcpdump on very busy hosts. That post sparked my interest about exploring how to measure the impact of tcpdump on very busy hosts. In this post, I wanted to highlight how much of an impact there really is and what options you have to make the query collection much more effective.

Some things you need to know:

  • The test is a sysbench read-only workload, 8 tables, 8 threads, 1000000 rows each with 16G of buffer pool. Dataset fully in memory.
  • sysbench is ran on the same host, on 1Gbps connection, sysbench can saturate the network and therefore affect my network test with netcat so I decided to run locally.
  • There are 13 tests, 5 minutes each with 1 minute interval, varying on how the dump file is captured.
    • First one as baseline is the MySQL slow query log.
      A:

      mysql -e 'set global long_query_time=0, slow_query_log=1; select sleep(300); set global long_query_time=1, slow_query_log=0;'
    • Second group is tcpdump with -w option, which means tcpdump itself is writing to the capture file.
      B:

      $DUMPCMD -i any -G 300 -W 1 -Z root -w tcpdump.pcap port 3306
      C:

      $DUMPCMD -i any -G 300 -W 1 -Z root -w tcpdump.pcap 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2'
      D:

      $DUMPCMD -i any -G 300 -W 1 -Z root -w tcpdump.pcap 'port 3306 and ( tcp[1] & 7 == 2 or tcp[3] & 7 == 2 )'
    • Third group, is using “packet-buffered” (-U option) to see if there will be improvement on response time.
      E:

      $DUMPCMD -i any -G 300 -W 1 -Z root -U -w tcpdump.pcap port 3306
      F:

      $DUMPCMD -i any -G 300 -W 1 -Z root -U -w tcpdump.pcap 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2'
      G:

      $DUMPCMD -i any -G 300 -W 1 -Z root -U -w tcpdump.pcap 'port 3306 and ( tcp[1] & 7 == 2 or tcp[3] & 7 == 2 )'
    • Next streams the backup to a remote location via netcat.
      H:

      $DUMPCMD -i any -G 300 -W 1 -Z root -w - port 3306 | nc remote_ip 33061
      I:

      $DUMPCMD -i any -G 300 -W 1 -Z root -U -w - port 3306 | nc remote_ip 33062
      J:

      $DUMPCMD -i any -G 300 -W 1 -Z root -U -w - 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2' | nc remote_ip 33063
    • The last group, the one most of us are probably accustomed with is piping the dumped packets to file.
      K:

      timeout -s KILL 300 $DUMPCMD -i any port 3306 > tcpdump.pcap
      L:

      timeout -s KILL 300 $DUMPCMD -i any 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2' > tcpdump.pcap
      M:

      timeout -s KILL 300 $DUMPCMD -i any 'port 3306 and ( tcp[1] & 7 == 2 or tcp[3] & 7 == 2 )' > tcpdump.pcap
    • $DUMPCMD

        is defined as

      tcpdump -s 65535 -x -nn -q -tttt
  • On each group there is an AND and OR variation in port filtering. I wanted to see whether how much of additional impact port expressions have. And as you will see below, they do not have significant impact on performance, but on number of queries captured.

I’ve graphed the sysbench data during the test and labeled each test for easy comparison on the graphs.

Sysbench_TPS-legend Sysbench_Response_Time-legened

Of course, I’ve also compared the size of resulting tcpdump capture and total queries identified when run through pt-query-digest.
pcap-Size-n-Queries-Count

Findings

  • We see that, piping the pcap (K, L, M) data as decoded packets has significant overhead in terms of number of captured queries, response time and reads requests completed.
  • Using the slow log has about 30% overhead in response time, nearly 20% drop in throughput but have highest number of queries captured.
  • Writing captured packets directly to binary file using the -w option has the lowest overhead in response time, around 10%. Throughput drops depending on how much filtering is involved though while also there are noticeable stalls when the operating system flushes the page cache. This side effect causes sysbench to drop to 0 reads or even reach response times of several seconds!
  • Streaming packets to a capable remote server in terms of network bandwidth, IO performance combined with -w option to capture binary data produces 20-25% overhead in response time, 10-15% drop in throughput, no stalls and number of queries captured as close to slow query log.

Summary

Use tcpdump -w option in all cases and decode later. If you are looking for an overall view of ALL your queries, streaming tcpdump data to remote is also ideal. If you have low bandwidth though i.e. 100Mbps, this might not be enough as 5mins of binary tcpdump data produced 31G of file. That is 105MBps requirement! In which case, consider writing to a separate partition with enough IO.

If you are using Percona Server or MariaDB and is only looking to capture a portion of your workload i.e. table scans, temp table on disk or rate limit the collection, the extended slow query logging capability with this versions are also an excellent way to capture the data you need.

The post Measuring the impact of tcpdump on Very Busy Hosts appeared first on MySQL Performance Blog.

Oct
14
2014
--

Recover orphaned InnoDB partition tablespaces in MySQL

A few months back, Michael wrote about reconnecting orphaned *.ibd files using MySQL 5.6. I will show you the same procedure, this time for partitioned tables. An InnoDB partition is also a self-contained tablespace in itself so you can use the same method described in the previous post.

To begin with, I have an example table with a few orphaned partitions and we will reconnect each partition one by one to the original table.

mysql [localhost] {msandbox} (recovery) > SHOW CREATE TABLE t1 G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
[...]
KEY `h_date` (`h_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (year(h_date))
(PARTITION p0 VALUES LESS THAN (2006) ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN (2010) ENGINE = InnoDB,
PARTITION px VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (recovery) > SELECT COUNT(*) FROM t1;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)
-rw-rw----.  1 revin revin 176M Oct  8 08:41 t1#P#p0.ibd
-rw-rw----.  1 revin revin 612M Oct  8 08:41 t1#P#p1.ibd
-rw-rw----.  1 revin revin 932M Oct  8 08:42 t1#P#px.ibd

The first step is to create a dummy table and remove partitioning so that we can reattach individual partitions to this table.

mysql [localhost] {msandbox} (recovery) > CREATE TABLE t1_t LIKE t1;
Query OK, 0 rows affected (0.02 sec)
mysql [localhost] {msandbox} (recovery) > ALTER TABLE t1_t REMOVE PARTITIONING;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql [localhost] {msandbox} (recovery) > ALTER TABLE t1_t DISCARD TABLESPACE;
Query OK, 0 rows affected (0.05 sec)


Once the tablespace for our dummy table has been discarded, we copy  one of the partitions to take the place of the dummy table’s tablespace. For example, we copy 

t1#P#p0.ibd

  as 

t1_t.ibd

  into the MySQL data directory, of course taking into account the permissions afterward. The next step is to import the tablespace to the dummy table.

mysql [localhost] {msandbox} (recovery) > ALTER TABLE t1_t IMPORT TABLESPACE;
Query OK, 0 rows affected, 1 warning (7.34 sec)


And for the secret sauce, we will exchange our dummy table recently imported tablespace to replace the target partition in our original table.

mysql [localhost] {msandbox} (recovery) > ALTER TABLE t1 EXCHANGE PARTITION px WITH TABLE t1_t;
Query OK, 0 rows affected (6.42 sec)
mysql [localhost] {msandbox} (recovery) > SELECT COUNT(*) FROM t1;
+----------+
| COUNT(*) |
+----------+
|  8523686 |
+----------+
1 row in set (2.50 sec)


You can do the same with subpartitions, too! Here’s my slightly different table with subpartitions where I reconnect one of the orphaned tablespaces

t2#P#px#SP#pxsp1.ibd

 .

mysql [localhost] {msandbox} (recovery) > SHOW CREATE TABLE t2 G
*************************** 1. row ***************************
       Table: t2
Create Table: CREATE TABLE `t2` (
[...]
  KEY `h_date` (`h_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (year(h_date))
SUBPARTITION BY HASH (u_id)
SUBPARTITIONS 2
(PARTITION p0 VALUES LESS THAN (2006) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (2010) ENGINE = InnoDB,
 PARTITION px VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (recovery) > SELECT COUNT(*) FROM t2;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.94 sec)
-rw-rw----.  1 revin revin  92M Oct  8 08:44 t2#P#p0#SP#p0sp0.ibd
-rw-rw----.  1 revin revin  92M Oct  8 08:44 t2#P#p0#SP#p0sp1.ibd
-rw-rw----.  1 revin revin 304M Oct  8 08:44 t2#P#p1#SP#p1sp0.ibd
-rw-rw----.  1 revin revin 316M Oct  8 08:44 t2#P#p1#SP#p1sp1.ibd
-rw-rw----.  1 revin revin 480M Oct  8 08:45 t2#P#px#SP#pxsp0.ibd
-rw-rw----.  1 revin revin 460M Oct  8 08:45 t2#P#px#SP#pxsp1.ibd
mysql [localhost] {msandbox} (recovery) > CREATE TABLE t2_t LIKE t2;
Query OK, 0 rows affected (0.02 sec)
mysql [localhost] {msandbox} (recovery) > ALTER TABLE t2_t REMOVE PARTITIONING;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql [localhost] {msandbox} (recovery) > ALTER TABLE t2_t DISCARD TABLESPACE;
Query OK, 0 rows affected (0.04 sec)


Once again, after copying 

t2#P#px#SP#pxsp1.ibd

  to replace 

t2_t.ibd

  in the MySQL data directory and setting the correct permissions, we can import it into our dummy table and exchange the same to our original table. In this case, on a table with subpartitions, a tablespace is a combined unit of partition and subpartition, hence in our partition name above P#px is our main partition and SP#pxsp1 the subpartition name. For our exchange command below, we will only need the later portion.

mysql [localhost] {msandbox} (recovery) > ALTER TABLE t2_t IMPORT TABLESPACE;
Query OK, 0 rows affected, 1 warning (2.49 sec)
mysql [localhost] {msandbox} (recovery) > ALTER TABLE t2 EXCHANGE PARTITION pxsp1 WITH TABLE t2_t;
Query OK, 0 rows affected (3.11 sec)
mysql [localhost] {msandbox} (recovery) > SELECT COUNT(*) FROM t2;
+----------+
| COUNT(*) |
+----------+
|  4546036 |
+----------+
1 row in set (0.94 sec)


But wait there’s more: Do you know that in MySQL 5.7, you can take the full shortcut? Directly import all partitions back to the original table, sweet! A quick example below on MySQL 5.7.5, I created the same t2 table above,

DISCARD TABLESPACE

 , copy the partition tablespaces from my test 5.6 instance and

IMPORT TABLESPACE

  And done! :-)

mysql [localhost] {msandbox} (test) > SELECT COUNT(*) FROM t2;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (test) > ALTER TABLE t2 DISCARD TABLESPACE;
Query OK, 0 rows affected (0.02 sec)
mysql [localhost] {msandbox} (test) > ! cp -v /sbx/msb/msb_5_6_210/data/test/t2#P#* /sbx/msb/msb_5_7_5/data/test/
`/sbx/msb/msb_5_6_210/data/test/t2#P#p0#SP#p0sp0.ibd' -> `/sbx/msb/msb_5_7_5/data/test/t2#P#p0#SP#p0sp0.ibd'
`/sbx/msb/msb_5_6_210/data/test/t2#P#p0#SP#p0sp1.ibd' -> `/sbx/msb/msb_5_7_5/data/test/t2#P#p0#SP#p0sp1.ibd'
`/sbx/msb/msb_5_6_210/data/test/t2#P#p1#SP#p1sp0.ibd' -> `/sbx/msb/msb_5_7_5/data/test/t2#P#p1#SP#p1sp0.ibd'
`/sbx/msb/msb_5_6_210/data/test/t2#P#p1#SP#p1sp1.ibd' -> `/sbx/msb/msb_5_7_5/data/test/t2#P#p1#SP#p1sp1.ibd'
`/sbx/msb/msb_5_6_210/data/test/t2#P#px#SP#pxsp0.ibd' -> `/sbx/msb/msb_5_7_5/data/test/t2#P#px#SP#pxsp0.ibd'
`/sbx/msb/msb_5_6_210/data/test/t2#P#px#SP#pxsp1.ibd' -> `/sbx/msb/msb_5_7_5/data/test/t2#P#px#SP#pxsp1.ibd'
mysql [localhost] {msandbox} (test) > ALTER TABLE t2 IMPORT TABLESPACE;
Query OK, 0 rows affected, 6 warnings (11.36 sec)
mysql [localhost] {msandbox} (test) > SHOW WARNINGS G
*************************** 1. row ***************************
  Level: Warning
   Code: 1810
Message: InnoDB: IO Read error: (2, No such file or directory) Error opening './test/t2#P#p0#SP#p0sp0.cfg', will attempt to import without schema verification
[...]


 

The post Recover orphaned InnoDB partition tablespaces in MySQL appeared first on MySQL Performance Blog.

Sep
17
2014
--

Syncing MySQL slave table with pt-online-schema-change

I recently encountered a situation in which after running Percona Toolkit’s pt-table-checksum on a customer system, 95% of the table on the MySQL master was different on the MySQL slave. Although this table was not a critical part of the infrastructure, from time to time, writes to the table from the master would break replication. Additionally, this table has about 6 million rows, and running pt-table-sync would take sometime. Rebuilding the slave from backup of the master would not be an easy option as well since the slave acts as an archive where it has a lot more data than the master.

So how did we solve it? With pt-online-schema-change and a NOOP ALTER.

pt-online-schema-change --alter 'ENGINE=INNODB' D=dbname,t=tblname

How is it possible? pt-online-schema-change works by creating a shadow copy of the original table and start copying the rows from the latter to the former. An additional set of TRIGGERs also ensures that any additional changes to existing rows after they have been copied to the shadow version will propagate.

There is little one caveat though, the 

binlog_format

  on the master would have to be ROW so the actual ROW images from the master would be copied to the slave. If your binlog_format is set to STATEMENT, you’d still end up with the same inconsistency. Since change statements will be logged as STATEMENT, the NOOP ALTER will operate on the slave table instead of copying from the master. You can configure the binlog_format for the alter process via the –set-vars option i.e. 

--set-vars 'binlog_format=ROW'

  to make it more easier!

The post Syncing MySQL slave table with pt-online-schema-change appeared first on MySQL Performance Blog.

Jul
16
2014
--

High Availability with mysqlnd_ms on Percona XtraDB Cluster

This is the second part of my series on High Availability with mysqlnd_ms. In my first post, “Simple MySQL Master HA with mysqlnd_ms,” I showed a simple HA solution using asynchronous MySQL replication. This time we will see how to leverage an all-primary cluster where you can write to all nodes. In this post I used Percona XtraDB Cluster, but you should also be able to do the same with MySQL NDB Cluster or Tungsten Replicator.

To start with, here is the mysqlnd_ms configuration I used:

mysqlnd_ms_mm.ini

.  All of these files are available from my Github repository. Below, I have three Percona XtraDB Cluster nodes, all defined as masters and no slaves. I’ve configured a roundrobin filter where all connections will happen on the first node, in this case

192.168.56.44

 . In case the first node fails, the second node will be used and so forth until no more nodes are available. Another interesting configuration option here is the loop_before_master strategy whereby if connection or a statement to the current server fails, it will be retried silently on the remaining nodes before returning an error to the user, more on this below.

{
   "primary": {
     "master": {
       "master_1": {
         "host": "192.168.56.44",
         "port": "3306"
       },
       "master_2": {
         "host": "192.168.56.43",
         "port": "3306"
       },
       "master_3": {
         "host": "192.168.56.42",
         "port": "3306"
       }
     },
     "slave": { },
     "filters": { "roundrobin": [ ] },
     "failover": { "strategy": "loop_before_master", "remember_failed": true }
   }
 }

Similar to my previous post, I also used a custom INI file for PHP to use, this time aptly named

master-master.ini

 :

mysqlnd_ms.enable = 1
mysqlnd_ms.disable_rw_split = 1
mysqlnd_ms.multi_master = 1
mysqlnd_ms.force_config_usage = 1
mysqlnd_ms.config_file = /home/revin/git/demo-me/phpugph201407/mysqlnd_ms_mm.ini

A new addition to this configuration is

mysqlnd_ms.multi_master

 , when enabled it would allow you to use all nodes or just one and treat the others as passive. The PHP script I used this time is called

master-master.php

 , it is largely similar to

master-slave-ng.php

with a few differences:

  1. There is no need for 
    /tmp/PRIMARY_HAS_FAILED

      sentinel as all nodes were writable.

  2. There is no need for 
    /*ms=master*/

      SQL hint when validating a connection from connect_mysql function since all nodes acts as master.

So here is a quick test, first with roundrobin filter, after 4 INSERTs, I shutdown 

192.168.56.44

  which sends my connection to the next server in the configuration,

192.168.56.43

 . When I started back 

192.168.56.44

  again, the script resumed connections there. Pretty cool right?

[revin@forge phpugph201407]$ php -c master-master.ini master-master.php
Last value 3564 from host 192.168.56.44 via TCP/IP and thread id 19
Last value 3565 from host 192.168.56.44 via TCP/IP and thread id 20
Last value 3566 from host 192.168.56.44 via TCP/IP and thread id 21
Last value 3567 from host 192.168.56.44 via TCP/IP and thread id 22
Warning: connect_mysql(): MySQL server has gone away in /home/revin/git/demo-me/phpugph201407/master-master.php on line 63
Warning: connect_mysql(): Error while reading greeting packet. PID=23464 in /home/revin/git/demo-me/phpugph201407/master-master.php on line 63
ERRROR: 192.168.56.43 via TCP/IP [2006] MySQL server has gone away on line 30
Last value 0 from host  and thread id 0
Last value 3568 from host 192.168.56.43 via TCP/IP and thread id 1552
Last value 3569 from host 192.168.56.43 via TCP/IP and thread id 1553
[...]
Last value 3584 from host 192.168.56.43 via TCP/IP and thread id 1568
Last value 3585 from host 192.168.56.44 via TCP/IP and thread id 18

Here’s another test using the random filter which allows you to write to all nodes, on my mysqlnd_ms_mm.ini above, I just changed 

roundrobin

  to

random

 . As you can see, all three nodes were being used, of course in random, at the same time you will also see when I shutdown 

192.168.56.44

  around where the 

connect_mysql

  errors and then the server was used again near the bottom after a started it back up. Still pretty cool right?

[revin@forge phpugph201407]$ php -c master-master.ini master-master.php
Last value 3590 from host 192.168.56.42 via TCP/IP and thread id 2060
Last value 3591 from host 192.168.56.43 via TCP/IP and thread id 1569
Last value 3592 from host 192.168.56.43 via TCP/IP and thread id 1570
Warning: connect_mysql(): MySQL server has gone away in /home/revin/git/demo-me/phpugph201407/master-master.php on line 63
Warning: connect_mysql(): Error while reading greeting packet. PID=23919 in /home/revin/git/demo-me/phpugph201407/master-master.php on line 63
ERRROR: 192.168.56.43 via TCP/IP [2006] MySQL server has gone away on line 30
Last value 0 from host  and thread id 0
Last value 3593 from host 192.168.56.42 via TCP/IP and thread id 2061
Last value 3594 from host 192.168.56.42 via TCP/IP and thread id 2062
Last value 3595 from host 192.168.56.42 via TCP/IP and thread id 2063
Last value 3596 from host 192.168.56.42 via TCP/IP and thread id 2064
Last value 3597 from host 192.168.56.43 via TCP/IP and thread id 1576
Last value 3598 from host 192.168.56.43 via TCP/IP and thread id 1577
Last value 3599 from host 192.168.56.43 via TCP/IP and thread id 1578
Last value 3600 from host 192.168.56.43 via TCP/IP and thread id 1579
Last value 3601 from host 192.168.56.42 via TCP/IP and thread id 2065
Last value 3602 from host 192.168.56.43 via TCP/IP and thread id 1581
Last value 3603 from host 192.168.56.43 via TCP/IP and thread id 1582
Last value 3604 from host 192.168.56.42 via TCP/IP and thread id 2066
Last value 3605 from host 192.168.56.44 via TCP/IP and thread id 19
Last value 3606 from host 192.168.56.43 via TCP/IP and thread id 1583
Last value 3607 from host 192.168.56.44 via TCP/IP and thread id 21

So here are some issues I’ve observed during these tests:

  1. remember_failed

      during failover does not work as advertised. Supposedly, a failed node should not be used again for every connection request but in my test, this is not the case. See more from this bug. This means that if you have 2 out of 3 failed nodes in this scenario the overhead would be too big when testing both connections. Perhaps some sort of in memory shared TTL can be used to overcome this? I’m not sure.

  2. If you look closely around line 7 on my last output above the error displayed is kind of misleading. In particular it says
    ERRROR: 192.168.56.43 via TCP/IP

     , whereby it was not 

    192.168.56.43

      that failed, it was

    192.168.56.43

     . This is because under the hood, immediately after failure the next node will be cycled to, this is especially true since we have loop_before_master configured. I sure do have a bug on the script that should capture the 

    host_info

      properly, but this is something to always keep in mind so you don’t keep scratching your head.

So we’ve seen these two forms of possibilities and they definitely have use cases and advantages. On the other hand because of the issues we have found so far (I’ve reported 4 bugs on the PHP bugs database during the course of these tests including one crashing), I recommend to make sure you test seriously before putting this on production.

The post High Availability with mysqlnd_ms on Percona XtraDB Cluster appeared first on MySQL Performance Blog.

Apr
23
2014
--

Encrypted and incremental MySQL backups with Percona XtraBackup

We’ve recently received a number of questions on how to implement incremental MySQL backups alongside encryption with Percona XtraBackup. Some users thought it was not initially possible because with the default

--encrypt

options with XtraBackup, all files will be encrypted, but alas, that is not the case. This is where the option

--extra-lsn-dir

becomes useful, because it allows you to save LSN (Log Sequence Number) information to another directory and exclude it from encryption, allowing you to use the same information needed by incremental backups. Enough talk, let me show you.

Because you would want to usually script your backup and restore procedure, I’d use variables here as well to make you more familiar. First, I’d create 3 folders, where my backups will be stored, ‘full’ for full backups, ‘incr’ for incremental backups, and ‘lsns’ to store an extra copy of my

xtrabackup_checkpoints

file with

--extra-lsn-dir

.

mkdir -p /ssd/msb/msb_5_5_360/bkp/full
mkdir -p /ssd/msb/msb_5_5_360/bkp/incr
mkdir -p /ssd/msb/msb_5_5_360/bkp/lsns

Second, to have better control of where my backups would go, I prefer assigning timestamped folders instead and use the –no-timestamp option to innobackupex.

CURDATE=$(date +%Y-%m-%d_%H_%M_%S)

Then manually create the specific directory where the backup’s xtrabackup_checkpoints file would be saved:

mkdir -p /ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE

Of course, I need an encryption key for my encrypted backups, in this case, taking the example from the manual, I used openssl to generate a random key. You can use your own key string as long as its size conforms to the size required by the

--encrypt

algorithm you chose.

echo -n $(
   openssl enc -aes-256-cbc -pass pass:Password -P -md sha1 \
   | grep iv | cut -d'=' -f2
) > /ssd/msb/msb_5_5_360/bkp/backups.key

Next, I would run my full backup:

innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --extra-lsndir=/ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE \
   --encrypt=AES256 --encrypt-key-file=/ssd/msb/msb_5_5_360/bkp/backups.key \
   --no-timestamp /ssd/msb/msb_5_5_360/bkp/full/$CURDATE

The output says my full backup is saved to:

innobackupex: Backup created in directory '/ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46'
140423 01:20:55  innobackupex: Connection to database server closed
140423 01:20:55  innobackupex: completed OK!

Now here’s the trick, because the full backup is encrypted, we will use the xtrabackup_checkpoints file separately saved by xtrabackup to the

--extra-lsn-dir

path we specified above to get the LSN and use that for our next incremental backup.

LAST_LSN=$(
   cat /ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE/xtrabackup_checkpoints \
   | grep to_lsn | cut -d'=' -f2
)
CURDATE=$(date +%Y-%m-%d_%H_%M_%S)
mkdir /ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE

Above, we get the LSN value and assign it to a variable. Similarly, we created a new CURDATE string for our incremental backup to use and created a new directory for the xtrabackup_checkpoints file. If you plan to create another incremental backup based off of what we are about to take now, you will use this next xtrabackup_checkpoints file to get LAST_LSN.

With the up and coming Percona XtraBackup 2.2.1, you will not need

--extra-lsn-dir

anymore nor parse the

xtrabackup_checkpoints

file anymore for this purpose. A new feature that will allow the user to save backup metadata to an InnoDB table will be available.

So, now that we got our

$LAST_LSN

value, we execute our incremental backup with the command:

innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --extra-lsndir=/ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE \
   --encrypt=AES256 --encrypt-key-file=/ssd/msb/msb_5_5_360/bkp/backups.key \
   --no-timestamp --incremental --incremental-lsn $LAST_LSN \
   /ssd/msb/msb_5_5_360/bkp/incr/$CURDATE

Again, based on the output, my backup was created at:

innobackupex: Backup created in directory '/ssd/msb/msb_5_5_360/bkp/incr/2014-04-23_01_21_00'
140423 01:21:47  innobackupex: Connection to database server closed
140423 01:21:47  innobackupex: completed OK!

No we have a full backup and an incremental backup, of course to make sure our backups are usable, we’d like to validate them. To do that, our first step is to decrypt both full and incremental backups. innobackupex has another handy

--decrypt

option for that, you can even use

--parallel

to make it faster.

innobackupex --decrypt=AES256 \
   --encrypt-key-file=/ssd/msb/msb_5_5_360/bkp/backups.key \
   /ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46
innobackupex --decrypt=AES256 \
   --encrypt-key-file=/ssd/msb/msb_5_5_360/bkp/backups.key \
   /ssd/msb/msb_5_5_360/bkp/incr/2014-04-23_01_21_00

Once the backups are decrypted, we can go through the usual process of preparing a full and incremental backups as described on the manual.

innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --apply-log --redo-only /ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46
innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --apply-log --redo-only /ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46 \
   --incremental-dir=/ssd/msb/msb_5_5_360/bkp/incr/2014-04-23_01_21_00
innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --apply-log /ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46

The post Encrypted and incremental MySQL backups with Percona XtraBackup appeared first on MySQL Performance Blog.

Jan
17
2014
--

keepalived with reader and writer VIPs for Percona XtraDB Cluster

This is a followup to Jay Janssen’s October post, “Using keepalived for HA on top of Percona XtraDB Cluster.” We got a request recently where the customer has 2 VIPs (Virtual IP addresses), one for reader and one for a writer for a cluster of 3 nodes. They wanted to keep it simple, with low latency and does not require an external node resource like HaProxy would.

keepalived is a simple load balancer with HA capabilities, which means it can proxy TCP services behind it and at the same time, keep itself highly available using VRRP as failover mechanism. This post is about taking advantage of the VRRP capabilities built into keepalived to intelligently manage your PXC VIPs.

While Yves Trudeau also wrote a very interesting and somewhat similar solution using ClusterIP and Pacemaker to load balance VIPs, they have different use cases. Both solutions reduce latency from an external proxy or load balancer, but unlike ClusterIP, connections to the desired VIP with keepalived go to a single node which means a little less work for each node trying to see if they should respond to the request. ClusterIP is good if you want to send writes to all nodes in calculated distribution while with our keepalived option, each VIP at best assigned to only a single node – depending on your workload, each will have advantages and disadvantages.

The OS I used was CentOS 6.4 with keepalived 1.2.7 available in the yum repositories, however, it’s difficult to troubleshoot failover behavior with VRRP_Instance weights without seeing them from keepalived directly. So I used a custom build, with a patch for –vrrp-status option that allows me to monitor something like this:

[root@pxc01 keepalived]# keepalived --vrrp-status
VRRP Instance             : writer_vip
   Interface              : eth5
   Virtual Router ID      : 60
   State                  : BACKUP
   Virtual IP address     : 192.168.56.83
   Advertisement interval : 1 sec
   Preemption             : Enabled, delay 0 secs
   Priority               : 101
   Effective Priority     : 101
   Authentication         : NONE
   Master router          : 192.168.56.44 priority 151
   Master down interval   : 3.6
VRRP Instance             : reader_vip
   Interface              : eth5
   Virtual Router ID      : 61
   State                  : MASTER
   Virtual IP address     : 192.168.56.84
   Advertisement interval : 1 sec
   Preemption             : Enabled, delay 0 secs
   Priority               : 101
   Effective Priority     : 181
   Authentication         : NONE
   Master router          : 192.168.56.42 (local)

So first, let’s compile keepalived from source, the Github branch here is where the status patch is available.

cd ~
git clone https://github.com/jonasj76/keepalived.git
git checkout 5c5b2cc51760967c92b968d6e886ab6ecc2ee86d
git branch 5c5b2cc51760967c92b968d6e886ab6ecc2ee86d
./configure
make && make install

Install the customer tracker script below – because compiling keepalived above installs it on /usr/local/bin, I put this script there as well. One would note that this script is completely redundant, it’s true, but beware that keepalived does not validate its configuration, especially track_scripts so I prefer to have it on separate bash script so I can easily debug misbehavior. Of course when all is working well, you can always merge this to the keepalived.conf file.

#!/bin/bash
# Modify these addresses to match your reader and writer VIPs
WRITER_VIP=192.168.56.83
READER_VIP=192.168.56.84
# Make sure your clustercheck script also works
PXC_CHECK='/usr/bin/clustercheck clustercheck password 0'
SCRIPT=$1
WEIGHT=101
case $SCRIPT in
   'bad_pxc')
      $PXC_CHECK || exit 1
      ;;
   'nopreempt_writer')
      [[ "$(hostname|cut -d'.' -f1)" != 'pxc01' && $(ip ad sh|grep $WRITER_VIP) && $(ip ad sh|grep $READER_VIP|grep -c inet) -eq 0 ]] || exit 1
      ;;
   'nopreempt_reader')
      [[ "$(hostname|cut -d'.' -f1)" != 'pxc02' && $(ip ad sh|grep $READER_VIP) && $(ip ad sh|grep $WRITER_VIP|grep -c inet) -eq 0 ]] || exit 1
      ;;
   'repel_writer')
      [ $(ip ad sh|grep $WRITER_VIP|grep -c inet) -eq 0 ] || exit 1
      ;;
esac
exit 0

And below is my /etc/keepalived.conf:

vrrp_script nopreempt_writer_vip {
        script "/usr/local/bin/pxc-track nopreempt_writer"
        interval 2
}
vrrp_script nopreempt_reader_vip {
        script "/usr/local/bin/pxc-track nopreempt_reader"
        interval 2
}
vrrp_script repel_writer_vip {
	script "/usr/local/bin/pxc-track repel_writer"
	interval 2
}
vrrp_script bad_pxc {
	script "/usr/local/bin/pxc-track bad_pxc"
	interval 2
}
vrrp_instance writer_vip {
	interface eth5
	state BACKUP
	virtual_router_id 60
	priority 101
	virtual_ipaddress {
		192.168.56.83
	}
	track_script {
		nopreempt_writer_vip weight 50
		bad_pxc weight -100
	}
	track_interface {
		eth5
	}
        notify_master "/bin/echo 'writer now master' > /tmp/keepalived-w.state"
        notify_backup "/bin/echo 'writer now backup' > /tmp/keepalived-w.state"
        notify_fault "/bin/echo 'writer now fault' > /tmp/keepalived-w.state"
}
vrrp_instance reader_vip {
	interface eth5
	state BACKUP
	virtual_router_id 61
	priority 101
	virtual_ipaddress {
		192.168.56.84
	}
	track_script {
		repel_writer_vip weight 30
		nopreempt_reader_vip weight 50
		bad_pxc weight -100
	}
	track_interface {
		eth5
	}
	! This does not work properly if we stop the MySQL process
	! VIP seems to stick on the node so we have separate nopreempt_* track_scripts
	!nopreempt
        notify_master "/bin/echo 'reader now master' > /tmp/keepalived-r.state"
        notify_backup "/bin/echo 'reader now backup' > /tmp/keepalived-r.state"
        notify_fault "/bin/echo 'reader now fault' > /tmp/keepalived-r.state"
}

There are a number of things you can change here like remove or modify the notify_* clauses to fit your needs or send SMTP notifications during VIP failovers. I also prefer the initial state of the VRRP_Instances to be on BACKUP instead of master and let the voting on runtime dictate where the VIPs should go.

The configuration ensures that the reader and writer will not share a single node if more than one is available in the cluster. Even though the writer VIP prefers pxc01 in my example, this does not really matter much and only makes a difference when the reader VIP is not in the picture, there is no automatic failback with the help of the nopreempt_* track_scripts.

Now, to see it in action, after starting the cluster and keepalived in order pxc01, pxc02, pxc03, I have these statuses and weights:

[revin@forge ~]$ for h in pxc01 pxc02 pxc03; do echo "$h $(ssh root@$h 'cat /tmp/keepalived-w.state /tmp/keepalived-r.state'|xargs)"; done
pxc01 writer now master reader now backup
pxc02 writer now backup reader now master
pxc03 writer now backup reader now backup
pxc01 2014-01-15_20_58_23 writer_vip 161 reader_vip 101
pxc02 2014-01-15_20_58_28 writer_vip 101 reader_vip 131
pxc03 2014-01-15_20_58_36 writer_vip 131 reader_vip 131

The writer is on pxc01 and reader on pxc02 – even though the reader VIP score between pxc02 and pxc03 matches, it remains on pxc02 because of our nopreempt_* script. Let’s see what happens if I stop MySQL on pxc02:

[revin@forge ~]$ for h in pxc01 pxc02 pxc03; do echo "$h $(ssh root@$h 'cat /tmp/keepalived-w.state /tmp/keepalived-r.state'|xargs)"; done
pxc01 writer now master reader now backup
pxc02 writer now backup reader now backup
pxc03 writer now backup reader now master
pxc01 2014-01-15_21_01_17 writer_vip 161 reader_vip 101
pxc02 2014-01-15_21_01_24 writer_vip 31 reader_vip 31
pxc03 2014-01-15_21_01_36 writer_vip 101 reader_vip 181

The reader VIP moved to pxc03 and the weights changed, pxc02 reader dropped by 100 and on pxc03 it gained by 50 – again we set this higher for nor preempt. Now let’s stop MySQL on pxc03:

[revin@forge ~]$ for h in pxc01 pxc02 pxc03; do echo "$h $(ssh root@$h 'cat /tmp/keepalived-w.state /tmp/keepalived-r.state'|xargs)"; done
pxc01 writer now master reader now master
pxc02 writer now backup reader now backup
pxc03 writer now backup reader now backup
pxc01 2014-01-15_21_04_43 writer_vip 131 reader_vip 101
pxc02 2014-01-15_21_04_49 writer_vip 31 reader_vip 31
pxc03 2014-01-15_21_04_56 writer_vip 31 reader_vip 31

All our VIPs are now on pxc01, let’s start MySQL on pxc02:

[revin@forge ~]$ for h in pxc01 pxc02 pxc03; do echo "$h $(ssh root@$h 'cat /tmp/keepalived-w.state /tmp/keepalived-r.state'|xargs)"; done
pxc01 writer now master reader now backup
pxc02 writer now backup reader now master
pxc03 writer now backup reader now backup
pxc01 2014-01-15_21_06_41 writer_vip 161 reader_vip 101
pxc02 2014-01-15_21_06_50 writer_vip 101 reader_vip 131
pxc03 2014-01-15_21_06_55 writer_vip 31 reader_vip 31

Our reader is back to pxc02 and writer remains intact. When both VIPs end up on a single node i.e. last node standing, and a second node comes up, the reader moves not the writer this is to prevent any risks in breaking any connections that may be writing to the node currently owning the VIP.

The post keepalived with reader and writer VIPs for Percona XtraDB Cluster appeared first on MySQL Performance Blog.

May
01
2013
--

Follow these basics when migrating to Percona XtraDB Cluster for MySQL

Percona XtraDB ClusterGalera/Percona XtraDB Cluster (PXC) for MySQL is a hot thing right now and some users jump right in without enough testing. Consequently, they’re more likely to either suffer failure or issues that prevent them from moving forward. If you are thinking of migrating your workload to Percona XtraDB Cluster, make sure to go through these basics.

log_slave_updates is REQUIRED

You need to have log_slave_updates enabled on the cluster node acting as async slave for replicated events from the async master to be applied to the other nodes, that is if you have more than one PXC node. This is because before Galera can create writesets for the replicated events, binlog events must be generated for the transactions first. Under normal async replication, an event will not be written to the slave’s binary unless log_slave_updates is enabled, this is similar to Percona XtraDB Cluster in that if you want an async event replicated to the whole cluster you have to have the same enabled.

MyISAM in PXC May Lead to Inconsistencies and May Not Even Work!

MyISAM tables are supported within Percona XtraDB Cluster, however, MyISAM has only basic support, primarily because the storage engine is non-transactional and so PXC cannot guarantee the data will remain consistent within the cluster. Also, at the time of this writing, from async stream, MyISAM is not being replicated at all which I reported on this bug. This would be a showstopper for anyone who wants to, but still have MyISAM tables. You can still try by filtering MyISAM tables though if you can leave them behind. Lastly, once that bug above is fixed, and you still have MyISAM tables you wish to keep running under PXC, wsrep_replicate_myisam allows you to do so. However, if you can, you should consider moving to InnoDB altogether. There are very few reasons to stay with MyISAM nowadays i.e. if you have FULLTEXT you simply cannot replace in short term.

Control Your Auto-Incrementing Columns

PXC/Galera controls auto-incrementing values internally within the cluster, this is to avoid collisions when INSERTs are happening on not only a single node. However, this may work differently when replicating from an async master, for example like the one described on these two bugs. Galera use writesets to replicate cluster events to the other nodes, in essence these are RBR events, plus a few additional structures used for certification. Having said that, it would be good if your async master can use ROW based binlog format as well to achieve better consistency, if you have an async master <= 5.0 though, you can workaround this by turning off wsrep_auto_increment_control from the Percona XtraDB Cluster nodes as workaround. Note that with the latter, make sure to not forget turning the feature back on when you switch to the new cluster especially if you are planning to write on multiple nodes.

Have PRIMARY KEYS

If you still have tables without PRIMARY KEYs, then its time to make one for them. Galera does not work well with those and even if there is basic support when wsrep_certify_nonPK is enabled, you can still hit issues like when automatic creation of primary keys for use during certification becomes non-deterministic. Although the previous bug has been fixed on latest release (5.5.30-23.7.4), table without PK imposes an additional overhead, and because cluster performance is somewhat dependent on the slowest node – this overhead can easily become visible on the whole cluster and your async replication being affected.

Be Prepared for some Latency

PXC can take workloads, however not just any workload – it shines with small transactions but not with big ones. If you are consistently running overnight reporting jobs and getting them through the replication stream expect some replication lag. This is because because synchronous replication inside PXC has an additional overhead, this means the SQL_THREAD will not be able to execute events as fast, on top of that, other factors affecting async replication like if your workload is CPU or IO bound. Peter wrote some good details about it here.

If you have encountered any other issues replicating to Percona XtraDB Cluster, I’d like to hear your thoughts and experience on the comments :)

The post Follow these basics when migrating to Percona XtraDB Cluster for MySQL appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com