Nov
30
2012
--

Percona Live London 2012: Percona XtraDB Cluster in a nutshell

On Monday I will deliver with Seppo Jaakola from Codership a tutorial on Percona XtraDB Cluster (Percona Server + Galera).

Already in London and you don’t know how to consume the hotel’s bandwidth ? If you attend this tutorial, you can download the Virtual Box appliance we will use during the session.

Two versions are available:

How to test them:

  1. Open the downloaded file with Virtual Box
  2. Install all 3 machines
  3. Boot them all 3.
  4. Login on one of them (root / vagrant) and ping the other machines (eg. ping node2)
  5. If it works you are already setup for Monday afternoon.

See you in London !

Nov
30
2012
--

What Operating System do you run MySQL on in Production

This Friday we’ll have two polls instead of one to learn about Operating system we’re using to run MySQL. First Poll is about production use and second is about development. I know many developers would run MySQL on Linux in Production but use Windows or MacOS X for development. Please select operating system which actually runs your MySQL Server not the base operating system. if you’re running Virtualized Linux on your Windows please select Linux as OS used for development. So

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.
Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

Thank you for Participating!

Nov
28
2012
--

Replication of the NOW() function (also, time travel)

Notice the result of the NOW() function in the following query. The query was run on a real database server and I didn’t change the clock of the server or change anything in the database configuration settings.

mysql> SELECT NOW(),SYSDATE();
+---------------------+---------------------+
| NOW()               | SYSDATE()           |
+---------------------+---------------------+
| 1999-01-01 00:00:00 | 2012-11-29 05:50:03 |
+---------------------+---------------------+
1 row in set (0.00 sec)

You may proceed to party like it is 1999.

How can the NOW() function return a value in the past?

The “secret” is the TIMESTAMP variable, which is a special MySQL variable that can be set by the MySQL client. MySQL adds special events in the binary log which set the TIMESTAMP and INSERT_ID (which is used for AUTO_INCREMENT replication) to the correct values to ensure that statements replicate properly.

Here is the SQL to produce the above output:

SET TIMESTAMP=UNIX_TIMESTAMP('1999-01-01');
SELECT NOW(),SYSDATE();

Notice that SYSDATE returns the correct date. This means the SYSDATE() function is not replication safe by default. If you want to use the SYSDATE function in your app in DML statements, make sure you use the –sysdate-as-now option to mysqld.

Nov
26
2012
--

Announcing Percona Playback 0.5

Percona is glad to announce the release of Percona Playback 0.5 on November 26th, 2012. Downloads are available from our download site and Percona Software Repositories.

Percona Playback is a tool for replaying the load of one database server to another. Currently it can read queries from MySQL query-log and tcpdump files and run them on other MySQL server. With Percona Playback you can measure how a server or database upgrade, change in my.cnf or schema change can affect the overall performance of your database server.

This release should be considered Alpha quality.

New features:

  • Reports have been extended and now return the list of queries with a response time longer than the ones run on the original. Additional statistics, like the query types (SELECT, UPDATE,…) with the proportion off faster and slower statement for each categories, has been added as well.
  • Percona Playback now supports reading the slow query log from standard input (stdin) instead of log file.
  • For this release binary packages were prepared, full list is available in our documentation.

Bugs fixed:

  • Multi-line queries in the slow query log didn’t work correctly. Bug fixed #1035217 (Stewart Smith).
  • Raw tcpdump format wasn’t documented. Bug fixed #1034527 (Hrvoje Matijakovic).

Replaying the load with Percona Playback and tcpdump

Miguel already showed in his post how to use Percona Playback with the query_log plugin. Following example shows how to capture traffic with tcpdump and create a dump file that will be later used with Percona Playback.

On the production server we start the tcpdump with the -w option to collect packets into the raw tcpdump file:

tcpdump -i any port 3306 -w example.dump

After we collect some amount of traffic we can stop the tcpdump and use the dump file to replay the load on the test server. Warning: All the queries will be run with Percona Playback, not just the SELECT but the UPDATE,DELETE and INSERT ones as well. So be sure not to run this on production database unless you don’t need the data anymore.

percona-playback --input-plugin=tcpdump --tcpdump-file=example.dump \
--tcpdump-mode=accurate --db-plugin=libmysqlclient --mysql-host=10.8.2.10 \
--mysql-username=root --mysql-password=passW0rd --mysql-schema=imdb

This example uses accurate tcpdump-mode, which means Percona Playback will preserve query execution times and it will pause between queries to playback the same load that was recorded on production with the same accuracy.
After the dump file has been replayed on the server, Percona Playback will show the playback statistics in the report like:

Executed 1271 queries
Spent 00:00:59.856656 executing queries versus an expected 00:00:47.085301 time.
61 queries were quicker than expected, 21 were slower
A total of 0 queries had errors.
Expected 30298 rows, got 30298 (a difference of 0)
Number of queries where number of rows differed: 0. 

We can use this information to decide if the test server needs further tuning or how it will be able to sustain production load.

If you encounter any bugs, please report them to our bug tracking system.

Nov
26
2012
--

Realtime stats to pay attention to in Percona XtraDB Cluster and Galera

I learn more and more about Galera every day.  As I learn more, I try to keep my myq_gadgets toolkit up to date with what I consider is important to keep any eye on on a PXC node.  In that spirit, I just today pushed some changes to the ‘wsrep’ report, and I thought I’d go over some of the status variables and metrics being tracked there with the aim to show folks what they should be watching (at least from my opinion, this is subject to change!).

First, let’s take a look at the output:

[root@node3 ~]# myq_status -t 1 wsrep
Wsrep    Cluster        Node           Queue   Ops     Bytes     Flow        Conflct
    time  name P cnf  #  name  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt dst lcf bfa
19:17:01 trime P   3  3 node3 Sync T/T   0   0  35  40  54K  61K 0.0   0  17   0   2
19:17:03 trime P   3  3 node3 Sync T/T   0   0  70  85 107K 124K 0.0   0  13   0   2
19:17:04 trime P   3  3 node3 Sync T/T   0   0  72  81 111K 121K 0.0   0  16   0   3
19:17:05 trime P   3  3 node3 Sync T/T   0   0  70  85 108K 124K 0.0   0  17   0   4
19:17:06 trime P   3  3 node3 Sync T/T   0   0  66  82 100K 124K 0.0   0  17   0   3
19:17:07 trime P   3  3 node3 Sync T/T   0   0  68  78 105K 117K 0.0   0  22   0   0
19:17:08 trime P   3  3 node3 Sync T/T   0   0  65  93 101K 135K 0.0   0  14   1   5
19:17:09 trime P   3  3 node3 Sync T/T   0   0  73  83 111K 125K 0.0   0  19   0   3
19:17:10 trime P   3  3 node3 Sync T/T   0   0  30  46  46K  66K 0.0   0  10   0   2
19:17:12 trime P   3  3 node3 Sync T/T   0   0  64  80  97K 120K 0.0   0  19   0   4
19:17:13 trime P   3  3 node3 Sync T/T   0   0  69  88 106K 131K 0.0   0  28   0   1
19:17:14 trime P   3  3 node3 Sync T/T   0   0  70  83 106K 121K 0.0   0  11   0   3
19:17:15 trime P   3  3 node3 Sync T/T   0   0  72  84 111K 126K 0.0   0  15   0   3

As I’ve mentioned before, myq_status gives an iostat-like output of your server.  This tool takes what are usually global counters in SHOW GLOBAL STATUS and calculates the change each second and reports that.  There’s lot of other reports it can run, but this one is focused on ‘wsrep%’ status variables.

It’s important to note that this reflects the status of a single PXC node in my cluster, node3 to be precise, so some information is cluster-wide, other information is specific to this particular node.  I tend to open a window for each node and run the tool on each so I can see things across the entire cluster at a glance.  Sometime in the future, I’d like to build a tool that polls every cluster node, but that’s not available currently.

Let’s go through the columns.

Cluster

There are 4 columns in the cluster section, and it’s important to understand that this tool only currently connections to a single node (by default, localhost).  The state of the cluster could be divergent across multiple nodes, so be careful to not assume all nodes have these values!

name

The cluster’s name (first 5 characters).  This is wsrep_cluster_name.

P

Either P for Primary or N for Non-primary.  This is the state of this partition of the cluster.  If a cluster gets split brained, then only a quorum (>=51%) of the remaining nodes will remain Primary.  Non-primary clusters are the remaining minority and will not allow database operations.

cnf

This is wserep_cluster_conf_id — the version # of the cluster configuration.  This changes every time a node joins or leaves the cluster.  Seeing high values here may indicate you have nodes frequently dropping out and rejoining the cluster and you may need some retuning of some node timeouts to prevent this.

#

The number of nodes in the cluster.

Node

This is state data about the local node that the tool happens to be connected to.

name

The name of this local node (first 5 characters).  This is handy when you have this tool running in several windows on several nodes.

cmt

This is the wsrep_local_state_comment — basically a plaintext word describing the state of the  node in terms of the rest of the cluster.  ’Sync’ (Synced) is what you want to see, but ‘Dono’ (Donor), ‘Join’ (Joiner), and others are possible.  This is handy to quickly spot which node was elected to Donate either SST or IST to another node entering the cluster.

sta

Short for state, this is two True/False values (T/F) for wsrep_ready and wsrep_connected.  These are somewhat redundant with the local_state value, so I may remove them in the future.

Queue

This is information about the replication queues in both directions.

The ‘Up’ queue is outbound replication.  This generally increases when some other node is having difficulty receiving replication events.

The ‘Dn’ (down) queue is inbound replication.  Positive values here can be an indicator that this node is slow to apply replication writesets.

Ops

Ops are simply replication transactions or writesets.  Up is outbound, i.e., where this node was the originator of the transaction.  Dn is download, that is, transactions from other nodes in the cluster.

Bytes

Just like Ops, but in Bytes instead of transaction counts.  I have seen production clusters having performance issues where I noticed that the Ops and Bytes went to Zeros on all the nodes for a few seconds, and then a massive 90M+ replication transaction came through.  Using the Up and Dn columns, I could easily see which node was the originator of the transaction.

Flow

Flow gives some information about Flow Control events.  Galera has some sophisticated ways of metering replication so lag does not become a problem.

pau

wsrep_flow_control_paused — This is the amount of time since the last time SHOW GLOBAL STATUS was run that replication was paused due to flow control.  This is a general indicator that flow control is slowing replication (and hence overall cluster writes) down.

snt

wsrep_flow_control_sent — how many flow control events were SENT from this node.  Handy to find the node slowing the others down.

dst

This does not go under the Flow group.  This is wsrep_cert_deps_distance — This is a general indicator of how many parallel replication threads you could use.  In practice I haven’t found this extremely helpful yet and I may remove this in the future.  I think being aware of how Flow control works and watching flow control events and queue sizes is a better way to detect replication lag, and this really just tells you if multi-threaded replication could help improve replication speed at all.

Conflct

Replication conflicts, as described in my last post.  lcf is local certification failures, and bfa is brute force aborts.  This should be helpful to understand that these conflicts are or are not happening.

Interpreting the results

Let’s look at that output again and make some observations about our cluster and this node:

[root@node3 ~]# myq_status -t 1 wsrep
Wsrep    Cluster        Node           Queue   Ops     Bytes     Flow        Conflct
    time  name P cnf  #  name  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt dst lcf bfa
19:17:01 trime P   3  3 node3 Sync T/T   0   0  35  40  54K  61K 0.0   0  17   0   2
19:17:03 trime P   3  3 node3 Sync T/T   0   0  70  85 107K 124K 0.0   0  13   0   2
19:17:04 trime P   3  3 node3 Sync T/T   0   0  72  81 111K 121K 0.0   0  16   0   3
19:17:05 trime P   3  3 node3 Sync T/T   0   0  70  85 108K 124K 0.0   0  17   0   4
19:17:06 trime P   3  3 node3 Sync T/T   0   0  66  82 100K 124K 0.0   0  17   0   3
19:17:07 trime P   3  3 node3 Sync T/T   0   0  68  78 105K 117K 0.0   0  22   0   0
19:17:08 trime P   3  3 node3 Sync T/T   0   0  65  93 101K 135K 0.0   0  14   1   5
19:17:09 trime P   3  3 node3 Sync T/T   0   0  73  83 111K 125K 0.0   0  19   0   3
19:17:10 trime P   3  3 node3 Sync T/T   0   0  30  46  46K  66K 0.0   0  10   0   2
19:17:12 trime P   3  3 node3 Sync T/T   0   0  64  80  97K 120K 0.0   0  19   0   4
19:17:13 trime P   3  3 node3 Sync T/T   0   0  69  88 106K 131K 0.0   0  28   0   1
19:17:14 trime P   3  3 node3 Sync T/T   0   0  70  83 106K 121K 0.0   0  11   0   3
19:17:15 trime P   3  3 node3 Sync T/T   0   0  72  84 111K 126K 0.0   0  15   0   3

We can see are are connected to the node identified as ‘node3′.  Our cluster is Primary and there are 3 nodes total belonging to it.

There isn’t any replication queue activity, and I find this is common except during cluster stalls.  There are clearly a fair amount of transactions being replicated to and from this node: approximately 100K worth of data outbound, and just a hair more than that coming in.

Our replication is performing well, because the Flow control columns are zeroes, but we do see some replication conflicts.  Mostly these are brute force aborts, but I was able to see the (very occasional) local certification failure.  This makes sense to me because the replication inbound queue always reports as empty, so it seems that replication is being applied nearly immediately. Local certification failures only happen when the inbound queue is > 0.  Instead brute force aborts are applying writesets rolling back locally open transactions.

In fact, this is a sysbench test what is running full speed (these are VMs, so that’s not that particularly fast) on two of my three nodes, and more slowly on the third.  I had to decrease my table size from 250k to 2.5k to start seeing replication conflicts regularly.

Hopefully this is helpful for you to get an idea of how to observe and interpret Galera status variables.

Nov
24
2012
--

What CPU Architecture do you prefer for running MySQL In production

Time for another poll! I wonder What CPU Architecture do you use for running MySQL in Production ? I guess most of the game here is between Intel and AMD x86-64 variants though I wonder if there is still a lot of use for others in the wild. If I am missing some Architecture which is still alive in your data center please post a comment and I will add it to the list.
Please select all what applies. If you would like to share what works better for you in the comments I appreciate it.

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

Nov
23
2012
--

Full table scan vs full index scan performance

Earlier this week, Cédric blogged about how easy we can get confused between a covering index and a full index scan in the EXPLAIN output. While a covering index (seen with EXPLAIN as Extra: Using index) is a very interesting performance optimization, a full index scan (type: index) is according to the documentation the 2nd worst possible execution plan after a full table scan.
If it is obvious that a full table scan is not good for performance, how much can we expect if we can switch to a full index scan? In other terms, is a full table scan always the worst possible execution and should it be avoided at all costs?

Let’s take the employees database, and slightly modify the employees tables:

mysql> ALTER TABLE employees ADD INDEX idx_first (first_name),ENGINE=InnoDB;

And then let’s consider this query:

SELECT * FROM employees ORDER BY first_name;

This query can of course by executed by running a full table scan, but we could also take advantage of the idx_first index, which will translate to a full index scan.
Let’s see how the optimizer will execute it:

mysql> EXPLAIN SELECT * FROM employees ORDER BY first_name\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 300363
        Extra: Using filesort

Surprising? The optimizer preferred a full table scan, and it did not even consider scanning the idx_first index as a relevant choice (possible_keys: NULL).

What do we get if we force the optimizer to use the index?

mysql> EXPLAIN SELECT * FROM employees FORCE INDEX(idx_first) ORDER BY first_name\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
         type: index
possible_keys: NULL
          key: idx_first
      key_len: 16
          ref: NULL
         rows: 300363
        Extra:

Honestly, it looks better: the number of rows is the same, but a full index scan instead of a full table scan and no filesort. But predicting how a query will perform by only looking at the execution plan is not enough, we must run both queries and compare execution time.

First case: the employees table does not fit in memory
With the full table scan, the query runs in about 4s.
With the full index scan, it runs in about 30s.

So the optimizer was right after all. But why? Simply because all access patterns are not equal. When we are doing a full table scan, we are doing sequential reads, which are quite fast even with slow disks. But when we are using the index, we first have to do a full index scan (fast sequential reads, no problem) and then lots of random reads to fetch rows by index value. And random reads are orders of magnitude slower than sequential reads.
The optimizer has this knowledge, so it is able to pick the right execution plan.

Second case: the employees table fits in memory
With the full table scan, the query runs in about 3.3s.
With the full index scan, the query runs in about 2.6s.

We can see here a limitation of the optimizer: it does not know on which kind of media data is stored. If it is stored on spinning disks, the assumption that random reads are much slower than sequential reads is correct, but it is not the case anymore if data is stored in memory. That’s why when execution plans look similar, you should always execute the query to really see which execution plan should be chosen. Here if we know that the table will always be in memory, we should add the FORCE INDEX hint to ensure optimal response time.

Now let’s modify the query by selecting only the first_name field instead of selecting all the fields:

mysql> explain SELECT first_name FROM employees ORDER BY first_name\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
         type: index
possible_keys: NULL
          key: idx_first
      key_len: 16
          ref: NULL
         rows: 300584
        Extra: Using index

The optimizer chooses the full index scan. It is a good choice because the index now covers the query, which means that reading the index is enough to get the results.

Conclusions:

  • For a non covering index, the different between a full table scan and an execution plan based on a full index scan is basically the difference between sequential reads and random reads: it can be close if you have fast storage or it can be very different if you have slow storage.
  • A full index scan can become interesting when the index is covering.
  • Don’t forget to measure response time when you are trying different execution plans. It is too easy to get focused on having a good-looking execution, but the end user only cares about response time!
Nov
22
2012
--

Get Me Some Query Logs!

One of my favorite tools in the Percona Toolkit is pt-query-digest.  This tool is indispensable for identifying your top SQL queries, and analyzing which queries are accounting for your database load.

But the report you get from pt-query-digest is only as good as the log of queries you give it as input.  You need a large enough sample of query logs, collected over a period of time when you have representative traffic on your database.

You also need the log to include all the queries, not just those that take more than N seconds.  The reason is that some queries are individually quick, and would not be logged if you set the long_query_time configuration variable to 1 or more seconds.  You want that threshold to be 0 seconds while you’re collecting logs.

However, activating such high-volume query log collection can be costly.  Every statement executed on your  database will cause file I/O, even when the query itself was served out of your buffer pool memory.  That’s a lot of overhead, so we need to be careful about how and when we collect logs, and for how long we leave that running.

I’ve put together a simple shell script to help automate this.  I have given it the functional but unimaginative name full-slow-log.

The script configures full logging, then sleeps for a number of seconds to allow queries to be collected in the logs.  After it finishes sleeping, or if you interrupt the script, the script restores log configuration back to the values they started with.

$ full-slow-log [ -v ] [ -s seconds ] [ -c config ]
  • -v is for verbose output.
  • -s seconds allows you to specify the number of seconds to sleep.  The default is 5 seconds, which is probably too short for most sites, but the value is chosen to be as low impact as possible if you forget to give another value.
  • -c config allows you to specify a MySQL config file other than $HOME/.my.cnf, so you can store host, user, and password.

Here’s an example of running it with verbose output:

$ full-slow-log -v
Discovering slow_query_log=1
Discovering slow_query_log_file=mysql-slow.log
Discovering long_query_time=60.000000
Setting long_query_time=0
Setting slow_query_log_file=mysql-slow.log-full-20121122112413
Setting slow_query_log=1
Flushing slow query log
Sleeping 5 seconds... done.
Restoring slow_query_log_file=mysql-slow.log
Restoring long_query_time=60.000000
Restoring slow_query_log=1
Flushing logs during restore

Notice that the script also redirects the slow query log to a new file, with a filename based on the timestamp.  This is so you have a distinct file that contains only the specific time range of logs you collected.

The restoration of settings is in a “trap” which is a shell scripting feature that serves as both an exit handler and signal handler.  So if you interrupt the script before it’s done, you have some assurance that it will do the right thing to restore settings anyway.

My full-slow-log script is now available on a Github project (along with a few other experimental scripts I have written).  See https://github.com/billkarwin/bk-tools

I hope you find this script a useful complement to my upcoming talks at the Percona Live  MySQL Conference in London, UK on December 3-4 2012:

If you can make it to London in December, we’d love to see you there!  If not, look for future Percona Live conferences.

Tweet the link to this blog for a chance to win a free full conference pass. Make sure to use hashtag #perconalive! Winner will be chosen at the end of the day. 

Nov
21
2012
--

Speaking at Percona Live London 2012: Percona XtraDB Cluster in a nutshell

The Beast : Percona XtraDB Cluster !If you are interested by a great and revolutionary HA solution for MySQL or even if you already have some experience with PXC and/or Galera, I invite you to attend the tutorial I will deliver in Percona Live MySQL Conference London 2012 (PLUK2012), December 3rd : Percona XtraDB Cluster in a nutshell.

Small amount of theory and a lot of exercises and examples to really tame the beast that PXC is !

A Codership engineer will also be present and happy to help me answering your questions.

Note to attendees: please bring your laptop with a recent Virtual Box installed and enough free space to store 3 VMs (+/-2G per VM). As said before, this session will be almost 100% hands on !

Last but not least, tweet the link to this blog for a chance to win a free full conference pass. Make sure to use hashtag #perconalive! Winner will be chosen at the end of the day.

Nov
20
2012
--

Understanding Multi-node writing conflict metrics in Percona XtraDB Cluster and Galera

I have addressed previously how multi-node writing causes unexpected deadlocks in PXC, at least, it is unexpected unless you know how Galera replication works.  This is a complicated topic and I personally feel like I’m only just starting to wrap my head around it.

The magic of Galera replication

The short of it is that Galera replication is not a doing a full 2-phase commit to replicate (where all nodes commit synchronously with the client), but it is instead something Codership calls “virtually synchronous” replication.  This could either be clever marketing or clever engineering, at least at face value.   However, I believe it really is clever engineering and that it is probably the best compromise for performance and data protection out there.

There’s likely a lot more depth we could cover in this definition, but fundamentally “virtually synchronous replication” means:

  • Writesets (or “transactions”) are replicated to all available nodes in the cluster on commit (and enqueued on each).
  • Enqueued writesets are applied on those nodes independently and asynchronously from the original commit on the source node.  And:
  • At this point the transaction can and should be considered permanent in the cluster.  But how can that be true if they are not applied?  Because:
    • Galera can do conflict detection between different writesets, so enqueued (but not yet committed) writesets are protected from local conflicting commits until our replicated writeset is committed. AND:
    • When the writeset is actually applied on a given node, any locking conflicts it detects with open (not-yet-committed) transactions on that node cause that open transaction to get rolled back.
    • Writesets being applied by replication threads always win.
So why is this “virtually synchronous”?  Because simply getting our writesets to every node on commit means that they are guaranteed to apply — therefore we don’t have to force all nodes to commit simultaneously to guarantee cluster consistency as you would in a two-phase commit architecture.

Seeing when replication conflicts happen

This brings me to my topic for today, the mysterious SHOW GLOBAL STATUS variables called:

  • wsrep_local_cert_failures
  • wsrep_local_bf_aborts

I found that understanding these helped me understand Galera replication better.  If you are experiencing the “unexpected deadlocks” problem, then you are likely seeing one or both of these counters increase over time, but what do they mean?

Actually, they are two sides to the same coin (kind of).  Both apply to some local transaction getting aborted and rolled back, and the difference comes down to when and how that transaction conflict was detected.  It turns out there are two possible ways:

wsrep_local_cert_failures

The Galera documentation states that this is the:

Total number of local transactions that failed certification test.

What is a local certification test?  It’s quite simple:  On COMMIT, galera takes the writeset for this transaction and does conflict detection against all pending writesets in the local queue on this node.  If there is a conflict, the deadlock on COMMIT error happens (which shouldn’t happen in normal Innodb), the transaction is rolled back, and this counter is incremented.

To put it another way, some other conflicting write from some other node was committed before we could commit, and so we must abort.

This local certification failure is only triggered by a Galera writeset comparison operation comparing a given to-be-commited writeset to all other writesets enqueued locally on the local node.  The local transaction always loses.

wsrep_local_bf_aborts

Again, the Galera documentation states that this is the:

Total number of local transactions that were aborted by slave transactions while in execution.

This kind of sounds like the same thing, but this is actually an abort from the opposite vector:  instead of a local transaction triggering the failure on commit, this is triggered by Galera replication threads applying replicated transactions.

To be clearer: a transaction was open on this node (not-yet-committed), and a conflicting writeset from some other node that was being applied caused a locking conflict.  Again, first committed (from some other node) wins, so our open transaction is again rolled back.  ”bf” stands for brute-force:  any transaction can get aborted by galera any time it is necessary.

Note that this conflict happens only when the replicated writeset (from some other node) is being applied, not when it’s just sitting in the queue.  If our local transaction got to its COMMIT and this conflicting writeset was in the queue, then it should fail the local certification test instead.

A brute force abort is only triggered by a locking conflict between a writeset being applied by a slave thread and an open transaction on the node, not by a Galera writeset comparison as in the local certification failure.

Testing it all out

So this is the part of the post where I wanted to show that these counters were being incremented using an example from my last post.  Those examples should trigger brute force aborts, but they didn’t seem to increment either of these counters on any of my testing nodes.   Codership agrees this seems like a bug and is investigating.  I’ll update this post if and when an actual bug is opened, but I have seen these counters being incremented in the wild, so any bug is likely some edge case.

By the way, I can’t think of how to reliably produce local certification errors without just a lot of fast modifications to a single row, because those depend on the replication queue being non-empty and I don’t know any way to pause the Galera queue for a controlled experiment.

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com