Aug
31
2015
--

Apple And Cisco Ink Nebulous Enterprise Partnership

apple-wwdc-20150411 Apple playing nicely with enterprise companies is a sight for sore eyes. The edict that Microsoft has enterprise on lockdown is dissipating. Huge enterprise player Cisco and Apple announced a “Fast Lane” for iOS enterprise users, which promises a more streamlined and optimized experience for those enterprise customers using Cisco networks and products. There aren’t a lot… Read More

Aug
31
2015
--

High-load clusters and desynchronized nodes on Percona XtraDB Cluster

There can be a lot of confusion and lack of planning in Percona XtraDB Clusters in regards to nodes becoming desynchronized for various reasons.  This can happen a few ways:

When I say “desynchronized” I mean a node that is permitted to build up a potentially large wsrep_local_recv_queue while some operation is happening.  For example a node taking a backup would set wsrep_desync=ON during the backup and potentially fall behind replication some amount.

Some of these operations may completely block Galera from applying transactions, while others may simply increase load on the server enough that it falls behind and applies at a reduced rate.

In all the cases above, flow control is NOT used while the node cannot apply transactions, but it MAY be used while the node is recovering from the operation.  For an example of this, see my last blog about IST.

If a cluster is fairly busy, then the flow control that CAN happen when the above operations catch up MAY be detrimental to performance.

Example setup

Let us take my typical 3 node cluster with workload on node1.  We are taking a blocking backup of some kind on node3 so we are executing the following steps:

  1. node3> set global wsrep_desync=ON;
  2. Node3’s “backup” starts, this starts with FLUSH TABLES WITH READ LOCK;
  3. Galera is paused on node3 and the wsrep_local_recv_queue grows some amount
  4. Node3’s “backup” finishes, finishing with UNLOCK TABLES;
  5. node3> set global wsrep_desync=OFF;

During the backup

This includes up through step 3 above.  My node1 is unaffected by the backup on node3, I can see it averaging 5-6k writesets(transactions) per second which it did before we began:

Screen Shot 2015-08-19 at 2.38.34 PM

 

node2 is also unaffected:

Screen Shot 2015-08-19 at 2.38.50 PM

but node3 is not applying and its queue is building up:

Screen Shot 2015-08-19 at 2.39.04 PM

Unlock tables, still wsrep_desync=ON

Let’s examine briefly what happens when node3 is permitted to start applying, but wsrep_desync stays enabled:

Screen Shot 2015-08-19 at 2.42.16 PM

node1’s performance is pretty much the same, node3 is not using flow control yet. However, there is a problem:

Screen Shot 2015-08-19 at 2.43.13 PM

It’s hard to notice, but node3 is NOT catching up, instead it is falling further behind!  We have potentially created a situation where node3 may never catch up.

The PXC nodes were close enough to the red-line of performance that node3 can only apply just about as fast (and somewhat slower until it heats up a bit) as new transactions are coming into node1.

This represents a serious concern in PXC capacity planning:

Nodes do not only need to be fast enough to handle normal workload, but also to catch up after maintenance operations or failures cause them to fall behind.

Experienced MySQL DBA’s will realize this isn’t all that different than Master/Slave replication.

Flow Control as a way to recovery

So here’s the trick:  if we turn off wsrep_desync on node3 now, node3 will use flow control if and only if the incoming replication exceeds node3’s apply rate.  This gives node3 a good chance of catching up, but the tradeoff is reducing write throughput of the cluster.  Let’s see what this looks like in context with all of our steps.  wsrep_desync is turned off at the peak of the replication queue size on node3, around 12:20PM:

Screen Shot 2015-08-19 at 2.47.12 PM

Screen Shot 2015-08-19 at 2.48.07 PM

So at the moment node3 starts utilizing flow control to prevent falling further behind, our write throughput (in this specific environment and workload) is reduced by approximately 1/3rd (YMMV).   The cluster will remain in this state until node3 catches up and returns to the ‘Synced’ state.  This catchup is still happening as I write this post, almost 4 hours after it started and will likely take another hour or two to complete.

I can see a more realtime representation of this by using myq_status on node1, summarizing every minute:

[root@node1 ~]# myq_status -i 1m wsrep
mycluster / node1 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
19:58:47 P   5  3 Sync 0.9ms 3128 2.0M   0   27 213b   0 25.4s   0   0   0 3003k  16k  62%
19:59:47 P   5  3 Sync 1.1ms 3200 2.1M   0   31 248b   0 18.8s   0   0   0 3003k  16k  62%
20:00:47 P   5  3 Sync 0.9ms 3378 2.2M  32   27 217b   0 26.0s   0   0   0 3003k  16k  62%
20:01:47 P   5  3 Sync 0.9ms 3662 2.4M  32   33 266b   0 18.9s   0   0   0 3003k  16k  62%
20:02:47 P   5  3 Sync 0.9ms 3340 2.2M  32   27 215b   0 27.2s   0   0   0 3003k  16k  62%
20:03:47 P   5  3 Sync 0.9ms 3193 2.1M   0   27 215b   0 25.6s   0   0   0 3003k  16k  62%
20:04:47 P   5  3 Sync 0.9ms 3009 1.9M  12   28 224b   0 22.8s   0   0   0 3003k  16k  62%
20:05:47 P   5  3 Sync 0.9ms 3437 2.2M   0   27 218b   0 23.9s   0   0   0 3003k  16k  62%
20:06:47 P   5  3 Sync 0.9ms 3319 2.1M   7   28 220b   0 24.2s   0   0   0 3003k  16k  62%
20:07:47 P   5  3 Sync 1.0ms 3388 2.2M  16   31 251b   0 22.6s   0   0   0 3003k  16k  62%
20:08:47 P   5  3 Sync 1.1ms 3695 2.4M  19   39 312b   0 13.9s   0   0   0 3003k  16k  62%
20:09:47 P   5  3 Sync 0.9ms 3293 2.1M   0   26 211b   0 26.2s   0   0   0 3003k  16k  62%

This reports around 20-25 seconds of flow control every minute, which is consistent with that ~1/3rd of performance reduction we see in the graphs above.

Watching node3 the same way proves it is sending the flow control (FlowC snt):

mycluster / node3 (idx: 2) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
17:38:09 P   5  3 Dono 0.8ms    0   0b   0 4434 2.8M 16m 25.2s  31   0   0 18634  16k  80%
17:39:09 P   5  3 Dono 1.3ms    0   0b   1 5040 3.2M 16m 22.1s  29   0   0 37497  16k  80%
17:40:09 P   5  3 Dono 1.4ms    0   0b   0 4506 2.9M 16m 21.0s  31   0   0 16674  16k  80%
17:41:09 P   5  3 Dono 0.9ms    0   0b   0 5274 3.4M 16m 16.4s  27   0   0 22134  16k  80%
17:42:09 P   5  3 Dono 0.9ms    0   0b   0 4826 3.1M 16m 19.8s  26   0   0 16386  16k  80%
17:43:09 P   5  3 Jned 0.9ms    0   0b   0 4957 3.2M 16m 18.7s  28   0   0 83677  16k  80%
17:44:09 P   5  3 Jned 0.9ms    0   0b   0 3693 2.4M 16m 27.2s  30   0   0  131k  16k  80%
17:45:09 P   5  3 Jned 0.9ms    0   0b   0 4151 2.7M 16m 26.3s  34   0   0  185k  16k  80%
17:46:09 P   5  3 Jned 1.5ms    0   0b   0 4420 2.8M 16m 25.0s  30   0   0  245k  16k  80%
17:47:09 P   5  3 Jned 1.3ms    0   0b   1 4806 3.1M 16m 21.0s  27   0   0  310k  16k  80%

There are a lot of flow control messages (around 30) per minute.  This is a lot of ON/OFF toggles of flow control where writes are briefly delayed rather than a steady “you can’t write” for 20 seconds straight.

It also interestingly spends a long time in the Donor/Desynced state (even though wsrep_desync was turned OFF hours before) and then moves to the Joined state (this has the same meaning as during an IST).

Does it matter?

As always, it depends.

If these are web requests and suddenly the database can only handle ~66% of the traffic, that’s likely a problem, but maybe it just slows down the website somewhat.  I want to emphasize that WRITES are what is affected here.  Reads on any and all nodes should be normal (though you probably don’t want to read from node3 since it is so far behind).

If this were some queue processing that had reduced throughput, I’d expect it to possibly catch up later

This can only be answered for your application, but the takeaways for me are:

  • Don’t underestimate your capacity requirements
  • Being at the redline normally means you are well past the redline for abnormal events.
  • Plan for maintenance and failure recoveries
  • Where possible, build queuing into your workflows so diminished throughput in your architecture doesn’t generate failures.

Happy clustering!

Graphs in this post courtesy of VividCortex.

The post High-load clusters and desynchronized nodes on Percona XtraDB Cluster appeared first on Percona Data Performance Blog.

Aug
29
2015
--

The SaaS Success Database

unicorn-money What does it take to build a billion-dollar SaaS enterprise-software company? We gave a 30,000-foot answer to this complex — and fascinating — question in a recent TechCrunch post, The SaaS Adventure. To recap: We’ve observed seven key phases in most SaaS companies’ go-to-market success. We dubbed this journey the “SaaS Adventure,” which is broadly how we… Read More

Aug
28
2015
--

The Math Behind SaaS Startup Customer Lifetime Value

mathematics One of the most critical metrics for software companies — but also one of the most difficult to measure — is the lifetime value of their customers (LTV). The lifetime value dictates how a company should spend its marketing and sales dollars. Unfortunately, many early stage startups struggle to measure LTV, because they haven’t been around very long and, consequently… Read More

Aug
28
2015
--

Percona Toolkit 2.2.15 is now available

Percona ToolkitPercona is pleased to announce the availability of Percona Toolkit 2.2.15.  Released August 28, 2015. Percona Toolkit is a collection of advanced command-line tools to perform a variety of MySQL server and system tasks that are too difficult or complex for DBAs to perform manually. Percona Toolkit, like all Percona software, is free and open source.

This release is the current GA (Generally Available) stable release in the 2.2 series. It includes multiple bug fixes as well as continued preparation for MySQL 5.7 compatibility. Full details are below. Downloads are available here and from the Percona Software Repositories.

New Features:

  • Added --max-flow-ctl option to pt-online-schema-change and pt-archiver with a value set in percent. When a Percona XtraDB Cluster node is very loaded, it sends flow control signals to the other nodes to stop sending transactions in order to catch up. When the average value of time spent in this state (in percent) exceeds the maximum provided in the option, the tool pauses until it falls below again.Default is no flow control checking.
  • Added the --sleep option for pt-online-schema-change to avoid performance problems. The option accepts float values in seconds.
  • Implemented ability to specify --check-slave-lag multiple times for pt-archiver. The following example enables lag checks for two slaves:
    pt-archiver --no-delete --where '1=1' --source h=oltp_server,D=test,t=tbl --dest h=olap_server --check-slave-lag h=slave1 --check-slave-lag h=slave2 --limit 1000 --commit-each
  • Added the --rds option to pt-kill, which makes the tool use Amazon RDS procedure calls instead of the standard MySQL kill command.

Bugs Fixed:

  • Fixed bug 1042727: pt-table-checksum doesn’t reconnect the slave $dbh
    Before, the tool would die if any slave connection was lost. Now the tool waits forever for slaves.
  • Fixed bug 1056507: pt-archiver --check-slave-lag agressiveness
    The tool now checks replication lag every 100 rows instead of every row, which significantly improves efficiency.
  • Fixed bug 1215587: Adding underscores to constraints when using pt-online-schema-change can create issues with constraint name length
    Before, multiple schema changes lead to underscores stacking up on the name of the constraint until it reached the 64 character limit. Now there is a limit of two underscores in the prefix, then the tool alternately removes or adds one underscore, attempting to make the name unique.
  • Fixed bug 1277049pt-online-schema-change can’t connect with comma in password
    For all tools, documented that commas in passwords provided on the command line must be escaped.
  • Fixed bug 1441928: Unlimited chunk size when using pt-online-schema-change with --chunk-size-limit=0 inhibits checksumming of single-nibble tables
    When comparing table size with the slave table, the tool now ignores --chunk-size-limit if it is set to zero to avoid multiplying by zero.
  • Fixed bug 1443763: Update documentation and/or implentation of pt-archiver --check-interval
    Fixed the documentation for --check-interval to reflect its correct behavior.
  • Fixed bug 1449226: pt-archiver dies with “MySQL server has gone away” when --innodb_kill_idle_transaction is set to a low value and --check-slave-lag is enabled
    The tool now sends a dummy SQL query to avoid timing out.
  • Fixed bug 1446928: pt-online-schema-change not reporting meaningful errors
    The tool now produces meaningful errors based on text from MySQL errors.
  • Fixed bug 1450499: ReadKeyMini causes pt-online-schema-change session to lock under some circumstances
    Removed ReadKeyMini, because it is no longer necessary.
  • Fixed bug 1452914: --purge and --no-delete are mutually exclusive, but still allowed to be specified together by pt-archiver
    The tool now issues an error when --purge and --no-delete are specified together.
  • Fixed bug 1455486: pt-mysql-summary is missing the --ask-pass option
    Added the --ask-pass option to the tool.
  • Fixed bug 1457573: pt-mysql-summary fails to download pt-diskstats pt-pmp pt-mext pt-align
    Added the -L option to curl and changed download address to use HTTPS.
  • Fixed bug 1462904: pt-duplicate-key-checker doesn’t support triple quote in column name
    Updated TableParser module to handle literal backticks.
  • Fixed bug 1488600: pt-stalk doesn’t check TokuDB status
    Implemented status collection similar to how it is performed for InnoDB.
  • Fixed bug 1488611: various testing bugs related to newer Perl versions

Details of the release can be found in the release notes and the 2.2.15 milestone on Launchpad. Bugs can be reported on the Percona Toolkit launchpad bug tracker.

The post Percona Toolkit 2.2.15 is now available appeared first on Percona Data Performance Blog.

Aug
27
2015
--

Say Hello To Windows 10 Build 10532

windows10 If you are part of the Windows Insider program, Microsoft’s tool to get new code into the hands of its hardcore users ahead of the general public, you get a treat today: Windows 10 build 10532. Yes, another Windows 10 build. So, what’s new? Core to the new build are improved menus, which should provide a bit more harmonic fit. Also up is the ability to share notes from the… Read More

Aug
27
2015
--

Narvar, A Service That Improves Online Post-Purchase Experiences, Raises $10M

narvar ipad screenshot Narvar, a startup that enables companies to better engage with customers after online purchases, said it has raised $10 million. Narvar provides companies with software that improves the post-purchase experience. That can include a better interface when it comes to shipping, more detailed text updates, and then of course options to return products and buy new ones. Those updates can even also… Read More

Aug
26
2015
--

Google Hopes Open Source Will Give Its Cloud A Path To The Enterprise

google logo “Google is not an enterprise company and we are trying to become cognizant of what the enterprise needs,” Craig McLuckie, Google’s product manager in charge of its Kubernetes and Google Container Engine projects, acknowledged during a panel discussion at the OpenStack Foundation’s annual Silicon Valley event today. Read More

Aug
26
2015
--

Talking Drupal #102 – Headless Drupal – Twit.tv

Topics

  • Overview of Twit.tv 
  • Goals of the new Twit.tv
  • The API
  • Overall strategy
  • Details of implementation
  • Challenges
  • Buidling Blocks
  • Drupal 7 RESTful

Resources

  • Twit – http://www.twit.tv 
  • Four Kitchens Launch Annoucement – http://fourword.fourkitchens.com/article/twittv-launches-content-api-and-headless-drupal-site 
  • Twit.tv API Documentation – http://docs.twittv.apiary.io/
  • Introducing Saucier – https://fourword.fourkitchens.com/article/introducing-saucier
  • TWit 3 Scale API Registration: https://twit-tv.3scale.net/ 
  • Rest Easy Tutorial Series: http://fourword.fourkitchens.com/article/series/rest-easy 
  • Decoupled benefits: https://www.youtube.com/watch?v=6eJj5UrUUpU 
  • API Design The Musical: https://www.youtube.com/watch?v=2yAMl8D0IFM 
  • Twit.tv Case Study: http://fourkitchens.com/our-work/twit-tv/ 
  • RESTful Module: https://github.com/RESTful-Drupal/restful
  • David Diers on iTunes – https://itunes.apple.com/us/artist/david-diers/id852703419
  • https://www.youtube.com/channel/UC-ccFbSsEQo8tbSDSKoluYg?sub_confirmation=1

Module of the Week

RESTFul – www.drupal.org/project/restful

Hosts

  • Stephen Cross – www.ParallaxInfoTech.com @stephencross
  • John Picozzi – www.oomphinc.com @johnpicozzi
  • Matt Grill – www.Fourkitchens.com @alwaysworking
  • David Diers – www.fourkitchens.com www.daviddiers.com  @beautyhammer
Aug
26
2015
--

Talking Drupal #102 – Headless Drupal – Twit.tv

Topics

  • Overview of Twit.tv 
  • Goals of the new Twit.tv
  • The API
  • Overall strategy
  • Details of implementation
  • Challenges
  • Buidling Blocks
  • Drupal 7 RESTful

Resources

  • Twit – http://www.twit.tv 
  • Four Kitchens Launch Annoucement – http://fourword.fourkitchens.com/article/twittv-launches-content-api-and-headless-drupal-site 
  • Twit.tv API Documentation – http://docs.twittv.apiary.io/
  • Introducing Saucier – https://fourword.fourkitchens.com/article/introducing-saucier
  • TWit 3 Scale API Registration: https://twit-tv.3scale.net/ 
  • Rest Easy Tutorial Series: http://fourword.fourkitchens.com/article/series/rest-easy 
  • Decoupled benefits: https://www.youtube.com/watch?v=6eJj5UrUUpU 
  • API Design The Musical: https://www.youtube.com/watch?v=2yAMl8D0IFM 
  • Twit.tv Case Study: http://fourkitchens.com/our-work/twit-tv/ 
  • RESTful Module: https://github.com/RESTful-Drupal/restful
  • David Diers on iTunes – https://itunes.apple.com/us/artist/david-diers/id852703419
  • https://www.youtube.com/channel/UC-ccFbSsEQo8tbSDSKoluYg?sub_confirmation=1

Module of the Week

RESTFul – www.drupal.org/project/restful

Hosts

  • Stephen Cross – www.ParallaxInfoTech.com @stephencross
  • John Picozzi – www.oomphinc.com @johnpicozzi
  • Matt Grill – www.Fourkitchens.com @alwaysworking
  • David Diers – www.fourkitchens.com www.daviddiers.com  @beautyhammer

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com