MySQL and geospatial programming: An introduction to GIS

Geographic information systems (GIS) are used by application developers to incorporate geographic information into their datasets to create apps with cool features such finding the address to the best steakhouse in town or the locations of local points of historical interest (the list is endless). In a nutshell, GIS captures, store, checks and displays data related to positions on Earth’s surface.

Next Wednesday I’m going provide an introduction to GIS functionality in MySQL along an overview of recent changes. I invite you to register now for this free webinar titled “MySQL and Geospatial Programming.” It starts at 10 a.m. Pacific time on March 18.

MySQL and Geospatial ProgrammingIn this talk I’ll discuss:

  • The various tasks that deal with geocoding
  • How MySQL can solve problems such as points within a radius (e.g., “Find the 10 closest coffee shops) among others
  • Highlights from some of the new features bundled in the upcoming 5.7 release, and what benefits they can bring to your applications

I’m happy to field questions in advance in the comments section below. This webinar, like all Percona webinars, will be recorded. In addition to the video, my slides will also be available for download.

Register now and I’ll talk to you next week!

The post MySQL and geospatial programming: An introduction to GIS appeared first on MySQL Performance Blog.


Managing big data? Say ‘hello’ to HP Vertica

Managing big data? Say ‘hello’ to HP VerticaOver the past few months, I’ve seen an increase in the following use case while working on performance and schema review engagements:

I need to store exponentially increasing amounts of data and analyze all of it in real-time.

This is also known simply as: “We have big data.” Typically, this data is used for user interaction analysis, ad tracking, or other common click stream applications. However, it can also be seen in threat assessment (ddos mitigation, etc), financial forecasting, and other applications as well. While MySQL (and other OLTP systems) can handle this to a degree, it is by no means a forte. Some of the pain points include:

  • Cost of rapidly increasing, expensive disk storage (OLTP disks need to be fast == $$)
  • Performance decrease as the data size increases
  • Wasted hardware resources (excess I/O, etc)
  • Impact against other time-sensitive transactions (i.e. OLTP workload)

While there are many approaches to this problem – and often times, the solution is actually a hybrid of many individually tailored components – a solution that I have seen more frequently in recent work is HP Vertica.

At the 30,000 foot overview, Vertica is built around the following principles:

  • Columnar data store
  • Highly compressed data
  • Clustered solution for both availability and scalability

Over the next few weeks, I’ll discuss several aspects of Vertica including:

  • Underlying architecture and concepts
  • Basic installation and use
  • Different data loading techniques
  • Some various maintenance/operational procedures
  • Some comparisons vs. traditional OLTP (MySQL) performance
  • Some potential use-cases
  • Integration with other tools (such as Hadoop)

While Vertica is by no means the silver bullet that will solve all of your needs, it may prove to be a very valuable tool in your overall approach to managing big data.

The post Managing big data? Say ‘hello’ to HP Vertica appeared first on MySQL Performance Blog.


Integrating pt-online-schema-change with a Scripted Deployment

Recently, I helped a client that was having issues with deployments causing locking in their production databases.  At a high level, the two key components used in the environment were:

  • Capistrano (scripted deployments) [website]
  • Liquibase (database version control) [website]

At a high level, they currently used a CLI call to Liquibase as a sub-task within a larger deployment task.  The goal of this engagement was to modify that sub-task to run Liquibase in a non-blocking fashion as opposed to the default that just runs native ALTERS against the database.

As I wasn’t very familiar with Liquibase, I took this opportunity to learn more about it and it seems like a very valuable tool.  Essentially, it does the following:

  • Adds two version control tables to your schema
  • Reads an XML “changelog” file that has the schema changes
  • Verifies which changes have yet to be applied
  • Applies the changes in serial to the tables
  • Records each change in the version control table as it happens

While this is a very powerful tool, as you dataset grows, this can be problematic.  As each change is run as an independent ALTER statement, consider a large (several million row) table that you are updating with multiple indexes – this can result in several lengthy blocking operations that can impact your application in other ways.

My first thought when hearing that table alters were locking up the production application was naturally to use Percona Toolkit’s pt-online-schema-change – but the challenge lied with the integration of the tools.  After some investigation and discussion, I found two approaches that seemed feasible to get pt-osc integrated:

  1. Write a custom plugin for Liquibase
  2. Utilize the updateSQL action to fetch the raw SQL and parse/process each statement

Due to time constraints, we landed on option 2.  In pseudocode, here is the approach that we took:

#Output is SQL string of all commands that Liquibase would run
sql = liquibase updateSQL
foreach sql as line
  switch (line)
    case blank/comment:
    case ALTER TABLE
      line = convertCreateToAlter(line)

This turned out to be exactly what the client needed as it allowed the code deployment to still utilize the version control methods that were understood, but did so in a non-blocking manner.  The main caveat with this approach is that the application code needs to be backwards and forwards compatible with the changes.  As you now may see a long running background process, depending on when your code is actually deployed in the complete process, it may need to handle the database in different states.

Although this is a frequent scenario with deployments anyway, I just wanted make note that often times when changing schema, the application must be able to handle the database in different states.

Here is the link to the (sanitized) proof-of-concept code and some samples I used for testing:  Please note that I am by no means a Ruby developer so I assume there are plenty of coding optimizations that can be made – this was more of an exercise in combining the tools as opposed to a Ruby showcase.

The post Integrating pt-online-schema-change with a Scripted Deployment appeared first on MySQL Performance Blog.


pt-online-schema-change and binlog_format

Statement-based or row-based, or mixed?  We’ve all seen this discussed at length so I’m not trying to rehash tired arguments.  At a high level, the difference is simple:

  1. Statement based replication (SBR) replicates the SQL statements to the slave to be replayed
  2. Row based replication (RBR) replicates the actual rows changed to the slave to be replayed
  3. Mixed mode uses RBR in the event of a non-deterministic statement, otherwise uses SBR

Recently, I worked with a client to optimize their use of pt-online-schema-change and keep replication delay to a minimum.  We found that using RBR in conjunction with a smaller chunk-time was the best result in their environment due to reduced IO on the slave, but I wanted to recreate the test locally as well to see how it looked in the generic sense (sysbench for data/load).

Here was my local setup:

  1. 3 VMs (CentOS 6.4, 512M RAM, Macbook Pro host)
  2. 1 VM was for sysbench and pt-osc
  3. 1 master, 1 slave (Percona Server 5.5.30, 256M buffer pool)

And here is the base test that I ran:

  1. Populate the db with four, 1 million row tables
  2. Restart MySQL to ensure an empty buffer pool (no LRU dump/restore)
  3. Run pt-osc against the master and capture diagnostics on slave
  4. Repeat with both SBR and RBR

Visually, the most telling difference between the two baselines comes from comparing the slave IOPs when using SBR vs RBR:

Statement Based - IOPs (iostat)

Statement Based – IOPs (iostat -mx 1)

Row Based - IOPs (iostat)

Row Based – IOPs (iostat -mx 1)


While the write operations look similar in both, you can see the dramatic difference in the read operations in that they are almost negligible when using row based. I had assumed there would be high read IOPs as the buffer pool was empty, so I also verified that the Innodb_buffer_pool_reads (reads that missed the buffer pool and went to disk) on both:

SBR - Buffer Pool Reads (from disk)

SBR – Buffer Pool Reads (from disk)

RBR - Buffer Pool Reads (from disk)

RBR – Buffer Pool Reads (from disk)


Looking at how pt-osc and the buffer pool operate, these results make sense for this workload.  Here is the basic process for pt-osc:

  1. Create a new, shadow table based on the original table
  2. Apply the alters to the new shadow table
  3. Add triggers to the original table to track changes
  4. Run INSERT … SELECT CHUNK against the original table to populate the shadow table
  5. Rename the tables after all records are copied

In the case of SBR, the actual INSERT … SELECT CHUNK is replayed on the slave verbatim, meaning that the chunk will need to be read from disk if not in the BP in order to insert it into the new table.  However, when using RBR, you simply send the rows to the slave.  As this is a new table, there is nothing to read from disk so InnoDB simply writes the new page and it eventually gets flushed to disk.

This is where the –set-vars switch can come in handy with pt-online-schema-change.  Picture this scenario:

  1. The buffer pool is undersized (due to lack of memory compared to data size)
  2. You are altering a table that mostly archive (i.e. not hot data)
  3. The slave is nearly IO bound already as it is actively serving read traffic

In this case, adding extra read IOPs to the slave could be a performance killer.  So assuming you have binlog_format=mixed on the slave, you can use the –set-vars like this to run the alter using RBR to save on IOPs:

pt-online-schema-change –alter=”ENGINE=InnoDB” –set-vars=”binlog_format=row” –execute h=master,D=db,t=tbl

Keep in mind, RBR isn’t going to give you the same results in the course of normal replication.  When replicating changes to the slave, the page is requested from the buffer pool and read from disk if not present so that the changes can be applied to it.  This can still be a win (think of a query that is very complicated that returns only a few rows), but you won’t see the dramatic difference as you do when using this approach with pt-osc.

The post pt-online-schema-change and binlog_format appeared first on MySQL Performance Blog.


MySQL 5.6 – InnoDB Memcached Plugin as a caching layer

A common practice to offload traffic from MySQL 5.6 is to use a caching layer to store expensive result sets or objects.  Some typical use cases include:

  • Complicated query result set (search results, recent users, recent posts, etc)
  • Full page output (relatively static pages)
  • Full objects (user or cart object built from several queries)
  • Infrequently changing data (configurations, etc)

In pseudo-code, here is the basic approach:

data = fetchCache(key)
if (data) {
  return data
data = callExpensiveFunction(params)
storeCache(data, key)
return data

Memcached is a very popular (and proven) option used in production as a caching layer.  While very fast, one major potential shortcoming of memcached is that it is not persistent.  While a common design consideration when using a cache layer is that “data in cache may go away at any point”, this can result in painful warmup time and/or costly cache stampedes.

Cache stampedes can be mitigated through application approaches (semaphores, pre-expiring and populating, etc), but those approaches are more geared towards single key expiration or eviction.  However, they can’t help overall warmup time when the entire cache is cleared (think restarting a memcache node).  This is where a persistent cache can be invaluable.

Enter MySQL 5.6 with the memcached plugin

As part of the standard MySQL 5.6 GA distribution, there is a memcached plugin included in the base plugin directory (/usr/lib64/mysql/plugin/ that can be stopped and started at runtime.  In a nutshell, here is how one would start the memcached plugin:

mysql> install plugin daemon_memcached soname '';

In an effort to not re-invent the wheel, here is a link to the full documentation for setting up the plugin:

As a quick benchmark, I ran some batches of fetch and store against both a standard memcached instance and a minimally tuned MySQL 5.6 instance running the memcached plugin.  Here are some details about the test:

  • Minimal hardware (vBox instances on MacBook Pro)
    • Centos 6.4
    • Single core VM
    • 528M RAM
    • Host-Only network
    • 1 Box with http/php, 1 box with memcache or mysql started
  • PHP script
    • Zend framework
    • libmemcached PECL module
    • Zend_Cache_Backend_Libmemcached

Here is the rough code for this benchmark:

// Identical config/code for memcached vs InnoDB
$frontendOpts = array(
  'caching' => true,
  'lifetime' => 3600,
  'automatic_serialization' => true
$memcacheOpts = array(
  'servers' =>array(
      'host'   => '',
      'port'   => 11211,
      'weight' => 1,
  'client' => array(
    'compression' => true,
$cache = Zend_Cache::factory('Core', 'Libmemcached', $frontendOpts, $memcacheOpts);
for ($i = 0; $i < 100000; $i++) {
  $cache->save(new Object(), "key_$i");
$totalTimeStore = $timer->stop();
$avgTimeStore = $totalTimeStore / 100000;
for ($i = 0; $i < 10; $i++) {
  for ($i = 0; $i < 100000; $i++) {
    $obj = $cache->load("key_$i");
$totalTimeFetch = $timer->stop();
$avgTimeFetch = $totalTimeFetch / 1000000;

While this benchmark doesn’t show any multi-threading or other advanced operation, it is using identical code to eliminate variation due to client libraries.  The only change between runs is on the remote server (stop/start memcached, stop/start plugin).

As expected, there is a slowdown for write operations when using the InnoDB version.  But there is also a slight increase in the average fetch time.  Here are the raw results from this test run (100,000 store operations, 1,000,000 fetch operations):

Standard Memcache:

Storing [100,000] items:

60486 ms total
0.60486 ms per/cmd
0.586 ms min per/cmd
0.805 ms max per/cmd
0.219 ms range per/cmd

Fetching [1,000,000] items:
288257 ms total
0.288257 ms per/cmd
0.2843 ms min per/cmd
0.3026 ms max per/cmd
0.0183 ms range per/cmd

InnoDB Memcache:

Storing [100,000] items:

233863 ms total
2.33863 ms per/cmd
1.449 ms min per/cmd
7.324 ms max per/cmd
5.875 ms range per/cmd

Fetching [1,000,000] items:
347181 ms total
0.347181 ms per/cmd
0.3208 ms min per/cmd
0.4159 ms max per/cmd
0.0951 ms range per/cmd

InnoDB MySQL Select (same table):

Fetching [1,000,000] items:

441573 ms total
0.441573 ms per/cmd
0.4327 ms min per/cmd
0.5129 ms max per/cmd
0.0802 ms range per/cmd

Keep in mind that the entire data set fits into the buffer pool, so there are no reads from disk.  However, there is write activity stemming from the fact that this is using InnoDB under the hood (redo logs, etc).

Based on the above numbers, here are the relative differences:

  • InnoDB store operation was 280% higher (~1.73 ms/op)
  • InnoDB fetch operation was 20% higher (~.06 ms/op)
  • MySQL Select showed 27% increase over InnoDB fetch (~.09 ms/op)
    • This replaced $cache->load() with $db->query(“SELECT * FROM memcached.container WHERE id=’key_id’”);
    • id is PK of the container table

While there are increases in both operations, there are some tradeoffs to consider:

  • Cost of additional memcached hardware
  • Cost of operations time to maintain an additional system
  • Impact of warmup time to application
  • Cost of disk space on database server

Now, there are definitely other NoSQL options for persistent cache out there (Redis, Couchbase, etc), but they are outside the scope of this investigation and would require different client libraries and benchmark methodology.

My goal here was to compare a transparent switch (in terms of code) and experiment with the memcache plugin.  Even the use of HandlerSocket would require coding changes (which is why it was also left out of the discussion).

The post MySQL 5.6 – InnoDB Memcached Plugin as a caching layer appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by