Feb
29
2016
--

EXPLAIN FORMAT=JSON: nested_loop makes JOIN hierarchy transparent

EXPLAIN FORMAT=JSON

EXPLAIN FORMAT=JSONOnce again it’s time for another EXPLAIN FORMAT=JSON is cool! post. This post will discuss how EXPLAIN FORMAT=JSON allows the nested_loop command to make the JOIN operation hierarchy transparent.

The regular

EXPLAIN

  command lists each table that participates in a 

JOIN

  operation on a single row. This works perfectly for simple queries:

mysql> explain select * from employees join titles join salariesG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 299379
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: titles
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 442724
     filtered: 100.00
        Extra: Using join buffer (Block Nested Loop)
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2745434
     filtered: 100.00
        Extra: Using join buffer (Block Nested Loop)
3 rows in set, 1 warning (0.00 sec)

You can see that the first accessed table was

employees

, then

titles

  and finally 

salaries

. Everything is clear.

EXPLAIN FORMAT=JSON

 in this case puts everything into the 

nested_loop

array (even if “MySQL isn’t limited to nested-loop joins”):

mysql> explain format=json select * from employees join titles join salariesG
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "7.277755124e16"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "employees",
          "access_type": "ALL",
          "rows_examined_per_scan": 299379,
          "rows_produced_per_join": 299379,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "929.00",
            "eval_cost": "59875.80",
            "prefix_cost": "60804.80",
            "data_read_per_join": "13M"
          },
          "used_columns": [
            "emp_no",
            "birth_date",
            "first_name",
            "last_name",
            "gender",
            "hire_date"
          ]
        }
      },
      {
        "table": {
          "table_name": "titles",
          "access_type": "ALL",
          "rows_examined_per_scan": 442724,
          "rows_produced_per_join": 132542268396,
          "filtered": "100.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "62734.88",
            "eval_cost": "26508453679.20",
            "prefix_cost": "26508577218.88",
            "data_read_per_join": "7T"
          },
          "used_columns": [
            "emp_no",
            "title",
            "from_date",
            "to_date"
          ]
        }
      },
      {
        "table": {
          "table_name": "salaries",
          "access_type": "ALL",
          "rows_examined_per_scan": 2745434,
          "rows_produced_per_join": 363886050091503872,
          "filtered": "100.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "314711040856.92",
            "eval_cost": "7.277721002e16",
            "prefix_cost": "7.277755124e16",
            "data_read_per_join": "5171P"
          },
          "used_columns": [
            "emp_no",
            "salary",
            "from_date",
            "to_date"
          ]
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.00 sec)

For a simple query this output does not add much. Except cost info and information about used columns and efficiency of composite indexes.

But what if you not only join tables, but use the other SQL language options? For example, for the query below, which has two

JOIN

 operations and two subqueries, a regular

EXPLAIN

 returns this plan:

mysql> explain select * from employees join dept_manager using (emp_no) where emp_no in (select emp_no from (select emp_no, salary from salaries where emp_no in (select emp_no from titles where title like '%manager%') group by emp_no, salary having salary > avg(salary) ) t )G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <subquery2>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: dept_manager
   partitions: NULL
         type: ref
possible_keys: PRIMARY,emp_no
          key: PRIMARY
      key_len: 4
          ref: <subquery2>.emp_no
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: employees
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: <subquery2>.emp_no
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 4. row ***************************
           id: 2
  select_type: MATERIALIZED
        table: <derived3>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9
     filtered: 100.00
        Extra: NULL
*************************** 5. row ***************************
           id: 3
  select_type: DERIVED
        table: titles
   partitions: NULL
         type: index
possible_keys: PRIMARY,emp_no
          key: emp_no
      key_len: 4
          ref: NULL
         rows: 442724
     filtered: 7.51
        Extra: Using where; Using index; Using temporary; Using filesort; LooseScan
*************************** 6. row ***************************
           id: 3
  select_type: DERIVED
        table: salaries
   partitions: NULL
         type: ref
possible_keys: PRIMARY,emp_no
          key: PRIMARY
      key_len: 4
          ref: employees.titles.emp_no
         rows: 9
     filtered: 100.00
        Extra: NULL
6 rows in set, 1 warning (0.00 sec)

It’s pretty hard to understand which part is a subquery and which is not. It’s also it is hard to find out if

DERIVED

 belongs to the first

JOIN

 or to the second. And I am not quite sure why

<subquery2>

  was marked as

PRIMARY

, which is supposed to indicate “Outermost SELECT”.

The real issue here is that the internal representation of

JOIN

 is hierarchical, and MySQL Server (like in the case for

UNION

) has trouble representing an object as a “flat” table.

EXPLAIN FORMAT=JSON

, with its hierarchical nature, can help us in this case.

mysql> explain format=json  select * from employees join dept_manager using (emp_no) where emp_no in (select emp_no from (select emp_no, salary from salaries where emp_no in (select emp_no from titles where title like '%manager%') group by emp_no, salary having salary > avg(salary) ) t )G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "39.45"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "<subquery2>",
          "access_type": "ALL",
          "materialized_from_subquery": {
            "using_temporary_table": true,
            "query_block": {
              "table": {
                "table_name": "t",
                "access_type": "ALL",
                "rows_examined_per_scan": 9,
                "rows_produced_per_join": 9,
                "filtered": "100.00",
                "cost_info": {
                  "read_cost": "10.45",
                  "eval_cost": "1.80",
                  "prefix_cost": "12.25",
                  "data_read_per_join": "144"
                },
                "used_columns": [
                  "emp_no",
                  "salary"
                ],
                "materialized_from_subquery": {
                  "using_temporary_table": true,
                  "dependent": false,
                  "cacheable": true,
                  "query_block": {
                    "select_id": 3,
                    "cost_info": {
                      "query_cost": "176246.11"
                    },
                    "grouping_operation": {
                      "using_temporary_table": true,
                      "using_filesort": true,
                      "cost_info": {
                        "sort_cost": "9.54"
                      },
                      "nested_loop": [
                        {
                          "table": {
                            "table_name": "titles",
                            "access_type": "index",
                            "possible_keys": [
                              "PRIMARY",
                              "emp_no"
                            ],
                            "key": "emp_no",
                            "used_key_parts": [
                              "emp_no"
                            ],
                            "key_length": "4",
                            "rows_examined_per_scan": 442724,
                            "rows_produced_per_join": 33229,
                            "filtered": "7.51",
                            "using_index": true,
                            "loosescan": true,
                            "cost_info": {
                              "read_cost": "3380.56",
                              "eval_cost": "6645.94",
                              "prefix_cost": "63199.96",
                              "data_read_per_join": "2M"
                            },
                            "used_columns": [
                              "emp_no",
                              "title",
                              "from_date"
                            ],
                            "attached_condition": "(`employees`.`titles`.`title` like '%manager%')"
                          }
                        },
                        {
                          "table": {
                            "table_name": "salaries",
                            "access_type": "ref",
                            "possible_keys": [
                              "PRIMARY",
                              "emp_no"
                            ],
                            "key": "PRIMARY",
                            "used_key_parts": [
                              "emp_no"
                            ],
                            "key_length": "4",
                            "ref": [
                              "employees.titles.emp_no"
                            ],
                            "rows_examined_per_scan": 9,
                            "rows_produced_per_join": 9,
                            "filtered": "100.00",
                            "cost_info": {
                              "read_cost": "49622.62",
                              "eval_cost": "1.91",
                              "prefix_cost": "176236.57",
                              "data_read_per_join": "152"
                            },
                            "used_columns": [
                              "emp_no",
                              "salary",
                              "from_date"
                            ]
                          }
                        }
                      ]
                    }
                  }
                }
              }
            }
          }
        }
      },
      {
        "table": {
          "table_name": "dept_manager",
          "access_type": "ref",
          "possible_keys": [
            "PRIMARY",
            "emp_no"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "emp_no"
          ],
          "key_length": "4",
          "ref": [
            "<subquery2>.emp_no"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 9,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "9.00",
            "eval_cost": "1.80",
            "prefix_cost": "23.05",
            "data_read_per_join": "144"
          },
          "used_columns": [
            "dept_no",
            "emp_no",
            "from_date",
            "to_date"
          ]
        }
      },
      {
        "table": {
          "table_name": "employees",
          "access_type": "eq_ref",
          "possible_keys": [
            "PRIMARY"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "emp_no"
          ],
          "key_length": "4",
          "ref": [
            "<subquery2>.emp_no"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 1,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "1.00",
            "eval_cost": "0.20",
            "prefix_cost": "39.45",
            "data_read_per_join": "48"
          },
          "used_columns": [
            "emp_no",
            "birth_date",
            "first_name",
            "last_name",
            "gender",
            "hire_date"
          ]
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.01 sec)

At first we see that all our tables,

JOIN

 operations and subqueries are in the

nested_loop

 array:

"nested_loop": [
      {
        "table": {
          "table_name": "<subquery2>",
...
      {
        "table": {
          "table_name": "dept_manager",
...
      {
        "table": {
          "table_name": "employees",
...
      }
    ]

Then we see that the first table,

<subquery2>

, was materialized_from_subquery:

"table": {
          "table_name": "<subquery2>",
          "access_type": "ALL",
          "materialized_from_subquery": {
...

Which, in its turn, was

materialized_from_subquery

 too:

"table": {
          "table_name": "<subquery2>",
          "access_type": "ALL",
          "materialized_from_subquery": {
...
                "materialized_from_subquery": {
...

This last subquery performs

grouping_operation

  on the other 

nested_loop

  (

JOIN

) of tables

titles

  and

salaries

:

"grouping_operation": {
                      "using_temporary_table": true,
                      "using_filesort": true,
                      "cost_info": {
                        "sort_cost": "9.54"
                      },
                      "nested_loop": [
                        {
                          "table": {
                            "table_name": "titles",
...
                        },
                        {
                          "table": {
                            "table_name": "salaries",
...

Now we have a better picture of how the query was optimized: tables

titles

 and

salaries

  were joined first, then

GROUP BY

 was executed on the result, then the result was materialized and queried. The result of the query

select emp_no from <materialized> t

  was materialized again as

<subquery2>

, and only after it joined with two other tables.

Conclusion:

EXPLAIN FORMAT=JSON

  helps to understand how complex queries are optimized.

Feb
29
2016
--

Gmail Gets Improved Security Features For Business Users

9027029071_ab7695723b_o Google announced a number of new security features for Gmail users in the enterprise today. Last year, the company launched its Data Loss Prevention (DLP) feature for Google Apps Unlimited users that helps businesses keep sensitive data out of emails. Today, it’s launching the first major update of this service at the RSA Conference in San Francisco. The DLP feature allows businesses to… Read More

Feb
29
2016
--

IBM Adds Post-Cyber Attack Planning With Resilient Systems Acquisition

Padlock with word "guard" on it. The RSA security conference is being held this week in San Francisco where security pros come together to discuss strategy. IBM made several security announcements this morning ahead of the conference, headlined by the purchase of Resilient Systems. Instead of trying to prevent an attack, Resilient gives customers a plan to deal with a breach after it’s happened. While IBM offers… Read More

Feb
29
2016
--

MongoLab Changes Its Name To mLab As It looks Beyond Database Management

data server MongoLab has long been the de facto service for deploying and managing MongoDB databases, but now that the company has cornered this market, it’s looking to expand into new areas. Before launching any new products, though, the company decided to change its name from MongoLab to mLab. MongoLab/mLab CEO and co-founder Will Shulman tells me that he always saw MongoLab as the first part of… Read More

Feb
28
2016
--

Graphing MySQL performance with Prometheus and Grafana

prometheus

prometheus grafanaThis post explains how you can quickly start using such trending tools as Prometheus and Grafana for monitoring and graphing of MySQL and system performance.

I will try to keep this blog as short as possible, so you can quickly set things up before getting bored. I plan to cover the details in the next few posts. I am going to go through the installation process here in order to get some really useful and good-looking graphs in the end.

Overview

PrometheusPrometheus is an open-source service monitoring system and time series database. In short, the quite efficient daemon scrapes metrics from remote machines using HTTP protocol and stores data in the local time-series database. Prometheus provides a simple web interface, a very powerful query language, HTTP API etc. However, the storage is not designed to be durable for the time being.

The remote machines need to run exporters to expose metrics to Prometheus. We will be using the following two:

GrafanaGrafana is an open source, feature-rich metrics dashboard and graph editor for Graphite, Elasticsearch, OpenTSDB, Prometheus and InfluxDB. It is a powerful tool for visualizing large-scale measurement data and designed to work with time-series. Grafana supports different types of graphs, allows for custom representation of individual metrics on the graph and various methods of authentication including LDAP.

Diagram

Here is a diagram of the setup we are going to use:
Prometheus + Grafana diagram

Prometheus setup

To install on the monitor host.

Get the latest tarball from Github.

wget https://github.com/prometheus/prometheus/releases/download/0.17.0rc2/prometheus-0.17.0rc2.linux-amd64.tar.gz
mkdir /opt/prometheus
tar zxf prometheus-0.17.0rc2.linux-amd64.tar.gz -C /opt/prometheus --strip-components=1

Create a simple config:

cat << EOF > /opt/prometheus/prometheus.yml
global:
  scrape_interval:     5s
  evaluation_interval: 5s
scrape_configs:
  - job_name: linux
    target_groups:
      - targets: ['192.168.56.107:9100']
        labels:
          alias: db1
  - job_name: mysql
    target_groups:
      - targets: ['192.168.56.107:9104']
        labels:
          alias: db1
EOF

where 192.168.56.107 is the IP address of the db host we are going to monitor and db1 is its short name. Note, the “alias” label is important here because we rely on it in the predefined dashboards below to get per host graphs.

Start Prometheus in foreground:

[root@centos7 ~]# cd /opt/prometheus
[root@centos7 prometheus]# ./prometheus
prometheus, version 0.17.0rc2 (branch: release-0.17, revision: 667c221)
  build user:       fabianreinartz@macpro
  build date:       20160205-13:35:53
  go version:       1.5.3
INFO[0000] Loading configuration file prometheus.yml     source=main.go:201
INFO[0000] Loading series map and head chunks...         source=storage.go:297
INFO[0000] 0 series loaded.                              source=storage.go:302
WARN[0000] No AlertManager configured, not dispatching any alerts  source=notification.go:165
INFO[0000] Starting target manager...                    source=targetmanager.go:114
INFO[0000] Target manager started.                       source=targetmanager.go:168
INFO[0000] Listening on :9090                            source=web.go:239

Now we can access Prometheus’ built-in web interface by http://monitor_host:9090

Prometheus web interface
If you look at the Status page from the top menu, you will see that our monitoring targets are down so far. Now let’s setup them – prometheus exporters.

Prometheus exporters setup

Install on the db host. Of course, you can use the same monitor host for the experiment. Obviously, this node must run MySQL.

Download exporters from here and there.

wget https://github.com/prometheus/node_exporter/releases/download/0.12.0rc3/node_exporter-0.12.0rc3.linux-amd64.tar.gz
wget https://github.com/prometheus/mysqld_exporter/releases/download/0.7.1/mysqld_exporter-0.7.1.linux-amd64.tar.gz
mkdir /opt/prometheus_exporters
tar zxf node_exporter-0.12.0rc3.linux-amd64.tar.gz -C /opt/prometheus_exporters
tar zxf mysqld_exporter-0.7.1.linux-amd64.tar.gz -C /opt/prometheus_exporters

Start node_exporter in foreground:

[root@centos7 ~]# cd /opt/prometheus_exporters
[root@centos7 prometheus_exporters]# ./node_exporter
INFO[0000] No directory specified, see --collector.textfile.directory  source=textfile.go:57
INFO[0000] Enabled collectors:                           source=node_exporter.go:146
INFO[0000]  - filesystem                                 source=node_exporter.go:148
INFO[0000]  - loadavg                                    source=node_exporter.go:148
INFO[0000]  - time                                       source=node_exporter.go:148
INFO[0000]  - vmstat                                     source=node_exporter.go:148
INFO[0000]  - diskstats                                  source=node_exporter.go:148
INFO[0000]  - filefd                                     source=node_exporter.go:148
INFO[0000]  - mdadm                                      source=node_exporter.go:148
INFO[0000]  - meminfo                                    source=node_exporter.go:148
INFO[0000]  - netdev                                     source=node_exporter.go:148
INFO[0000]  - textfile                                   source=node_exporter.go:148
INFO[0000]  - entropy                                    source=node_exporter.go:148
INFO[0000]  - stat                                       source=node_exporter.go:148
INFO[0000]  - uname                                      source=node_exporter.go:148
INFO[0000]  - conntrack                                  source=node_exporter.go:148
INFO[0000]  - netstat                                    source=node_exporter.go:148
INFO[0000]  - sockstat                                   source=node_exporter.go:148
INFO[0000]  - version                                    source=node_exporter.go:148
INFO[0000] Starting node_exporter v0.12.0rc3 at :9100    source=node_exporter.go:167

Unlike node_exporter, mysqld_exporter wants MySQL credentials. Those privileges should be sufficient:

mysql> GRANT REPLICATION CLIENT, PROCESS ON *.* TO 'prom'@'localhost' identified by 'abc123';
mysql> GRANT SELECT ON performance_schema.* TO 'prom'@'localhost';

Create .my.cnf and start mysqld_exporter in foreground:

[root@centos7 ~]# cd /opt/prometheus_exporters
[root@centos7 prometheus_exporters]# cat << EOF > .my.cnf
[client]
user=prom
password=abc123
EOF
[root@centos7 prometheus_exporters]#
[root@centos7 prometheus_exporters]# ./mysqld_exporter -config.my-cnf=".my.cnf"
INFO[0000] Starting Server: :9104                        file=mysqld_exporter.go line=1997

At this point we should see our endpoints are up and running on the Prometheus Status page:
Prometheus status page

Grafana setup

Install on the monitor host.

Grafana has RPM and DEB packages. The installation is as simple as installing one package.
RPM-based system:

yum install https://grafanarel.s3.amazonaws.com/builds/grafana-2.6.0-1.x86_64.rpm

or APT-based one:

wget https://grafanarel.s3.amazonaws.com/builds/grafana_2.6.0_amd64.deb
apt-get install -y adduser libfontconfig
dpkg -i grafana_2.6.0_amd64.deb

Open and edit the last section of /etc/grafana/grafana.ini resulting in the following ending:

[dashboards.json]
enabled = true
path = /var/lib/grafana/dashboards

Percona has built the predefined dashboards for Grafana with Prometheus for you.

Let’s get them deployed:

git clone https://github.com/percona/grafana-dashboards.git
cp -r grafana-dashboards/dashboards /var/lib/grafana

Finally, start Grafana:

service grafana-server start

At this point, we are one step before being done. Login into Grafana web interface http://monitor_host:3000 (admin/admin).

Go to Data Sources and add one for Prometheus:
Grafana datasource

Now check out the dashboards and graphs. Say choose “System Overview” and period “Last 5 minutes” on top-right. You should see something similar:
Grafana screen
If your graphs are not populating ensure the system time is correct on the monitor host.

Samples

Here are some real-world samples (images are clickable and scrollable):



Enjoy!

Conclusion

Prometheus and Grafana is a great tandem for enabling monitoring and graphing capabilities for MySQL. The tools are pretty easy to deploy, they are designed for time series with high efficiency in mind. In the next blog posts I will talk more about technical aspects, problems and related stuff.

Feb
28
2016
--

How Smart Founders Can Take Advantage Of The Platform Shift

gaming-super-mario-bros-3 About 15 years ago, the Internet triggered a platform shift in the delivery of enterprise software; a wave of cloud startups quickly unseated the kings of the server era. Now, another platform evolution — the mobile era — is upon us. For nimble and opportunistic founders, the shift toward mobile is a monumental opportunity to stake ownership in a newly dominant platform. Read More

Feb
27
2016
--

Bookbub

BookPileIn February 2016, I ran a week-long promotion with Bookbub. For those unfamiliar with the name, it's a service that emails readers informing them of free or heavily discounted books. It is reckoned to be the best promotional service for a book, guaranteed to reach tens or hundreds of thousands of readers interested specifically in a genre – in my case, Fantasy.

Bookbub US is notoriously difficult to get accepted into. They are very picky to guarantee their readers get the best deals. It's not uncommon to be rejected several times before getting a slot. But this time, I opted for Bookbub non-US. Easier to get into, but smaller audiences.

I paid $140 to get Necromancer mailed to readers in the UK, Canada and India. The $3.99 ebook would be discounted to 99c, a total steal. The discount would be available for 7 days on Amazon, Kobo, Apple, Nook and Google Play. Bookbub predicted I would sell 410 books.

Let's see how I did…

Here's the very typical sales graph (From Amazon), where one sees a huge spike day 1 when all those readers open their Bookbub emails. The sales taper off as the week goes on.

Bookbub_KDP

To put this into perspective, currently, 18 months after this books' launch, I am selling about 1 copy/day on Amazon. That first day I sold 231! I was still selling about 16-20 per day at the end of the week.

In total, across all retailers, I sold 529 books. In a week. Now, Stephen King probably sells this many of a single book per day, or even per hour, but this is a big number for me.

Let's see how this breaks down…

BookbubRetailers

BookbubCountries

So what did I make? Total Earnings: $217. This gave me a profit of $77, but the goal of the promotion was to get my book into the hands of more readers, especially outside of the US. More readers means more potential fans, more reviews, more word-of-mouth sales, etc. Making a profit on the promotion is simply good business sense.

For any authors out there wondering about Bookbub, I think my results speak for themselves.

 

Feb
26
2016
--

Monitoring MongoDB Response Time

Monitoring MongoDB response timeIn this blog post, we’ll discuss how using Prometheus can help with monitoring MongoDB response time. I am currently comparing the performance of different storage engines on Percona Server for MongoDB, using a slightly customized version of Tim Callaghan’s sysbench-mongodb. Since I’m interested in measuring response time for database operations, I created a very simple exporter of response time data for Prometheus.

My first approach to measuring MongoDB response time was inspired by Ignacio Nin’s work on tcprstat, some years ago – and by the way the VividCortex query agent works (which is not surprising, since, to the best of my knowledge, Baron inspired tcprstat in the first place).

With this in mind, I created mongo-response-time, which performs the only function of printing to stdout the response time of every mongodb query seen on the wire, along with a timestamp up to the second. My thanks go to the programmers of Facebook’s Go projects, as their code helped me hit the ground running.

As a first approach this was useful enough for my needs, and here is an example of a basic graph created from data generated by it: Monitoring MongoDB response time

I had to use a log scale as otherwise the graph was just a thick bar near the bottom, and a few outliers above. This is already useful, but it does not scale well. As an example, a sysbench-mongodb run of about an hour produced a csv file with a little over eight million data points. Larger rounds (like 24 hours) are just too difficult to handle with R (in one case, even though I had enough memory in my box to hold more than three copies of the file, read.csv aborted after running out of memory – if this happens to you, I suggest the use of RMySQL instead, which seems more memory-efficient than read.csv for ingesting large amounts of data).

For a second approach, I decided to live with less fidelity and settled for some quantiles and a max. For this, I created a simple Prometheus exporter that exposes 0.5, 0.9 and 0.99 quantiles, and also the max response time for every five second period.

With it, I was able to visualize the MongoDB response time data in Grafana in a way that is affordable and good enough for my needs, as can be seen in the following graphs: Monitoring MongoDB response time

The quantiles are calculated on the client side, using the Summary type from Prometheus’ Go client. The exporter also provides the same quantiles, but through a Histogram, which has the advantage of being more lightweight on clients. I decided to use the Summary as a source for this graph as the impact on the machine seems negligible for now, and I do find its data a bit more reliable (if I compare to calculating quantiles per periods analyzing the full data set in R). You can see how the max (a Gauge, glad you asked!) is useful to have, as it lets you find out about outliers that even the .99 quantile misses (which is expected, by the way).

If you want to try this out, you can find darwin and linux binaries here, and if you hit any problems, please reply here or email me directly at fernando (dot) ipar at Percona’s domain.

Happy monitoring!

Feb
25
2016
--

Percona Back to Basics: MongoDB updates

MongoDB Updates

MongoDB Updates

Welcome to the first in a new series of MongoDB blogs. These blogs will cover a selection of topics, including:

  • How-tos
  • New release and new features
  • Getting back to the basics
  • Solutions from the field

In this first blog, we’ll discuss MongoDB updates. You can use the update method to update documents in a collection. MongoDB updates are well-covered in the MongoDB documentation, but there are some cases we should review for clarity, and to understand how and when to use them.

In this blog post, we’ll cover:

  • To
    $set

     or

    $inc

     

  • What about the engine differences?
  • What is a move?
  • How can I limit updates when a customer wants to make massive changes and remain online?

$set vs $inc

The

$set

 and

$inc

 options are a throwback to the MMAPv1 storage engine, but are a consideration for optimization rather than a rule. If we know we want to add 100 to some value in a document that is 200k in size, it could cost many more times the disk IO to update the entire document (using

$set

). The question is how much more efficient is

$inc

? The manual talks about it being faster because it writes less, and that moves are more costly (we’ll cover them in a second). However, it doesn’t give the technical logic behind this argument.

$set

 could update 3200 to 3300 with no issue, and would not initiate a move (in MMAP). However, anything adding an entry to an array, adding a subdoc, adding characters to a string, adding new fields, etc., might cause a move. The larger issue at hand is that

$set

 requires you to fetch the data first to be able to set it, while

$inc

 lets you blindly increment the data. In practice, this might look something like:

db.logins.update ({"user": user},{"login_count": { "$inc": 1}});

Replacing the whole document might look like this:

user_data = db.logins.findOne({"user": user})
db.logins.update(
             {"user":user_data.user},
             {"$set":
                          {"login_count": user_data.login_count+1}
              }
)

With regards to incrementing data, BSON is designed to advertise the length of a field at the start of each field, making it easy to skip over bytes you don’t need to read, parse and consider. As the cursor is at a very specific offset, it can change a number since it will still take the same storage size – meaning nothing else in the document needs to be shifted around. The important point is the number of seeks we need to make on a document. With BSON, if we want to update the 900th field, we would make 900 “jumps” to get to the correct position. JSON on the other hand, would read the whole document into memory and then parse each and every bracket. This requires significantly more CPU.

For BSON, the application must spend some application CPU to move between BSON and native types – but this isn’t a deal breaker: the CPU on apps is more scalable.

What about engines?

There are cases where

$set

  could be optimal – especially if the storage engine uses a fast-update concept (this is also known as “read-less”). What this means is we can just write blindly to the document space, making the changes we want. If the space needed is the same as what is available, we might even be able to avoid a move or restructure of the document. This is true in TokuMX, PerconaFT, or MMAPv1. However in other engines – such as systems built on LSM structures like WiredTiger and RocksDB – this is impossible (you can read more about LSM’s later, but the way an insert or update works is largely the same). It will append a new copy of the full record to the end of a file, which is very fast because it doesn’t need to look for a free item in a free list of the right size.

The downside is that using

$set

 to append a field, or

$inc

 to increase a counter, is much more expensive as it executes a full document read and a complete document write. This is why the type of storage engine is critically important when explaining methods for updating documents and the expected overhead.

What is a move?

A move occurs when a document is using 64k, but an update would make it 65k. Since this is larger, the new document will not fit in the existing location. This means from a storage perspective an update becomes a read, an insert, and delete. In some engines, this might be fine (for example, RocksDB will just mark the delete for later), but in other engines (i.e., LSM-based engines) too many reads can force the engine to clean up when the history list gets too long. This forced overhead is one of the reasons that LSM read operations can get bogged down, while writes are very fast.

It could be said that the default LSM state is that it needs to perform a read in any case. For the memory map, however, this means the write lock has escalated and could be many times more expensive than a non-move.

Limiting the effect of massive updates

Let’s assume that we are a shopping cart system, and we have following document structure in a collection with 100 million documents:

{
 _id : ObjectId(),
 accountName: "dmurphy",  // unique, immutable id, enforced in app
 v : 1,
 address: "123 nowhere",
 state: "tx",
 zip: 78634,
 phone: "123-124-1234",
 password: "9dd4e461268c8034f5c8564e155c67a6"
}

This has worked well for tracking, but now we want to support users having multiple addresses. We have a million users, and we want to force them to a new form, as having mixed types for a long time could be an issue. (There are cases and designs to help a client be intelligent and self-updating, however, that is out of the scope of this blog.)

The document now should be:

{
   _id : ObjectId(),
   accountName: "dmurphy",  // unique, immutable id, enforced in app
   v : 2,
   addresses: [
       {address: "123 nowhere",state: "tx",zip: 78634,phone1: "123-124-1234"}
   ],
   password: "9dd4e461268c8034f5c8564e155c67a6"
}

There are a couple of reasons for this selection. You should NEVER reuse a field with different data types that are indexed. MongoDB can technically store both times, however far in the past; the index could return incorrect data or due to scan order, causing user confusion by not matching types you might think it would. In MongoDB  “123” is not anything like 123. Therefore, depending on your query you might not get all expected results. Also, we incremented the version to “2”, so that if you were programmatically checking and fixing versions in your application, you would know if it needs to be done. That model does not work for inactive users, however, which is more relevant to this example. This means we have two ways we could make our update

Option 1:

{v:{$lt:2}}

Option 2:

{addresses:{$exists: 0}}

Option 1 is much more secure and exact, while option 2 is based more on the outcome. We would want to use option 1 for clarity and repeatability, but how do we ensure it doesn’t update all 100 million documents (as the IO needed and impact on the system would be far too expensive – such as filling the oplog so much it could make a restore impossible):

function parseNS(ns){
   //Expects we are forcing people to not violate the rules and not doing "foodb.foocollection.month.day.year" if they do they need to use an array.
   if (ns instanceof Array){
       database =  ns[0];
       collection = ns[1];
   }
   else{
       tNS =  ns.split(".");
       if (tNS.length > 2){
           print('ERROR: NS had more than 1 period in it, please pass as an [ "dbname","coll.name.with.dots"] !');
           return false;
       }
       database = tNS[0];
       collection = tNS[1];
   }
   return {database: database,collection: collection};
}
function buildAddressDelta(doc){
   addresses = [];
   addresses[0] = { address: doc.address , state: doc.state, zip: doc.zip , phone1:doc.phone };
   delta = {"$set": {"addresses" : addresses,"version": 2},"$unset" : { "address":"","state":"","zip":"","phone":""}};
   return delta
}
function updateWithPauses(ns, query, delta , batchSize,pauseMS){
   count = 0;
   objNS = parseNS(ns);
   if (tNS == false){ return false; }
    
   totalToProcess = db.getDB(objNS.database).getCollection(objNS.collection).find(query).count();
   while(db.getDB(objNS.database).getCollection(objNS.collection).find(query).count() > 0){
       db.getDB(objNS.database).getCollection(objNS.collection).find(query).limit(batchSize).forEach(function(doc){
           active_error = false;
           try:{
               db.getDB(objNS.database).getCollection(objNS.collectioin).update({_id:doc._id},buildAddressDelta(doc)});
           }
           catch(err){
               print("Found error when updating _id: "+doc._id+"t: +"err.message);
               active_error = false;
           }
           if (! active_error) {count++;}
           if (count % batchSize == 0) {
               print("Processed "+count+" of "+totalToProcess+" with "+(totalToProcess-count)+" to go approximately.");
               sleep(pauseMS);
           }
       });
   }
}

In this example, the specific bit was “buildAddressDelta”, and the more generic part was “updateWithPauses”. A future improvement would be to make the “buildAddressDelta” become “buildDelta”, and pass it an array of deltas to apply. As you can see, the delta is adding the new array of addresses with the current as a member, updating the version, and unsetting the old fields – which should be pretty straightforward. Our focus here is more on the “updateWithPauses” script, which is doing a few things:

  1. Splitting and setting up an NS object for ease-of-use
  2. Finding out if we still have documents to change (and quitting when it’s done)
  3. Getting one document at a time and updating it, we could up a bulk op per batch also.
  4. Forcing a pause and reporting each time we hit a batchSize (% in JS means modulo)

It is possible to do more, but this is safe and has a natural slow down by not batching while still doing a forced yield from time to time. You can also safely bail out of the shell to stop the process if it is impacting the system too much, and restart it again later as it will try and find documents it needs to change just in time for each loop.

Conclusion

Hopefully, this blog has helped to demonstrate some of the finer points of MongoDB updates. The MongoDB documentation has comprehensive coverage of the processes, but don’t hesitate to ping us for specific questions!

 

Feb
25
2016
--

High availability with asynchronous replication… and transparent R/W split

High availability with asynchronous replication

High availability with asynchronous replicationIn this post, the first one of a Maxscale series, I describe how to use MariaDB’s MaxScale and MySQL-utilities with MySQL Asynchronous replication.

When we talk about high availability with asynchronous replication, we always think about MHA or PRM. But if we want to transparently use the slave(s) for READs, what can we use ?

Description:

  • Three MySQL servers, but one has very limited resources and will never be able to handle the production load. In fact this node is used for backup and some back-office queries.
  • We would like to use one of the nodes as a master and the other two as slaves, but only one will be addressed by the application for the READs. If needed, that same node will become the master.
  • The application doesn’t handle READ and WRITE connections, and it’s impossible to change it.

To achieve our goals, we will use MaxScale and it’s R/W filter. When using Maxscale and asynchronous replication with MariaDB, it’s possible to use MariaDB’s replication manager, which is a wonderful tool written in Go. Unfortunately, this tool doesn’t support standard MySQL. To replace it, I used then the Oracle’s MySQL-Utilities.

Our three nodes are:

  • percona1 (master)
  • percona2 (powerful slave)
  • percona3 (weak slave)

It’s mandatory in this solution to use GTID, as it’s the only method supported by the mysql-utilities we are using.

This is the MaxScale configuration:

[maxscale]
threads=4
[Splitter Service]
type=service
router=readwritesplit
servers=percona1, percona2
user=maxscale
passwd=264D375EC77998F13F4D0EC739AABAD4
[Splitter Listener]
type=listener
service=Splitter Service
protocol=MySQLClient
port=3306
socket=/tmp/ClusterMaster
[percona1]
type=server
address=192.168.90.2
port=3306
protocol=MySQLBackend
[percona2]
type=server
address=192.168.90.3
port=3306
protocol=MySQLBackend
[percona3]
type=server
address=192.168.90.4
port=3306
protocol=MySQLBackend
[Replication Monitor]
type=monitor
module=mysqlmon
servers=percona1, percona2, percona3
user=maxscale
passwd=264D375EC77998F13F4D0EC739AABAD4
monitor_interval=1000
script=/usr/local/bin/failover.sh
events=master_down
[CLI]
type=service
router=cli
[CLI Listener]
type=listener
service=CLI
protocol=maxscaled
address=localhost
port=6603

As you can notice, the Splitter Service contains only the two nodes able to handle the load.

And to perform the failover, in the Replication Monitor section, we define a script to use when the master is down.

That script calls mysqlrpladmin from the mysql-utilities.

In the script we also define the following line to be sure the weak slave will never become a master.

never_master=192.168.90.4

When everything is setup and running, you should see something like this:

# maxadmin -pmariadb list servers
Servers.
-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status
-------------------+-----------------+-------+-------------+--------------------
percona1           | 192.168.90.2    |  3306 |          15 | Master, Running
percona2           | 192.168.90.3    |  3306 |        1025 | Slave, Running
percona3           | 192.168.90.4    |  3306 |           0 | Slave, Running
-------------------+-----------------+-------+-------------+--------------------

So as you can see, Maxscale discovers on its own which server is the master; this doesn’t need to be specified in the configuration.

You can also use mysqldrpladmin utility to verify the cluster’s health:

# /usr/bin/mysqlrpladmin --rpl-user=repl:replpercona --master=manager:percona@192.168.90.2:3306 --slaves=manager:percona@192.168.90.3:3306,manager:percona@192.168.90.4:3306  health
# Checking privileges.
#
# Replication Topology Health:
+---------------+-------+---------+--------+------------+---------+
| host          | port  | role    | state  | gtid_mode  | health  |
+---------------+-------+---------+--------+------------+---------+
| 192.168.90.2  | 3306  | MASTER  | UP     | ON         | OK      |
| 192.168.90.3  | 3306  | SLAVE   | UP     | ON         | OK      |
| 192.168.90.4  | 3306  | SLAVE   | UP     | ON         | OK      |
+---------------+-------+---------+--------+------------+---------+

Try it with --verbose ????

When we test with sysbench, and we stop the master, we can see that there are some errors due to disconnects. Also, during the promotion of the new master, sysbench can’t reconnect:

[  20s] queue length: 0, concurrency: 0
[  21s] threads: 8, tps: 2.00, reads: 28.00, writes: 8.00, response time: 107.61ms (95%), errors: 0.00, reconnects:  0.00
[  21s] queue length: 0, concurrency: 0
[  22s] threads: 8, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  22s] queue length: 0, concurrency: 0
[  23s] threads: 8, tps: 1.00, reads: 14.00, writes: 4.00, response time: 100.85ms (95%), errors: 0.00, reconnects:  0.00
[  23s] queue length: 0, concurrency: 0
[  24s] threads: 8, tps: 0.00, reads: 11.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  24s] queue length: 0, concurrency: 1
[  25s] threads: 8, tps: 1.00, reads: 3.00, writes: 4.00, response time: 235.41ms (95%), errors: 0.00, reconnects:  0.00
[  25s] queue length: 0, concurrency: 0
[  26s] threads: 8, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  26s] queue length: 0, concurrency: 0
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
[  27s] threads: 8, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  27s] queue length: 0, concurrency: 3
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
[  28s] threads: 8, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  28s] queue length: 0, concurrency: 4
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
[  29s] threads: 8, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  29s] queue length: 0, concurrency: 5
[  30s] threads: 8, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  30s] queue length: 0, concurrency: 5
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
[  31s] threads: 8, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  31s] queue length: 0, concurrency: 7
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
WARNING: Both max-requests and max-time are 0, running endless test
sysbench 0.5:  multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Target transaction rate: 1/sec
Report intermediate results every 1 second(s)
Random number generator seed is 0 and will be ignored
Threads started!
FATAL: unable to connect to MySQL server, aborting...
FATAL: error 1045: failed to create new session
PANIC: unprotected error in call to Lua API (Failed to connect to the database)
WARNING: Both max-requests and max-time are 0, running endless test
sysbench 0.5:  multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Target transaction rate: 1/sec
Report intermediate results every 1 second(s)
Random number generator seed is 0 and will be ignored
Threads started!
[   1s] threads: 8, tps: 1.99, reads: 27.93, writes: 7.98, response time: 211.49ms (95%), errors: 0.00, reconnects:  0.00
[   1s] queue length: 0, concurrency: 0
[   2s] threads: 8, tps: 1.00, reads: 14.00, writes: 4.00, response time: 51.01ms (95%), errors: 0.00, reconnects:  0.00
[   2s] queue length: 0, concurrency: 0
[   3s] threads: 8, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[   3s] queue length: 0, concurrency: 0
[   4s] threads: 8, tps: 1.00, reads: 13.99, writes: 4.00, response time: 80.28ms (95%), errors: 0.00, reconnects:  0.00

It took 8 seconds to automatically failover.

Then we can see the status of the servers:

# maxadmin -pmariadb list serversServers.
-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status
-------------------+-----------------+-------+-------------+--------------------
percona1           | 192.168.90.2    |  3306 |          17 | Down
percona2           | 192.168.90.3    |  3306 |        1025 | Master, Running
percona3           | 192.168.90.4    |  3306 |           0 | Slave, Running
-------------------+-----------------+-------+-------------+--------------------

If we re-start percona1, we can see now:

# maxadmin -pmariadb list servers
Servers.
-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status
-------------------+-----------------+-------+-------------+--------------------
percona1           | 192.168.90.2    |  3306 |          17 | Running
percona2           | 192.168.90.3    |  3306 |        1025 | Master, Running
percona3           | 192.168.90.4    |  3306 |           0 | Slave, Running
-------------------+-----------------+-------+-------------+--------------------

To add the node again in the asynchronous replication as a slave, we need to use another MySQL utility, mysqlreplicate:

# mysqlreplicate --master=manager:percona@192.168.90.3 --slave=manager:percona@192.168.90.2 --rpl-user=repl:replpercona
# master on 192.168.90.3: ... connected.
# slave on 192.168.90.2: ... connected.
# Checking for binary logging on master...
# Setting up replication...
# ...done.

This is source of failover.sh:

#!/bin/bash
# failover.sh
# wrapper script to mysqlrpladmin
# user:password pair, must have administrative privileges.
user=manager:percona
# user:password pair, must have REPLICATION SLAVE privileges.
repluser=repl:replpercona
never_master=192.168.90.4
ARGS=$(getopt -o '' --long 'event:,initiator:,nodelist:' -- "$@")
eval set -- "$ARGS"
while true; do
    case "$1" in
        --event)
            shift;
            event=$1
            shift;
        ;;
        --initiator)
            shift;
            initiator=$1
            shift;
        ;;
        --nodelist)
            shift;
            nodelist=$1
            shift;
        ;;
        --)
            shift;
            break;
        ;;
    esac
done
# find the candidates
for i in $(echo $nodelist | sed s/,/n/g)
do
  if [[ "$i" =~ "$never_master" ]]
  then
     # do nothing
     echo nothing >/dev/null
  else
     if [[ "$i" =~ "$initiator" ]]
     then
	# do nothing
        echo nothing >/dev/null
     else
        candidates="$candidates,${user}@${i}"
     fi
  fi
  if [[ "$i" =~ "$initiator" ]]
  then
     # do nothing
     echo nothing >/dev/null
  else
     slaves="$slaves,${user}@${i}"
  fi
done
cmd="/usr/bin/mysqlrpladmin --rpl-user=$repluser --slaves=${slaves#?} --candidates=${candidates#?} failover"
# uncomment following line for debug
#echo $cmd >> /tmp/fred
eval $cmd

In the next post, we will focus on the monitoring module used in this configuration.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com