Jun
26
2017
--

Facebook, Microsoft, YouTube and Twitter form Global Internet Forum to Counter Terrorism

 Today Facebook, Microsoft, YouTube and Twitter collectively announced a new partnership aimed at reducing the accessibility of internet services to terrorists. The new Global Internet Forum to Counter Terrorism adds structure to existing efforts by the companies to target and remove recruiting materials for terror groups from major web platforms. Together, the four tech leaders say they… Read More

Jun
26
2017
--

Amazon said to be working on translation services for AWS customers

 Amazon is working on an offering that would allow developers building apps and websites using AWS to translate their content to multiple languages, CNBC reports. The machine translation tech used to provide the multi-lingual versions of client products would be based on tech Amazon uses across its own products, the report claims. Translation services are a key competitive offering for Amazon… Read More

Jun
24
2017
--

Headstart wants to better analyze candidates to fit them with the best jobs

 Nicholas Shekerdemian has a pretty typical startup story: he dropped out of college, met up with a technical co-founder and then ended up starting a company. But Shekerdemian, who at the time was helping match English teachers with Chinese citizens, wanted to start a company that would solve his own problem: actually getting a job at a cookie cutter company where everyone else applies. So… Read More

Jun
23
2017
--

Percona XtraDB Cluster, Galera Cluster, MySQL Group Replication High Availability Webinar: Q & A

High Availability Webinar

High Availability WebinarThank you for attending the Wednesday, June 21, 2017 high availability webinar titled Percona XtraDB Cluster, Galera Cluster, MySQL Group Replication. In this blog, I will provide answers to the Q & A for that webinar.

You can find the slides and a recording of the webinar here.

Is there a minimum MySQL server version for Group Replication?

MySQL Group Replication is GA since MySQL Community 5.7.17. This is the lowest version that you should use for the Group Replication feature. Otherwise, you are using a beta version.

Since 5.7.17 was the GA release, it’s strongly recommended you use the latest 5.7 minor release. Bugs get fixed and features added in each of the minor releases (as can be seen in the Limitations section in the slide deck).

In MySQL 5.6 and earlier versions, Group Replication is not supported. Note that Percona Server for MySQL 5.7.17 and beyond also ships with Group Replication.

Can I use Percona XtraDB Cluster with MariaDB v10.2? or must I use Percona Server for MySQL?

Percona XtraDB Cluster is Percona Server for MySQL and Percona XtraBackup with the modified Galera library. You cannot run Percona XtraDB Cluster on MariaDB.

However, as Percona XtraDB Cluster is open source, it is possible that MariaDB/Codership implements our modifications into their codebase.

If Percona XtraDB Cluster does not allow InnoDB tables, how do we typically deal with applications that need to use MyISAM tables?

You cannot use MyISAM with Percona XtraDB Cluster, Galera or Group Replication. However, there is experimental MyISAM support in Galera/Percona XtraDB Cluster. But we strongly recommend that you don’t use this in production. It effectively executes all statements in Total Order Isolation, which results in bad performance.

What is a typical business use case for the Group Replication? I specifically like the writes order feature.

Typical use cases are:

  • Environments with strict **durability** requirements
  • Write to multiple nodes simultaneously while keeping data **consistent**
  • Reducing failover time
  • Using other nodes for read-scaling, where reading stale data is more difficult for the application (as opposed to standard asynchronous replication)

The use cases for Galera and Percona XtraDB Cluster are similar.

Where do you run ProxySQL, on a separate server? We are using HAProxy.

You can deploy ProxySQL in many different ways. One common method of installation is to run ProxySQL on a separate layer of servers (ensuring there is failover on this layer). Another commonly used method is to run a ProxySQL daemon on every application server.

Do you support KVM?

Yes, there are no limitations on virtualization solutions.

Can you give some examples of an “arbitrator”?

Some useful links:

What does Percona XtraDB add to make it more performant than InnoDB?

The scalability and performance improvement of Percona XtraDB are listed on the Percona Server for MySQL documentation page: https://www.percona.com/doc/percona-server/LATEST/index.html

How scalable is Percona XtraDB Cluster storage wise? Do we have any limitations?

Storage happens through the storage engine (which is InnoDB). Percona XtraDB Cluster does not have any different limitations than Percona Server for MySQL or MySQL.

However, we need to also consider the practical side of things: the larger the cluster gets, the longer certain operations take. For example, when adding a new node to the cluster another node must be the donor and provide all the data. This will take substantially longer with larger datasets. Certain operational aspects might therefore become more complex.

Is there any development to add multiple nodes simultaneously?

No, at the moment only one node can join the cluster at the same time. Other nodes automatically wait until it is finished before joining.

Why does Galera say we cannot use READ COMMITTED isolation for multimaster mode, even though we can start the cluster with READ-COMMITTED?

You can use READ-COMMITTED as transaction isolation level. The limitation is that you cannot use SERIALIZABLE: http://galeracluster.com/documentation-webpages/isolationlevels.html.

Galera Cluster and MariaDB currently do not prevent a user from using this transaction isolation level. Percona XtraDB Cluster implemented the strict mode to prevent these operations: https://www.percona.com/doc/percona-xtradb-cluster/LATEST/features/pxc-strict-mode.html#explicit-table-locking

MariaDB 10.2 fixed the check constraints issue, When will Percona fix this issue?

There are currently no plans to support CHECK constraints in Percona Server for MySQL (and therefore Percona XtraDB Cluster as well).

As Percona Server is effectively a fully backwards-compatible (but modified) MySQL Community Server, CHECK constraints is a feature that normally would be implemented in MySQL Community first.

Can you share your performance benchmark git repository (if you have one)?

We don’t have a performance benchmark in git repository. You can get detailed information about this benchmark in this blog: Performance improvements in Percona XtraDB Cluster 5.7.17-29.20.

On your slide pointing to scalability charts, how many nodes did you run your test against?

We used a three-node cluster for this performance benchmark.

The product is using Master-Master replication. As such what do you mean when you talk about failover in such configuration?
Where do you maintain the cluster state?

All technologies automatically maintain the cluster state as you add and remove nodes.

What are the network/IP requirements for Proxy SQL?

There are no specific requirements. More documentation about ProxySQL can be found here: https://github.com/sysown/proxysql/wiki.

Jun
23
2017
--

Algorithmia raises $10.5M Series A round led by Google’s new AI fund

 Word recently spread that Google had quietly launched a new fund for investing into AI companies. Now this fund has made its first (or at least its first public) investment, leading a $10.5 million Series A into Algorithmia, a marketplace and enterprise solution that allows developers to easily tap into its catalog of 3,500 algorithms, functions and machine-learning models. Read More

Jun
22
2017
--

ClickHouse in a General Analytical Workload (Based on a Star Schema Benchmark)

ClickHouse

ClickHouseIn this blog post, we’ll look at how ClickHouse performs in a general analytical workload using the star schema benchmark test.

We have mentioned ClickHouse in some recent posts (ClickHouse: New Open Source Columnar Database, Column Store Database Benchmarks: MariaDB ColumnStore vs. Clickhouse vs. Apache Spark), where it showed excellent results. ClickHouse by itself seems to be event-oriented RDBMS, as its name suggests (clicks). Its primary purpose, using Yandex Metrica (the system similar to Google Analytics), also points to an event-based nature. We also can see there is a requirement for date-stamped columns.

It is possible, however, to use ClickHouse in a general analytical workload. This blog post shares my findings. For these tests, I used a Star Schema benchmark — slightly-modified so that able to handle ClickHouse specifics.

First, let’s talk about schemas. We need to adjust to ClickHouse data types. For example, the biggest fact table in SSB is “lineorder”. Below is how it is defined for Amazon RedShift (as taken from https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-create-test-data.html):

CREATE TABLE lineorder
(
  lo_orderkey          INTEGER NOT NULL,
  lo_linenumber        INTEGER NOT NULL,
  lo_custkey           INTEGER NOT NULL,
  lo_partkey           INTEGER NOT NULL,
  lo_suppkey           INTEGER NOT NULL,
  lo_orderdate         INTEGER NOT NULL,
  lo_orderpriority     VARCHAR(15) NOT NULL,
  lo_shippriority      VARCHAR(1) NOT NULL,
  lo_quantity          INTEGER NOT NULL,
  lo_extendedprice     INTEGER NOT NULL,
  lo_ordertotalprice   INTEGER NOT NULL,
  lo_discount          INTEGER NOT NULL,
  lo_revenue           INTEGER NOT NULL,
  lo_supplycost        INTEGER NOT NULL,
  lo_tax               INTEGER NOT NULL,
  lo_commitdate        INTEGER NOT NULL,
  lo_shipmode          VARCHAR(10) NOT NULL
);

For ClickHouse, the table definition looks like this:

CREATE TABLE lineorderfull (
        LO_ORDERKEY             UInt32,
        LO_LINENUMBER           UInt8,
        LO_CUSTKEY              UInt32,
        LO_PARTKEY              UInt32,
        LO_SUPPKEY              UInt32,
        LO_ORDERDATE            Date,
        LO_ORDERPRIORITY        String,
        LO_SHIPPRIORITY         UInt8,
        LO_QUANTITY             UInt8,
        LO_EXTENDEDPRICE        UInt32,
        LO_ORDTOTALPRICE        UInt32,
        LO_DISCOUNT             UInt8,
        LO_REVENUE              UInt32,
        LO_SUPPLYCOST           UInt32,
        LO_TAX                  UInt8,
        LO_COMMITDATE           Date,
        LO_SHIPMODE             String
)Engine=MergeTree(LO_ORDERDATE,(LO_ORDERKEY,LO_LINENUMBER),8192);

From this we can see we need to use datatypes like UInt8 and UInt32, which are somewhat unusual for database world datatypes.

The second table (RedShift definition):

CREATE TABLE customer
(
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL
);

For ClickHouse, I defined as:

CREATE TABLE customerfull (
        C_CUSTKEY       UInt32,
        C_NAME          String,
        C_ADDRESS       String,
        C_CITY          String,
        C_NATION        String,
        C_REGION        String,
        C_PHONE         String,
        C_MKTSEGMENT    String,
        C_FAKEDATE      Date
)Engine=MergeTree(C_FAKEDATE,(C_CUSTKEY),8192);

For reference, the full schema for the benchmark is here: https://github.com/vadimtk/ssb-clickhouse/blob/master/create.sql.

For this table, we need to define a rudimentary column C_FAKEDATE Date in order to use ClickHouse’s most advanced engine (MergeTree). I was told by the ClickHouse team that they plan to remove this limitation in the future.

To generate data acceptable by ClickHouse, I made modifications to ssb-dbgen. You can find my version here: https://github.com/vadimtk/ssb-dbgen. The most notable change is that ClickHouse can’t accept dates in CSV files formatted as “19971125”. It has to be “1997-11-25”. This is something to keep in mind when loading data into ClickHouse.

It is possible to do some preformating on the load, but I don’t have experience with that. A common approach is to create the staging table with datatypes that match loaded data, and then convert them using SQL functions when inserting to the main table.

Hardware Setup

One of the goals of this benchmark to see how ClickHouse scales on multiple nodes. I used a setup of one node, and then compared to a setup of three nodes. Each node is 24 cores of “Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz” CPUs, and the data is located on a very fast PCIe Flash storage.

For the SSB benchmark I use a scale factor of 2500, which provides (in raw data):

Table lineorder – 15 bln rows, raw size 1.7TB, Table customer – 75 mln rows

When loaded into ClickHouse, the table lineorder takes 464GB, which corresponds to a 3.7x compression ratio.

We compare a one-node (table names lineorderfull, customerfull) setup vs. a three-node (table names lineorderd, customerd) setup.

Single Table Operations

Query:

SELECT
    toYear(LO_ORDERDATE) AS yod,
    sum(LO_REVENUE)
FROM lineorderfull
GROUP BY yod

One node:

7 rows in set. Elapsed: 9.741 sec. Processed 15.00 billion rows, 90.00 GB (1.54 billion rows/s., 9.24 GB/s.)

Three nodes:

7 rows in set. Elapsed: 3.258 sec. Processed 15.00 billion rows, 90.00 GB (4.60 billion rows/s., 27.63 GB/s.)

We see a speed up of practically three times. Handling 4.6 billion rows/s is blazingly fast!

One Table with Filtering

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorderfull
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)

One node:

1 rows in set. Elapsed: 3.175 sec. Processed 2.28 billion rows, 18.20 GB (716.60 million rows/s., 5.73 GB/s.)

Three nodes:

1 rows in set. Elapsed: 1.295 sec. Processed 2.28 billion rows, 18.20 GB (1.76 billion rows/s., 14.06 GB/s.)

It’s worth mentioning that during the execution of this query, ClickHouse was able to use ALL 24 cores on each box. This confirms that ClickHouse is a massively parallel processing system.

Two Tables (Independent Subquery)

In this case, I want to show how Clickhouse handles independent subqueries:

SELECT sum(LO_REVENUE)
FROM lineorderfull
WHERE LO_CUSTKEY IN
(
    SELECT C_CUSTKEY AS LO_CUSTKEY
    FROM customerfull
    WHERE C_REGION = 'ASIA'
)

One node:

1 rows in set. Elapsed: 28.934 sec. Processed 15.00 billion rows, 120.00 GB (518.43 million rows/s., 4.15 GB/s.)

Three nodes:

1 rows in set. Elapsed: 14.189 sec. Processed 15.12 billion rows, 121.67 GB (1.07 billion rows/s., 8.57 GB/s.)

We  do not see, however, the close to 3x speedup on three nodes, because of the required data transfer to perform the match LO_CUSTKEY with C_CUSTKEY

Two Tables JOIN

With a subquery using columns to return results, or for GROUP BY, things get more complicated. In this case we want to GROUP BY the column from the second table.

First, ClickHouse doesn’t support traditional subquery syntax, so we need to use JOIN. For JOINs, ClickHouse also strictly prescribes how it must be written (a limitation that will also get changed in the future). Our JOIN should look like:

SELECT
    C_REGION,
    sum(LO_EXTENDEDPRICE * LO_DISCOUNT)
FROM lineorderfull
ANY INNER JOIN
(
    SELECT
        C_REGION,
        C_CUSTKEY AS LO_CUSTKEY
    FROM customerfull
) USING (LO_CUSTKEY)
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)
GROUP BY C_REGION

One node:

5 rows in set. Elapsed: 31.443 sec. Processed 2.35 billion rows, 28.79 GB (74.75 million rows/s., 915.65 MB/s.)

Three nodes:

5 rows in set. Elapsed: 25.160 sec. Processed 2.58 billion rows, 33.25 GB (102.36 million rows/s., 1.32 GB/s.)

In this case the speedup is not even two times. This corresponds to the fact of the random data distribution for the tables lineorderd and customerd. Both tables were defines as:

CREATE TABLE lineorderd AS lineorder ENGINE = Distributed(3shards, default, lineorder, rand());
CREATE TABLE customerd AS customer ENGINE = Distributed(3shards, default, customer, rand());

Where  rand() defines that records are distributed randomly across three nodes. When we perform a JOIN by LO_CUSTKEY=C_CUSTKEY, records might be located on different nodes. One way to deal with this is to define data locally. For example:

CREATE TABLE lineorderLD AS lineorderL ENGINE = Distributed(3shards, default, lineorderL, LO_CUSTKEY);
CREATE TABLE customerLD AS customerL ENGINE = Distributed(3shards, default, customerL, C_CUSTKEY);

Three Tables JOIN

This is where it becomes very complicated. Let’s consider the query that you would normally write:

SELECT sum(LO_REVENUE),P_MFGR, toYear(LO_ORDERDATE) yod FROM lineorderfull ,customerfull,partfull WHERE C_REGION = 'ASIA' and
LO_CUSTKEY=C_CUSTKEY and P_PARTKEY=LO_PARTKEY GROUP BY P_MFGR,yod ORDER BY P_MFGR,yod;

With Clickhouse’s limitations on JOINs syntax, the query becomes:

SELECT
    sum(LO_REVENUE),
    P_MFGR,
    toYear(LO_ORDERDATE) AS yod
FROM
(
    SELECT
        LO_PARTKEY,
        LO_ORDERDATE,
        LO_REVENUE
    FROM lineorderfull
    ALL INNER JOIN
    (
        SELECT
            C_REGION,
            C_CUSTKEY AS LO_CUSTKEY
        FROM customerfull
    ) USING (LO_CUSTKEY)
    WHERE C_REGION = 'ASIA'
)
ALL INNER JOIN
(
    SELECT
        P_MFGR,
        P_PARTKEY AS LO_PARTKEY
    FROM partfull
) USING (LO_PARTKEY)
GROUP BY
    P_MFGR,
    yod
ORDER BY
    P_MFGR ASC,
    yod ASC

By writing queries this way, we force ClickHouse to use the prescribed JOIN order — at this moment there is no optimizer in ClickHouse and it is totally unaware of data distribution.

There is also not much speedup when we compare one node vs. three nodes:

One node execution time:

35 rows in set. Elapsed: 697.806 sec. Processed 15.08 billion rows, 211.53 GB (21.61 million rows/s., 303.14 MB/s.)

Three nodes execution time:

35 rows in set. Elapsed: 622.536 sec. Processed 15.12 billion rows, 211.71 GB (24.29 million rows/s., 340.08 MB/s.)

There is a way to make the query faster for this 3-way JOIN, however. (Thanks to Alexander Zaytsev from https://www.altinity.com/ for help!)

Optimized query:

SELECT
    sum(revenue),
    P_MFGR,
    yod
FROM
(
    SELECT
        LO_PARTKEY AS P_PARTKEY,
        toYear(LO_ORDERDATE) AS yod,
        SUM(LO_REVENUE) AS revenue
    FROM lineorderfull
    WHERE LO_CUSTKEY IN
    (
        SELECT C_CUSTKEY
        FROM customerfull
        WHERE C_REGION = 'ASIA'
    )
    GROUP BY
        P_PARTKEY,
        yod
)
ANY INNER JOIN partfull USING (P_PARTKEY)
GROUP BY
    P_MFGR,
    yod
ORDER BY
    P_MFGR ASC,
    yod ASC

Optimized query time:

One node:

35 rows in set. Elapsed: 106.732 sec. Processed 15.00 billion rows, 210.05 GB (140.56 million rows/s., 1.97 GB/s.)

Three nodes:

35 rows in set. Elapsed: 75.854 sec. Processed 15.12 billion rows, 211.71 GB (199.36 million rows/s., 2.79 GB/s.

That’s an improvement of about 6.5 times compared to the original query. This shows the importance of understanding data distribution, and writing the optimal query to process the data.

Another option for dealing with JOIN complexity, and to improve performance, is to use ClickHouse’s dictionaries. These dictionaries are described here: https://www.altinity.com/blog/2017/4/12/dictionaries-explained.

I will review dictionary performance in future posts.

Another traditional way to deal with JOIN complexity in an analytics workload is to use denormalization. We can move some columns (for example, P_MFGR from the last query) to the facts table (lineorder).

Observations

  • ClickHouse can handle general analytical queries (it requires special schema design and considerations, however)
  • Linear speedup is possible, but it depends on query design and requires advanced planning — proper speedup depends on data locality
  • ClickHouse is blazingly fast (beyond what I’ve seen before) because it can use all available CPU cores for query, as shown above using 24 cores for single server and 72 cores for three nodes
  • Multi-table JOINs are cumbersome and require manual work to achieve better performance, so consider using dictionaries or denormalization
Jun
21
2017
--

Talking Drupal #150 – Drupal GovCon

www.talkingdrupal.com/150

In Episode #150 we talk with the organizers of Drupal GovCon. Learn why this Drupal Camp is called “GovCon”.

Show Topics

  • July 31 – August 2, Bethesda, MD
  • Not a Government focussed conference
  • Camp size and attendees
  • Sessions
  • Beyond the Sessions
  • History of GovCon
  • Organizing a camp

Resources

www.drupalgovcon.org

Hosts

Stephen Cross – www.ParallaxInfoTech.com @stephencross

Nic Laflin – www.nLightened.net @nicxvan

John Picozzi – www.oomphinc.com @johnpicozzi

Guest

Kirsten Burgard – www.drupal.org/u/bendygirl @uniongal

Daniel Schiavone – www.drupal.org/u/schiavone @schiavo

Jun
21
2017
--

Talking Drupal #150 – Drupal GovCon

www.talkingdrupal.com/150

In Episode #150 we talk with the organizers of Drupal GovCon. Learn why this Drupal Camp is called “GovCon”.

Show Topics

  • July 31 – August 2, Bethesda, MD
  • Not a Government focussed conference
  • Camp size and attendees
  • Sessions
  • Beyond the Sessions
  • History of GovCon
  • Organizing a camp

Resources

www.drupalgovcon.org

Hosts

Stephen Cross – www.ParallaxInfoTech.com @stephencross

Nic Laflin – www.nLightened.net @nicxvan

John Picozzi – www.oomphinc.com @johnpicozzi

Guest

Kirsten Burgard – www.drupal.org/u/bendygirl @uniongal

Daniel Schiavone – www.drupal.org/u/schiavone @schiavo

Jun
21
2017
--

Percona Monitoring and Management 1.1.5 is Now Available

Percona Monitoring and Management (PMM)

Percona announces the release of Percona Monitoring and Management 1.1.5 on June 21, 2017.

For installation instructions, see the Deployment Guide.


Changes in PMM Server

  • PMM-667: Fixed the Latency graph in the ProxySQL Overview dashboard to plot microsecond values instead of milliseconds.

  • PMM-800: Fixed the InnoDB Page Splits graph in the MySQL InnoDB Metrics Advanced dashboard to show correct page merge success ratio.

  • PMM-1007: Added links to Query Analytics from MySQL Overview and MongoDB Overview dashboards. The links also pass selected host and time period values.

    NOTE: These links currently open QAN2, which is still considered experimental.

Changes in PMM Client

  • PMM-931: Fixed pmm-admin script when adding MongoDB metrics monitoring for secondary in a replica set.

About Percona Monitoring and Management

Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB performance. Percona developed it in collaboration with experts in the field of managed database services, support and consulting.

PMM is a free and open-source solution that you can run in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

A live demo of PMM is available at pmmdemo.percona.com.

Please provide your feedback and questions on the PMM forum.

If you would like to report a bug or submit a feature request, use the PMM project in JIRA.

Jun
21
2017
--

Tracing MongoDB Queries to Code with Cursor Comments

Tracing MongoDB Queries

Tracing MongoDB QueriesIn this short blog post, we will discuss a helpful feature for tracing MongoDB queries: Cursor Comments.

Cursor Comments

Much like other database systems, MongoDB supports the ability for application developers to set comment strings on their database queries using the Cursor Comment feature. This feature is very useful for both DBAs and developers for quickly and efficiently tying a MongoDB query found on the database server to a line of code in the application source.

Once Cursor Comments are set in application code, they can be seen in the following areas on the server:

  1. The
    db.currentOp()

     shell command. If Auth is enabled, this requires a role that has the ‘inprog’ privilege.

  2. Profiles in the system.profile collection (per-db) if profiling is enabled.
  3. The QUERY log component.

Note: the Cursor Comment string shows as the field “query.comment” in the Database Profiler output, and as the field originatingCommand.comment in the output of the db.currentOp() command.

This is fantastic because this makes comments visible in the areas commonly used to find performance issues!

Often it is very easy to find a slow query on the database server, but it is difficult to target the exact area of a large application that triggers the slow query. This can all be changed with Cursor Comments!

Python Example

Below is a snippet of Python code implementing a cursor comment on a simple query to the collection “test.test”. (Most other languages and MongoDB drivers should work similarly if you do not use Python.)

My goal in this example is to get the MongoDB Profiler to log a custom comment, and then we will read it back manually from the server afterward to confirm it worked.

In this example, I include the following pieces of data in my comment:

  1. The Python class
  2. The Python method that executed the query
  3. The file Python was executing
  4. The line of the file Python was executing

Unfortunately, three of the four useful details above are not built-in variables in Python, so the “inspect” module is required to fetch those details. Using the “inspect” module and setting a cursor comment for every query in an application is a bit clunky, so it is best to create a method to do this. I made a class-method named “find_with_comment” in this example code to do this. This method performs a MongoDB query and sets the cursor comment automagically, finally returning a regular pymongo cursor object.

Below is the simple Python example script. It connects to a Mongod on localhost:27017, and demonstrates everything for us. You can run this script yourself if you have the “pymongo” Python package installed.

Script:

from inspect import currentframe, getframeinfo
from pymongo import MongoClient
class TestClass:
    def find_with_comment(self, conn, query, db, coll):
        frame      = getframeinfo(currentframe().f_back)
        comment    = "%s:%s;%s:%i" % (self.__class__.__name__, frame.function, frame.filename, frame.lineno)
        collection = conn[db][coll]
        return collection.find(query).comment(comment)
    def run(self):
        uri   = "localhost:27017"
        conn  = MongoClient(uri)
        query = {'user.name': 'John Doe'}
        for doc in self.find_with_comment(conn, query, 'test', 'test'):
            print doc
        conn.close()
if __name__  == "__main__":
    t = TestClass()
    t.run()

There are a few things to explain in this code:

  1. Line #6-10: The “find_with_comment” method runs a pymongo query and handles adding our special cursor comment string. This method takes-in the connection, query and db+collection name as variables.
  2. Line #7: is using the “inspect” module to read the last Python “frame” so we can fetch the file, line number, that called the query.
  3. Line #12-18: The “run” method makes a database connection, runs the “find_with_comment” method with a query, prints the results and closes the connection. This method is just boilerplate to run the example.
  4. Line #20-21: This code initiates the TestClass and calls the “run” method to run our test.

Trying It Out

Before running this script, enable database profiling mode “2” on the “test” database. This is the database the script will query. The profiling mode “2” causes MongoDB to profile all queries:

$ mongo --port=27017
> use test
switched to db test
> db.setProfilingLevel(2)
{ "was" : 1, "slowms" : 100, "ratelimit" : 1, "ok" : 1 }
> quit()

Now let’s run the script. There should be no output from the script, it is only going to do a find query to generate a Profile.

I saved the script as cursor-comment.py and ran it like this from my Linux terminal:

$ python cursor-comment.py
$

Now, let’s see if we can find any Profiles containing the “query.comment” field:

$ mongo --port=27017
> use test
> db.system.profile.find({ "query.comment": {$exists: true} }, { query: 1 }).pretty()
{
	"query" : {
		"find" : "test",
		"filter" : {
			"user.name" : "John Doe"
		},
		"comment" : "TestClass:run;cursor-comment.py:16"
	}
}

Now we know the exact class, method, file and line number that ran this profiled query! Great!

From this Profile we can conclude that the class-method “TestClass:run” initiated this MongoDB query from Line #16 of cursor-comment.py. Imagine this was a query that slowed down your production system and you need to know the source quickly. The usefulness of this feature/workflow becomes obvious, fast.

More on Python “inspect”

Instead of constructing a custom comment like the example above, you can also use Python “inspect” to collect the Python source code comment that precedes the code that is running. This might be useful for projects that have code comments that would be more useful than class/method/file/line number. As the comment is a string, the sky is the limit on what you can set!

Read about the

.getcomments()

  method of “inspect” here: https://docs.python.org/2/library/inspect.html#inspect.getcomments

Aggregation Comments

MongoDB 3.5/3.6 added support for comments in aggregations. This is a great new feature, as aggregations are often heavy operations that would be useful to tie to a line of code as well!

This can be used by adding a “comment” field to your “aggregate” server command, like so:

db.runCommand({
  aggregate: "myCollection",
  pipeline: [
    { $match: { _id: "foo" } }
  ],
  comment: "fooMatch"
})

See more about this new feature in the following MongoDB tickets: SERVER-28128 and DOCS-10020.

Conclusion

Hopefully this blog gives you some ideas on how this feature can be useful in your application. Start adding comments to your application today!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com