An Overview of Sharding in PostgreSQL and How it Relates to MongoDB’s

PostgreSQL LogoA couple of weeks ago I presented at Percona University São Paulo about the new features in PostgreSQL that allow the deployment of simple shards. I’ve tried to summarize the main points in this post, as well as providing an introductory overview of sharding itself. Please note I haven’t included any third-party extensions that provide sharding for PostgreSQL in my discussion below.

Partitioning in PostgreSQL

In a nutshell, until not long ago there wasn’t a dedicated, native feature in PostgreSQL for table partitioning. Not that that prevented people from doing it anyway: the PostgreSQL community is very creative. There’s a table inheritance feature in PostgreSQL that allows the creation of child tables with the same structure as a parent table. That, combined with the employment of proper constraints in each child table along with the right set of triggers in the parent table, has provided practical “table partitioning” in PostgreSQL for years (and still works). Here’s an example:

Using table inheritance

CREATE TABLE temperature (
  city_id INT NOT NULL,
  temp DECIMAL(5,2) NOT NULL

Figure 1a. Main (or parent) table

CREATE TABLE temperature_201901 (CHECK (timestamp >= DATE '2019-01-01' AND timestamp <= DATE '2019-01-31')) INHERITS (temperature);
CREATE TABLE temperature_201902 (CHECK (timestamp >= DATE '2019-02-01' AND timestamp <= DATE '2019-02-28')) INHERITS (temperature);
CREATE TABLE temperature_201903 (CHECK (timestamp >= DATE '2019-03-01' AND timestamp <= DATE '2019-03-31')) INHERITS (temperature);

Figure 1b. Child tables inherit the structure of the parent table and are limited by constraints

CREATE OR REPLACE FUNCTION temperature_insert_trigger()
    IF ( NEW.timestamp >= DATE '2019-01-01' AND NEW.timestamp <= DATE '2019-01-31' ) THEN INSERT INTO temperature_201901 VALUES (NEW.*);
    ELSIF ( NEW.timestamp >= DATE '2019-02-01' AND NEW.timestamp <= DATE '2019-02-28' ) THEN INSERT INTO temperature_201902 VALUES (NEW.*);
    ELSIF ( NEW.timestamp >= DATE '2019-03-01' AND NEW.timestamp <= DATE '2019-03-31' ) THEN INSERT INTO temperature_201903 VALUES (NEW.*);
    ELSE RAISE EXCEPTION 'Date out of range!';
    END IF;
LANGUAGE plpgsql;

Figure 1c. A function that controls in which child table a new entry should be added according to the timestamp field

CREATE TRIGGER insert_temperature_trigger
    BEFORE INSERT ON temperature
    FOR EACH ROW EXECUTE PROCEDURE temperature_insert_trigger();

Figure 1d. A trigger is added to the parent table that calls the function above when an INSERT is performed

The biggest drawbacks for such an implementation was related to the amount of manual work needed to maintain such an environment (even though a certain level of automation could be achieved through the use of 3rd party extensions such as pg_partman) and the lack of optimization/support for “distributed” queries. The PostgreSQL optimizer wasn’t advanced enough to have a good understanding of partitions at the time, though there were workarounds that could be used such as employing constraint exclusion.

Declarative partitioning

About 1.5 year ago, PostgreSQL 10 was released with a bunch of new features, among them native support for table partitioning through the new declarative partitioning feature. Here’s how we could partition the same temperature table using this new method:

CREATE TABLE temperature (
  city_id INT NOT NULL,
  temp DECIMAL(5,2) NOT NULL
) PARTITION BY RANGE (timestamp);

Figure 2a. Main table structure for a partitioned table

CREATE TABLE temperature_201901 PARTITION OF temperature FOR VALUES FROM ('2019-01-01') TO ('2019-02-01');
CREATE TABLE temperature_201902 PARTITION OF temperature FOR VALUES FROM ('2019-02-01') TO ('2019-03-01');
CREATE TABLE temperature_201903 PARTITION OF temperature FOR VALUES FROM ('2019-03-01') TO ('2019-04-01');

Figure 2b. Tables defined as partitions of the main table; with declarative partitioning, there was no need for triggers anymore.

It still missed the greater optimization and flexibility needed to consider it a complete partitioning solution. It wasn’t possible, for example, to perform an UPDATE that would result in moving a row from one partition to a different one, but the foundation had been laid. Fast forward another year and PostgreSQL 11 builds on top of this, delivering additional features like:

  • the possibility to define a default partition, to which any entry that wouldn’t fit a corresponding partition would be added to.
  • having indexes added to the main table “replicated” to the underlying partitions, which improved declarative partitioning usability.
  • support for Foreign Keys

These are just a few of the features that led to a more mature partitioning solution.

Sharding in PostgreSQL

By now you might be reasonably questioning my premise, and that partitioning is not sharding, at least not in the sense and context you would have expected this post to cover. In fact, PostgreSQL has implemented sharding on top of partitioning by allowing any given partition of a partitioned table to be hosted by a remote server. The basis for this is in PostgreSQL’s Foreign Data Wrapper (FDW) support, which has been a part of the core of PostgreSQL for a long time. While technically possible to implement, we just couldn’t make practical use of it for sharding using the table inheritance + triggers approach. Declarative partitioning allowed for much better integration of these pieces making sharding – partitioned tables hosted by remote servers – more of a reality in PostgreSQL.

CREATE TABLE temperature_201904 (
  city_id INT NOT NULL,
  temp DECIMAL(5,2) NOT NULL

Figure 3a. On the remote server we create a “partition” – nothing but a simple table

CREATE EXTENSION postgres_fdw;
GRANT USAGE ON FOREIGN DATA WRAPPER postgres_fdw to app_user;
    OPTIONS (dbname 'postgres', host 'shard02', port
CREATE USER MAPPING for app_user SERVER shard02
    OPTIONS (user 'fdw_user', password 'secret');

Figure 3b. On the local server the preparatory steps involve loading the postgres_fdw extension, allowing our local application user to use that extension, creating an entry to access the remote server, and finally mapping that user with a user in the remote server (fdw_user) that has local access to the table we’ll use as a remote partition.

CREATE FOREIGN TABLE temperature_201904 PARTITION OF temperature
    FOR VALUES FROM ('2019-04-01') TO ('2019-05-01')
    SERVER remoteserver01;

Figure 3c. Now it’s simply a matter of creating a proper partition of our main table in the local server that will be linked to the table of the same name in the remote server

You can read more about postgres_fdw in Foreign Data Wrappers in PostgreSQL and a closer look at postgres_fdw.

When does it make sense to partition a table?

There are a several principle reasons to partition a table:

  1. When a table grows so big that searching it becomes impractical even with the help of indexes (which will invariably become too big as well).
  2. When data management is such that the target data is often the most recently added and/or older data is constantly being purged/archived, or even not being searched anymore (at least not as often).
  3. If you are loading data from different sources and maintaining it as a data warehousing for reporting and analytics.
  4. For a less expensive archiving or purging of massive data that avoids exclusive locks on the entire table.

When should we resort to sharding?

Here are a couple of classic cases:

  1. To scale out (horizontally), when even after partitioning a table the amount of data is too great or too complex to be processed by a single server.
  2. Use cases where the data in a big table can be divided into two or more segments that would benefit the majority of the search patterns. A common example used to describe a scenario like this is that of a company whose customers are evenly spread across the United States and searches to a target table involves the customer ZIP code. A shard then could be used to host entries of customers located on the East coast and another for customers on the West coast.

Note though this is by no means an extensive list.

How should we shard the data?

With sharding (in this context) being “distributed” partitioning, the essence for a successful (performant) sharded environment lies in choosing the right shard key – and by “right” I mean one that will distribute your data across the shards in a way that will benefit most of your queries. In the example above, using the customer ZIP code as shard key makes sense if an application will more often be issuing queries that will hit one shard (East) or the other (West). However, if most queries would filter by, say, birth date, then all queries would need to be run through all shards to recover the full result set. This could easily backfire on performance with the shard approach, by not selecting the right shard key or simply by having such a heterogeneous workload that no shard key would be able to satisfy it.

It only ever makes sense to shard if the nature of the queries involving the target table(s) is such that distributed processing will be the norm and constitute an advantage far greater than any overhead caused by a minority of queries that rely on JOINs involving multiple shards. Due to the distributed nature of sharding such queries will necessarily perform worse if compared to having them all hosted on the same server.

Why not simply rely on replication or clustering?

Sharding should be considered in those situations where you can’t efficiently break down a big table through data normalization or use an alternative approach and maintaining it on a single server is too demanding. The table is then partitioned and the partitions distributed across different servers to spread the load across many servers. It doesn’t need to be one partition per shard, often a single shard will host a number of partitions.

Note how sharding differs from traditional “share all” database replication and clustering environments: you may use, for instance, a dedicated PostgreSQL server to host a single partition from a single table and nothing else. However, these data scaling technologies may well complement each other: a PostgreSQL database may host a shard with part of a big table as well as replicate smaller tables that are often used for some sort of consultation (read-only), such as a price list, through logical replication.

How does sharding in PostgreSQL relates to sharding in MongoDB®?

MongoDB® tackles the matter of managing big collections straight through sharding: there is no concept of local partitioning of collections in MongoDB. In fact, the whole MongoDB scaling strategy is based on sharding, which takes a central place in the database architecture. As such, the sharding process has been made as transparent to the application as possible: all a DBA has to do is to define the shard key.

Instead of connecting to a reference database server the application will connect to an auxiliary router server named mongos which will process the queries and request the necessary information to the respective shard. It knows which shard contains what because they maintain a copy of the metadata that maps chunks of data to shards, which they get from a config server, another important and independent component of a MongoDB sharded cluster. Together, they also play a role in maintaining good data distribution across the shards, actively splitting and migrating chunks of data between servers as needed.

In PostgreSQL the application will connect and query the main database server. There isn’t an intermediary router such as the mongos but PostgreSQL’s query planner will process the query and create an execution plan. When data requested from a partitioned table is found on a remote server PostgreSQL will request the data from it, as shown in the EXPLAIN output below:

   Remote SQL: UPDATE public.emp SET sal = $2 WHERE ctid = $1
   ->  Nested Loop  (cost=100.00..300.71 rows=669 width=118)
     	Output: emp.empno, emp.ename, emp.job, emp.mgr, emp.hiredate, (emp.sal * '1.1'::double precision), emp.comm, emp.deptno, emp.ctid, salgrade.ctid
     	Join Filter: ((emp.sal > (salgrade.losal)::double precision) AND (emp.sal < (salgrade.hisal)::double precision)) ->  Foreign Scan on public.emp  (cost=100.00..128.06 rows=602 width=112)
           	Output: emp.empno, emp.ename, emp.job, emp.mgr, emp.hiredate, emp.sal, emp.comm, emp.deptno, emp.ctid

Figure 4: excerpt of an EXPLAIN plan that involves processing a query in a remote server

Note in the above query the mention “Remote SQL”. A lot of optimizations have been made in the execution of remote queries in PostgreSQL 10 and 11, which contributed to mature and improve the sharding solution. Among them is support for having grouping and aggregation operations executed on the remote server itself (“push down”) rather than recovering all rows and having them processed locally.

What is missing in PostgreSQL implementation?

There is, however, still room for improvement. In terms of remote execution, reports from the community indicate not all queries are performing as they should. For example, in some cases the PostgreSQL planner is not performing a full push-down, resulting in shards transferring more data than required. Parallel scheduling of queries that touch multiple shards is not yet implemented: for now, the execution is taking place sequentially, one shard at a time, which takes longer to complete. When it comes to the maintenance of partitioned and sharded environments, changes in the structure of partitions are still complicated and not very practical. For example, when you add a new partition to a partitioned table with an appointed default partition you may need to detach the default partition first if it contains rows that would now fit in the new partition, manually move those to the new partition, and finally re-attach the default partition back in place.

But that is all part of a maturing technology. We’re looking forward to PostgreSQL 12 and what it will bring in the partitioning and sharding fronts.

Image based on photos by Leonardo Quatrocchi from Pexels



Webinar Wed 7/19: MongoDB Sharding

MongoDB shard zones

MongoDB shard zonesPlease join Percona’s Senior Support Engineer, Adamo Tonete as he presents MongoDB Sharding 101 on July 19th, 2018, at 12:30 PM PDT (UTC-7) / 3:30 PM EDT (UTC-4).


This tutorial is a continuation of advanced topics for the DBA. In it, we will share best practices and tips on how to perform the most common activities.

In this tutorial, we are going to cover MongoDB sharding.

Register Now

The post Webinar Wed 7/19: MongoDB Sharding appeared first on Percona Database Performance Blog.


Configuring PMM Monitoring for MongoDB Cluster

In this blog, we will see how to configure Percona Monitoring and Management (PMM) monitoring for a MongoDB cluster. It’s very simple, like adding a replica set or standalone instances to PMM Monitoring. 

For this example, I have used docker to create PMM Server and MongoDB sharded cluster containers. If you want the steps I used to create the MongoDB® cluster environment using docker, I have shared them at the end of this blog. You can refer to this if you would like to create the same set up.

Configuring PMM Clients

For PMM installations, you can check these links for PMM installation and pmm-client setup. The following are the members of the MongoDB cluster:

config db: (replSet name - mongors1conf)  
Shard1: (replSet name - mongors1)  
Shard1: (replSet name - mongors2)  

In this setup, I installed the pmm-client on mongos1 server. Then I added an agent to monitor MongoDB metrics with cluster option as shown in the next code block. I named the cluster “mongoClusterPMM” (Note: you have to be root user or need sudo access to execute the



root@90262f1360a0:/# pmm-admin add  mongodb --cluster mongoClusterPMM
[linux:metrics]   OK, now monitoring this system.
[mongodb:metrics] OK, now monitoring MongoDB metrics using URI localhost:27017
[mongodb:queries] OK, now monitoring MongoDB queries using URI localhost:27017
[mongodb:queries] It is required for correct operation that profiling of monitored MongoDB databases be enabled.
[mongodb:queries] Note that profiling is not enabled by default because it may reduce the performance of your MongoDB server.
[mongodb:queries] For more information read PMM documentation (
root@90262f1360a0:/# pmm-admin list
pmm-admin 1.11.0
PMM Server      | (password-protected)
Client Name     | 90262f1360a0
Client Address  | 
Service Manager | unix-systemv
---------------- ------------- ----------- -------- ---------------- ------------------------
---------------- ------------- ----------- -------- ---------------- ------------------------
mongodb:queries  90262f1360a0  -           YES      localhost:27017  query_examples=true     
linux:metrics    90262f1360a0  42000       YES      -                                        
mongodb:metrics  90262f1360a0  42003       YES      localhost:27017  cluster=mongoClusterPMM 

As you can see, I used the 

pmm-admin add mongodb [options]

 command which enables monitoring for system, MongoDB metrics and queries. You need to enable profiler to monitor MongoDB queries. Use the next command to enable it at database level:

use db_name

Check this blog to know more about QAN setup and details. If you want to enable only MongoDB metrics, rather than queries to be monitored, then you can use the command 

pmm-admin add  mongodb:metrics [options]

 . After this, go to the PMM homepage (in my case localhost:8080) in your browser and select MongoDB Cluster Summary from the drop down list under the MongoDB option. The below screenshot shows the MongoDB Cluster—“mongoClusterPMM” statistics— collected by the agent that we added in mongos1 server. 

Percona Monitoring and Management for a MongoDB cluster

Did we miss something here? And do you see any metrics in the dashboard above except “Balancer Enabled” and “Chunks Balanced”?

No. This is because, PMM doesn’t have enough data to show in the dashboard. The shards are not added to the cluster yet and as you can see it displays 0 under shards. Let’s add two shard replica sets mongors1 and mongors2 in the mongos1 instance, and enable sharding to the database to complete the cluster setup as follows:

mongos> sh.addShard("mongors1/mongors1n1:27017,mongors1n2:27017,mongors1n3:27017")
{ "shardAdded" : "mongors1", "ok" : 1 }
mongos> sh.addShard("mongors2/mongors2n1:27017,mongors2n2:27017,mongors2n3:27017")
{ "shardAdded" : "mongors2", "ok" : 1 }

Now, I’ll add some data, collection and shard keys, and enable sharding so that we can see some statistics in the dashboard:

use vinodh
db.testColl.insertMany([{id1:1,name:"test insert”},{id1:2,name:"test insert"},{id1:3,name:"insert"},{id1:4,name:"insert"}])
sh.shardCollection("vinodh.testColl", {id1:1})

At last! Now you can see statistics in the graph for the MongoDB cluster:

PMM Statistics for a MongoDB Cluster 

Full Cluster Monitoring

We are not done yet. We have just added an agent to monitor mongos1 instance, and so only the cluster related statistics and QAN are collected through this node. For monitoring all nodes in the MongoDB cluster, we need to configure all nodes in PMM under “mongoClusterPMM” cluster. This will tell PMM that the configured nodes are part of the same cluster. We could also monitor the replica set related metrics for the members in config DBs and shards. Let’s add the monitoring agents in mongos1 server to monitor all MongoDB instances remotely. We’ll use these commands:

pmm-admin add mongodb:metrics --uri "mongodb://mongocfg1:27017/admin?replicaSet=mongors1conf" mongocfg1replSet --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongocfg2:27017/admin?replicaSet=mongors1conf" mongocfg2replSet --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongocfg3:27017/admin?replicaSet=mongors1conf" mongocfg3replSet --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors1n1:27017/admin?replicaSet=mongors1" mongors1replSetn1 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors1n2:27017/admin?replicaSet=mongors1" mongors1replSetn2 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors1n3:27017/admin?replicaSet=mongors1" mongors1replSetn3 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors2n1:27017/admin?replicaSet=mongors2" mongors2replSetn1 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors2n2:27017/admin?replicaSet=mongors2" mongors2replSetn2 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors2n3:27017/admin?replicaSet=mongors2" mongors2replSetn3 --cluster  mongoClusterPMM

Once you have added them, you can check the agent’s (mongodb-exporter) status as follows:

root@90262f1360a0:/# pmm-admin list
pmm-admin 1.11.0
PMM Server      | (password-protected)
Client Name     | 90262f1360a0
Client Address  | 
Service Manager | unix-systemv
---------------- ------------------ ----------- -------- ----------------------- ------------------------
SERVICE TYPE     NAME               LOCAL PORT  RUNNING  DATA SOURCE             OPTIONS                 
---------------- ------------------ ----------- -------- ----------------------- ------------------------
mongodb:queries  90262f1360a0       -           YES      localhost:27017         query_examples=true     
linux:metrics    90262f1360a0       42000       YES      -                                               
mongodb:metrics  90262f1360a0       42003       YES      localhost:27017         cluster=mongoClusterPMM 
mongodb:metrics  mongocfg1replSet   42004       YES      mongocfg1:27017/admin   cluster=mongoClusterPMM 
mongodb:metrics  mongocfg2replSet   42005       YES      mongocfg2:27017/admin   cluster=mongoClusterPMM 
mongodb:metrics  mongocfg3replSet   42006       YES      mongocfg3:27017/admin   cluster=mongoClusterPMM 
mongodb:metrics  mongors1replSetn1  42007       YES      mongors1n1:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors1replSetn3  42008       YES      mongors1n3:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors2replSetn1  42009       YES      mongors2n1:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors2replSetn2  42010       YES      mongors2n2:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors2replSetn3  42011       YES      mongors2n3:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors1replSetn2  42012       YES      mongors1n2:27017/admin  cluster=mongoClusterPMM

So now you can monitor every member of the MongoDB cluster including their replica set and shard statistics. The next screenshot shows one of the members from the replica set under MongoDB replica set dashboard. You can select this from dashboard in this way: Cluster: mongoClusterPMM ? Replica Set: mongors1 ? Instance: mongors1replSetn2]:

 PMM MongoDB Replica Set Dashboard

MongoDB cluster docker setup

As I said in the beginning of this blog, I’ve provided the steps for the MongoDB cluster setup. Since this is for testing, I have used very simple configuration to setup the cluster environment using docker-compose. Before creating the MongoDB cluster, create the network in docker for the cluster nodes and PMM to connect each other like this:

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker network create mongo-cluster-nw

I used the docker-compose.yaml file to create the docker environment here:

version: '2'
    container_name: mongors1n1
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors1 --dbpath /data/db --port 27017
      - mongo-cluster
      - 27047:27017
      - "27017"
      TERM: xterm
    container_name: mongors1n2
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors1 --dbpath /data/db --port 27017
      - mongo-cluster
      - 27048:27017
      - "27017"
      TERM: xterm
    container_name: mongors1n3
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors1 --dbpath /data/db --port 27017
      - mongo-cluster
      - 27049:27017
      - "27017"
      TERM: xterm
    container_name: mongors2n1
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors2 --dbpath /data/db --port 27017
      - mongo-cluster
      - 27057:27017
      - "27017"
      TERM: xterm
    container_name: mongors2n2
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors2 --dbpath /data/db --port 27017
      - mongo-cluster
      - 27058:27017
      - "27017"
      TERM: xterm
    container_name: mongors2n3
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors2 --dbpath /data/db --port 27017
      - mongo-cluster
      - 27059:27017
      - "27017"
      TERM: xterm
    container_name: mongocfg1
    image: mongo:3.4
    command: mongod --configsvr --replSet mongors1conf --dbpath /data/db --port 27017
      - mongo-cluster
      - 27025:27017
      TERM: xterm
      - "27017"
    container_name: mongocfg2
    image: mongo:3.4
    command: mongod --configsvr --replSet mongors1conf --dbpath /data/db --port 27017
      - mongo-cluster
      - 27024:27017
      TERM: xterm
      - "27017"
    container_name: mongocfg3
    image: mongo:3.4
    command: mongod --configsvr --replSet mongors1conf --dbpath /data/db --port 27017
      - mongo-cluster
      - 27023:27017
      TERM: xterm
      - "27017"
    container_name: mongos1
    image: mongo:3.4
    - mongocfg1
      - mongocfg2
    command: mongos --configdb mongors1conf/mongocfg1:27017,mongocfg2:27017,mongocfg3:27017 --port 27017
      - mongo-cluster
      - 27019:27017
      - "27017"
      name: mongo-cluster-nw

In the docker compose file, above, if you are using docker-compose version >=3.x, then you can use the option for “networks” in the yaml file like this:

    name: mongo-cluster-nw

Now start the containers and configure the cluster quickly as follows:

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker-compose up -d
Creating mongocfg1 ... 
Creating mongos1 ... 
Creating mongors2n3 ... 
Creating mongocfg2 ... 
Creating mongors1n1 ... 
Creating mongocfg3 ... 
Creating mongors2n2 ... 
Creating mongors1n3 ... 
Creating mongors2n1 ... 
Creating mongors1n2 ...

ConfigDB replica set setup

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker exec -it mongocfg1 mongo --quiet
> rs.initiate({  _id: "mongors1conf",configsvr: true,
...   members: [{ _id:0, host: "mongocfg1" },{_id:1, host: "mongocfg2" },{ _id:2, host: "mongocfg3"}]
...   })
{ "ok" : 1 }

Shard1 replica set setup

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker exec -it mongors1n1 mongo --quiet
> rs.initiate( { _id: "mongors1", members:[ {_id: 0, host: "mongors1n1"}, {_id:1, host: "mongors1n2"}, {_id:2, host: "mongors1n3"}] })
{ "ok" : 1 }

Shard2 replica set setup:

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker exec -it mongors2n1 mongo --quiet
> rs.initiate({ _id: "mongors2",  members: [ { _id: 0, host: "mongors2n1"}, { _id:1, host: "mongors2n2"}, { _id:2, host: "mongors2n3"}] })
{ "ok" : 1 }

I hope that this blog helps you to setup a MongoDB cluster and also to configure PMM to monitor it! Have a great day!

The post Configuring PMM Monitoring for MongoDB Cluster appeared first on Percona Database Performance Blog.


Zone Based Sharding in MongoDB

MongoDB shard zones

MongoDB shard zonesIn this blog post, we will discuss about how to use zone based sharding to deploy a sharded MongoDB cluster in a customized manner so that the queries and data will be redirected per geographical groupings. This feature of MongoDB is a part of its Data Center Awareness, that allows queries to be routed to particular MongoDB deployments considering physical locations or configurations of mongod instances.

Before moving on, let’s have an overview of this feature. You might already have some questions about zone based sharding. Was it recently introduced? If zone-based sharding is something we should use, then what about tag-aware sharding?

MongoDB supported tag-aware sharding from even the initial versions of MongoDB. This means tagging a range of shard keys values, associating that range with a shard, and redirecting operations to that specific tagged shard. This tag-aware sharding, since version 3.4, is referred to as ZONES. So, the only change is its name, and this is the reason sh.addShardTag(shard, tag) method is being used.

How it works

  1. With the help of a shard key, MongoDB allows you to create zones of sharded data – also known as shard zones.
  2. Each zone can be associated with one or more shards.
  3. Similarly, a shard can associate with any number of non-conflicting zones.
  4. MongoDB migrates chunks to the zone range in the selected shards.
  5. MongoDB routes read and write to a particular zone range that resides in particular shards.

Useful for what kind of deployments/applications?

  1. In cases where data needs to be routed to a particular shard due to some hardware configuration restrictions.
  2. Zones can be useful if there is the need to isolate specific data to a particular shard. For example, in the case of GDPR compliance that requires businesses to protect data and privacy for an individual within the EU.
  3. If an application is being used geographically and you want a query to route to the nearest shards for both reads and writes.

Let’s consider a Scenario

Consider the scenario of a school where students are experts in Biology, but most students are experts in Maths. So we have more data for the maths students compare to Biology students. In this example, deployment requires that Maths students data should route to the shard with the better configuration for a large amount of data. Both read and write will be served by specific shards.  All the Biology students will be served by another shard. To implement this, we will add a tag to deploy the zones to the shards.

For this scenario we have an environment with:

DB: “school”

Collection: “students”

Fields: “sId”, “subject”, “marks” and so on..

Indexed Fields: “subject” and “sId”

We enable sharding:


And create a shardkey: “subject” and “sId” 

sh.shardCollection("school.students", {subject: 1, sId: 1});

We have two shards in our test environment


{  "_id" : "shard0000",  "host" : "",  "state" : 1 }
{  "_id" : "shard0001",  "host" : "",  "state" : 1 }

Zone Deployment

1) Disable balancer

To prevent migration of the chunks across the cluster, disable the balancer for the “students” collection:

mongos> sh.disableBalancing("school.students")
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

Before proceeding further make sure the balancer is not running. It is not a mandatory process, but it is always a good practice to make sure no migration of chunks takes place while configuring zones

mongos> sh.isBalancerRunning()

2) Add shard to the zone

A zone can be associated with a particular shard in the form of a tag, using the sh.addShardTag(), so a tag will be added to each shard. Here we are considering two zones so the tags “MATHS” and “BIOLOGY” need to be added.

mongos> sh.addShardTag( "shard0000" , "MATHS");
{ "ok" : 1 }
mongos> sh.addShardTag( "shard0001" , "BIOLOGY");
{ "ok" : 1 }

We can see zones are assigned in the form of tags as required against each shard.

mongos> sh.status()
        {  "_id" : "shard0000",  "host" : "",  "state" : 1,  "tags" : [ "MATHS" ] }
        {  "_id" : "shard0001",  "host" : "",  "state" : 1,  "tags" : [ "BIOLOGY" ] }

3) Define ranges for each zone

Each zone covers one or more ranges of shard key values. Note: each range a zone covers is always inclusive of its lower boundary and exclusive of its upper boundary.

mongos> sh.addTagRange(
	{ "subject" : "maths", "sId" : MinKey},
	{ "subject" : "maths", "sId" : MaxKey},
{ "ok" : 1 }
mongos> sh.addTagRange(
	{ "subject" : "biology", "sId" : MinKey},
	{ "subject" : "biology", "sId" : MaxKey},
{ "ok" : 1 }

4) Enable balancer

Now enable the balancer so the chunks will migrate across the shards as per the requirement and all the read and write queries will be routed to the particular shards.

mongos> sh.enableBalancing("school.students")
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
mongos> sh.isBalancerRunning()

Let’s check how documents get routed as per the tags:

We have inserted 6 documents, 4 documents with “subject”:”maths” and 2 documents with “subject”:”biology”

mongos> db.students.find({"subject":"maths"}).count()
mongos> db.students.find({"subject":"biology"}).count()

Checking the shard distribution for the students collection:

mongos> db.students.getShardDistribution()
Shard shard0000 at
data : 236B docs : 4 chunks : 4
estimated data per chunk : 59B
estimated docs per chunk : 1
Shard shard0001 at
data : 122B docs : 2 chunks : 1
estimated data per chunk : 122B
estimated docs per chunk : 2

So in this test case, all the queries for the students collection have routed as per the tag used, with four documents inserted into shard0000 and two documents inserted to shard0001.

Any queries related to MATHS will route to shard0000 and queries related to BIOLOGY will route to shard0001, hence the load will be distributed as per the configuration of the shard, keeping the database performance optimized.

Sharding MongoDB using zones is a great feature provided by MongoDB. With the help of zones, data can be isolated to the specific shards. Or if we have any kind of hardware or configuration restrictions to the shards, it is a possible solution for routing the operations.

The post Zone Based Sharding in MongoDB appeared first on Percona Database Performance Blog.


Common MongoDB Topologies

MongoDB Topologies

In this blog post, we’ll look at some common MongoDB topologies used in database deployments.

The question of the best architecture for MongoDB will arise in your conversations between developers and architects. In this blog, we wanted to go over the main sharded and unsharded designs, with their pros and cons.

We will first look at “Replica Sets.” Replica sets are the most basic form of high availability (HA) in MongoDB, and the building blocks for sharding. From there, we will cover sharding approaches and if you need to go that route.

Replica Set

Replica Set

From the MongoDB manual:

replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments.

Short of sharding, this is the ideal way to run MongoDB. Things like high availability, failover and recovery become automated with no action typically needed. If you expect large growth or more than 200G of data, you should consider using this plus sharding to reduce your mean time to recovery on a restore from backup.


  • Elections happen automatically and unnoticed by application setup with retry
  • Rebuilding a new node, or adding an additional read-only node, is as easy as  “rs.add(‘hostname’)”
  • Can skip building indexes to improve write speed
  • Can have members
    • hidden in other geographic location
    • delayed replication
    • analytics nodes via taggings


  • Depending on the size of the oplog used, you can use  10-100+% more space to hold to change data for replication
  • You must scale up not out meaning more expensive hardware
  • Recovery using a sharded approach is faster than having is all on a single node ( parallelism)

Flat Mongos (not load balanced)

Flat Mongos

This is one of MongoDB’s more suggested deployment designs. To understand why, we should talk about the driver and the fact that it supports a CSV list of mongos hosts for fail-over.

You can’t distribute writes in a single replica set. Instead, they all need to go to the primary node. You can distribute reads to the secondaries using Read Preferences. The driver keeps track of what is a primary and what is a secondary and routes queries appropriately.

Conceptually, the driver should have connections bucketed into the mongos they go to. This allowed the 3.0+ driver to be semi-stateless and ask any connection to a specific mongos to preform a getMore to that mongos. In theory, this allows slightly more concurrency. Realistically you only use one mongos, since this is only a fail-over system.


  • Mongos is on its own gear, so it will not run the application out of memory
  • If Mongos doesn’t respond, the driver “fails-over” to the next in the list
  • Can be put closer to the database or application depending on your network and sorting needs


  • You can’t use mongos in a list evenly, so it is only good for fail-over (not evenness) in most drivers. Please read specific drivers for support, and test thoroughly.

Load Balanced (preferred if possible)

Load Balanced

According to the Mongo docs:

You may also deploy a group of mongos instances and use a proxy/load balancer between the application and the mongos. In these deployments, you must configure the load balancer for client affinity so that every connection from a single client reaches the same mongos.

This is the model used by platforms such as ObjectRocket. In this pattern, you move mongos nodes to their own tier but then put them behind a load-balancer. In this design, you can even out the use of mongos by using a least-connection system. The challenge, however, is new drivers have issues with getMores. By this we mean the getMore selects a new random connection, and the load balancer can’t be sure which mongos should get it. Thus it has a one in N (number of mongos) chance of selecting the right one, or getting a “Cursor Not Found” error.


  • Ability to have an even use of mongos
  • Mongos are separated from each other and the applications to prevent memory and CPU contention
  • You can easily remove or add mongos to help scale the layer without code changes
  • High availability at every level (multiple mongos, multiple configs, ReplSet for high availability and even multiple applications for app failures)


  • If batching is used, unless switched to an IP pinning algorithm (which loses evenness) you can get “Cursor Not Found” errors due to the wrong mongos getting getMore and bulk connector connections

App-Centric Mongos

Appcentric Mongos

By and large, this is one of the most typical deployment designs for MongoDB sharding. In it, we have each application host talking to a mongos on the local network interface. This ensures there is very little latency to the application from the mongos.

Additionally, this means if a mongos fails, at most its own host is affected instead of the wider range of all application hosts.


  • Local mongos on the loopback interface mean low to no latency
  • Limited scope of outage if this mongos fails
  • Can be geographically farther from the data storage in cases where you have a DR site


  • Mongos is a memory hog; you could steal from your application memory to support running it here
    • Made worse with large batches, many connections, and sorting
  • Mongos is single-threaded and could become a bottleneck for your application
  • It is possible for a slow network to cause bad decision making, including duplicate databases on different shards. The functional result is data writing intermittently to two locations, and a DBA must remediate that at some point (think MMM VIP ping pong issues)
  • All sorting and limits are applied on the application host. In cases where the sort uses an index this is OK, but if not indexed the entire result set must be held in memory by mongos and then sorted, then returned the limited number of results to the client. This is the typical cause of mongos OOM’s errors due to the memory issues listed before.


The topologies are above cover many of the deployment needs for MongoDB environments. Hope this helps, and list any questions in the comments below.


Webinar Wednesday, October 18, 2017: How to Scale with MongoDB

Scale with MongoDB

Scale with MongoDBJoin Percona’s Senior Technical Services Engineer Adamo Tonete as he presents How To Scale with MongoDB on Wednesday, October 18, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

In this webinar, we will talk about how to scale with MongoDB, up to thousands of writes and reads per second. What are the common issues when you scale with MongoDB? Is it better to shard or to add further secondaries?

We will walk through many common scaling situations, and through the steps needed to deploy a sharded cluster: from a single instance to a sharded environment. We will also talk about common mistakes/pitfalls a company can make when scaling its database – and how to avoid such situations.

Register for the webinar.

Adamo Tonete, Senior Technical Services Engineer

Adamo joined Percona in 2015, after working as a MongoDB/MySQL database administrator for three years. As the main database member of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24×7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three has moved to NoSQL technologies without giving up relational databases. He likes to play video games and study everything that is related to engines. Adamo lives with his wife in São Paulo, Brazil.


When Should I Enable MongoDB Sharding?

MongoDB Sharding

MongoDB ShardingIn this blog post, we will talk about MongoDB sharding and walk through the main reasons why you should start a cluster (independent of the approach you have chosen).

Note: I will cover this subject in my webinar How To Scale with MongoDB on Wednesday, October 18, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

Sharding is the most complex architecture you can deploy using MongoDB, and there are two main approaches as to when to shard or not. The first is to configure the cluster as soon as possible – when you predict high throughput and fast data growth.

The second says you should use a cluster as the best alternative when the application demands more resources than the replica set can offer (such as low memory, an overloaded disk or high processor load). This approach is more corrective than preventative, but we will discuss that in the future.

1) Disaster recovery plan

Disaster recovery (DR) is a very delicate topic: how long would you tolerate an outage? If necessary, how long would it take you to restore the entire database? Depending on the database size and on disk speed, a backup/restore process might take hours or even days!
There is no hard number in Gigabytes to justify a cluster. But in general, you should engage when the database is more than 200GB the backup and restore processes might take a while to finish.
Let’s consider the case where we have a replica set with a 300GB database. The full restore process might last around four hours, whereas if the database has two shards, it will take about two hours – and depending on the number of shards we can improve that time. Simple math: if there are two shards, the restore process takes half of the time to restore when compared to a single replica set.

2) Hardware limitations

Disk and memory are inexpensive nowadays. However, this is not true when companies need to scale out to high numbers (such as TB of RAM). Suppose your cloud provider can only offer you up to 5,000 IOPS in the disk subsystem, but the application needs more than that to work correctly. To work around this performance limitation, it is better to start a cluster and divide the writes among instances. That said, if there are two shards the application will have 10000 IOPS available to use for writes and reads in the disk subsystem.

3) Storage engine limitations

There are a few storage engine limitations that can be a bottleneck in your use case. MMAPv2 does have a lock per collection, while WiredTiger has tickets that will limit the number of writes and reads happening concurrently. Although we can tweak the number of tickets available in WiredTiger, there is a virtual limit – which means that changing the available tickets might generate processor overload instead of increasing performance. If one of these situations becomes a bottleneck in your system, you start a cluster. Once you shard the collection, you distribute the load/lock among the different instances.

4) Hot data vs. cold data

Several databases only work with a small percentage of the data being stored. This is called hot data or working set. Cold data or historical data is rarely read, and demands considerable system resources when it is. So why spend money on expensive machines that only store cold data or low-value data? With a cluster deployment we can choose where the cold data is stored, and use cheap devices and disks to do so. The same is true for hot data – we can use better machines to have better performance. This methodology also speeds up writes and reads on the hot data, as the indexes are smaller and add less overhead to the system.

5) Geo-distributed data

It doesn’t matter whether this need comes from application design or legal compliance. If the data must stay within continent or country borders, a cluster helps make that happen. It is possible to limit data localization so that it is stored solely in a specific “part of the world.” The number of shards and their geographic positions is not essential for the application, as it only views the database. This is commonly used in worldwide companies for better performance, or simply to comply with the local law.

6) Infrastructure limitations

Infrastructure and hardware limitations are very similar. When thinking about infrastructure, however, we focus on specific cases when the instances should be small. An example is running MongoDB on Mesos. Some providers only offer a few cores and a limited amount of RAM. Even if you are willing to pay more for that, it is not possible to purchase more than they offer as their products. A cluster provides the option to split a small amount of data among a lot of shards, reaching the same performance a big and expensive machine provides.

7) Failure isolation

Consider that a replica set or a single instance holds all the data. If for any reason this instance/replica set goes down, the whole application goes down. In a cluster, if we lose one of the five shards, 80% of the data is still available. Running a few shards helps to isolate failures. Obviously, running a bunch of instances makes the cluster prone to have a failed instance, but as each shard must have at least three instances the probability of the entire shard being down is minimal. For providers that offer different zones, it is good practice to have different members of the shard in different availability zones (or even different regions).

8) Speed up queries

Queries can take too long, depending on the number of reads they perform. In a clustered deployment, queries can run in parallel and speed up the query response time. If a query runs in ten seconds in a replica set, it is very likely that the same query will run in five to six seconds if the cluster has two shards, and so on.

I hope this helps with MongoDB sharding. Having a cluster solves several other problems as well, and we have listed only a few of them. Don’t miss our webinar regarding scaling out MongoDB next Wednesday, October 18, 2017!


Percona Live Featured Tutorial with Derek Downey, David Turner and René Cannaò — ProxySQL Tutorial

Percona Live Featured Tutorial

Welcome to another post in the series of Percona Live featured tutorial speakers blogs! In these blogs, we’ll highlight some of the tutorial speakers that will be at this year’s Percona Live conference. We’ll also discuss how these tutorials can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured tutorial, we’ll meet Derek Downey (OSDB Practice Advocate, Pythian), David Turner (Storage SRE, Uber) and René Cannaò (MySQL SRE, Dropbox / ProxySQL). Their session is ProxySQL Tutorial. There is a stigma attached to database proxies when it comes to MySQL. This tutorial hopes to blow away that stigma by showing you what can be done with a proxy designed from the ground up to perform. I had a chance to speak with Derek, David and René and learn a bit more about ProxySQL:

Percona: How did you get into database technology? What do you love about it?

Percona Live Featured Tutorial
Derek Downey

Derek: I took a relational database course in college based on Oracle. Set theory and the relational model made a lot of sense to me. After a few years as a web developer at a small company, I transitioned to a hybrid SysAdmin/DBA role and got my first taste of the potential of “the cloud” (and some of the drawbacks).

I really came to understand that data is the lifeblood of any organization, and making sure it is always available through any disaster – from human error to hurricanes – is a unique and exciting challenge.

You should never notice the DBA if they’re doing their job right. There’s not much praise for a DBA on a job well done. But it’s a vital position to keep a company running. And that suits me just fine.

David: I started working for the Advanced Projects Group at the University of Missouri, now known as MOREnet. They were largely responsible for connecting all of the libraries and schools in the state to the Internet. I was initially helping them with their Internet presence as a webmaster. Later they needed help with their databases. I got very excited about working with Oracle at the time, and decided to join that team.

My relationship with MySQL started primarily because the cost of sharding Oracle was so high. Additionally, MySQL’s replication allowed us to use slaves. Oracle’s Dataguard/standby options wouldn’t allow reads from the slaves at that time. Lastly, MySQL was sort of “wild west” fun, since it lacked so many other features that Oracle had long ago. You had to get creative. It has been humbling to see how much innovation has come from the community and how far MySQL has come. And this is only the beginning!

Percona Live Featured Tutorial
René Cannaò

René: My career followed the classic path of a system administrator that ends up becoming a DBA. I used to work for a few companies as webmaster,  and finally as SysAdmin for a web hosting company. I always saw a similar pattern: “the bottleneck is in the database.” Nobody ever knew why the database was the bottleneck. I volunteered to improve the performance of this “unknown system.” Learning was a fun experience, and the result was extremely rewarding. I love understanding how databases operate and their internals. This is the only way to be able to get the maximum performance: “scientia potentia est”!

Percona: Your tutorial is called “ProxySQL Tutorial.” What exactly is ProxySQL, and what does it do?

Derek: I’ll leave it to René, the creator of ProxySQL, to give more detail on exactly what it is. But for a DBA trying to ensure their data is always available, it is a new and exciting tool in our toolbox.

René: ProxySQL is the MySQL data gateway. It’s the Stargate that can inspect, control, transform, manage, and route all traffic between clients and database servers. It builds reliable and fault-tolerant networks. It is a software bridge that empowers MySQL DBAs, built by DBAs for DBAs, allowing them to control all MySQL traffic where previously such traffic could not be controlled either on the client side (normally the developers’ realm) or server side (where there are not enough tools).

Percona Live Featured Tutorial
David Turner

David: Architecturally, ProxySQL is a separate process between the client and the database. Because traffic passes through it, ProxySQL can become many things (three of which got my attention). It can be a multiplexer, a filter, and a replicator.

Multiplexers reduce many signals down to a few. Web servers often open many static connections to MySQL. Since MySQL can only support a limited number of connections before performance suffers, ProxySQL’s ability to transparently manage tens of thousands of connections while only opening a few to the database is a great feature.

Administrators can update ProxySQL to filter and even rewrite queries based on patterns they decide on. As someone that has worked in operations and seen how long it can take to disable misbehaving applications, this is a very compelling feature. With ProxySQL in place, I can completely block a query from the database in no time.

ProxySQL’s replication or mirroring capability means that all of the queries sent to one database can now be sent to N databases. As someone that has to roll out new versions of MySQL, test index changes, and benchmark hardware this is also a compelling feature.

Percona: What are the advantages of using ProxySQL in a database environment?

René: ProxySQL is the bridge between the clients and the servers. It creates two layers, and controls all the communication between the two. Sitting in the middle, ProxySQL provides a lot of advantages normally impossible to achieve in a standard database environment, such as throttling or blocking queries, rewriting queries, implementing sharding, read/write splitting, caching, duplicating traffic, handling backend failures, failovers, integration with HA solutions, generating real-time statistics, etc. All this, without any application change, and completely transparent to the application.

Derek: For me, ProxySQL decouples the application from the data layer. This provides more control over the backend database environment to the DBA in regards to queries, maintenance and failovers, without impact to the application.

David: In addition to the roles noted above, ProxySQL can be part of a failover solution, routing queries to a new master when the current master fails. Other advantages are splitting queries over multiple databases to distribute the load, provide additional metrics, etc.

Percona: What do you want attendees to take away from your tutorial session? Why should they attend?

Derek: This tutorial highlights what ProxySQL is trying to achieve, and discusses how to add ProxySQL to common architectures environments. Attendees will get hands-on experience with the technology, and learn how to install and configure ProxySQL to achieve query rewriting, seamless failover, and query mirroring.

David: Not only hands-on experience with ProxySQL, but an understanding of how much they can leverage with it. The more I use ProxySQL, the more advantages I see to it. For example, I did not realize that by clustering ProxySQL processes I can distribute the query matching and rewrites over many hosts, as well as use them as a caching layer.

René: ProxySQL is built upon very innovative technologies. Certain architectural concepts like hostgroups, chaining of query rules, granular routing and sharding, query retries and the very powerful Admin interface are concepts not always intuitive for DBAs with experience using other proxies. This tutorial helps understand these concepts, and attendees get hand-on experience in the configuration of ProxySQL in various scenarios.

Percona: What are you most looking forward to at Percona Live?

David: First, the people. Next, everything everyone is working on. We’re really lucky to work in such an innovative and collaborative industry. As databases evolve, we are on the ground floor of their evolution. What an exciting place to be.

Derek: I am mostly looking forward to reconnecting with my peers in the MySQL community. Both ones I’ve formerly worked with or previously met at Percona Live, as well as meeting new open source database professionals and hearing how they are providing solutions for their companies.

René: I am looking forward to attending sessions regarding new features in MySQL 8 and other new technologies. But moreover, I am excited to interact with MySQL users and get more input on how to improve ProxySQL so that it can become an indispensable tool in any MySQL environment.

Register for Percona Live Data Performance Conference 2017, and see Derek, David and René present their ProxySQL Tutorial. Use the code FeaturedTutorial and receive $30 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.


MySQL Sharding Models for SaaS Applications

MySQL Sharding Models

MySQL Sharding ModelsIn this blog post, I’ll discuss MySQL sharding models, and how they apply to SaaS application environments.

MySQL is one of the most popular database technologies used to build many modern SaaS applications, ranging from simple productivity tools to business-critical applications for the financial and healthcare industries.

Pretty much any large scale SaaS application powered by MySQL uses sharding to scale. In this blog post, we will discuss sharding choices as they apply to these kinds of applications.

In MySQL, unlike in some more modern technologies such as MongoDB, there is no standard sharding implementation that the vast majority of applications use. In fact, if anything “no standard” is the standard. The common practice is to roll your own sharding framework, as famous MySQL deployments such as Facebook and Twitter have done. MySQL Cluster – the MySQL software that has built-in Automatic Sharding functionality – is rarely deployed (for a variety of reasons). MySQL Fabric, which has been the official sharding framework, has no traction either.

When sharding today, you have a choice of rolling your own system from scratch, using comprehensive sharding platform such as Vitess or using a proxy solution to assist you with sharding. For proxy solutions, MySQL Router is the official solution. But in reality, third party solutions such as open source ProxySQL, commercial ScaleArc and semi-commercial (BSL)  MariaDB MaxScale are widely used. Keep in mind, however, that traffic routing is only one of the problems that exist in large scale sharding implementations.

Beneath all these “front end” choices for sharding on the application database connection framework or database proxy, there are some lower level decisions that you’ve got to make. Namely, around how your data is going to be led out and organized on the MySQL nodes.

When it comes to SaaS applications, at least one answer is simple. It typically makes sense to shard your data by “customer” or “organization” using some sort of mapping tables. In the vast majority of cases, single node (or replicated cluster) should be powerful enough to handle all the data and load coming from each customer.

What Should I Ask Myself Now?

The next set questions you should ask yourself are around your SaaS applications:

  • How much revenue per customer are you generating?
  • Do your customers (or regulations) require data segregation?
  • Are all the customers about the same, or are there outliers?
  • Are all your customers running the same database schema?

I address the answers in the sections below.

How Much Revenue?

How much revenue per customer you’re generating is an important number. It defines how much infrastructure costs per customer you can afford. In the case of “freemium” models, and customers generating less than $1 a month an average, you might need to ensure low overhead per customer (even if you have to compromise on customer isolation).

How much revenue per customer you’re generating is an important number. It defines how much infrastructure costs per customer you can afford. In the case of “freemium” models, and customers generating less than $1 a month an average, you might need to ensure low overhead per customer (even if you have to compromise on customer isolation).

Typically with low revenue customers, you have to co-locate the data inside the same MySQL instance (potentially even same tables). In the case of high revenue customers, isolation in separate MySQL instances (or even containers or virtualized OS instances) might be possible.

Data Segregation?

Isolation is another important area of consideration. Some enterprise customers might require that their data is physically separate from others. There could also be government regulations in play that require customer data to be stored in a specific physical location. If this is the case, you’re looking at completely dedicated customer environments. Or at the very least, separate database instances (which come with additional costs).

Customer Types?

Customer size and requirements are also important. A system designed to handle all customers of approximately the same scale (for example, personal accounting) is going to be different than if you are in the business of blog hosting. Some blogs might be 10,000 times more popular than the average.

Same Database Schema?

Finally, there is a there is the big question of whether all your customers are running the same database schema and same software version. If you want to support different software versions (if your customers require a negotiated maintenance window for software upgrades, for example) or different database schemas (if the schema is dependent on the custom functionality and modules customers might use, for example), keeping such customers in different MySQL schemas make sense.

Sharding Models

This gets us to the following sharding isolation models, ranging from lowest to highest:

  • Customers Share Schemas. This is the best choice when you have very large numbers of low-revenue customers. In this case, you would map multiple customers to the same set of tables, and include something like a customer_id field in them to filter customer data. This approach minimizes customer overhead and reduces customer isolation. It’s harder to backup/restore data for individual customers, and it is easier to introduce coding mistakes that can access other customers data. This method does not mean there is only one schema, but that there is a one-to-many relationship between schemas and customers.  For example, you might have 100 schema’s per MySQL instance, each handling 1000 to 10000 customers (depending on the application). Note that with a well-designed sharding implementation, you should be able to map customers individually to schemas. This allows you to have key customer data stored in dedicated schemas, or even on dedicated nodes.
  • Schema per Customer. This is probably the most common sharding approach in MySQL powered SaaS applications. Especially ones that have substantial revenue ($10+ per month / per customer). In this model, each customer’s data is stored in its own schema (database). This makes it very easy to backup/restore individual customers. It allows customers to have different schemas (i.e., add custom tables). It also allows them to run different versions of the application if desired. This approach allows the application server to use different MySQL users connecting on behalf of different customers, which adds an extra level of protection from accidental (or intentional) access of data that belongs to different customers. The schema per customer approach also makes it easier to move the shards around, and limits maintenance impact. The downside of this approach is higher overhead. It also results in a large number of tables per instance, and potentially larger numbers of files (which can be hard to manage).
  • Database Instance per Customer. You achieve even better isolation by having a MySQL instance per customer. This approach, however, increases overhead even further. The recent rise of light virtualization technologies and containers has reduced its usage.
  • OS Instance/Container per Customer. This approach allows you to improve isolation even further. It can be used for any customer, but can also be applied to selected customers in a model that uses Schema per Customer model for a majority of them.  Dedicated OS Instance, with improved isolation and better performance SLAs, might be a feature of some premium customer tiers. This method not only allows better isolation, but it also let’s you handle outliers better. You might chose to run a majority of your customers on the hardware (or cloud instance) that has best price/performance numbers, and also place some of the larger customers on the highest performance nodes.
  • Environment per customer. Finally, if you take this all the way you can build completely separate environments for customers. This includes databases, application servers and other required components. This is especially useful if you need to deploy the application close to the customer – which includes the appliance model, or deployment in the customer’s data center or cloud provider. This also allows you to accommodate customers if their data must be stored in a specific location. This is often due to government regulations. It is worth noting that many SaaS applications, even if they do not quite have one environment per customer, have multiple independent environments. These are often hosted in different locations or availability zones. Such setups allow you to reduce the impact of large-scale failures to only a portion of your customers. This avoids overloading your customer service group and allowing the operational organization to focus on repairing smaller environments.

The farther you go down this route – from the shared schema to an environment per customer – the more important is to have a high level of automation. With a shared schema, you often can get by with little automation (and some environments manually set up) and all the schema’s pre-created. If customer sign up requires setting up dedicated database instance or the whole environment, manual implementation doesn’t scale. For this type of setup, you need state-of-the-art automation and orchestration.


I hope this helps you to understand your options for MySQL sharding models. Each of the different sharding models for SaaS applications powered by MySQL have benefits and drawbacks. As you can see, many of these approaches require you to work with a large number of tables in the MySQL – this will be the topic of one of my next posts!


MongoDB PIT Backups: Part 2

MongoDB PIT Backups

This blog post is the second in a series covering MongoDB PIT backups. You can find the first part here.

Sharding Makes Everything Fun(ner)

The first blog post in this series looked at MongoDB backups in a simple single-replica set environment. In this post, we’ll look at the scale-out use case. When sharding, we have exactly the same problem as we do on a single replica set. However, now the problem is multiplied by the number of replica sets in the cluster. Additionally, we have a bonus problem: each replica set has unique data. That means to get a truly consistent snapshot of the cluster, we need to orchestrate our backups to capture a single consistent point in time. Just so we’re on the same page, that means that every replica set needs to stop their backups at, or near, the same time that the slowest replica set stops. Are you sufficiently confused now? Let me get to a basic concept that I forgot to cover in the first post, and then I’ll give you a simple description of the problem.

Are you Write Concerned?

So far, I’ve neglected to talk about the very important role of “write concern” when taking consistent backups. In MongoDB, the database is not durable by default. By “durable,” I mean “on disk” when the database acknowledges receipt of an operation from your application. There are most likely several reasons for this. Most likely the biggest one originally was probably throughput given a lack of concurrency.

However, the side effect is possible data loss due to loss of operations applied only in memory. Changing the write concern to “journaled” (

j : true

) will change this behavior so that MongoDB journals changes before acknowledging them (you also need to be running with journal enabled).

TIP: For true durability in a replica set, you should use a write concern of “majority” for operations and the writeConcernMajorityJournalDefault : true on all replica set members (new to v3.4). This has the added benefit of greatly decreasing the chance of rollback after an election.

Wow, you’re inconsistent

At the risk of being repetitive, the crux of this issue is that we need to run a backup on every shard (replica set). This is necessary because every shard has a different piece of the data set. Each piece of that data set is necessary to get an entire picture of the data set for the cluster (and thus, your application). Since we’re using mongodump, we’ll only have a consistent snapshot at the point in time when the backup completes. This means we must end each shard’s backup at a consistent point in time. We cannot expect that the backup will complete in exactly the same amount of time on every shard, which is what we’ll need for a consistent point in time across the cluster. This means that Shard1 might have a backup that is consistent to 12:05 PM, and another shard that is consistent to 12:06 PM. In a high traffic environment (the kind that’s likely to need horizontal scale), this could mean thousands of lost documents. Here’s a diagram:

MongoDB PIT Backups
MongoDB PIT Backups


Here’s the math to illustrate the problem:

  • Shard1’s backup will contain 30,000 documents ((100 docs * 60 secs) * 5 mins)
  • Shard2’s backup will contain 36,000 documents ((100 docs * 60 secs) * 6 mins)

In this example, to get a consistent point in time you’d need to remove all insert, update and delete operations that happened on Shard 2 from the time that Shard 1’s backup completed (6,000 documents). This means examining the timestamp of every operation in the oplog and reversing it’s operation. That’s a very intensive process, and will be unique for every mongodump that’s executed. Furthermore, this is a pretty tricky thing to do. The repeatable and much more efficient method is to have backups that finish in a consistent state, ready to restore when needed.

Luckily, Percona has you covered!

You’re getting more consistent

Having data is important, but knowing what data you have is even more important. Here’s how you can be sure you know what you have in your MongoDB backups:

David Murphy has released his MongoDB Consistent Backup Tool in the Percona Labs github account, and has written a very informative blog post about it. My goal with these blog posts is to make it even easier to understand the problem and how to solve it. We’ve already had an exhaustive discussion about the problem on both small and large scales. How about the solution?

It’s actually pretty simple. The solution, at a basic level, is to use a simple algorithm to decide when a cluster-wide consistent point-in-time can be reached. In the MongoDB Consistent Backup tool, this is done by the backup host kicking off backups on a “known good member” of each shard (that’s a pretty cool feature by itself) and then tracking the progress of each dump. At the same time the backup is kicked off, the backup host kicks off a separate thread that tails the oplog on each “known good member” until the mongodump on the slowest shard completes. By using this method, we have a very simple way of deciding when we can get a cluster-wide consistent snapshot. In other words, when the slowest member completes their piece of the workload. Here’s the same workload from Figure 4, but with the MongoDB Consistent Backup Tool methodology:

MongoDB PIT Backups
MongoDB PIT Backups


TIP: The amount of time that it takes to perform these backups is often decided by two factors:

  1. How evenly distributed the data is across the shards (balanced)
  2. How much data each shard contains (whether or not it’s balanced).

The takeaway here is that you may need to shard so that each shard has a manageable volume of data. This allows you to hit your backup/restore windows more easily.

…The Proverbial “Monkey Wrench”

There’s always a “gotcha” just when you think you’ve got your mind around any difficult concept. Of course, this is no different.

There is one very critical concept in sharding that we didn’t cover: tracking what data lies on which shard. This is important for routing the workload to the right place, and balancing the data across the shards. In MongoDB, this is completed by the config servers. If you cannot reach (or recover) your config servers, your entire cluster is lost! For obvious reasons, you need to back them up as well. With the Percona Labs MongoDB Consistent Backup Tool, there are actually two modes used to backup config servers: v3.2 and greater, and legacy. The reason is that in v3.2, config servers went from mirrors to a standard replica set. In v3.2 mode, we just treat the config servers like another replica set. They have their own mongodump and oplog tail thread. They get a backup that is consistent to the same point in time as all other shards in the cluster. If you’re on a version of MongoDB prior to v3.2, and you’re interested in an explanation of legacy mode, please refer back to David’s blog post.

The Wrap Up

We’ve examined the problems with getting consistent backups in a running MongoDB environment in this and the previous blog posts. Whether you have a single replica set or a sharded cluster, you should have a much better understanding of what the problems are and how Percona has you covered. If you’re still confused, or you’d just like to ask some additional questions, drop a comment in the section below. Or shoot me a tweet @jontobs, and I’ll make sure to get back to you.

Powered by WordPress | Theme: Aeros 2.0 by