Streaming MongoDB Backups Directly to S3

streaming MongoDB backups

If you ever had to make a quick ad-hoc backup of your MongoDB databases, but there was not enough disk space on the local disk to do so, this blog post may provide some handy tips to save you from headaches.

It is a common practice that before a backup can be stored in the cloud or on a dedicated backup server, it has to be prepared first locally and later copied to the destination.

Fortunately, there are ways to skip the local storage entirely and stream MongoDB backups directly to the destination. At the same time, the common goal is to save both the network bandwidth and storage space (cost savings!) while not overloading the CPU capacity on the production database. Therefore, applying on-the-fly compression is essential.

In this article, I will show some simple examples to help you quickly do the job.

Prerequisites for streaming MongoDB backups

You will need an account for one of the providers offering object storage compatible with Amazon S3. I used Wasabi in my tests as it offers very easy registration for a trial and takes just a few minutes to get started if you want to test the service.

A second need is a tool allowing you to manage the data from a Linux command line. The two most popular ones — s3cmd and AWS — are sufficient, and I will show examples using both.

Installation and setup will depend on your OS and the S3 provider specifics. Please refer to the documentation below to proceed, as I will not cover the installation details here.

* https://s3tools.org/s3cmd
* https://docs.aws.amazon.com/cli/index.html

Backup tools

Two main tools are provided with the MongoDB packages, and both do a logical backup.

Compression tool

We all know gzip or bzip2 are installed by default on almost every Linux distro. However, I find zstd way more efficient, so I’ll use it in the examples.


I believe real-case examples are best if you wish to test something similar, so here they are.

Mongodump & s3cmd – Single database backup

  • Let’s create a bucket dedicated to MongoDB data backups:
$ s3cmd mb s3://mbackups
Bucket 's3://mbackups/' created

  • Now, do a simple dump of one example database using the −−archive option, which changes the behavior from storing collections data in separate files on disk, to streaming the whole backup to standard output (STDOUT) using common archive format. At the same time, the stream gets compressed on the fly and sent to the S3 destination.
  • Note the below command does not create a consistent backup with regards to ongoing writes as it does not contain the oplog.
$ mongodump --db=db2 --archive| zstd | s3cmd put - s3://mbackups/$(date +%Y-%m-%d.%H-%M)/db2.zst
2023-02-07T19:33:58.138+0100 writing db2.products to archive on stdout
2023-02-07T19:33:58.140+0100 writing db2.people to archive on stdout
2023-02-07T19:33:59.364+0100 done dumping db2.people (50474 documents)
2023-02-07T19:33:59.977+0100 done dumping db2.products (516784 documents)
upload: '<stdin>' -> 's3://mbackups/2023-02-07.19-33/db2.zst' [part 1 of -, 15MB] [1 of 1]
15728640 of 15728640 100% in 1s 8.72 MB/s done
upload: '<stdin>' -> 's3://mbackups/2023-02-07.19-33/db2.zst' [part 2 of -, 1491KB] [1 of 1]
1527495 of 1527495 100% in 0s 4.63 MB/s done

  • After the backup is done, let’s verify its presence in S3:
$ s3cmd ls -H s3://mbackups/2023-02-07.19-33/
2023-02-07 18:34 16M s3://mbackups/2023-02-07.19-33/db2.zst

Mongorestore & s3cmd – Database restore directly from S3

The below mongorestore command uses archive option as well, which allows us to stream the backup directly to it:

$ s3cmd get --no-progress s3://mbackups/2023-02-07.20-14/db2.zst - |zstd -d | mongorestore --archive --drop
2023-02-08T00:42:41.434+0100 preparing collections to restore from
2023-02-08T00:42:41.480+0100 reading metadata for db2.people from archive on stdin
2023-02-08T00:42:41.480+0100 reading metadata for db2.products from archive on stdin
2023-02-08T00:42:41.481+0100 dropping collection db2.people before restoring
2023-02-08T00:42:41.502+0100 restoring db2.people from archive on stdin
2023-02-08T00:42:42.130+0100 dropping collection db2.products before restoring
2023-02-08T00:42:42.151+0100 restoring db2.products from archive on stdin
2023-02-08T00:42:43.217+0100 db2.people 16.0MB
2023-02-08T00:42:43.217+0100 db2.products 12.1MB
2023-02-08T00:42:43.654+0100 db2.people 18.7MB
2023-02-08T00:42:43.654+0100 finished restoring db2.people (50474 documents, 0 failures)
2023-02-08T00:42:46.218+0100 db2.products 46.3MB
2023-02-08T00:42:48.758+0100 db2.products 76.0MB
2023-02-08T00:42:48.758+0100 finished restoring db2.products (516784 documents, 0 failures)
2023-02-08T00:42:48.758+0100 no indexes to restore for collection db2.products
2023-02-08T00:42:48.758+0100 no indexes to restore for collection db2.people
2023-02-08T00:42:48.758+0100 567258 document(s) restored successfully. 0 document(s) failed to restore.

Mongodump & s3cmd – Full backup

The below command provides a consistent point-in-time snapshot thanks to oplog option:

$ mongodump --port 3502 --oplog --archive | zstd | s3cmd put - s3://mbackups/$(date +%Y-%m-%d.%H-%M)/full_dump.zst
2023-02-13T00:05:54.080+0100 writing admin.system.users to archive on stdout
2023-02-13T00:05:54.083+0100 done dumping admin.system.users (1 document)
2023-02-13T00:05:54.084+0100 writing admin.system.version to archive on stdout
2023-02-13T00:05:54.085+0100 done dumping admin.system.version (2 documents)
2023-02-13T00:05:54.087+0100 writing db1.products to archive on stdout
2023-02-13T00:05:54.087+0100 writing db2.products to archive on stdout
2023-02-13T00:05:55.260+0100 done dumping db2.products (284000 documents)
upload: '<stdin>' -> 's3://mbackups/2023-02-13.00-05/full_dump.zst' [part 1 of -, 15MB] [1 of 1]
2023-02-13T00:05:57.068+0100 [####################....] db1.products 435644/516784 (84.3%)
15728640 of 15728640 100% in 1s 9.63 MB/s done
2023-02-13T00:05:57.711+0100 [########################] db1.products 516784/516784 (100.0%)
2023-02-13T00:05:57.722+0100 done dumping db1.products (516784 documents)
2023-02-13T00:05:57.723+0100 writing captured oplog to
2023-02-13T00:05:58.416+0100 dumped 136001 oplog entries
upload: '<stdin>' -> 's3://mbackups/2023-02-13.00-05/full_dump.zst' [part 2 of -, 8MB] [1 of 1]
8433337 of 8433337 100% in 0s 10.80 MB/s done

$ s3cmd ls -H s3://mbackups/2023-02-13.00-05/full_dump.zst
2023-02-12 23:05 23M s3://mbackups/2023-02-13.00-05/full_dump.zst

Mongodump & s3cmd – Full backup restore

By analogy, mongorestore is using the oplogReplay option to apply the log contained in the archived stream:

$ s3cmd get --no-progress s3://mbackups/2023-02-13.00-05/full_dump.zst - | zstd -d | mongorestore --port 3502 --archive --oplogReplay
2023-02-13T00:07:25.977+0100 preparing collections to restore from
2023-02-13T00:07:25.977+0100 don't know what to do with subdirectory "db1", skipping...
2023-02-13T00:07:25.977+0100 don't know what to do with subdirectory "db2", skipping...
2023-02-13T00:07:25.977+0100 don't know what to do with subdirectory "", skipping...
2023-02-13T00:07:25.977+0100 don't know what to do with subdirectory "admin", skipping...
2023-02-13T00:07:25.988+0100 reading metadata for db1.products from archive on stdin
2023-02-13T00:07:25.988+0100 reading metadata for db2.products from archive on stdin
2023-02-13T00:07:26.006+0100 restoring db2.products from archive on stdin
2023-02-13T00:07:27.651+0100 db2.products 11.0MB
2023-02-13T00:07:28.429+0100 restoring db1.products from archive on stdin
2023-02-13T00:07:30.651+0100 db2.products 16.0MB
2023-02-13T00:07:30.652+0100 db1.products 14.4MB
2023-02-13T00:07:33.652+0100 db2.products 32.0MB
2023-02-13T00:07:33.652+0100 db1.products 18.0MB
2023-02-13T00:07:36.651+0100 db2.products 37.8MB
2023-02-13T00:07:36.652+0100 db1.products 32.0MB
2023-02-13T00:07:37.168+0100 db2.products 41.5MB
2023-02-13T00:07:37.168+0100 finished restoring db2.products (284000 documents, 0 failures)
2023-02-13T00:07:39.651+0100 db1.products 49.3MB
2023-02-13T00:07:42.651+0100 db1.products 68.8MB
2023-02-13T00:07:43.870+0100 db1.products 76.0MB
2023-02-13T00:07:43.870+0100 finished restoring db1.products (516784 documents, 0 failures)
2023-02-13T00:07:43.871+0100 restoring users from archive on stdin
2023-02-13T00:07:43.913+0100 replaying oplog
2023-02-13T00:07:45.651+0100 oplog 2.14MB
2023-02-13T00:07:48.651+0100 oplog 5.68MB
2023-02-13T00:07:51.651+0100 oplog 9.34MB
2023-02-13T00:07:54.651+0100 oplog 13.0MB
2023-02-13T00:07:57.651+0100 oplog 16.7MB
2023-02-13T00:08:00.651+0100 oplog 19.7MB
2023-02-13T00:08:03.651+0100 oplog 22.7MB
2023-02-13T00:08:06.651+0100 oplog 25.3MB
2023-02-13T00:08:09.651+0100 oplog 28.1MB
2023-02-13T00:08:12.651+0100 oplog 30.8MB
2023-02-13T00:08:15.651+0100 oplog 33.6MB
2023-02-13T00:08:18.651+0100 oplog 36.4MB
2023-02-13T00:08:21.651+0100 oplog 39.1MB
2023-02-13T00:08:24.651+0100 oplog 41.9MB
2023-02-13T00:08:27.651+0100 oplog 44.7MB
2023-02-13T00:08:30.651+0100 oplog 47.5MB
2023-02-13T00:08:33.651+0100 oplog 50.2MB
2023-02-13T00:08:36.651+0100 oplog 53.0MB
2023-02-13T00:08:38.026+0100 applied 136001 oplog entries
2023-02-13T00:08:38.026+0100 oplog 54.2MB
2023-02-13T00:08:38.026+0100 no indexes to restore for collection db1.products
2023-02-13T00:08:38.026+0100 no indexes to restore for collection db2.products
2023-02-13T00:08:38.026+0100 800784 document(s) restored successfully. 0 document(s) failed to restore.

Mongoexport – Export all collections from a given database, compress, and save directly to S3

Another example is using the tool to create regular JSON dumps; this is also not a consistent backup if writes are ongoing.

$ ts=$(date +%Y-%m-%d.%H-%M)
$ mydb="db2"
$ mycolls=$(mongo --quiet $mydb --eval "db.getCollectionNames().join('n')")

$ for i in $mycolls; do mongoexport -d $mydb -c $i |zstd| s3cmd put - s3://mbackups/$ts/$mydb/$i.json.zst; done
2023-02-07T19:30:37.163+0100 connected to: mongodb://localhost/
2023-02-07T19:30:38.164+0100 [#######.................] db2.people 16000/50474 (31.7%)
2023-02-07T19:30:39.164+0100 [######################..] db2.people 48000/50474 (95.1%)
2023-02-07T19:30:39.166+0100 [########################] db2.people 50474/50474 (100.0%)
2023-02-07T19:30:39.166+0100 exported 50474 records
upload: '<stdin>' -> 's3://mbackups/2023-02-07.19-30/db2/people.json.zst' [part 1 of -, 4MB] [1 of 1]
4264922 of 4264922 100% in 0s 5.71 MB/s done
2023-02-07T19:30:40.015+0100 connected to: mongodb://localhost/
2023-02-07T19:30:41.016+0100 [##......................] db2.products 48000/516784 (9.3%)
2023-02-07T19:30:42.016+0100 [######..................] db2.products 136000/516784 (26.3%)
2023-02-07T19:30:43.016+0100 [##########..............] db2.products 224000/516784 (43.3%)
2023-02-07T19:30:44.016+0100 [##############..........] db2.products 312000/516784 (60.4%)
2023-02-07T19:30:45.016+0100 [##################......] db2.products 408000/516784 (78.9%)
2023-02-07T19:30:46.016+0100 [#######################.] db2.products 496000/516784 (96.0%)
2023-02-07T19:30:46.202+0100 [########################] db2.products 516784/516784 (100.0%)
2023-02-07T19:30:46.202+0100 exported 516784 records
upload: '<stdin>' -> 's3://mbackups/2023-02-07.19-30/db2/products.json.zst' [part 1 of -, 11MB] [1 of 1]
12162655 of 12162655 100% in 1s 10.53 MB/s done

$ s3cmd ls -H s3://mbackups/$ts/$mydb/
2023-02-07 18:30 4M s3://mbackups/2023-02-07.19-30/db2/people.json.zst
2023-02-07 18:30 11M s3://mbackups/2023-02-07.19-30/db2/products.json.zst

Mongoimport & s3cmd – Import single collection under a different name

$ s3cmd get --no-progress s3://mbackups/2023-02-08.00-49/db2/people.json.zst - | zstd -d | mongoimport -d db2 -c people_copy
2023-02-08T00:53:48.355+0100 connected to: mongodb://localhost/
2023-02-08T00:53:50.446+0100 50474 document(s) imported successfully. 0 document(s) failed to import.

Mongodump & AWS S3 – Backup database

$ mongodump --db=db2 --archive | zstd | aws s3 cp - s3://mbackups/backup1/db2.zst
2023-02-08T11:34:46.834+0100 writing db2.people to archive on stdout
2023-02-08T11:34:46.837+0100 writing db2.products to archive on stdout
2023-02-08T11:34:47.379+0100 done dumping db2.people (50474 documents)
2023-02-08T11:34:47.911+0100 done dumping db2.products (516784 documents)

$ aws s3 ls --human-readable mbackups/backup1/
2023-02-08 11:34:50 16.5 MiB db2.zst

Mongorestore & AWS S3 – Restore database

$ aws s3 cp s3://mbackups/backup1/db2.zst - | zstd -d | mongorestore --archive --drop
2023-02-08T11:37:08.358+0100 preparing collections to restore from
2023-02-08T11:37:08.364+0100 reading metadata for db2.people from archive on stdin
2023-02-08T11:37:08.364+0100 reading metadata for db2.products from archive on stdin
2023-02-08T11:37:08.365+0100 dropping collection db2.people before restoring
2023-02-08T11:37:08.462+0100 restoring db2.people from archive on stdin
2023-02-08T11:37:09.100+0100 dropping collection db2.products before restoring
2023-02-08T11:37:09.122+0100 restoring db2.products from archive on stdin
2023-02-08T11:37:10.288+0100 db2.people 16.0MB
2023-02-08T11:37:10.288+0100 db2.products 13.8MB
2023-02-08T11:37:10.607+0100 db2.people 18.7MB
2023-02-08T11:37:10.607+0100 finished restoring db2.people (50474 documents, 0 failures)
2023-02-08T11:37:13.288+0100 db2.products 47.8MB
2023-02-08T11:37:15.666+0100 db2.products 76.0MB
2023-02-08T11:37:15.666+0100 finished restoring db2.products (516784 documents, 0 failures)
2023-02-08T11:37:15.666+0100 no indexes to restore for collection db2.products
2023-02-08T11:37:15.666+0100 no indexes to restore for collection db2.people
2023-02-08T11:37:15.666+0100 567258 document(s) restored successfully. 0 document(s) failed to restore.

In the above examples, I used both mongodump/mongorestore and mongoexport/mongoimport tools to backup and recover your MongoDB data directly to and from the S3 object storage type, while doing it the streaming and compressed way. Therefore, these methods are simple, fast, and resource-friendly. I hope what I used will be useful when you are looking for options to use in your backup scripts or ad-hoc backup tasks.

Additional tools

Here, I would like to mention that there are other free and open source backup solutions you may try, including Percona Backup for MongoDB (PBM), which now offers both logical and physical backups:

With the Percona Server for MongoDB variant, you may also stream hot physical backups directly to S3 storage:


It is as easy as this:

mongo > db.runCommand({createBackup: 1, s3: {bucket: "mbackups", path: "my_physical_dump1", endpoint: "s3.eu-central-2.wasabisys.com"}})
{ "ok" : 1 }

$ s3cmd du -H s3://mbackups/my_physical_dump1/
138M 26 objects s3://mbackups/my_physical_dump1/

For a sharded cluster, you should use PBM rather for consistent backups.

Btw, don’t forget to check out the MongoDB best backup practices!

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

Download Percona Distribution for MongoDB today!


Percona XtraDB Cluster on Amazon EC2 and Two Interesting Changes in PXC 8.0

Percona XtraDB Cluster on Amazon EC2

Percona XtraDB Cluster on Amazon EC2This article outlines the basic configurations for setting up and deploying Percona XtraDB Cluster 8.0 (PXC) on Amazon EC2, as well as what is new in the setup compared to Percona XtraDB Cluster 5.7.

What is Percona XtraDB Cluster an ideal fit for?

Percona XtraDB Cluster is a cost-effective, high-performance clustering solution for mission-critical data. It combines all the improvements, and functionality found in MySQL 8 with Percona Server for MySQL‘s Enterprise features and Percona’s upgraded Galera library.

A Percona XtraDB Cluster environment is an ideal fit for applications requiring 5-9s uptime with high read workloads; industries like financial or healthcare businesses that require in-house or externally dedicated database resources.

How is a three-node cluster configured in an EC2 environment?

In order to describe the setup procedures, I’ll be using Amazon EC2 instances to build the environment, and based on the business requirements, we may utilize alternative infrastructures to build the cluster environment.

Amazon EC2 settings are designed to provide uptime and high availability. When designing architecture in an EC2 environment, it is preferable to have one node situated in another Availability Zones to avoid the loss of an entire AZ and its data.

If a different region is planned for a node, we can prevent the loss of the entire region and its data. It is desirable that nodes and regions have appropriate network connectivity because network latency between the two regions affects synchronous replication write latency. Alternatively, an async replica in a different region is an option.

I’m not going into too much depth on Amazon EC2 to keep this blog brief and readable. 

To build the three-node Percona XtraDB 8.0 cluster environment we first spin up the following three nodes in EC2. I’m using Amazon Linux but you can also use Ubuntu or any of the Percona-supported operating systems.

It is advised that a cluster’s nodes all have the same configuration. 

PXCNode1   IP Address:

PXCNode2   IP Address:

PXCNode3  IP Address:

To install the Percona repository on all three nodes, use the following command.

$ sudo yum install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm

Enable the Percona Server for MySQL 8.0 repository in all three nodes by running the following command.

$ sudo percona-release setup pxc-80
* Disabling all Percona Repositories
* Enabling the Percona XtraDB Cluster 8.0 repository
* Enabling the Percona Tools repository
<*> All done!

Using the following command, install the Percona XtraDB Cluster packages and software on all three nodes.

$ sudo yum install percona-xtradb-cluster

Before starting the nodes, update the basic variables listed below for the nodes

The following default variables must be modified with the first installations. Those that came from PXC 5.7 might wonder why wsrep_sst_auth is missing. The wsrep_sst_auth variable was removed in PXC 8 since it causes security concerns, as the user and password are saved in the config file and are easily visible to OS users. 

In PXC 8, a temporary user is created when a new node joins the existing cluster. For additional details on this security enhancement, check this article.

$ vi /etc/my.cnf
######## wsrep ###############
#If wsrep_node_name is not specified,  then system hostname will be used

Percona XtraDB Cluster nodes utilize the following ports by default, so we must open them and ensure that the nodes can communicate with one another.

3306 is used for MySQL client connections and SST (State Snapshot Transfer) via mysqldump.

    4444 is used for SST via Percona XtraBackup.

    4567 is used for write-set replication traffic (over TCP) and multicast replication (over TCP and UDP).

    4568 is used for IST (Incremental State Transfer).

For example, to test access.

Node 1
#  socat - TCP-LISTEN:4444

Node 2
# echo "hello" | socat - TCP:

How is the first node bootstrapped?

After configuring each PXC node, you must bootstrap the cluster starting with the first node. All of the data that you wish to replicate to additional nodes must be present on the first node. 

Run the command to bootstrap the first node.

# systemctl start mysql@bootstrap.service

Since this is a brand new install, a temporary password is generated for the ‘root’ MySQL user which we can find in the mysqld.log

# grep -i "A temporary password is generated " /var/log/mysqld.log
2022-09-13T06:52:37.700818Z 6 [Note] [MY-010454] [Server] A temporary password is generated for root@localhost: PmqQiGf*#6iy

Reset the temporary password using the following alter.

$ mysql -uroot -p
Enter password:
mysql> SET PASSWORD = 'GdKG*12#ULmE';
Query OK, 0 rows affected (0.03 sec)

How can the cluster’s remaining nodes be joined?

Before starting node2, you must copy the SSL certificates from node1 to node2 (and to node3). PXC 8 by default encrypts all replication communication, so this is a critical step that most users miss, causing cluster startup failures.

PXCNode1# scp /var/lib/mysql/*.pem
PXCNode2# chown -R mysql.mysql /var/lib/mysql/*.pem

Start the node.

PXCNode2# systemctl start mysql

Verify the following SET and make sure they appear as below once the node has been added to the cluster.

PXCNode2$ mysql -uroot -p -e " show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected','wsrep_ready');"
Enter password:
| Variable_name             | Value   |
| wsrep_cluster_size        | 2       |
| wsrep_cluster_status      | Primary |
| wsrep_connected           | ON      |
| wsrep_local_state         | 4       |
| wsrep_local_state_comment | Synced  |
| wsrep_ready               | ON      |

The third node may be added to the cluster using the same procedures, and its status will then look as follows.

PXCNode3$ mysql -uroot -p -e " show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected','wsrep_ready');"
Enter password:
| Variable_name             | Value   |
| wsrep_cluster_size        | 3       |
| wsrep_cluster_status      | Primary |
| wsrep_connected           | ON      |
| wsrep_local_state         | 4       |
| wsrep_local_state_comment | Synced  |
| wsrep_ready               | ON      |

Additional supporting factors

Additionally, the use of failover technologies like ProxySQL, which assist in removing failing nodes from the active read pool in some cases and shifting the primary, is advised.

It is advised to have a backup in place, with the open-source tool Percona XtraBackup taking physical copies of the dataset that are significantly faster to recover. It is strongly advised to back up binary logs using mysqlbinlog in order to do point-in-time recovery. Backups should be encrypted, compressed, and transferred to S3 as soon as possible.

To copy binlogs to an s3 bucket the command looks like this.

aws s3 sync /backups/binlogs/ s3://backup-bucket/
upload: ../../../../backups/binlogs/binlog.000001.gz.gpg to 

For query analytics and time-based database performance insights, the open-source tool Percona Monitoring and Management is highly recommended. Using the Amazon Marketplace, this needs to be deployed on a separate host. This monitors the operating system and MySQL metrics and provides sophisticated query analytics.


Amazon S3 Storage Lens gives IT visibility into complex S3 usage

As your S3 storage requirements grow, it gets harder to understand exactly what you have, and this especially true when it crosses multiple regions. This could have broad implications for administrators, who are forced to build their own solutions to get that missing visibility. AWS changed that this week when it announced a new product called Amazon S3 Storage Lens, a way to understand highly complex S3 storage environments.

The tool provides analytics that help you understand what’s happening across your S3 object storage installations, and to take action when needed. As the company describes the new service in a blog post, “This is the first cloud storage analytics solution to give you organization-wide visibility into object storage, with point-in-time metrics and trend lines as well as actionable recommendations,” the company wrote in the post.

Amazon S3 Storage Lens Console

Image Credits: Amazon

The idea is to present a set of 29 metrics in a dashboard that help you “discover anomalies, identify cost efficiencies and apply data protection best practices,” according to the company. IT administrators can get a view of their storage landscape and can drill down into specific instances when necessary, such as if there is a problem that requires attention. The product comes out of the box with a default dashboard, but admins can also create their own customized dashboards, and even export S3 Lens data to other Amazon tools.

For companies with complex storage requirements, as in thousands or even tens of thousands of S3 storage instances, who have had to kludge together ways to understand what’s happening across the systems, this gives them a single view across it all.

S3 Storage Lens is now available in all AWS regions, according to the company.


Quilt Data launches from stealth with free portal to access petabytes of public data

Quilt Data‘s founders, Kevin Moore and Aneesh Karve, have been hard at work for the last four years building a platform to search for data quickly across vast repositories on AWS S3 storage. The idea is to give data scientists a way to find data in S3 buckets, then package that data in forms that a business can use. Today, the company launched out of stealth with a free data search portal that not only proves what they can do, but also provides valuable access to 3.7 petabytes of public data across 23 S3 repositories.

The public data repository includes publicly available Amazon review data along with satellite images and other high-value public information. The product works like any search engine, where you enter a query, but instead of searching the web or an enterprise repository, it finds the results in S3 storage on AWS.

The results not only include the data you are looking for, it also includes all of the information around the data, such as Jupyter notebooks, the standard workspace that data scientists use to build machine learning models. Data scientists can then use this as the basis for building their own machine learning models.

The public data, which includes more than 10 billion objects, is a resource that data scientists should greatly appreciate it, but Quilt Data is offering access to this data out of more than pure altruism. It’s doing so because it wants to show what the platform is capable of, and in the process hopes to get companies to use the commercial version of the product.

Screen Shot 2019 09 16 at 2.31.53 PM

Quilt Data search results with data about the data found (Image: Quilt Data)

Customers can try Quilt Data for free or subscribe to the product in the Amazon Marketplace. The company charges a flat rate of $550 per month for each S3 bucket. It also offers an enterprise version with priority support, custom features and education and on-boarding for $999 per month for each S3 bucket.

The company was founded in 2015 and was a member of the Y Combinator Summer 2017 cohort. The company has received $4.2 million in seed money so far from Y Combinator, Vertex Ventures, Fuel Capital and Streamlined Ventures, along with other unnamed investors.


New tools help could help prevent Amazon S3 data leaks

 If you do a search for Amazon S3 breaches due to customer error of leaving the data unencrypted, you’ll see a long list that includes a DoD contractor, Verizon (the owner of this publication) and Accenture, among the more high profile examples. Today, AWS announced a new set of five tools designed to protect customers from themselves and ensure (to the extent possible) that the data in S3… Read More


Minio scores $20 million Series A to build a neutral object storage layer

 Minio has a plan to become the neutral object storage layer, while still maintaining Amazon S3 object storage compatibility. That may seem like an odd strategy, but as CEO Anand Babu Periasamy, co-founder and CEO of Minio points out, there is a clear market need.
By building a solution that enables customers to store data across a variety of solutions including S3, he believes he is giving… Read More


The day Amazon S3 storage stood still

Jeff Bezoz, CEO of Amazon. By now you’ve probably heard that Amazon’s S3 storage service went down in its Northern Virginia datacenter for the better part of 4 hours yesterday, and took parts of a bunch of prominent websites and services with it. It’s worth noting that as of this morning, the Amazon dashboard was showing everything was operating normally. While yesterday’s outage was a big deal… Read More


AWS drops its storage prices and launches new cold storage retrieval options

Perito Moreno Glacier Amazon Web Services (AWS) today announced a significant price drop for some of its storage services. In addition, it is also launched a few new features for developers who want to use its Glacier cold storage service. The new prices that most developers will likely care about are those for S3, AWS’ main cloud storage service. Instead of six pricing tiers, S3 will now use three: 0-50 TB;… Read More


MySQL Backup tools used by Percona Remote DBA for MySQL

Percona Remote DBA for MySQLAs part of Percona Remote DBA for MySQL service we recognize that reliable backups are one of the most important things we can bring to the table. In my experience handling emergencies, the single worst thing that can happen is finding out you don’t have backups available when some sort of data loss or catastrophic event occurs.

With our Remote DBA service we can take care of backups for you, what follows are some of the internals of our implementation.

What kind of outages can happen?

  • Someone runs UPDATE or DELETE and forgets the where clause or filters weren’t quite right
  • The application had a bug causing data to be removed or overwritten
  • A table (or entire schema) was dropped accidentally
  • Your InnoDB table was corrupt and mysql shuts down
  • Your server or RAID controller crashes and all data is lost on that server
  • A disk failed, and RAID array does not recover
  • You run into a InnoDB corruption bug that propagates via replication (not common, but does happen)
  • You lose your entire SAN and all your DB servers were located there. Let’s hope your backups are somewhere else!
  • You lose a PSU or network switch in your datacenter and some or all of your servers go down in that location
  • Your entire datacenter loses power and the generators do not start, which happens more often than you might think

What tools do we use in Remote DBA?

We have these major components:
  1. Percona XtraBackup for MySQL for binary backups
  2. mydumper for logical backups
  3. mysqlbinlog 5.6
  4. Amazon S3
  5. monitoring for all the above

Philosophy on backups

  •  It is a good idea to schedule both logical and binary backups. They each have their use cases and add redundancy to your backups. If there is an issue with your backup, it’s likely not to affect the other tool.
  • Store your backups on more than one server.
  • In addition to local copies, store backups offsite. Look at the cost of S3 or S3+Glacier, it’s worth the peace of mind!
  • Test your backups, and if you have a test environment, load them there periodically. You can also spin up an EC2 instance to load your backups onto. In addition, you can binlog rollforward 24 hours of binlogs as a good test.
  • Store your binlogs off your primary server so you can perform point in time recovery.
  • Store your binlogs offsite for disaster recovery scenarios.
  • Run pt-table-checksum periodically (i.e. once a month) and make sure your servers data stays consistent. Checksumming is important, as backups are typically pulled off a slave and it’s vital that it has the same data.

How do we use these components to give our customers reliable backups?

Think about the 10 example outages listed above. Each tool has it’s strong points given the conditions.

Percona XtraBackup for MySQL for binary backups.

Strong Points:
  • It can restore an entire server very fast. Often the limiter of how fast this can be restored to another server, is how fast you can transfer data over your network. If you have 1GB network and you have 1TB of data, it could take awhile.
  • It can compress the DB on the fly
  • It can backup a server at approximately the maximum rate the server allows, given it’s IO system
  • It can typically execute a backup with little to no major impact on the server. For example in xtrabackup 2.0.5+, the time taken for “FLUSH TABLES WITH RAED LOCK” is normally under 1 second.
  • If you have a lot of non-transactional tables (i.e. myisam), use –rsync option. This will rsync a copy of all the frm files and all the MYD/MYI files. It then does a second rsync while under a global lock. This means where you may have been locked for hours where you had many non-transactional tables, now you can be locked sub-second. Even with innodb only this can greatly cut down on the lock time by syncing the frm files.
  • Enable –slave-info when backing up from a slave so you know what the position you are in the master’s bin logs
  • –compress option, compresses on the fly using qpress under the hood.
When do we typically use xtrbackup restores:
  • Setting up new slaves
  • When we lose an entire server due to hardware failure, corruption, etc
  • When the majority of data on the server was lost. e.g. there is one primary schema and that schema was dropped. Basically when restoring may take less time that trying to load a logical backup.

Restoring your data from backup is another topic. Piecing together data after accidental data loss is one of Percona’s specialties, and there are many different techniques depending on the scenario. I will go through some of these in detail in a future blog post.

Mydumper for logical backups

Strong Points:
  • Very fast for logical backups – compared to mysqldump
  • Consistent backups between myisam and innodb tables. Global read lock only held until myisam tables are dumped.
    • We are researching into how we could further improve lock times here when non-transactional tables are
  • Almost no locking, if not using myisam tables
  • Built in compression
  • Each table is dumped to a separate file. This is very important to make restoring single tables easy. You can quickly restore a single table, instead of restoring your entire backup just to find a tiny table you need. This is actually the most common type of restore needed, so it’s important to make this operation as painless as possible.
  • Compressed mydumper typically 3x-5x smaller vs compressed xtrabackup
  • Typically we upload mydumper backups to s3 vs xtrabackup given the time needed to upload/download. Though it depends on the available bandwidth and should be factored into your restore time.


  • You can’t rely on mydumper to dump schema’s. It does not handle views/triggers/procedures etc. Run with –no-schemas, instead use mysqldump for the schemas and rely on mydumper for data only.
  • You will have to compile it yourself as binary packages aren’t distributed
  • Be careful with importing a dump from a server running in a different timezone. We have a fix here.
Details on how we dump schemas:
  • loop through each DB
    • write out ALTER DATABASE DEFAULT CHARACTER SET <charset> to the schema file, putting in the current charset
    • mysqldump … -d -R –skip-triggers, out to the schema file
    • create a schema-post file that has the triggers # mysqldump … -d -t
How to restore mydumper data:
  • Load the schema file
  • Run myloader –threads=x
  • Load the schema-post file
I will get into specifics on the tips/tricks to restore data in a future blog post.
  • run with –kill-long-queries to avoid nasty problems with “FLUSH TABLES WITH READ LOCK”
  • –compress, compresses tables per file and should typically be enabled by default. The time needed to uncompress is not a limiting factor on restore time when done inline.
When do we typically use mydumper restores:
  • Restoring a single file
  • Restoring a single schema or rolling forward a single schema to a point in time
  • Restoring data while automatically replicating out to all slaves

mysqlbinlog 5.6

Last year Percona IT director Tamas Kozak had a great blog post that showed how mysqlbinlog in 5.6 could be used. With mysqlbinlog 5.6, you can now pull binary logs in real time to another server using “mysqlbinlog … –read-from-remote-server –raw –stop-never”

  • Useful to mirror the binlogs on the master to a second server.
  • Allows you to roll forward backups even after losing the master
  • Very useful for disaster recovery.
  • You can have your backups in S3 and mysqlbinlog –stop-never running on a small ec2 instance. This can allow for a very low cost disaster recovery plan to ensure you will not lose data even in the worst case scenarios.
  • Takes very little resources to run, can run about anywhere with disk space and writes out binlog files sequentially.
Tips/Tricks (how we run this):
  • Ensure it stays running, restart it if it appears to be hanging
  • Verify the file is the same on master and slave
  • Re-transfer files that are partially transferred
  • Compress the files after successful transfer

Amazon S3 for MySQL

I discuss S3 here but other cloud based storage can be used as well. S3 is just the most popular in this category and is in wide use.
  • s3cmd – we have been using the version from github,  Mostly for multi-part upload support. This prevents us from having to split files up before uploading to S3.
  • There is released alpha version of this version here
  • You can now set bucket lifecycle properties so data over X days is archived to Glacier and data over Y days is removed. This is very convenient feature and allows you to cost effectively store long term backups with little additional work
  • –add-header=x-amz-server-side-encryption:AES256 to use the server side encryption feature which helps with some types of compliance. We also have the capability to encrypt all files with gpg prior to upload via a separate script
  • use_https = True, especially if your data is not encrypted before transfer


  • Monitoring is the most important piece to tie all of these process together. We employ nsca nagios alerts for all of the backup processes.
  • freshness_threshold should be set so if your nsca hasn’t checked in within a certain period it will alert you. For example if you backup once a day a good threshold could be 36 hours.
  • For our mysqlbinlog processes, we have it sending nsca alerts every  30 seconds and have it alert when nothing has been received for 15 minutes -> 1 hour
  • If backups throw an error and are aborted, we send a critical alert immediately to be investigated
  • The number one cause of backup alerts are due to problems with “FLUSH TABLE WITH READ LOCK”. Namely when a select is blocking the flush from completing and queuing all requests behind it. Our current solution to deal with this issue is we have a guardian process that runs during a backup. It then kills any process that causes a stall of the flush. We are also researching into other ways that could improve this in the future.

Other details on Percona Remote DBA for MySQL backup systems for future posts

  • Detailed strategies for different types of restores
  • Strategies on retention dailies weeklies, long term backups
  • Decompressing Percona XtraBackup for MySQL  in parallel using all your resources available
  • Downloading from s3 in parallel
  • Parallel encryption/description
  • Hardlinking of backups. Given both our mydumper and xtrbackup are seperated by file, for files that don’t change they can be easily hardlinked to typically save 20-80% of space locally

The post MySQL Backup tools used by Percona Remote DBA for MySQL appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com