Feb
23
2023
--

Streaming MongoDB Backups Directly to S3

streaming MongoDB backups

If you ever had to make a quick ad-hoc backup of your MongoDB databases, but there was not enough disk space on the local disk to do so, this blog post may provide some handy tips to save you from headaches.

It is a common practice that before a backup can be stored in the cloud or on a dedicated backup server, it has to be prepared first locally and later copied to the destination.

Fortunately, there are ways to skip the local storage entirely and stream MongoDB backups directly to the destination. At the same time, the common goal is to save both the network bandwidth and storage space (cost savings!) while not overloading the CPU capacity on the production database. Therefore, applying on-the-fly compression is essential.

In this article, I will show some simple examples to help you quickly do the job.

Prerequisites for streaming MongoDB backups

You will need an account for one of the providers offering object storage compatible with Amazon S3. I used Wasabi in my tests as it offers very easy registration for a trial and takes just a few minutes to get started if you want to test the service.

A second need is a tool allowing you to manage the data from a Linux command line. The two most popular ones — s3cmd and AWS — are sufficient, and I will show examples using both.

Installation and setup will depend on your OS and the S3 provider specifics. Please refer to the documentation below to proceed, as I will not cover the installation details here.

* https://s3tools.org/s3cmd
* https://docs.aws.amazon.com/cli/index.html

Backup tools

Two main tools are provided with the MongoDB packages, and both do a logical backup.

Compression tool

We all know gzip or bzip2 are installed by default on almost every Linux distro. However, I find zstd way more efficient, so I’ll use it in the examples.

Examples

I believe real-case examples are best if you wish to test something similar, so here they are.

Mongodump & s3cmd – Single database backup

  • Let’s create a bucket dedicated to MongoDB data backups:
$ s3cmd mb s3://mbackups
Bucket 's3://mbackups/' created

  • Now, do a simple dump of one example database using the −−archive option, which changes the behavior from storing collections data in separate files on disk, to streaming the whole backup to standard output (STDOUT) using common archive format. At the same time, the stream gets compressed on the fly and sent to the S3 destination.
  • Note the below command does not create a consistent backup with regards to ongoing writes as it does not contain the oplog.
$ mongodump --db=db2 --archive| zstd | s3cmd put - s3://mbackups/$(date +%Y-%m-%d.%H-%M)/db2.zst
2023-02-07T19:33:58.138+0100 writing db2.products to archive on stdout
2023-02-07T19:33:58.140+0100 writing db2.people to archive on stdout
2023-02-07T19:33:59.364+0100 done dumping db2.people (50474 documents)
2023-02-07T19:33:59.977+0100 done dumping db2.products (516784 documents)
upload: '<stdin>' -> 's3://mbackups/2023-02-07.19-33/db2.zst' [part 1 of -, 15MB] [1 of 1]
15728640 of 15728640 100% in 1s 8.72 MB/s done
upload: '<stdin>' -> 's3://mbackups/2023-02-07.19-33/db2.zst' [part 2 of -, 1491KB] [1 of 1]
1527495 of 1527495 100% in 0s 4.63 MB/s done

  • After the backup is done, let’s verify its presence in S3:
$ s3cmd ls -H s3://mbackups/2023-02-07.19-33/
2023-02-07 18:34 16M s3://mbackups/2023-02-07.19-33/db2.zst

Mongorestore & s3cmd – Database restore directly from S3

The below mongorestore command uses archive option as well, which allows us to stream the backup directly to it:

$ s3cmd get --no-progress s3://mbackups/2023-02-07.20-14/db2.zst - |zstd -d | mongorestore --archive --drop
2023-02-08T00:42:41.434+0100 preparing collections to restore from
2023-02-08T00:42:41.480+0100 reading metadata for db2.people from archive on stdin
2023-02-08T00:42:41.480+0100 reading metadata for db2.products from archive on stdin
2023-02-08T00:42:41.481+0100 dropping collection db2.people before restoring
2023-02-08T00:42:41.502+0100 restoring db2.people from archive on stdin
2023-02-08T00:42:42.130+0100 dropping collection db2.products before restoring
2023-02-08T00:42:42.151+0100 restoring db2.products from archive on stdin
2023-02-08T00:42:43.217+0100 db2.people 16.0MB
2023-02-08T00:42:43.217+0100 db2.products 12.1MB
2023-02-08T00:42:43.217+0100
2023-02-08T00:42:43.654+0100 db2.people 18.7MB
2023-02-08T00:42:43.654+0100 finished restoring db2.people (50474 documents, 0 failures)
2023-02-08T00:42:46.218+0100 db2.products 46.3MB
2023-02-08T00:42:48.758+0100 db2.products 76.0MB
2023-02-08T00:42:48.758+0100 finished restoring db2.products (516784 documents, 0 failures)
2023-02-08T00:42:48.758+0100 no indexes to restore for collection db2.products
2023-02-08T00:42:48.758+0100 no indexes to restore for collection db2.people
2023-02-08T00:42:48.758+0100 567258 document(s) restored successfully. 0 document(s) failed to restore.

Mongodump & s3cmd – Full backup

The below command provides a consistent point-in-time snapshot thanks to oplog option:

$ mongodump --port 3502 --oplog --archive | zstd | s3cmd put - s3://mbackups/$(date +%Y-%m-%d.%H-%M)/full_dump.zst
2023-02-13T00:05:54.080+0100 writing admin.system.users to archive on stdout
2023-02-13T00:05:54.083+0100 done dumping admin.system.users (1 document)
2023-02-13T00:05:54.084+0100 writing admin.system.version to archive on stdout
2023-02-13T00:05:54.085+0100 done dumping admin.system.version (2 documents)
2023-02-13T00:05:54.087+0100 writing db1.products to archive on stdout
2023-02-13T00:05:54.087+0100 writing db2.products to archive on stdout
2023-02-13T00:05:55.260+0100 done dumping db2.products (284000 documents)
upload: '<stdin>' -> 's3://mbackups/2023-02-13.00-05/full_dump.zst' [part 1 of -, 15MB] [1 of 1]
2023-02-13T00:05:57.068+0100 [####################....] db1.products 435644/516784 (84.3%)
15728640 of 15728640 100% in 1s 9.63 MB/s done
2023-02-13T00:05:57.711+0100 [########################] db1.products 516784/516784 (100.0%)
2023-02-13T00:05:57.722+0100 done dumping db1.products (516784 documents)
2023-02-13T00:05:57.723+0100 writing captured oplog to
2023-02-13T00:05:58.416+0100 dumped 136001 oplog entries
upload: '<stdin>' -> 's3://mbackups/2023-02-13.00-05/full_dump.zst' [part 2 of -, 8MB] [1 of 1]
8433337 of 8433337 100% in 0s 10.80 MB/s done

$ s3cmd ls -H s3://mbackups/2023-02-13.00-05/full_dump.zst
2023-02-12 23:05 23M s3://mbackups/2023-02-13.00-05/full_dump.zst

Mongodump & s3cmd – Full backup restore

By analogy, mongorestore is using the oplogReplay option to apply the log contained in the archived stream:

$ s3cmd get --no-progress s3://mbackups/2023-02-13.00-05/full_dump.zst - | zstd -d | mongorestore --port 3502 --archive --oplogReplay
2023-02-13T00:07:25.977+0100 preparing collections to restore from
2023-02-13T00:07:25.977+0100 don't know what to do with subdirectory "db1", skipping...
2023-02-13T00:07:25.977+0100 don't know what to do with subdirectory "db2", skipping...
2023-02-13T00:07:25.977+0100 don't know what to do with subdirectory "", skipping...
2023-02-13T00:07:25.977+0100 don't know what to do with subdirectory "admin", skipping...
2023-02-13T00:07:25.988+0100 reading metadata for db1.products from archive on stdin
2023-02-13T00:07:25.988+0100 reading metadata for db2.products from archive on stdin
2023-02-13T00:07:26.006+0100 restoring db2.products from archive on stdin
2023-02-13T00:07:27.651+0100 db2.products 11.0MB
2023-02-13T00:07:28.429+0100 restoring db1.products from archive on stdin
2023-02-13T00:07:30.651+0100 db2.products 16.0MB
2023-02-13T00:07:30.652+0100 db1.products 14.4MB
2023-02-13T00:07:30.652+0100
2023-02-13T00:07:33.652+0100 db2.products 32.0MB
2023-02-13T00:07:33.652+0100 db1.products 18.0MB
2023-02-13T00:07:33.652+0100
2023-02-13T00:07:36.651+0100 db2.products 37.8MB
2023-02-13T00:07:36.652+0100 db1.products 32.0MB
2023-02-13T00:07:36.652+0100
2023-02-13T00:07:37.168+0100 db2.products 41.5MB
2023-02-13T00:07:37.168+0100 finished restoring db2.products (284000 documents, 0 failures)
2023-02-13T00:07:39.651+0100 db1.products 49.3MB
2023-02-13T00:07:42.651+0100 db1.products 68.8MB
2023-02-13T00:07:43.870+0100 db1.products 76.0MB
2023-02-13T00:07:43.870+0100 finished restoring db1.products (516784 documents, 0 failures)
2023-02-13T00:07:43.871+0100 restoring users from archive on stdin
2023-02-13T00:07:43.913+0100 replaying oplog
2023-02-13T00:07:45.651+0100 oplog 2.14MB
2023-02-13T00:07:48.651+0100 oplog 5.68MB
2023-02-13T00:07:51.651+0100 oplog 9.34MB
2023-02-13T00:07:54.651+0100 oplog 13.0MB
2023-02-13T00:07:57.651+0100 oplog 16.7MB
2023-02-13T00:08:00.651+0100 oplog 19.7MB
2023-02-13T00:08:03.651+0100 oplog 22.7MB
2023-02-13T00:08:06.651+0100 oplog 25.3MB
2023-02-13T00:08:09.651+0100 oplog 28.1MB
2023-02-13T00:08:12.651+0100 oplog 30.8MB
2023-02-13T00:08:15.651+0100 oplog 33.6MB
2023-02-13T00:08:18.651+0100 oplog 36.4MB
2023-02-13T00:08:21.651+0100 oplog 39.1MB
2023-02-13T00:08:24.651+0100 oplog 41.9MB
2023-02-13T00:08:27.651+0100 oplog 44.7MB
2023-02-13T00:08:30.651+0100 oplog 47.5MB
2023-02-13T00:08:33.651+0100 oplog 50.2MB
2023-02-13T00:08:36.651+0100 oplog 53.0MB
2023-02-13T00:08:38.026+0100 applied 136001 oplog entries
2023-02-13T00:08:38.026+0100 oplog 54.2MB
2023-02-13T00:08:38.026+0100 no indexes to restore for collection db1.products
2023-02-13T00:08:38.026+0100 no indexes to restore for collection db2.products
2023-02-13T00:08:38.026+0100 800784 document(s) restored successfully. 0 document(s) failed to restore.

Mongoexport – Export all collections from a given database, compress, and save directly to S3

Another example is using the tool to create regular JSON dumps; this is also not a consistent backup if writes are ongoing.

$ ts=$(date +%Y-%m-%d.%H-%M)
$ mydb="db2"
$ mycolls=$(mongo --quiet $mydb --eval "db.getCollectionNames().join('n')")

$ for i in $mycolls; do mongoexport -d $mydb -c $i |zstd| s3cmd put - s3://mbackups/$ts/$mydb/$i.json.zst; done
2023-02-07T19:30:37.163+0100 connected to: mongodb://localhost/
2023-02-07T19:30:38.164+0100 [#######.................] db2.people 16000/50474 (31.7%)
2023-02-07T19:30:39.164+0100 [######################..] db2.people 48000/50474 (95.1%)
2023-02-07T19:30:39.166+0100 [########################] db2.people 50474/50474 (100.0%)
2023-02-07T19:30:39.166+0100 exported 50474 records
upload: '<stdin>' -> 's3://mbackups/2023-02-07.19-30/db2/people.json.zst' [part 1 of -, 4MB] [1 of 1]
4264922 of 4264922 100% in 0s 5.71 MB/s done
2023-02-07T19:30:40.015+0100 connected to: mongodb://localhost/
2023-02-07T19:30:41.016+0100 [##......................] db2.products 48000/516784 (9.3%)
2023-02-07T19:30:42.016+0100 [######..................] db2.products 136000/516784 (26.3%)
2023-02-07T19:30:43.016+0100 [##########..............] db2.products 224000/516784 (43.3%)
2023-02-07T19:30:44.016+0100 [##############..........] db2.products 312000/516784 (60.4%)
2023-02-07T19:30:45.016+0100 [##################......] db2.products 408000/516784 (78.9%)
2023-02-07T19:30:46.016+0100 [#######################.] db2.products 496000/516784 (96.0%)
2023-02-07T19:30:46.202+0100 [########################] db2.products 516784/516784 (100.0%)
2023-02-07T19:30:46.202+0100 exported 516784 records
upload: '<stdin>' -> 's3://mbackups/2023-02-07.19-30/db2/products.json.zst' [part 1 of -, 11MB] [1 of 1]
12162655 of 12162655 100% in 1s 10.53 MB/s done

$ s3cmd ls -H s3://mbackups/$ts/$mydb/
2023-02-07 18:30 4M s3://mbackups/2023-02-07.19-30/db2/people.json.zst
2023-02-07 18:30 11M s3://mbackups/2023-02-07.19-30/db2/products.json.zst

Mongoimport & s3cmd – Import single collection under a different name

$ s3cmd get --no-progress s3://mbackups/2023-02-08.00-49/db2/people.json.zst - | zstd -d | mongoimport -d db2 -c people_copy
2023-02-08T00:53:48.355+0100 connected to: mongodb://localhost/
2023-02-08T00:53:50.446+0100 50474 document(s) imported successfully. 0 document(s) failed to import.

Mongodump & AWS S3 – Backup database

$ mongodump --db=db2 --archive | zstd | aws s3 cp - s3://mbackups/backup1/db2.zst
2023-02-08T11:34:46.834+0100 writing db2.people to archive on stdout
2023-02-08T11:34:46.837+0100 writing db2.products to archive on stdout
2023-02-08T11:34:47.379+0100 done dumping db2.people (50474 documents)
2023-02-08T11:34:47.911+0100 done dumping db2.products (516784 documents)

$ aws s3 ls --human-readable mbackups/backup1/
2023-02-08 11:34:50 16.5 MiB db2.zst

Mongorestore & AWS S3 – Restore database

$ aws s3 cp s3://mbackups/backup1/db2.zst - | zstd -d | mongorestore --archive --drop
2023-02-08T11:37:08.358+0100 preparing collections to restore from
2023-02-08T11:37:08.364+0100 reading metadata for db2.people from archive on stdin
2023-02-08T11:37:08.364+0100 reading metadata for db2.products from archive on stdin
2023-02-08T11:37:08.365+0100 dropping collection db2.people before restoring
2023-02-08T11:37:08.462+0100 restoring db2.people from archive on stdin
2023-02-08T11:37:09.100+0100 dropping collection db2.products before restoring
2023-02-08T11:37:09.122+0100 restoring db2.products from archive on stdin
2023-02-08T11:37:10.288+0100 db2.people 16.0MB
2023-02-08T11:37:10.288+0100 db2.products 13.8MB
2023-02-08T11:37:10.288+0100
2023-02-08T11:37:10.607+0100 db2.people 18.7MB
2023-02-08T11:37:10.607+0100 finished restoring db2.people (50474 documents, 0 failures)
2023-02-08T11:37:13.288+0100 db2.products 47.8MB
2023-02-08T11:37:15.666+0100 db2.products 76.0MB
2023-02-08T11:37:15.666+0100 finished restoring db2.products (516784 documents, 0 failures)
2023-02-08T11:37:15.666+0100 no indexes to restore for collection db2.products
2023-02-08T11:37:15.666+0100 no indexes to restore for collection db2.people
2023-02-08T11:37:15.666+0100 567258 document(s) restored successfully. 0 document(s) failed to restore.

In the above examples, I used both mongodump/mongorestore and mongoexport/mongoimport tools to backup and recover your MongoDB data directly to and from the S3 object storage type, while doing it the streaming and compressed way. Therefore, these methods are simple, fast, and resource-friendly. I hope what I used will be useful when you are looking for options to use in your backup scripts or ad-hoc backup tasks.

Additional tools

Here, I would like to mention that there are other free and open source backup solutions you may try, including Percona Backup for MongoDB (PBM), which now offers both logical and physical backups:

With the Percona Server for MongoDB variant, you may also stream hot physical backups directly to S3 storage:

https://docs.percona.com/percona-server-for-mongodb/6.0/hot-backup.html#streaming-hot-backups-to-a-remote-destination

It is as easy as this:

mongo > db.runCommand({createBackup: 1, s3: {bucket: "mbackups", path: "my_physical_dump1", endpoint: "s3.eu-central-2.wasabisys.com"}})
{ "ok" : 1 }

$ s3cmd du -H s3://mbackups/my_physical_dump1/
138M 26 objects s3://mbackups/my_physical_dump1/

For a sharded cluster, you should use PBM rather for consistent backups.

Btw, don’t forget to check out the MongoDB best backup practices!

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

Download Percona Distribution for MongoDB today!

Nov
25
2022
--

Percona Operator for MongoDB Backup and Restore on S3-Compatible Storage – Backblaze

Percona Operator for MongoDB Backblaze

Percona Operator for MongoDB BackblazeOne of the main features that I like about the Percona Operator for MongoDB is the integration with Percona Backup for MongoDB (PBM) tool and the ability to backup/restore the database without manual intervention. The Operator allows backing up the DB to S3-compatible cloud storage and so you can use AWS, Azure, etc.

One of our customers asked about the integration between Backblaze and the Operator for backup and restore purposes. So I was checking for it and found that it is S3-compatible and provides a free account with 10GB of cloud storage. So I jumped into testing it with our Operator. Also, I saw in our forum that a few users are using Backblaze cloud storage. So making this blog post for everyone to utilize if they want to test/use Backblaze S3-compatible cloud storage for testing our Operator and PBM.

S3-compatible storage configuration

The Operator supports backup to S3-compatible storage. The steps for backup to AWS or Azure blob are given here. So you can try that as well. In this blog let me focus on B2 cloud storage configuring as the backup location and restore it to another deployment.

Let’s configure the Percona Server for MongoDB (PSMDB) Sharded Cluster using the Operator (with minimal config as explained here). I have used PSMDB operator v1.12.0 and PBM 1.8.1 for the test below. You can sign up for the free account here – https://www.backblaze.com/b2/cloud-storage-b.html. Then log in to your account. You can first create a key pair to access the storage from your operator as follows in the “App Keys” tab:

Add Application Key GUI

 

Then you can create a bucket with your desired name and note down the S3-compatible storage’s details like bucketname (shown in the picture below) and the endpointUrl to point here to send the backup files. The details of endpointUrl can be obtained from the provider and the region is specified in the prefix of the endpointURL variable.

Create Bucket

 

Deploy the cluster

Now let’s download the Operator from GitHub (I used v1.12.0) and configure the files for deploying the MongoDB sharded cluster. Here, I am using cr-minimal.yaml for deploying a very minimal setup of single member replicaset for a shard, config db, and a mongos.

#using an alias for the kubectl command
$ alias "k=kubectl"
$ cd percona-server-mongodb-operator

# Add a backup section in the cr file as shown below. Use the appropriate values from your setup
$ cat deploy/cr-minimal.yaml
apiVersion: psmdb.percona.com/v1-12-0
kind: PerconaServerMongoDB
metadata:
  name: minimal-cluster
spec:
  crVersion: 1.12.0
  image: percona/percona-server-mongodb:5.0.7-6
  allowUnsafeConfigurations: true
  upgradeOptions:
    apply: 5.0-recommended
    schedule: "0 2 * * *"
  secrets:
    users: minimal-cluster
  replsets:
  - name: rs0
    size: 1
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 3Gi
  sharding:
    enabled: true
    configsvrReplSet:
      size: 1
      volumeSpec:
        persistentVolumeClaim:
          resources:
            requests:
              storage: 3Gi
    mongos:
      size: 1

  backup:
    enabled: true
    image: percona/percona-backup-mongodb:1.8.1
    serviceAccountName: percona-server-mongodb-operator
    pitr:
      enabled: false
      compressionType: gzip
      compressionLevel: 6
    storages:
      s3-us-west:
        type: s3
        s3:
          bucket: psmdbbackupBlaze
          credentialsSecret: my-cluster-name-backup-s3
          region: us-west-004
          endpointUrl: https://s3.us-west-004.backblazeb2.com/
#          prefix: ""
#          uploadPartSize: 10485760
#          maxUploadParts: 10000
#          storageClass: STANDARD
#          insecureSkipTLSVerify: false

 

The backup-s3.yaml contains the key details to access the B2 cloud storage. Encode the Key ID and Access Details (retrieved from Backblaze as mentioned here) as follows to use inside the backup-s3.yaml file. The key name: my-cluster-name-backup-s3 should be unique which is used to refer to the other yaml files:

# First use base64 to encode your keyid and access key:
$ echo "key-sample" | base64 --wrap=0
XXXX==
$ echo "access-key-sample" | base64 --wrap=0
XXXXYYZZ==

$ cat deploy/backup-s3.yaml
apiVersion: v1
kind: Secret
metadata:
  name: my-cluster-name-backup-s3
type: Opaque
data:
 AWS_ACCESS_KEY_ID: XXXX==
 AWS_SECRET_ACCESS_KEY: XXXXYYZZ==

 

Then deploy the cluster as mentioned below and deploy backup-s3.yaml as well.

$ k apply -f ./deploy/bundle.yaml
customresourcedefinition.apiextensions.k8s.io/perconaservermongodbs.psmdb.percona.com created
customresourcedefinition.apiextensions.k8s.io/perconaservermongodbbackups.psmdb.percona.com created
customresourcedefinition.apiextensions.k8s.io/perconaservermongodbrestores.psmdb.percona.com created
role.rbac.authorization.k8s.io/percona-server-mongodb-operator created
serviceaccount/percona-server-mongodb-operator created
rolebinding.rbac.authorization.k8s.io/service-account-percona-server-mongodb-operator created
deployment.apps/percona-server-mongodb-operator created

$ k apply -f ./deploy/cr-minimal.yaml
perconaservermongodb.psmdb.percona.com/minimal-cluster created

$ k apply -f ./deploy/backup-s3.yaml 
secret/my-cluster-name-backup-s3 created

After starting the Operator and applying the yaml files, the setup looks like the below:

$ k get pods
NAME                                               READY   STATUS    RESTARTS   AGE
minimal-cluster-cfg-0                              2/2     Running   0          39m
minimal-cluster-mongos-0                           1/1     Running   0          70m
minimal-cluster-rs0-0                              2/2     Running   0          38m
percona-server-mongodb-operator-665cd69f9b-44tq5   1/1     Running   0          74m

$ k get svc
NAME                     TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)     AGE
kubernetes               ClusterIP   10.96.0.1     <none>        443/TCP     76m
minimal-cluster-cfg      ClusterIP   None          <none>        27017/TCP   72m
minimal-cluster-mongos   ClusterIP   10.100.7.70   <none>        27017/TCP   72m
minimal-cluster-rs0      ClusterIP   None          <none>        27017/TCP   72m

 

Backup

After deploying the cluster, the DB is ready for backup anytime. Other than the scheduled backup, you can create a backup-custom.yaml file to take a backup whenever you need it (you will need to provide a unique backup name each time, or else a new backup will not work). Our backup yaml file looks like the below one:

$ cat deploy/backup/backup-custom.yaml
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
  - delete-backup
  name: backup1
spec:
  clusterName: minimal-cluster
  storageName: s3-us-west
#  compressionType: gzip
#  compressionLevel: 6

Now load some data into the database and then start the backup now:

$ k apply -f deploy/backup/backup-custom.yaml 
perconaservermongodbbackup.psmdb.percona.com/backup1 configured

The backup progress looks like the below:

$ k get perconaservermongodbbackup.psmdb.percona.com
NAME      CLUSTER           STORAGE      DESTINATION            STATUS      COMPLETED   AGE
backup1   minimal-cluster   s3-us-west   2022-09-08T03:21:58Z   requested               43s
$ k get perconaservermongodbbackup.psmdb.percona.com
NAME      CLUSTER           STORAGE      DESTINATION            STATUS      COMPLETED   AGE
backup1   minimal-cluster   s3-us-west   2022-09-08T03:22:19Z   requested               46s
$ k get perconaservermongodbbackup.psmdb.percona.com
NAME      CLUSTER           STORAGE      DESTINATION            STATUS    COMPLETED   AGE
backup1   minimal-cluster   s3-us-west   2022-09-08T03:22:19Z   running               49s

Here, if you have any issues with the backup, you can view the backup logs from the backup agent sidecar as follows:

$ k logs pod/minimal-cluster-rs0 -c backup-agent

To start another backup, edit backup-custom.yaml and change the backup name followed by applying it (using name:backup2):

$ k apply -f deploy/backup/backup-custom.yaml 
perconaservermongodbbackup.psmdb.percona.com/backup2 configured 

Monitor the backup process (you can use -w option to watch the progress continuously). It should show the status as READY:

$ k get perconaservermongodbbackup.psmdb.percona.com -w
NAME      CLUSTER           STORAGE      DESTINATION            STATUS   COMPLETED   AGE
backup1   minimal-cluster   s3-us-west   2022-09-08T03:22:19Z   ready    12m         14m
backup2   minimal-cluster   s3-us-west                                               8s
backup2   minimal-cluster   s3-us-west   2022-09-08T03:35:56Z   requested               21s
backup2   minimal-cluster   s3-us-west   2022-09-08T03:35:56Z   running                 26s
backup2   minimal-cluster   s3-us-west   2022-09-08T03:35:56Z   ready       0s          41s

From the bucket on Backblaze, the backup files are listed as they were sent from the backup:

Backup Files in B2 cloud Storage

 

Restore

You can restore the cluster from the backup into another similar deployment or into the same cluster. List the backups and restore one of them as follows. The configuration restore-custom.yaml has the backup information to restore. If you are using another deployment, then you can also include backupSource section which I commented on below for your reference, from which the restore process finds the source of the backup. In this case, make sure you create a secret my-cluster-name-backup-s3 before restoring as well to access the backup.

$ cat deploy/backup/restore-custom.yaml 
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore2
spec:
  clusterName: minimal-cluster
  backupName: backup2
#  pitr:
#    type: date
#    date: YYYY-MM-DD HH:MM:SS
#  backupSource:
#    destination: s3://S3-BACKUP-BUCKET-NAME-HERE/BACKUP-DESTINATION
#    s3:
#      credentialsSecret: my-cluster-name-backup-s3
#      region: us-west-004
#      bucket: S3-BACKUP-BUCKET-NAME-HERE
#      endpointUrl: https://s3.us-west-004.backblazeb2.com/
#      prefix: ""
#    azure:
#      credentialsSecret: SECRET-NAME
#      prefix: PREFIX-NAME
#      container: CONTAINER-NAME

Listing the backup:

$ k get psmdb-backup
NAME      CLUSTER           STORAGE      DESTINATION            STATUS   COMPLETED   AGE
backup1   minimal-cluster   s3-us-west   2022-09-08T03:22:19Z   ready    3h5m        3h6m
backup2   minimal-cluster   s3-us-west   2022-09-08T03:35:56Z   ready    171m        172m
backup3   minimal-cluster   s3-us-west   2022-09-08T04:16:39Z   ready    130m        131m

 

To verify the restore process, I write some data into a collection vinodh.testData after the backup and before the restore. So the newly inserted document shouldn’t be there after the restore:

# Using mongosh from the mongo container to see the data
# Listing data from collection vinodh.testData
$ kubectl run -i --rm --tty mongo-client --image=mongo:5.0.7 --restart=Never -- bash -c "mongosh --host=10.96.30.92 --username=root --password=password --authenticationDatabase=admin --eval \"db.getSiblingDB('vinodh').testData.find()\" --quiet "
If you don't see a command prompt, try pressing enter.
[ { _id: ObjectId("631956cc70e60e9ed3ecf76d"), id: 1 } ]
pod "mongo-client" deleted

Inserting a document into it:

$ kubectl run -i --rm --tty mongo-client --image=mongo:5.0.7 --restart=Never -- bash -c "mongosh --host=10.96.30.92 --username=root --password=password --authenticationDatabase=admin --eval \"db.getSiblingDB('vinodh').testData.insert({id:2})\" --quiet "
If you don't see a command prompt, try pressing enter.
DeprecationWarning: Collection.insert() is deprecated. Use insertOne, insertMany, or bulkWrite.
{
  acknowledged: true,
  insertedIds: { '0': ObjectId("631980fe07180f860bd22534") }
}
pod "mongo-client" delete

Listing it again to verify:

$ kubectl run -i --rm --tty mongo-client --image=mongo:5.0.7 --restart=Never -- bash -c "mongosh --host=10.96.30.92 --username=root --password=password --authenticationDatabase=admin --eval \"db.getSiblingDB('vinodh').testData.find()\" --quiet "
If you don't see a command prompt, try pressing enter.
[
  { _id: ObjectId("631956cc70e60e9ed3ecf76d"), id: 1 },
  { _id: ObjectId("631980fe07180f860bd22534"), id: 2 }
]
pod "mongo-client" deleted

Running restore as follows:

$ k apply -f deploy/backup/restore-custom.yaml
perconaservermongodbrestore.psmdb.percona.com/restore2 created

Now check the data again in vinodh.testData collection and verify whether the restore is done properly. The below data proves that the collection was restored from the backup as it is listing only the record from the backup:

$ kubectl run -i --rm --tty mongo-client --image=mongo:5.0.7 --restart=Never -- bash -c "mongosh --host=minimal-cluster-mongos --username=root --password=password --authenticationDatabase=admin --eval \"db.getSiblingDB('vinodh').testData.find()\" --quiet "
If you don't see a command prompt, try pressing enter.
[ { _id: ObjectId("631956cc70e60e9ed3ecf76d"), id: 1 } ]

 

Hope this helps you! Now you can try the same from your end to check and use Backblaze in your production if it suits your requirements. I haven’t tested the performance of the network yet. If you have used Backblaze or similar S3-compatible storage for backup, then you can share your experience with us in the comments.

The Percona Kubernetes Operators automate the creation, alteration, or deletion of members in your Percona Distribution for MySQL, MongoDB, or PostgreSQL environment.

Learn More About Percona Kubernetes Operators

Nov
24
2021
--

Querying Archived RDS Data Directly From an S3 Bucket

querying archived rds data from s3 bucket.png

querying archived rds data from s3 bucket.pngA recommendation we often give to our customers is along the lines of “archive old data” to reduce your database size. There is a tradeoff between keeping all our data online and archiving part of it to cold storage.

There could also be legal requirements to keep certain data online, or you might want to query old data occasionally without having to go through the hassle of restoring an old backup.

In this post, we will explore a very useful feature of AWS RDS/Aurora that allows us to export data to an S3 bucket and run SQL queries directly against it.

Archiving Data to S3

Let’s start by describing the steps we need to take to put our data into an S3 bucket in the required format, which is called Apache Parquet.

Amazon states the Parquet format is up to 2x faster to export and consumes up to 6x less storage in S3, compared to other text formats.

1. Create a snapshot of the database (or select an existing one)

2. Create a customer-managed KMS key to encrypt the exported data

Archiving Data to S3

 

3. Create an IAM role (e.g. exportrdssnapshottos3role)

4. Create an IAM policy for the export task and assign it to the role

{
    "Version": "2012-10-17",
    "Id": "Policy1636727509941",
    "Statement": [
        {
            "Sid": "Stmt1636727502144",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789:role/service-role/exportrdssnapshottos3role"
            },
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::test-athena",
                "arn:aws:s3:::test-athena/exports/*"
            ]
        }
    ]
}

5. Optional: Create an S3 bucket (or use an existing one)

6. Set a bucket policy to allow the IAM role to perform the export e.g.:

{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
            "Service": "export.rds.amazonaws.com"
          },
         "Action": "sts:AssumeRole"
       }
     ] 
   }

7. Export the snapshot to Amazon S3 as an Apache Parquet file. You can choose to export specific sets of databases, schemas, or tables

 

Export the snapshot to Amazon S3

IAM role

Querying the Archived Data

When you need to access the data, you can use Amazon Athena to query the data directly from the S3 bucket.

1. Set a query result location

Amazon Athena

2. Create an external table in Athena Query editor. We need to map the MySQL column types to equivalent types in Parquet

CREATE EXTERNAL TABLE log_requests (
  id DECIMAL(20,0),
  name STRING, 
  is_customer TINYINT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://aurora-training-s3/exports/2021/log/'
tblproperties ("parquet.compression"="SNAPPY");

3. Now we can query the external table from the Athena Query editor

SELECT name, COUNT(*)
FROM log_requests
WHERE created_at >= CAST('2021-10-01' AS TIMESTAMP)
  AND created_at < CAST('2021-11-01' AS TIMESTAMP)
GROUP BY name;

Removing the Archived Data from the Database

After testing that we can query the desired data from the S3 bucket, it is time to delete archived data from the database for good. We can use the pt-archiver tool for this task.

Having a smaller database has several benefits. To name a few: your backup/restore will be faster, you will be able to keep more data in memory so response times improve, you may even be able to scale down your server specs and save some money.

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Apr
04
2018
--

AWS launches a cheaper single-zone version of its S3 storage service

AWS’ S3 storage service today launched a cheaper option for keeping data in the cloud — as long as developers are willing to give up a few 9s of availability in return for saving up to 20 percent compared to the standard S3 price for applications that need infrequent access. The name for this new S3 tier: S3 One Zone-Infrequent Access.

S3 was among the first services AWS offered. Over the years, the company added a few additional tiers to the standard storage service. There’s the S3 Standard tier with the promise of 99.999999999 percent durability and 99.99 percent availability and S3 Standard-Infrequent Access with the same durability promise and 99.9 percent availability. There’s also Glacier for cold storage.

Data stored in the Standard and Standard-Infrequent access tiers is replicated across three or more availability zones. As the name implies, the main difference between those and the One Zone-Infrequent Access tier is that with this cheaper option, all the data sits in only one availability zone. It’s still replicated across different machines, but if that zone goes down (or is destroyed), you can’t access your data.

Because of this, AWS only promises 99.5 percent availability and only offers a 99 percent SLA. In terms of features and durability, though, there’s no difference between this tier and the other S3 tiers.

As Amazon CTO Werner Vogels noted in a keynote at the AWS Summit in San Francisco today, it’s the replication across availability zones that defines the storage cost. In his view, this new service should be used for data that is infrequently accessed but can be replicated.

An availability of 99.5 percent does mean that you should expect to experience a day or two per year where you can’t access your data, though. For some applications, that’s perfectly acceptable, and Vogels noted that he expects AWS customers to use this for secondary backup copies or for storing media files that can be replicated.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com