Querying Archived RDS Data Directly From an S3 Bucket

querying archived rds data from s3 bucket.png

querying archived rds data from s3 bucket.pngA recommendation we often give to our customers is along the lines of “archive old data” to reduce your database size. There is a tradeoff between keeping all our data online and archiving part of it to cold storage.

There could also be legal requirements to keep certain data online, or you might want to query old data occasionally without having to go through the hassle of restoring an old backup.

In this post, we will explore a very useful feature of AWS RDS/Aurora that allows us to export data to an S3 bucket and run SQL queries directly against it.

Archiving Data to S3

Let’s start by describing the steps we need to take to put our data into an S3 bucket in the required format, which is called Apache Parquet.

Amazon states the Parquet format is up to 2x faster to export and consumes up to 6x less storage in S3, compared to other text formats.

1. Create a snapshot of the database (or select an existing one)

2. Create a customer-managed KMS key to encrypt the exported data

Archiving Data to S3


3. Create an IAM role (e.g. exportrdssnapshottos3role)

4. Create an IAM policy for the export task and assign it to the role

    "Version": "2012-10-17",
    "Id": "Policy1636727509941",
    "Statement": [
            "Sid": "Stmt1636727502144",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789:role/service-role/exportrdssnapshottos3role"
            "Action": [
            "Resource": [

5. Optional: Create an S3 bucket (or use an existing one)

6. Set a bucket policy to allow the IAM role to perform the export e.g.:

     "Version": "2012-10-17",
     "Statement": [
         "Effect": "Allow",
         "Principal": {
            "Service": ""
         "Action": "sts:AssumeRole"

7. Export the snapshot to Amazon S3 as an Apache Parquet file. You can choose to export specific sets of databases, schemas, or tables


Export the snapshot to Amazon S3

IAM role

Querying the Archived Data

When you need to access the data, you can use Amazon Athena to query the data directly from the S3 bucket.

1. Set a query result location

Amazon Athena

2. Create an external table in Athena Query editor. We need to map the MySQL column types to equivalent types in Parquet

  id DECIMAL(20,0),
  name STRING, 
  is_customer TINYINT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
LOCATION 's3://aurora-training-s3/exports/2021/log/'
tblproperties ("parquet.compression"="SNAPPY");

3. Now we can query the external table from the Athena Query editor

FROM log_requests
WHERE created_at >= CAST('2021-10-01' AS TIMESTAMP)
  AND created_at < CAST('2021-11-01' AS TIMESTAMP)
GROUP BY name;

Removing the Archived Data from the Database

After testing that we can query the desired data from the S3 bucket, it is time to delete archived data from the database for good. We can use the pt-archiver tool for this task.

Having a smaller database has several benefits. To name a few: your backup/restore will be faster, you will be able to keep more data in memory so response times improve, you may even be able to scale down your server specs and save some money.

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!


AWS launches a cheaper single-zone version of its S3 storage service

AWS’ S3 storage service today launched a cheaper option for keeping data in the cloud — as long as developers are willing to give up a few 9s of availability in return for saving up to 20 percent compared to the standard S3 price for applications that need infrequent access. The name for this new S3 tier: S3 One Zone-Infrequent Access.

S3 was among the first services AWS offered. Over the years, the company added a few additional tiers to the standard storage service. There’s the S3 Standard tier with the promise of 99.999999999 percent durability and 99.99 percent availability and S3 Standard-Infrequent Access with the same durability promise and 99.9 percent availability. There’s also Glacier for cold storage.

Data stored in the Standard and Standard-Infrequent access tiers is replicated across three or more availability zones. As the name implies, the main difference between those and the One Zone-Infrequent Access tier is that with this cheaper option, all the data sits in only one availability zone. It’s still replicated across different machines, but if that zone goes down (or is destroyed), you can’t access your data.

Because of this, AWS only promises 99.5 percent availability and only offers a 99 percent SLA. In terms of features and durability, though, there’s no difference between this tier and the other S3 tiers.

As Amazon CTO Werner Vogels noted in a keynote at the AWS Summit in San Francisco today, it’s the replication across availability zones that defines the storage cost. In his view, this new service should be used for data that is infrequently accessed but can be replicated.

An availability of 99.5 percent does mean that you should expect to experience a day or two per year where you can’t access your data, though. For some applications, that’s perfectly acceptable, and Vogels noted that he expects AWS customers to use this for secondary backup copies or for storing media files that can be replicated.

Powered by WordPress | Theme: Aeros 2.0 by