Nov
24
2021
--

Querying Archived RDS Data Directly From an S3 Bucket

querying archived rds data from s3 bucket.png

querying archived rds data from s3 bucket.pngA recommendation we often give to our customers is along the lines of “archive old data” to reduce your database size. There is a tradeoff between keeping all our data online and archiving part of it to cold storage.

There could also be legal requirements to keep certain data online, or you might want to query old data occasionally without having to go through the hassle of restoring an old backup.

In this post, we will explore a very useful feature of AWS RDS/Aurora that allows us to export data to an S3 bucket and run SQL queries directly against it.

Archiving Data to S3

Let’s start by describing the steps we need to take to put our data into an S3 bucket in the required format, which is called Apache Parquet.

Amazon states the Parquet format is up to 2x faster to export and consumes up to 6x less storage in S3, compared to other text formats.

1. Create a snapshot of the database (or select an existing one)

2. Create a customer-managed KMS key to encrypt the exported data

Archiving Data to S3

 

3. Create an IAM role (e.g. exportrdssnapshottos3role)

4. Create an IAM policy for the export task and assign it to the role

{
    "Version": "2012-10-17",
    "Id": "Policy1636727509941",
    "Statement": [
        {
            "Sid": "Stmt1636727502144",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789:role/service-role/exportrdssnapshottos3role"
            },
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::test-athena",
                "arn:aws:s3:::test-athena/exports/*"
            ]
        }
    ]
}

5. Optional: Create an S3 bucket (or use an existing one)

6. Set a bucket policy to allow the IAM role to perform the export e.g.:

{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
            "Service": "export.rds.amazonaws.com"
          },
         "Action": "sts:AssumeRole"
       }
     ] 
   }

7. Export the snapshot to Amazon S3 as an Apache Parquet file. You can choose to export specific sets of databases, schemas, or tables

 

Export the snapshot to Amazon S3

IAM role

Querying the Archived Data

When you need to access the data, you can use Amazon Athena to query the data directly from the S3 bucket.

1. Set a query result location

Amazon Athena

2. Create an external table in Athena Query editor. We need to map the MySQL column types to equivalent types in Parquet

CREATE EXTERNAL TABLE log_requests (
  id DECIMAL(20,0),
  name STRING, 
  is_customer TINYINT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://aurora-training-s3/exports/2021/log/'
tblproperties ("parquet.compression"="SNAPPY");

3. Now we can query the external table from the Athena Query editor

SELECT name, COUNT(*)
FROM log_requests
WHERE created_at >= CAST('2021-10-01' AS TIMESTAMP)
  AND created_at < CAST('2021-11-01' AS TIMESTAMP)
GROUP BY name;

Removing the Archived Data from the Database

After testing that we can query the desired data from the S3 bucket, it is time to delete archived data from the database for good. We can use the pt-archiver tool for this task.

Having a smaller database has several benefits. To name a few: your backup/restore will be faster, you will be able to keep more data in memory so response times improve, you may even be able to scale down your server specs and save some money.

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com