In a previous article, I tested a new feature of MongoDB 5.0: resharding. Today, I take a look at another new feature: the Time Series collections.
The Time Series collection is an astonishing new feature available in MongoDB 5.0. Based on the first tests I have done, the Time Series support provides comparable performance to the index usage on regular collections but saves a lot of disk and memory space. Aggregation pipelines, which are common queries you can run on time series data, can get even more benefit.
Let’s start the tests.
What is a Time Series Database?
Generally speaking, a Time Series database is a specialized database designed for efficiently storing data generated from a continuous stream of values associated with a timestamp. The typical use case is when you need to store data coming from sensory equipment that transmits data points at fixed intervals, but now they are used in support of a much wider range of applications.
Typical use cases are:
- IoT data
- Monitoring web services, applications, and infrastructure
- Sales forecasting
- Understanding financial trends
- Processing self-driving car data or other physical devices
A Time Series specialized database utilizes compression algorithms to minimize the space requirement and also provides access paths to dig more efficiently into the data. This improves the performance of retrieving data based on time range filters and aggregating data. They are more efficient than using a common relational database.
Usually, the values of a Time Series shouldn’t change once recorded, they are defined as INSERT only, also known as immutable data points. Once the data is stored the update operation is really uncommon.
Another characteristic of Time Series is that every item should have a single value (a single temperature, a stock price, and so on).
Popular Time Series databases are InfluxDB, Prometheus, Graphite. There are also many others. VictoriaMetrics in particular is a popular fork of Prometheus and is used in our Percona Monitoring and Management software.
The New Time Series Collections in MongoDB 5.0
MongoDB, as well as relational databases, has been widely used for years for storing temperature data from sensors, stock prices, and any other kind of unchanging data over time. MongoDB version 5.0 promises that this can be done more efficiently, so let’s take a look at how it works.
A Time Series collection appears as a regular collection and the operations you can do are exactly the same: insert, update, find, delete, aggregate. The main difference is behind the curtain. MongoDB stores data into an optimized storage format on insert. Compared to a normal collection, a Time Series is smaller and provides more query efficiency.
MongoDB treats Time Series collections as writable non-materialized views. The data is stored more efficiently, saving disk space, and an automatically created internal index orders the data by time. By default, the data is compressed using the zstd algorithm instead of snappy. The new compression provides a higher ratio, less CPU requirements, and it is well suited for time series data where there are few variations from one document to the next one. You can eventually change the compression algorithm, but it is not really recommended.
A Time Series collection is not implicitly created when you insert a document, the same as regular collections. You must create it explicitly.
Let’s do some tests.
Create a Time Series Collection for Storing Stock Prices
We need to use the createCollection() method, providing some parameters.
[direct: mongos] timeseries> db.createCollection(
"stockPrice1week", {
timeseries: {
timeField: "timestamp",
metaField: "metadata",
granularity: "minutes"
},
expireAfterSeconds: 604800
}
)
{ ok: 1 }
The name of the collection is stockPrice1week and the only required parameter is timeField. The other parameters are optional.
timeField: the name of the field where the date is stored. This will be automatically indexed and used for retrieving data.
metaField: the field containing the metadata. It can be a simple scalar value or a more complex JSON object. It’s optional. It cannot be the _id or the same as the timeField. For example, the metadata for a temperature sensor could be the code of the sensor, the type, the location, and so on.
granularity: possible values are seconds, minutes, and hours. If not set, it defaults to seconds. If you specify the closest match between two consecutive values this will help MongoDB to store data more efficiently and improve the query performance.
expireAfterSeconds: you can automatically delete documents after the specified time, the same as TTL index. If not specified the documents will not expire.
Let’s insert some random data for three stocks: Apple, Orange, and Banana. Data is collected once per minute.
[direct: mongos] timeseries> var stockPriceDate = ISODate("2021-10-13T00:00:00.000Z")
[direct: mongos] timeseries> var priceApple = 100
[direct: mongos] timeseries> var priceOrange = 50
[direct: mongos] timeseries> var priceBanana = 80
[direct: mongos] timeseries> for (i = 1; i < 100000; i++) {
priceApple = priceApple + Math.random();
priceOrange = priceOrange + Math.random();
priceBanana = priceBanana + Math.random();
db.stockPrice1week.insert({ "timestamp": stockPriceDate, "metadata": { "stockName": "Apple", "currency": "Dollar" }, "stockPrice": priceApple });
db.stockPrice1week.insert({ "timestamp": stockPriceDate, "metadata": { "stockName": "Orange", "currency": "Dollar" }, "stockPrice": priceOrange });
db.stockPrice1week.insert({ "timestamp": stockPriceDate, "metadata": { "stockName": "Banana", "currency": "Euro" }, "stockPrice": priceBanana });
stockPriceDate = new Date(stockPriceDate.getTime() + 1000 * 60);
}
We can query to check the inserted documents:
[direct: mongos] timeseries> db.stockPrice1week.find().limit(3)
[
{
_id: ObjectId("6166df318f32e5d3ed304fc5"),
timestamp: ISODate("2021-10-13T00:00:00.000Z"),
metadata: { stockName: 'Apple', currency: 'Dollar' },
stockPrice: 100.6547271930824
},
{
_id: ObjectId("6166df318f32e5d3ed304fc6"),
timestamp: ISODate("2021-10-13T00:00:00.000Z"),
metadata: { stockName: 'Orange', currency: 'Dollar' },
stockPrice: 50.51709117468818
},
{
_id: ObjectId("6166df318f32e5d3ed304fc7"),
timestamp: ISODate("2021-10-13T00:00:00.000Z"),
metadata: { stockName: 'Banana', currency: 'Euro' },
stockPrice: 80.17611551979255
}
]
Check the Collection Size
Now, let’s create a regular collection having the same exact data.
[direct: mongos] timeseries> db.stockPrice1week.find().forEach(function (doc) {
db.stockPrice1week_regular.insertOne(doc);
})
Let’s check the total size of the two collections.
[direct: mongos] timeseries> db.stockPrice1week.stats().totalSize
5357568
[direct: mongos] timeseries> db.stockPrice1week_regular.stats().totalSize
21934080
As expected, the Time Series collection is four times smaller than the regular one. Also, consider the regular collection doesn’t have any secondary index at the moment.
Query the Collections
Let’s run a simple query to find out the stock values for a specific timestamp. We test the query on both collections.
[direct: mongos] timeseries> db.stockPrice1week.find( { "timestamp": ISODate("2021-10-23T12:00:00.000Z") } )
[
{
_id: ObjectId("6166dfc68f32e5d3ed3100f5"),
timestamp: ISODate("2021-10-23T12:00:00.000Z"),
metadata: { stockName: 'Apple', currency: 'Dollar' },
stockPrice: 7636.864548363888
},
{
_id: ObjectId("6166dfc68f32e5d3ed3100f6"),
timestamp: ISODate("2021-10-23T12:00:00.000Z"),
metadata: { stockName: 'Orange', currency: 'Dollar' },
stockPrice: 7607.03756525094
},
{
_id: ObjectId("6166dfc68f32e5d3ed3100f7"),
timestamp: ISODate("2021-10-23T12:00:00.000Z"),
metadata: { stockName: 'Banana', currency: 'Euro' },
stockPrice: 7614.360031277444
}
]
[direct: mongos] timeseries> db.stockPrice1week_regular.find( { "timestamp": ISODate("2021-10-23T12:00:00.000Z") } )
[
{
_id: ObjectId("6166dfc68f32e5d3ed3100f5"),
timestamp: ISODate("2021-10-23T12:00:00.000Z"),
metadata: { stockName: 'Apple', currency: 'Dollar' },
stockPrice: 7636.864548363888
},
{
_id: ObjectId("6166dfc68f32e5d3ed3100f6"),
timestamp: ISODate("2021-10-23T12:00:00.000Z"),
metadata: { stockName: 'Orange', currency: 'Dollar' },
stockPrice: 7607.03756525094
},
{
_id: ObjectId("6166dfc68f32e5d3ed3100f7"),
timestamp: ISODate("2021-10-23T12:00:00.000Z"),
metadata: { stockName: 'Banana', currency: 'Euro' },
stockPrice: 7614.360031277444
}
]
We’ve got the same result, but what is important here is looking at the explain() to see the execution plan. Here is the explain() of the regular collection.
[direct: mongos] timeseries> db.stockPrice1week_regular.find( { "timestamp": ISODate("2021-10-23T12:00:00.000Z") } ).explain("executionStats")
{
...
winningPlan: {
stage: 'COLLSCAN',
filter: {
timestamp: { '$eq': ISODate("2021-10-23T12:00:00.000Z") }
},
direction: 'forward'
...
...
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 200,
totalKeysExamined: 0,
totalDocsExamined: 299997,
...
...
We didn’t create any secondary index, so the winning plan is a COLLSCAN, all documents must be examined. The query takes 200 milliseconds.
The following is the explain() of the Time Series collection instead.
[direct: mongos] timeseries> db.stockPrice1week.find( { "timestamp": ISODate("2021-10-23T12:00:00.000Z") } ).explain("executionStats")
{
...
...
executionStats: {
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 2,
totalKeysExamined: 0,
totalDocsExamined: 8,
executionStages: {
stage: 'COLLSCAN',
filter: {
'$and': [
{
_id: { '$lte': ObjectId("6173f940ffffffffffffffff") }
},
{
_id: { '$gte': ObjectId("6172a7c00000000000000000") }
},
{
'control.max.timestamp': {
'$_internalExprGte': ISODate("2021-10-23T12:00:00.000Z")
}
},
{
'control.min.timestamp': {
'$_internalExprLte': ISODate("2021-10-23T12:00:00.000Z")
}
}
]
},
...
...
Surprisingly it is a COLLSCAN, but with different numbers. The number of documents examined is now only eight and execution time is two milliseconds.
As already mentioned, the Time Series is a non-materialized view. It works as an abstraction layer. The actual data is stored into another system collection (system.buckets.stockPrice1week) where documents are saved in a slightly different format. It’s not the goal of this article to dig into the internals, just keep in mind the different storage format permits mongod to fetch only a few buckets of data instead of reading everything, even if it is flagged as a COLLSCAN. That’s amazing.
What Happens if I Create an Index on the Regular Collection?
Let’s try.
[direct: mongos] timeseries> db.stockPrice1week_regular.createIndex( { "timestamp": 1 } )
timestamp_1
[direct: mongos] timeseries> db.stockPrice1week_regular.getIndexes()
[
{ v: 2, key: { _id: 1 }, name: '_id_' },
{ v: 2, key: { timestamp: 1 }, name: 'timestamp_1' }
]
[direct: mongos] timeseries> db.stockPrice1week_regular.find({"timestamp": ISODate("2021-10-23T12:00:00.000Z")}).explain("executionStats")
{
...
...
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { timestamp: 1 },
indexName: 'timestamp_1',
...
...
executionStats: {
nReturned: 3,
executionTimeMillis: 2,
totalKeysExamined: 3,
totalDocsExamined: 3,
...
Now the winning plan is an IXSCAN, the new index is used. Only three keys examined, three docs examined, and three docs returned. The query takes two milliseconds.
So, it is as fast as the Time Series collection. There is not such a big difference; the order of magnitude is the same.
Also notice the same performance comes at the cost of having a larger collection at the end because we have created a secondary index.
[direct: mongos] timeseries> db.stockPrice1week_regular.stats().totalSize
25251840
[direct: mongos] timeseries> db.stockPrice1week.stats().totalSize
5357568
For getting a comparable execution time, now the regular collection is five times larger than the Time Series.
A Query with a Time Range Filter
Let’s test a different query looking for a range of timestamps. The following are the explain() outputs.
[direct: mongos] timeseries> db.stockPrice1week_regular.find( { "timestamp": { $gte: ISODate("2021-10-20T00:00:00Z"), $lt: ISODate("2021-10-20T23:59:59Z") } } ).explain("executionStats")
{
...
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { timestamp: 1 },
...
executionStats: {
nReturned: 4320,
executionTimeMillis: 7,
totalKeysExamined: 4320,
totalDocsExamined: 4320,
...
[direct: mongos] timeseries> db.stockPrice1week.find( { "timestamp": { $gte: ISODate("2021-10-20T00:00:00Z"), $lt: ISODate("2021-10-20T23:59:59Z") } } ).explain("executionStats")
{
...
...
winningPlan: {
stage: 'COLLSCAN',
filter: {
'$and': [
{
_id: { '$lt': ObjectId("6170ad7f0000000000000000") }
},
{
_id: { '$gte': ObjectId("616e0a800000000000000000") }
},
{
'control.max.timestamp': {
'$_internalExprGte': ISODate("2021-10-20T00:00:00.000Z")
}
},
{
'control.min.timestamp': {
'$_internalExprLt': ISODate("2021-10-20T23:59:59.000Z")
}
}
]
},
...
...
executionStats: {
executionSuccess: true,
nReturned: 6,
executionTimeMillis: 6,
totalKeysExamined: 0,
totalDocsExamined: 11,
...
The same as before. The execution time is basically the same for both queries. The main problem remains the size of the regular collection that is significantly larger.
Only six documents are apparently returned by the Time Series, but it’s not. If you execute the query for real you’ll get 4320 documents. The six documents mentioned by explain() refer to the documents that must be returned by the real collection below the non-materialized view.
Aggregation Test
On our Time Series data, we would like to do some aggregation. This is a typical task: calculate averages over a period, find min and max values, and other kinds of statistics.
Let’s suppose we need to calculate the average stock price on a daily basis. We can use the following aggregation pipeline for example:
db.stockPrice1week.aggregate([
{
$project: {
date: {
$dateToParts: { date: "$timestamp" }
},
stockPrice: 1
}
},
{
$group: {
_id: {
date: {
year: "$date.year",
month: "$date.month",
day: "$date.day"
}
},
avgPrice: { $avg: "$stockPrice" }
}
}
])
[
{
_id: { date: { year: 2021, month: 12, day: 4 } },
avgPrice: 37939.782043249594
},
{
_id: { date: { year: 2021, month: 11, day: 22 } },
avgPrice: 29289.700949196136
},
{
_id: { date: { year: 2021, month: 10, day: 27 } },
avgPrice: 10531.347070537977
},
...
...
As usual, let’s have a look at the explain() of the aggregate against the two collections, just focusing on execution time and documents examined.
[direct: mongos] timeseries> db.stockPrice1week.explain("executionStats").aggregate([ { $project: { date: { $dateToParts: { date: "$timestamp" } }, stockPrice: 1 } }, { $group: { _id: { date: { year: "$date.year", month: "$date.month", day: "$date.day" } }, avgPrice: { $avg: "$stockPrice" } } }])
{
...
executionStats: {
executionSuccess: true,
nReturned: 300,
executionTimeMillis: 615,
totalKeysExamined: 0,
totalDocsExamined: 300,
executionStages: {
stage: 'COLLSCAN',
...
[direct: mongos] timeseries> db.stockPrice1week_regular.explain("executionStats").aggregate([ { $project: { date: { $dateToParts: { date: "$timestamp" } }, stockPrice: 1 } }, { $group: { _id: { date: { year: "$date.year", month: "$date.month", day: "$date.day" } }, avgPrice: { $avg: "$stockPrice" } } }])
{
...
executionStats: {
executionSuccess: true,
nReturned: 299997,
executionTimeMillis: 1022,
totalKeysExamined: 0,
totalDocsExamined: 299997,
executionStages: {
stage: 'PROJECTION_DEFAULT',
...
The aggregation pipeline runs 40 percent faster with the Time Series collection. This should be more relevant the larger the collection is.
Conclusion
MongoDB 5.0 is an interesting new version of the most popular document-based database, and new features like Time Series collections and resharding are amazing.
Anyway, due to many core changes to WiredTiger and the core server introduced to facilitate new features, MongoDB 5.0.x is still unstable. We do not recommend using it for production environments.
Check the documentation of Percona Server for MongoDB 5.0.3-2 (Release Candidate).
Complete the 2021 Percona Open Source Data Management Software Survey
Have Your Say!