Oct
20
2021
--

MongoDB 5.0 Time Series Collections

MongoDB 5.0 Time Series Collections

MongoDB 5.0 Time Series CollectionsIn a previous article, I tested a new feature of MongoDB 5.0: resharding. Today, I take a look at another new feature: the Time Series collections.

The Time Series collection is an astonishing new feature available in MongoDB 5.0. Based on the first tests I have done, the Time Series support provides comparable performance to the index usage on regular collections but saves a lot of disk and memory space. Aggregation pipelines, which are common queries you can run on time series data, can get even more benefit.

Let’s start the tests.

What is a Time Series Database?

Generally speaking, a Time Series database is a specialized database designed for efficiently storing data generated from a continuous stream of values associated with a timestamp. The typical use case is when you need to store data coming from sensory equipment that transmits data points at fixed intervals, but now they are used in support of a much wider range of applications.

Typical use cases are:

  • IoT data
  • Monitoring web services, applications, and infrastructure
  • Sales forecasting
  • Understanding financial trends
  • Processing self-driving car data or other physical devices

A Time Series specialized database utilizes compression algorithms to minimize the space requirement and also provides access paths to dig more efficiently into the data. This improves the performance of retrieving data based on time range filters and aggregating data. They are more efficient than using a common relational database.

Usually, the values of a Time Series shouldn’t change once recorded, they are defined as INSERT only, also known as immutable data points. Once the data is stored the update operation is really uncommon.

Another characteristic of Time Series is that every item should have a single value (a single temperature, a stock price, and so on).

Popular Time Series databases are InfluxDB, Prometheus, Graphite. There are also many others. VictoriaMetrics in particular is a popular fork of Prometheus and is used in our Percona Monitoring and Management software.

The New Time Series Collections in MongoDB 5.0

MongoDB, as well as relational databases, has been widely used for years for storing temperature data from sensors, stock prices, and any other kind of unchanging data over time. MongoDB version 5.0 promises that this can be done more efficiently, so let’s take a look at how it works.

A Time Series collection appears as a regular collection and the operations you can do are exactly the same: insert, update, find, delete, aggregate. The main difference is behind the curtain. MongoDB stores data into an optimized storage format on insert. Compared to a normal collection, a Time Series is smaller and provides more query efficiency.

MongoDB treats Time Series collections as writable non-materialized views. The data is stored more efficiently, saving disk space, and an automatically created internal index orders the data by time. By default, the data is compressed using the zstd algorithm instead of snappy. The new compression provides a higher ratio, less CPU requirements, and it is well suited for time series data where there are few variations from one document to the next one. You can eventually change the compression algorithm, but it is not really recommended.

A Time Series collection is not implicitly created when you insert a document, the same as regular collections. You must create it explicitly.

Let’s do some tests.

Create a Time Series Collection for Storing Stock Prices

We need to use the createCollection() method, providing some parameters.

[direct: mongos] timeseries> db.createCollection( 
"stockPrice1week", { 
  timeseries: { 
    timeField: "timestamp", 
    metaField: "metadata", 
    granularity: "minutes" 
  }, 
  expireAfterSeconds: 604800   
  }
)
{ ok: 1 }

The name of the collection is stockPrice1week and the only required parameter is timeField. The other parameters are optional.

timeField: the name of the field where the date is stored. This will be automatically indexed and used for retrieving data.

metaField: the field containing the metadata. It can be a simple scalar value or a more complex JSON object. It’s optional. It cannot be the _id or the same as the timeField. For example, the metadata for a temperature sensor could be the code of the sensor, the type, the location, and so on.

granularity: possible values are seconds, minutes, and hours. If not set, it defaults to seconds. If you specify the closest match between two consecutive values this will help MongoDB to store data more efficiently and improve the query performance.

expireAfterSeconds: you can automatically delete documents after the specified time, the same as TTL index. If not specified the documents will not expire.

Let’s insert some random data for three stocks: Apple, Orange, and Banana. Data is collected once per minute.

[direct: mongos] timeseries> var stockPriceDate = ISODate("2021-10-13T00:00:00.000Z")

[direct: mongos] timeseries> var priceApple = 100

[direct: mongos] timeseries> var priceOrange = 50

[direct: mongos] timeseries> var priceBanana = 80

[direct: mongos] timeseries> for (i = 1; i < 100000; i++) { 
  priceApple = priceApple + Math.random(); 
  priceOrange = priceOrange + Math.random(); 
  priceBanana = priceBanana + Math.random(); 
  db.stockPrice1week.insert({ "timestamp": stockPriceDate, "metadata": { "stockName": "Apple", "currency": "Dollar" }, "stockPrice": priceApple }); 
  db.stockPrice1week.insert({ "timestamp": stockPriceDate, "metadata": { "stockName": "Orange", "currency": "Dollar" }, "stockPrice": priceOrange }); 
  db.stockPrice1week.insert({ "timestamp": stockPriceDate, "metadata": { "stockName": "Banana", "currency": "Euro" }, "stockPrice": priceBanana }); 
  stockPriceDate = new Date(stockPriceDate.getTime() + 1000 * 60); 
}

We can query to check the inserted documents:

[direct: mongos] timeseries> db.stockPrice1week.find().limit(3)
[
  {
    _id: ObjectId("6166df318f32e5d3ed304fc5"),
    timestamp: ISODate("2021-10-13T00:00:00.000Z"),
    metadata: { stockName: 'Apple', currency: 'Dollar' },
    stockPrice: 100.6547271930824
  }, 
  {
    _id: ObjectId("6166df318f32e5d3ed304fc6"),
    timestamp: ISODate("2021-10-13T00:00:00.000Z"),
    metadata: { stockName: 'Orange', currency: 'Dollar' },
    stockPrice: 50.51709117468818
  },  
  {
    _id: ObjectId("6166df318f32e5d3ed304fc7"),
    timestamp: ISODate("2021-10-13T00:00:00.000Z"),
    metadata: { stockName: 'Banana', currency: 'Euro' },
    stockPrice: 80.17611551979255
  }
]

Check the Collection Size

Now, let’s create a regular collection having the same exact data.

[direct: mongos] timeseries> db.stockPrice1week.find().forEach(function (doc) {
  db.stockPrice1week_regular.insertOne(doc);
})


Let’s check the total size of the two collections.

[direct: mongos] timeseries> db.stockPrice1week.stats().totalSize
5357568
[direct: mongos] timeseries> db.stockPrice1week_regular.stats().totalSize
21934080

As expected, the Time Series collection is four times smaller than the regular one. Also, consider the regular collection doesn’t have any secondary index at the moment.

Query the Collections

Let’s run a simple query to find out the stock values for a specific timestamp. We test the query on both collections.

[direct: mongos] timeseries> db.stockPrice1week.find( { "timestamp": ISODate("2021-10-23T12:00:00.000Z") } )
[
  { 
    _id: ObjectId("6166dfc68f32e5d3ed3100f5"),
    timestamp: ISODate("2021-10-23T12:00:00.000Z"),
    metadata: { stockName: 'Apple', currency: 'Dollar' },
    stockPrice: 7636.864548363888
  },
  {
    _id: ObjectId("6166dfc68f32e5d3ed3100f6"),
    timestamp: ISODate("2021-10-23T12:00:00.000Z"),
    metadata: { stockName: 'Orange', currency: 'Dollar' },
    stockPrice: 7607.03756525094
  },
  {
    _id: ObjectId("6166dfc68f32e5d3ed3100f7"),
    timestamp: ISODate("2021-10-23T12:00:00.000Z"),
    metadata: { stockName: 'Banana', currency: 'Euro' },
    stockPrice: 7614.360031277444  
  }
]
[direct: mongos] timeseries> db.stockPrice1week_regular.find( { "timestamp": ISODate("2021-10-23T12:00:00.000Z") } )
[
  {
    _id: ObjectId("6166dfc68f32e5d3ed3100f5"),
    timestamp: ISODate("2021-10-23T12:00:00.000Z"),
    metadata: { stockName: 'Apple', currency: 'Dollar' },
    stockPrice: 7636.864548363888
  }, 
  {
    _id: ObjectId("6166dfc68f32e5d3ed3100f6"),
    timestamp: ISODate("2021-10-23T12:00:00.000Z"),
    metadata: { stockName: 'Orange', currency: 'Dollar' },
    stockPrice: 7607.03756525094
  },
  {
    _id: ObjectId("6166dfc68f32e5d3ed3100f7"),
    timestamp: ISODate("2021-10-23T12:00:00.000Z"),
    metadata: { stockName: 'Banana', currency: 'Euro' },
    stockPrice: 7614.360031277444
  }
]

We’ve got the same result, but what is important here is looking at the explain() to see the execution plan. Here is the explain() of the regular collection.

[direct: mongos] timeseries> db.stockPrice1week_regular.find( { "timestamp": ISODate("2021-10-23T12:00:00.000Z") } ).explain("executionStats")
{
...
winningPlan: {
  stage: 'COLLSCAN',
  filter: {
    timestamp: { '$eq': ISODate("2021-10-23T12:00:00.000Z") }
  },
  direction: 'forward'
...
...
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 200,
totalKeysExamined: 0,
totalDocsExamined: 299997,
...
...

We didn’t create any secondary index, so the winning plan is a COLLSCAN, all documents must be examined. The query takes 200 milliseconds.

The following is the explain() of the Time Series collection instead.

[direct: mongos] timeseries> db.stockPrice1week.find( { "timestamp": ISODate("2021-10-23T12:00:00.000Z") } ).explain("executionStats")
{
...
...
executionStats: {
  executionSuccess: true,
  nReturned: 3,
  executionTimeMillis: 2,
  totalKeysExamined: 0,
  totalDocsExamined: 8,
  executionStages: {
  stage: 'COLLSCAN',
    filter: {
      '$and': [
        {
          _id: { '$lte': ObjectId("6173f940ffffffffffffffff") }
        },
        {
          _id: { '$gte': ObjectId("6172a7c00000000000000000") }
        },
      {
      'control.max.timestamp': {
        '$_internalExprGte': ISODate("2021-10-23T12:00:00.000Z")
      }
    },
    {
      'control.min.timestamp': {
        '$_internalExprLte': ISODate("2021-10-23T12:00:00.000Z")
      }
    }
  ]
},
...
...

Surprisingly it is a COLLSCAN, but with different numbers. The number of documents examined is now only eight and execution time is two milliseconds.

As already mentioned, the Time Series is a non-materialized view. It works as an abstraction layer. The actual data is stored into another system collection (system.buckets.stockPrice1week) where documents are saved in a slightly different format. It’s not the goal of this article to dig into the internals, just keep in mind the different storage format permits mongod to fetch only a few buckets of data instead of reading everything, even if it is flagged as a COLLSCAN. That’s amazing.

What Happens if I Create an Index on the Regular Collection?

Let’s try.

[direct: mongos] timeseries> db.stockPrice1week_regular.createIndex( { "timestamp": 1 } )
timestamp_1
[direct: mongos] timeseries> db.stockPrice1week_regular.getIndexes()
[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  { v: 2, key: { timestamp: 1 }, name: 'timestamp_1' }
]
[direct: mongos] timeseries> db.stockPrice1week_regular.find({"timestamp": ISODate("2021-10-23T12:00:00.000Z")}).explain("executionStats")
{
...
...
winningPlan: {
  stage: 'FETCH',
  inputStage: {
  stage: 'IXSCAN',
  keyPattern: { timestamp: 1 },
  indexName: 'timestamp_1',
...
...
executionStats: {
  nReturned: 3,
  executionTimeMillis: 2,
  totalKeysExamined: 3,
  totalDocsExamined: 3,
...

Now the winning plan is an IXSCAN, the new index is used. Only three keys examined, three docs examined, and three docs returned. The query takes two milliseconds.

So, it is as fast as the Time Series collection. There is not such a big difference; the order of magnitude is the same.

Also notice the same performance comes at the cost of having a larger collection at the end because we have created a secondary index.

[direct: mongos] timeseries> db.stockPrice1week_regular.stats().totalSize
25251840
[direct: mongos] timeseries> db.stockPrice1week.stats().totalSize
5357568

For getting a comparable execution time, now the regular collection is five times larger than the Time Series.

A Query with a Time Range Filter

Let’s test a different query looking for a range of timestamps. The following are the explain() outputs.

[direct: mongos] timeseries> db.stockPrice1week_regular.find( { "timestamp": { $gte: ISODate("2021-10-20T00:00:00Z"), $lt: ISODate("2021-10-20T23:59:59Z") } } ).explain("executionStats")
{
...
winningPlan: {
  stage: 'FETCH',
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { timestamp: 1 },
...
executionStats: {
  nReturned: 4320,
  executionTimeMillis: 7,
  totalKeysExamined: 4320,
  totalDocsExamined: 4320,
...

 

[direct: mongos] timeseries> db.stockPrice1week.find( { "timestamp": { $gte: ISODate("2021-10-20T00:00:00Z"), $lt: ISODate("2021-10-20T23:59:59Z") } } ).explain("executionStats")
{
...
...
winningPlan: {
  stage: 'COLLSCAN',
  filter: {
    '$and': [
    {
      _id: { '$lt': ObjectId("6170ad7f0000000000000000") }
    },
    {
      _id: { '$gte': ObjectId("616e0a800000000000000000") }
    },
    {
      'control.max.timestamp': {
        '$_internalExprGte': ISODate("2021-10-20T00:00:00.000Z")
      }
    },
    {
      'control.min.timestamp': {
        '$_internalExprLt': ISODate("2021-10-20T23:59:59.000Z")
      }
    }
  ]
},
...
...
executionStats: {
  executionSuccess: true,
  nReturned: 6,
  executionTimeMillis: 6,
  totalKeysExamined: 0,
  totalDocsExamined: 11,
...

The same as before. The execution time is basically the same for both queries. The main problem remains the size of the regular collection that is significantly larger.

Only six documents are apparently returned by the Time Series, but it’s not. If you execute the query for real you’ll get 4320 documents. The six documents mentioned by explain() refer to the documents that must be returned by the real collection below the non-materialized view.

Aggregation Test

On our Time Series data, we would like to do some aggregation. This is a typical task: calculate averages over a period, find min and max values, and other kinds of statistics.

Let’s suppose we need to calculate the average stock price on a daily basis. We can use the following aggregation pipeline for example:

db.stockPrice1week.aggregate([
{
  $project: {
    date: {
      $dateToParts: { date: "$timestamp" }
    },
    stockPrice: 1
  }
},
{
  $group: {
    _id: {
      date: {
        year: "$date.year",
        month: "$date.month",
        day: "$date.day"
      }
    },
    avgPrice: { $avg: "$stockPrice" }
  }
}
])
[
{
_id: { date: { year: 2021, month: 12, day: 4 } },
avgPrice: 37939.782043249594
},
{
_id: { date: { year: 2021, month: 11, day: 22 } },
avgPrice: 29289.700949196136
},
{
_id: { date: { year: 2021, month: 10, day: 27 } },
avgPrice: 10531.347070537977
},
...
...

As usual, let’s have a look at the explain() of the aggregate against the two collections, just focusing on execution time and documents examined.

[direct: mongos] timeseries> db.stockPrice1week.explain("executionStats").aggregate([ { $project: { date: { $dateToParts: { date: "$timestamp" } }, stockPrice: 1 } }, { $group: { _id: { date: { year: "$date.year", month: "$date.month", day: "$date.day" } }, avgPrice: { $avg: "$stockPrice" } } }])
{
...
executionStats: {
  executionSuccess: true,
  nReturned: 300,
  executionTimeMillis: 615,
  totalKeysExamined: 0,
  totalDocsExamined: 300,
  executionStages: {
  stage: 'COLLSCAN',
...

[direct: mongos] timeseries> db.stockPrice1week_regular.explain("executionStats").aggregate([ { $project: { date: { $dateToParts: { date: "$timestamp" } }, stockPrice: 1 } }, { $group: { _id: { date: { year: "$date.year", month: "$date.month", day: "$date.day" } }, avgPrice: { $avg: "$stockPrice" } } }])
{
...
executionStats: {
  executionSuccess: true,
  nReturned: 299997,
  executionTimeMillis: 1022,
  totalKeysExamined: 0,
  totalDocsExamined: 299997,
  executionStages: {
    stage: 'PROJECTION_DEFAULT',
...

The aggregation pipeline runs 40 percent faster with the Time Series collection. This should be more relevant the larger the collection is.

Conclusion

MongoDB 5.0 is an interesting new version of the most popular document-based database, and new features like Time Series collections and resharding are amazing.

Anyway, due to many core changes to WiredTiger and the core server introduced to facilitate new features, MongoDB 5.0.x is still unstable. We do not recommend using it for production environments.

Check the documentation of Percona Server for MongoDB 5.0.3-2 (Release Candidate).

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Sep
05
2013
--

TokuDB vs InnoDB in timeseries INSERT benchmark

TokuDBThis post is a continuation of my research of TokuDB’s  storage engine to understand if it is suitable for timeseries workloads.

While inserting LOAD DATA INFILE into an empty table shows great results for TokuDB, what’s more interesting is seeing some realistic workloads.

So this time let’s take a look at the INSERT benchmark.

What I am going to do is to insert data in 16 parallel threads into the table from the previous post:

CREATE TABLE `sensordata` (
  `ts` int(10) unsigned NOT NULL DEFAULT '0',
  `sensor_id` int(10) unsigned NOT NULL,
  `data1` double NOT NULL,
  `data2` double NOT NULL,
  `data3` double NOT NULL,
  `data4` double NOT NULL,
  `data5` double NOT NULL,
  `cnt` int(10) unsigned NOT NULL,
  PRIMARY KEY (`sensor_id`,`ts`)
)

The INSERTS are bulk inserts with sequentially increasing ts and with sensor_id from 1 to 1000.

While the inserts are not fully sequential, because the primary key is (sensor_id, ts), it is enough to have in memory workload, so I do not expect performance degradation when data exceeds memory. This will play in favor for InnoDB, as it is known that TokuDB performs worse in CPU-bound benchmarks.

The benchmark executes 1mln events, each event inserts 1000 records in bulk. That is when finished we should about about 1 bln records in the table.

So let’s see how InnoDB (compressed 8K vs not compressed) performs.
Throughput (more is better):
innodb-thr

Response time (log 10 scale on the axe Y) (less is better):
innodb_resp

So InnoDB comes with following numbers:

  • InnoDB no compression. Averages at 350 inserts/sec with response time 80-100ms per transaction. The final table size is 82GB
  • InnoDB 8K compression. Throughput is 130 inserts/sec, response time 250ms. Table size is 60GB

Now, we have a quite bad compression rate, because I used uniform distribution for values of data1-data5 columns, and uniform may not be good for compression. And actually in the real case I expect much more repeating values, so I am going to re-test with pareto (zipfian) distribution.

For TokuDB (tested tokudb_fast and tokudb_small formats)

Throughput (more is better):
tokudb_thr

Response time (log 10 scale on the axe Y) (less is better):
tokudb_resp

TokuDB observations:

  • After an initial warmup TokuDB shows quite inconsistent performance with both tokudb_fast and tokudb_small formats
  • For tokudb_fast, the throughput is topping at ~150 inserts/sec, and 95% response time ~160 ms. However there are periodical stalls when throughput drops almost to 0 and response time jump to 10 sec !!! per transaction.
  • For tokudb_small, the throughput even less stable jumping around 100 inserts/sec and response time starts from 300ms per transactions with stalls up to 30 sec per transaction

File sizes for TokuDB: tokudb_fast: 50GB, tokudb_small: 45GB. Again I correspond a bad compression rate to uniform distribution. If we switch to pareto, the file size for tokudb_fast is 21GB, and for tokudb_small is 13GB

If we zoom in to 900 sec timeframe we can see periodic behavior of TokuDB:
zoom

Now I consider these stalls in TokuDB as severe and I do not think I can recommend to use it in production under such workload conditions until the problem is fixed.

The scripts for the timeseries benchmark for sysbench v0.5 you can find there
https://github.com/percona/sysbench-scripts

Software versions, for InnoDB: Percona Server 5.6-RC3 , for TokuDB: mariadb-5.5.30-tokudb-7.0.4

UPDATE (5-Sep-2013):
By many requests I update the post with following information:
TokuDB throughput (tokudb_small row format) with Pareto distribution, for two cases:

  • 1. PRIMARY KEY (sensor_id,ts) (on graph: tokudb_small)
  • 2. PRIMARY KEY (ts,sensor_id), KEY (sensor_id,ts) (on graph: tokudb_small_key)

Throughput in this case:
tokudb_key

We can see that top throughput for tokudb_small_key is less then for tokudb_small, but there is also less variance in throughput.

The my.cnf files.
For InnoDB

[mysqld]
# gdb
log-error=error.log
innodb_file_per_table = true
innodb_data_file_path = ibdata1:100M:autoextend
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 256M
innodb_flush_log_at_trx_commit = 1
innodb_buffer_pool_size = 40G
innodb_buffer_pool_instances=1
innodb_file_format = Barracuda
innodb_checksum_algorithm = crc32
innodb_log_file_size = 4G
innodb_log_files_in_group = 2
#innodb_log_block_size=4096
#####plugin options
innodb_read_io_threads = 16
innodb_write_io_threads = 4
innodb_io_capacity = 4000
#not innodb options (fixed)
port = 3306
back_log = 50
max_connections = 2000
max_prepared_stmt_count=500000
max_connect_errors = 10
table_open_cache = 2048
max_allowed_packet = 16M
binlog_cache_size = 16M
max_heap_table_size = 64M
sort_buffer_size = 4M
join_buffer_size = 4M
thread_cache_size = 1000
query_cache_size = 0
query_cache_type = 0
thread_stack = 192K
tmp_table_size = 64M
server-id = 10
key_buffer_size = 8M
read_buffer_size = 1M
read_rnd_buffer_size = 4M
bulk_insert_buffer_size = 8M
myisam_sort_buffer_size = 8M
myisam_max_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover
socket=/var/lib/mysql/mysql.sock
user=root

for TokuDB (pretty much defaults)

[mysqld]
gdb
skip-innodb
#not innodb options (fixed)
port = 3306
back_log = 50
max_connections = 2000
max_prepared_stmt_count=500000
max_connect_errors = 10
table_open_cache = 2048
max_allowed_packet = 16M
binlog_cache_size = 16M
max_heap_table_size = 64M
sort_buffer_size = 4M
join_buffer_size = 4M
thread_cache_size = 1000
query_cache_size = 0
query_cache_type = 0
ft_min_word_len = 4
#default_table_type = InnoDB
thread_stack = 192K
tmp_table_size = 64M
server-id = 10
key_buffer_size = 8M
read_buffer_size = 1M
read_rnd_buffer_size = 4M
bulk_insert_buffer_size = 8M
myisam_sort_buffer_size = 8M
myisam_max_sort_file_size = 10G
#myisam_max_extra_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover
socket=/var/lib/mysql/mysql.sock
user=root

The post TokuDB vs InnoDB in timeseries INSERT benchmark appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com