Sep
15
2017
--

Percona Blog Poll Results: What Database Engine Are You Using to Store Time Series Data?

TIme Series Data

TIme Series DataIn this blog post, we talk about the results of Percona’s time series database poll “What Database Engine Are You Using to Store Time Series Data?”

Time series data is some of the most actionable data available when it comes to analyzing trends and making predictions. Simply put, time series data is data that is indexed not just by value, but by time as well – allowing you to view value changes over time as they occur. Obvious uses include the stock market, web traffic, user behavior, etc.

With the increasing number of smart devices in the Internet of Things (IoT), being able to track data over time is more and more important. With time series data, you can measure and make predictions on things like energy consumption, pH values, water consumption, data from environment-aware machines like smart cars, etc. The sensors used in IoT devices and systems generate huge amounts of time-series data.

A couple of months back, we ran a poll on what time series databases were being used by the community. We wanted to quickly report on the results from that poll.

First the results:

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

Here are some thoughts:

  • The fact that this blog started as a place exclusively for MySQL information probably explains why we skewed high with MySQL respondents – still that doesn’t mean it doesn’t reflect reality.
  • Elastic seems the most common after that, possibly to tie in with MySQL use.
  • InfluxDB as next popular. This suggests that Paul Dix’s chosen business model is “AOK” so to speak. It is unclear if people use the open source version, or outgrow it and switch to the commercial stuff.
  • We lumped together “general purpose NoSQL engine”, but in some cases examples like Cassandra are targeted at time series. Notice that KairosDB, which is built on top of Cassandra itself, is not as popular in our survey.
  • Prometheus is the canonical “not a time series database”, but still used as one. I have a feeling alongside Graphite, this is monitoring related.
  • ClickHouse time series is a new time series database and it is surprising that it gets such high rankings. It was also relatively unknown outside of its home country Russia, but now we are seeing uses at places like CloudFlare and more.

Thanks for participating in the poll. We’re still running a poll on operating systems, so don’t forget to register your responses. We’ll report on that poll soon, with a new one on the way shortly.

Feb
10
2017
--

Percona Blog Poll: What Database Engine Are You Using to Store Time Series Data?

TIme Series Data

TIme Series DataTake Percona’s blog poll on what database engine you are using to store time series data.

Time series data is some of the most actionable data available when it comes to analyzing trends and making predictions. Simply put, time series data is data that is indexed not just by value, but by time as well – allowing you to view value changes over time as they occur. Obvious uses include the stock market, web traffic, user behavior, etc.

With the increasing number of smart devices in the Internet of Things (IoT), being able to track data over time is more and more important. With time series data, you can measure and make predictions on things like energy consumption, pH values, water consumption, data from environment-aware machines like smart cars, etc. The sensors used in IoT devices and systems generate huge amounts of time-series data.

How is all of this data collected, segmented and stored? We’d like to hear from you: what database engine are you using to store time series data? Please take a few seconds and answer the following poll. Which are you using? Help the community learn what database engines help solve critical database issues. Please select from one to three database engines as they apply to your environment. Feel free to add comments below if your engine isn’t listed.

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

May
26
2015
--

Storing time-series data with MongoDB and TokuMX

Storing time-series data is a frequent pattern for databases – be it for logs or for any kind of monitoring. Such data has the following properties: records are inserted but also never updated, the insertion rate can be high and records are likely to expire after some time. MongoDB and TokuMX are both good fits because of their flexible schema feature. But how can we handle data expiration efficiently? Several options are available: capped collections, TTL collections and partitioning (TokuMX only), but they all have different features and performance profiles.

Summary

  • Capped collections: very good insert performance, but not eligible for sharding and hard to predict when documents will expire.
  • TTL collections: expiration date is easy to enforce and compatible with sharding, but purge or records is inefficient.
  • Partitioning: very good performance for inserts and purge, but not compatible with sharding and only available with TokuMX.

Capped collections

To create an app_stats capped collection with a size of 1GB, use the following command:

> db.createCollection( "app_stats", { capped: true, size: 1024*1024*1024 } )

MongoDB will write to this collections in a circular fashion: once allocated files are full, data at the beginning of the first file is being overwritten. This is very good to make sure your collection will never exceed the size you set. However predicting the size you will need to store 6 months of data can be tricky.

Another benefit of capped collection is that they keep data in insertion order. So you don’t need to add an extra index to sort data by insertion date: this is a good point for good write performance.

Can you update records in a capped collection? Yes but only as long as they don’t increase the original size of the document, but it is recommended to be light on updates as you might experience strange errors with secondaries in some rare cases.

Another limitation is that sharding is not supported.

TTL collections

TTL collections take a different approach: they are normal, but they have an index that has a special option. For instance:

> db.logs.createIndex({insertDate:1}, {expireAfterSeconds: 86400})

This index will make sure that records will be expired automatically after one day. Pretty neat!

And as a TTL collection is a regular collection, you can shard it if you need.

However the main limitation is the purge process: every minute, a background thread will look for documents that are expired and if it finds some, it will remove them. This is not really different from the application running a cron job every minute to remove old documents: it adds a constant purge workload that can be detrimental to insertion performance.

Note that TTL indexes do not work with capped collections.

Partitioning (TokuMX)

With TokuMX you can partition your data like you would with MySQL. For instance, if you want to create a collection partitioned on an insertDate field, you would use these commands:

> db.createCollection('stats_part',{primaryKey:{insertDate:1,_id:1}, partitioned: true})
> db.stats_part.addPartition({insertDate:ISODate("2015-05-31T23:59:59")})
> db.stats_part.addPartition({insertDate:ISODate("2015-06-30T23:59:59")})
[...]

The main benefit of partitioning is that expiring data is extremely simple and fast: remove the corresponding partition(s) with the dropPartition() function. Another nice property is that you can be very flexible regarding how large your partitions can grow. This could deserve a whole blog post, but let me give a quick example.

Suppose you want to expire data after 6 months. With regular MongoDB, indexes are using B-Trees so insertion is only fast if indexes fit in memory. So if you could create a partitioned collection, insertions would only be fast if the partition you are writing to (the last one) would fit in memory. Then you would probably have to create a partition for every day. With TokuMX, there is no such limitation and insertion performance will be as good if you have larger partitions like one partition per week or one partition per month.

Any drawback of partitioning? It involves some application overhead as you need to write the logic to periodically drop old partitions and create new ones. And sharding only has a limited support.

Conclusion

There is no one-size-fits-all solution when it comes to storing time-series data in MongoDB and TokuMX, but several options with their own benefits and trade-offs. Note that if your collection needs to be sharded, a TTL index is probably the only option. And if you are using TokuMX, partitioning is something to look at if you were planning a capped collection.

The post Storing time-series data with MongoDB and TokuMX appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com