May
05
2015
--

MongoDB’s flexible schema: How to fix write amplification

Being schemaless is one of the key features of MongoDB. On the bright side this allows developers to easily modify the schema of their collections without waiting for the database to be ready to accept a new schema. However schemaless is not free and one of the drawbacks is write amplification. Let’s focus on that topic.

Write amplification?

The link between schema and write amplification is not obvious at first sight. So let’s first look at a table in the relational world:

mysql> SELECT * FROM user LIMIT 2;
+----+-------+------------+-----------+-----------+----------------------------------+---------+-----------------------------------+------------+------------+
| id | login | first_name | last_name | city      | country                          | zipcode | address                           | password   | birth_year |
+----+-------+------------+-----------+-----------+----------------------------------+---------+-----------------------------------+------------+------------+
|  1 | arcu  | Vernon     | Chloe     | Paulista  | Cook Islands                     | 28529   | P.O. Box 369, 1464 Ac Rd.         | SSC44GZL5R |       1970 |
|  2 | quis  | Rogan      | Lewis     | Nashville | Saint Vincent and The Grenadines | H3T 3S6 | P.O. Box 636, 5236 Elementum, Av. | TSY29YRN6R |       1983 |
+----+-------+------------+-----------+-----------+----------------------------------+---------+-----------------------------------+------------+------------+

As all records have exactly the same fields, the field names are stored once in a separate file (.frm file). So the field names is metadata while the value of each field for each record is of course data.

Now let’s look at an equivalent collection in MongoDB:

{
        {
                "login": "arcu",
                "first_name": "Vernon",
                "last_name": "Chloe",
                "city": "Paulista",
                "country": "Cook Islands",
                "zipcode": "28529",
                "address": "P.O. Box 369, 1464 Ac Rd.",
                "password": "SSC44GZL5R",
                "birth_year": 1970
        },
        {
                "login": "quis",
                "first_name": "Rogan",
                "last_name": "Lewis",
                "city": "Nashville",
                "country": "Saint Vincent and The Grenadines",
                "zipcode": "H3T 3S6",
                "address": "P.O. Box 636, 5236 Elementum, Av.",
                "password": "TSY29YRN6R",
                "birth_year": 1983
        }
}

One difference with a table in the relational world is that MongoDB doesn’t know which fields each document will have. Therefore field names are data, not metadata and they must be stored with each document.

Then the question is: how large is the overhead in terms of disk space? To have an idea, I inserted 10M such records in an InnoDB table (adding an index on password and on birth_year to make the table look like a real table): the size on disk is around 1.4GB.

I also inserted the exact same 10M records in a MongoDB collection using the regular MMAPv1 storage engine, again adding an index on password and on birth_year, and this time the size on disk is … 2.97GB!

Of course it is not an apples-to-apples comparison as the InnoDB storage format and the MongoDB storage format are not identical. However a 100% difference is still significant.

Compression

One way to deal with write amplification is to use compression. With MongoDB 3.0, the WiredTiger storage engine is available and one of its benefits is compression (default algorithm: snappy). Percona TokuMX also has built-in compression using zlib by default.

Rebuilding the collection with 10M documents and the 2 indexes now gives the following results:
WiredTiger: 1.14GB
TokuMX: 736MB

This is a 2.5x to 4x data size reduction, pretty good!

WiredTiger also provides zlib compression and in this case the collection is only 691MB. However CPU usage is much higher compared to snappy so zlib will not be usable in all situations.

Conclusion

MongoDB schemaless design is attractive but it comes with several tradeoffs. Write amplification is one of them and using either WiredTiger with MongoDB 3.0 or Percona TokuMX is a very simple way to fix the issue.

The post MongoDB’s flexible schema: How to fix write amplification appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com