In this blog post, we’ll discuss how shorter field names impact performance and document size in MongoDB.
The MongoDB Manual Developer Notes state:
Shortening field names reduce expressiveness and does not provide considerable benefit for larger documents and where document overhead is not of significant concern. Shorter field names do not lessen the size of indexes because indexes have a predefined structure. In general, it is not necessary to use short field names.
This is a pretty one-sided statement, and we should be careful not to fall into this trap. At first glance, you might think “Oh that makes sense due to compression!” However, compression is only one part of the story. When we consider the size of a single document, we need to consider several things:
- Size of the data in the application memory
- Size over the network
- Size in the replication log
- Size in memory in the cache
- Amount of data being sent to the compressor
- Size on disk*
- Size in the journal files*
As you can see, this is a pretty expansive list, and this is just for consideration on field naming – we haven’t even gotten to using the right data types for the value yet.
Further, only the last two items in the list (“*” starred) represent any part of the system that has compression (to date). Put another way, the conversation about compression only covers about 25% of the discussion about field names. MongoDB Inc’s comment is trying to sidestep nearly 75% of the rest of the conversation.
To ensure an even debate, I want to break size down into two major areas: Field Optimization and Value Optimization. They both touch on all of the areas listed above except for sorting, which is only about value optimization.
Field Optimization
When we talk about field optimization, it is purely considering using smaller field names. This might seem obvious, but when your database field names become object properties in your application code, the developers want these to be expressive (i.e., longer and space-intensive).
Consider the following:
locations=[]; for (i=1;i<=1000;i++){ locations.push({ longitude : 28.2211, latitude : 128.2828 }) } devices=[]; for (i=1;i<=10;i++){ devices.push( { name:"iphone6", last_ping: ISODate(), version: 8.1 , security_pass: true, last_10_locations: locations.slice(10,20) }) } x={ _id : ObjectId(), first_name: "David", last_name: "Murphy", birthdate: "Aug 16 2080", address : "123 nowhere drive Nonya, TX, USA , 78701", phone_number1: "512-555-5555", phone_number2: "512-555-5556", known_locations: locations, last_checkin : ISODate(), devices : devices } >Object.bsonsize(x) 54879
Seems pretty standard, but wow! That’s 54.8k per document! Now let’s consider another format:
locations2=[]; for (i=1;i<=1000;i++){ locations2.push({ lon : 28.2211, lat : 128.2828 }) } devices2=[]; for (i=1;i<=10;i++){ devices2.push( { n:"iphone6", lp: ISODate(), v: 8.1 , sp: true, l10: locations.slice(10,20) }) } y={ _id : ObjectId(), fn: "David", ln: "Murphy", bd: "Aug 16 2080", a : "123 nowhere drive Nonya, TX, USA , 78701", pn1: "512-555-5555", pn2: "512-555-5556", kl: locations2, lc : ISODate(), d : devices2 } > Object.bsonsize(y) 41392 > Object.bsonsize(y)/Object.bsonsize(x) 0.754241148708978
This minor change saves space by 25%, without changing any actual data. I know you can already see things like kl or l10 and are wondering, “What the heck is that!” This is where some clever tricks with the application code can come in.
You can make a mapping collection in MongoDB, or keep it in your application code – so in the code self.l10 is renamed to self.last_10_locations. Some people go so far as using constants – for example “self.LAST_10_LOCATIONS” to “self.l10 = self.get_value(LAST_10_LOCATIONS)” – to reduce the field size.
Value Optimization
Using the same example, let’s assume we want to improve the field usage. We know we will always pull a user by their _id, or the most recent people to check-in. To help optimize this further, let us assume “x” is still our main document:
locations=[]; for (i=1;i<=1000;i++){ locations.push({ longitude : 28.2211, latitude : 128.2828 }) } devices=[]; for (i=1;i<=10;i++){ devices.push( { name:"iphone6", last_ping: ISODate(), version: 8.1 , security_pass: true, last_10_locations: locations.slice(10,20) }) } x={ _id : ObjectId(), first_name: "David", last_name: "Murphy", birthdate: "Aug 16 2080", address : "123 nowhere drive Nonya, TX, USA , 78701", phone_number1: "512-555-5555", phone_number2: "512-555-5556", known_locations: locations, laat_checkin : ISODate(), devices : devices } >Object.bsonsize(x) 54879
But now, instead of optimizing field names, we want to optimize the values:
locations=[]; for (i=1;i<=1000;i++){ locations.push({ longitude : 28.2211, latitude : 128.2828 }) } devices=[]; for (i=1;i<=10;i++){ devices.push( { name:"iphone6", last_ping: ISODate(), version: 8.1 , security_pass: true, last_10_locations: locations.slice(10,20) }) } z={ _id : ObjectId(), first_name: "David", last_name: "Murphy", birthdate: ISODate("2080-08-16T00:00:00Z"), address : "123 nowhere drive Nonya, TX, USA , 78701", phone_number1: 5125555555, phone_number2: 5125555556, known_locations: locations, last_checkin : ISODate(), devices : devices } >Object.bsonsize(z) 54853
In this example, we changed phone numbers to integers and used the “Date Type” for dates (as already done in the devices document). The savings were much smaller than earlier, coming in at only 26 bytes, but this could have a significant impact when multiplied out to many fields and documents. If we had started this example quoting the floats as many people do, we would see more of a difference. But always watch out for numbers and dates shown as strings: these almost always waste space.
When you combine both sets of savings you have:
54853- 26 - 41392 = 13435
That’s right: 24.5% smaller memory size on the network and for the application to parse with its CPU! Easy wins to reduce your resource needs, and to make the COO happier.