A while back Friendfeed posted a blog post explaining how they changed from storing data in MySQL columns to serializing data and just storing it inside TEXT/BLOB columns. It seems that since then, the technique has gotten more popular with Ruby gems now around to do this for you automatically.
So when is it a good idea to use this technique?
If the application really is schema-less and has a lot of optional parameters that do not appear in every record, serializing the data in one column can be a better idea than having many extra columns that are NULL. The restriction on this, would be that searching on these columns now becomes more difficult. A good example of this optional nature of data is user preferences – you only really need to store the settings that differ from the default values.
The second situation where I can see this technique making sense, is that when you update the text/blob, a large percentage of the data is actually modified. If you have to write back the full blob for 1 change, it is expensive.
Another potential pro for this technique is that ALTER TABLE commands are no longer required. Until 5.1-plugin simple operations like adding a secondary index on an InnoDB table require the whole table to be rebuilt (now just the index is built). I don’t really buy this pro, since using master-slave switching I’ve always been able to emulate online index adds. Projects like MMM and Flipper make this even easier.
In Friendfeed’s case, they also used the serialization to be able to compress the objects. From 5.1-plugin this is now available natively.
 Friendfeed solved this indexing problem by creating separate ‘index’ tables.
What are the downsides?
I would say that the first serious downside is write amplification. If you are constantly making small updates to one piece of data in a very large blob, the effort MySQL has to go to is greatly increased.
The second downside I would mention is that this pattern tends to force you to read/write larger amounts of data at once. We’ve all been trained to know that SELECT * is bad. This creates a pattern where SELECT * is not optional. In practical terms I would expect this increases the size of a working set since more data must be kept in the buffer pool.
The next obvious restriction is that there is a clear loss in functionality. You can no longer easily perform aggregation functions on the data (MIN, MAX, AVG). You are storing the data in a way that is very specific to one application, and you can’t just point a BI tool at it to process.
It can become difficult to apply even the simplest constraints on the data such as character length, if an age must be a number, and if the age must be unsigned. MySQL doesn’t go as far as having check constraints, but what it has is better than nothing.
Some smaller issues I have is that if you are using a standard serialization technique like JSON it’s difficult to store pieces of data like numbers or IP addresses in their most efficient form, and that technically this design breaks 1NF.
What would I recommend?
Josh Berkus (of PgExperts) calls this serialization E-blob. Much like EAV, he criticizes this as one of 10 ways to wreck your database (slides). I tend to side with Josh. I’m optimistic that this design has its place in a smaller component of your application, but I’m weary every time I hear someone decide to switch to it exclusively.
Before making such decisions for yourself I would recommend reading Peter’s KISS KISS KISS post.
Entry posted by Morgan Tocker |