Jul
28
2016
--

On Uber’s Choice of Databases

A few days ago Uber published the article “Why Uber Engineering Switched from Postgres to MySQL”. I didn’t read the article right away because my inner nerd told me to do some home improvements instead. While doing so my mailbox was filling up with questions like “Is PostgreSQL really that lousy?”. Knowing that PostgreSQL is not generally lousy, these messages made me wonder what the heck is written in this article. This post is an attempt to make sense out of Uber’s article.

In my opinion Uber’s article basically says that they found MySQL to be a better fit for their environment as PostgreSQL. However, the article does a lousy job to transport this message. Instead of writing “PostgreSQL has some limitations for update-heavy use-cases” the article just says “Inefficient architecture for writes,” for example. In case you don’t have an update-heavy use-case, don’t worry about the problems described in Uber’s article.

In this post I’ll explain why I think Uber’s article must not be taken as general advice about the choice of databases, why MySQL might still be a good fit for Uber, and why success might cause more problems than just scaling the data store.

On UPDATE

The first problem Uber’s article describes in great, yet incomplete detail is that PostgreSQL always needs to update all indexes on a table when updating rows in the table. MySQL with InnoDB, on the other hand, needs to update only those indexes that contain updated columns. The PostgreSQL approach causes more disk IOs for updates that change non-indexed columns (“Write Amplification” in the article). If this is such a big problem to Uber, these updates might be a big part of their overall workload.

However, there is a little bit more speculation possible based upon something that is not written in Uber’s article: The article doesn’t mention PostgreSQL Heap-Only-Tuples (HOT). From the PostgreSQL source, HOT is useful for the special case “where a tuple is repeatedly updated in ways that do not change its indexed columns.” In that case, PostgreSQL is able to do the update without touching any index if the new row-version can be stored in the same page as the previous version. The latter condition can be tuned using the fillfactor setting. Assuming Uber’s Engineering is aware of this means that HOT is no solution to their problem because the updates they run at high frequency affect at least one indexed column.

This assumption is also backed by the following sentence in the article: “if we have a table with a dozen indexes defined on it, an update to a field that is only covered by a single index must be propagated into all 12 indexes to reflect the ctid for the new row”. It explicitly says “only covered by a single index” which is the edge case—just one index—otherwise PostgreSQL’s HOT would solve the problem.

[Side note: I’m genuinely curious whether the number of indexes they have could be reduced—index redesign in my challenge. However, it is perfectly possible that those indexes are used sparingly, yet important when they are used.]

It seems that they are running many updates that change at least one indexed column, but still relatively few indexed columns compared to the “dozen” indexes the table has. If this is a predominate use-case, the article’s argument to use MySQL over PostgreSQL makes sense.

On SELECT

There is one more statement about their use-case that caught my attention: the article explains that MySQL/InnoDB uses clustered indexes and also admits that “This design means that InnoDB is at a slight disadvantage to Postgres when doing a secondary key lookup, since two indexes must be searched with InnoDB compared to just one for Postgres.” I’ve previously written about this problem (“the clustered index penalty”) in context of SQL Server.

What caught my attention is that they describe the clustered index penalty as a “slight disadvantage”. In my opinion, it is a pretty big disadvantage if you run many queries that use secondary indexes. If it is only a slight disadvantage to them, it might suggest that those indexes are used rather seldom. That would mean, they are mostly searching by primary key (then there is no clustered index penalty to pay). Note that I wrote “searching” rather than “selecting”. The reason is that the clustered index penalty affects any statement that has a where clause—not just select. That also implies that the high frequency updates are mostly based on the primary key.

Finally there is another omission that tells me something about their queries: they don’t mention PostgreSQL’s limited ability to do index-only scans. Especially in an update-heavy database, the PostgreSQL implementation of index-only scans is pretty much useless. I’d even say this is the single issue that affects most of my clients. I’ve already blogged about this in 2011. In 2012, PostgreSQL 9.2 got limited support of index-only scans (works only for mostly static data). In 2014 I even raised one aspect of my concern at PgCon. However, Uber doesn’t complain about that. Select speed is not their problem. I guess query speed is generally solved by running the selects on the replicas (see below) and possibly limited by mostly doing primary key side.

By now, their use-case seems to be a better fit for a key/value store. And guess what: InnoDB is a pretty solid and popular key/value store. There are even packages that bundle InnoDB with some (very limited) SQL front-ends: MySQL and MariaDB are the most popular ones, I think. Excuse the sarcasm. But seriously: if you basically need a key/value store and occasionally want to run a simple SQL query, MySQL (or MariaDB) is a reasonable choice. I guess it is at least a better choice than any random NoSQL key/value store that just started offering an even more limited SQL-ish query language. Uber, on the other hand just builds their own thing (“Schemaless”) on top of InnoDB and MySQL.

On Index Rebalancing

One last note about how the article describes indexing: it uses the word “rebalancing” in context of B-tree indexes. It even links to a Wikipedia article on “Rebalancing after deletion.” Unfortunately, the Wikipedia article doesn’t generally apply to database indexes because the algorithm described on Wikipedia maintains the requirement that each node has to be at least half-full. To improve concurrency, PostgreSQL uses the Lehman, Yao variation of B-trees, which lifts this requirement and thus allows sparse indexes. As a side note, PostgreSQL still removes empty pages from the index (see slide 15 of “Indexing Internals”). However, this is really just a side issue.

What really worries me is this sentence: “An essential aspect of B-trees are that they must be periodically rebalanced, …” Here I’d like to clarify that this is not a periodic process one that runs every day. The index balance is maintained with every single index change (even worse, hmm?). But the article continues “…and these rebalancing operations can completely change the structure of the tree as sub-trees are moved to new on-disk locations.” If you now think that the “rebalancing” involves a lot of data moving, you misunderstood it.

The important operation in a B-tree is the node split. As you might guess, a node split takes place when a node cannot host a new entry that belongs into this node. To give you a ballpark figure, this might happen once for about 100 inserts. The node split allocates a new node, moves half of the entries to the new node and connects the new node to the previous, next and parent nodes. This is where Lehman, Yao save a lot of locking. In some cases, the new node cannot be added to the parent node straight away because the parent node doesn’t have enough space for the new child entry. In this case, the parent node is split and everything repeats.

In the worst case, the splitting bubbles up to the root node, which will then be split as well and a new root node will be put above it. Only in this case, a B-tree ever becomes deeper. Note that a root node split effectively shifts the whole tree down and therefore keeps the balance. However, this doesn’t involve a lot of data moving. In the worst case, it might touch three nodes on each level and the new root node. To be explicit: most real world indexes have no more than 5 levels. To be even more explicit: the worst case—root node split—might happen about five times for a billion inserts. On the other cases it will not need to go the whole tree up. After all, index maintenance is not “periodic”, not even very frequent, and is never completely changing the structure of the tree. At least not physically on disk.

On Physical Replication

That brings me to the next major concern the article raises about PostgreSQL: physical replication. The reason the article even touches the index “rebalancing” topic is that Uber once hit a PostgreSQL replication bug that caused data corruption on the downstream servers (the bug “only affected certain releases of Postgres 9.2 and has been fixed for a long time now”).

Because PostgreSQL 9.2 only offers physical replication in core, a replication bug “can cause large parts of the tree to become completely invalid.” To elaborate: if a node split is replicated incorrectly so that it doesn’t point to the right child nodes anymore, this sub-tree is invalid. This is absolutely true—like any other “if there is a bug, bad things happen” statement. You don’t need to change a lot of data to break a tree structure: a single bad pointer is enough.

The Uber article mentions other issues with physical replication: huge replication traffic—partly due to the write amplification caused by updates—and the downtime required to update to new PostgreSQL versions. While the first one makes sense to me, I really cannot comment on the second one (but there were some statements on the PostgreSQL-hackers mailing list).

Finally, the article also claims that “Postgres does not have true replica MVCC support.” Luckily the article links to the PostgreSQL documentation where this problem (and remediations) are explained. The problem is basically that the master doesn’t know what the replicas are doing and might thus delete data that is still required on a replica to complete a query.

According to the PostgreSQL documentation, there are two ways to cope with this issue: (1) delaying the application of the replication stream for a configurable timeout so the read transaction gets a chance to complete. If a query doesn’t finish in time, kill the query and continue applying the replication stream. (2) configure the replicas to send feedback to the master about the queries they are running so that the master does not vacuum row versions still needed by any slave. Uber’s article rules the first option out and doesn’t mention the second one at all. Instead the article blames the Uber developers.

On Developers

To quote it in all its glory: “For instance, say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.”

Unfortunately, I understand and even agree with this argument. Instead of “most engineers are not database experts” I’d even say that most developers have very little understanding of databases because every developer that touches SQL needs to know about transactions—not just database experts.

Giving SQL training to developers is my main business. I do it at companies of all sizes. If there is one thing I can say for sure is that the knowledge about SQL is ridiculously low. In context of the “open transaction” problem just mentioned I can conform that hardly any developer even knows that read only transactions are a real thing. Most developers just know that transactions can be used to back out writes. I’ve encountered this misunderstanding often enough that I’ve prepared slides to explain it and I just uploaded these slides for the curious reader.

On Success

This leads me to the last problem I’d like to write about: the more people a company hires, the closer their qualification will be to the average. To exaggerate, if you hire the whole planet, you’ll have the exact average. Hiring more people really just increases the sample size.

The two ways to beat the odds are: (1) Only hire the best. The difficult part with this approach is to wait if no above-average candidates are available; (2) Hire the average and train them on the job. This needs a pretty long warm-up period for the new staff and might also bind existing staff for the training. The problem with both approaches is that they take time. If you don’t have time—because your business is rapidly growing—you have to take the average, which doesn’t know a lot about databases (empirical data from 2014). In other words: for a rapidly growing company, technology is easier to change than people.

The success factor also affects the technology stack as requirements change over time. At an early stage, start-ups need out-of-the-box technology that is immediately available and flexible enough to be used for their business. SQL is a good choice here because it is actually flexible (you can query your data in any way) and it is easy to find people knowing SQL at least a little bit. Great, let’s get started! And for many—probably most—companies, the story ends here. Even if they become moderately successful and their business grows, they might still stay well within the limits of SQL databases forever. Not so for Uber.

A few lucky start-ups eventually outgrow SQL. By the time that happens, they have access to way more (virtually unlimited?) resources and then…something wonderful happens: They realize that they can solve many problems if they replace their general purpose database by a system they develop just for their very own use-case. This is the moment a new NoSQL database is born. At Uber, they call it Schemaless.

On Uber’s Choice of Databases

By now, I believe Uber did not replace PostgreSQL by MySQL as their article suggests. It seems that they actually replaced PostgreSQL by their tailor-made solution, which happens to be backed by MySQL/InnoDB (at the moment).

It seems that the article just explains why MySQL/InnoDB is a better backend for Schemaless than PostgreSQL. For those of you using Schemaless, take their advice! Unfortunately, the article doesn’t make this very clear because it doesn’t mention how their requirements changed with the introduction of Schemaless compared to 2013, when they migrated from MySQL to PostgreSQL.

Sadly, the only thing that sticks in the reader’s mind is that PostgreSQL is lousy.

If you like my way of explaining things, you’ll love my book.

Original title and author: “On Uber’s Choice of Databases” by Markus Winand.

Sep
25
2015
--

Modern SQL: Inaugural Post


I've just launched my new website modern-sql.com. It's very small right now—just six content pages—but it will grow over next month and years and might eventually become a book like Use-The-Inde-Luke.com did. So I though I'd better get you on board right now so you can grow as the site grows.

So what is modern SQL about? Yet another SQL reference? Absolutely not. There are definitively enough references out there. And they all suffer from the same issues:

  • They hardly cover more than entry-level SQL-92. Some tutorials don't even mention that they describe a proprietary dialect.

  • The examples just show the syntax but not how to solve real-world problems.

  • They don't document the availability of the features amongst different database products.

No question, the latter ones are a consequence of the first one. However, modern SQL is different:

  • modern SQL covers SQL:2011—the current release of the ISO SQL standard. But it adheres to the fact that many databases are developed against elder releases or only support a subset of the standard.

  • modern SQL shows how to use the most recent SQL features to solve real world problems. It also provides “conforming alternatives” as a way to solve the same problem using other (often elder) standard methods. If necessary and possible it will also show “proprietary alternatives.”

    Note that there is a hierarchy: the recommended (most idiomatic) approach is shown first, standard conforming alternatives next and proprietary alternatives last but only if inevitable.

  • modern SQL documents the availability of the features amongst six SQL databases (the example is about values):

Go here if you don’t see an image above.

Click on a feature (e.g., “valid where select is valid”) to see how the availability of the features has evolved.

There you can see that SQLite just recently started to accept values where select is valid. The timeline view also documents the last checked version for those databases that don't support a feature.

Inevitably, modern SQL will become both: an homage to the SQL standard and a rant about its poor adaptation. The availability documentation and the above mentioned alternative approaches (conforming or proprietary) are there to keep the “rant” a productive one. Look at the “select without from” use-case to get the idea.

This is just the first step of a long journey. I invite you to follow it via Twitter, Email or RSS.

Original title and author: “Modern SQL: Inaugural Post” by Markus Winand.

Feb
02
2015
--

Modern SQL in PostgreSQL [and other databases]

“SQL has gone out of fashion lately—partly due to the NoSQL movement, but mostly because SQL is often still used like 20 years ago. As a matter of fact, the SQL standard continued to evolve during the past decades resulting in the current release of 2011. In this session, we will go through the most important additions since the widely known SQL-92, explain how they work and how PostgreSQL supports and extends them. We will cover common table expressions and window functions in detail and have a very short look at the temporal features of SQL:2011 and the related features of PostgreSQL.”

This is the abstract for the talk I’ve given at FOSDEM in Brussels on Saturday. The PostgreSQL community was so kind to host this talk in their (way too small) devroom—thus the references to PostgreSQL. However, the talk is build upon standard SQL and covers features that are commonly available in DB2, Oracle, SQL Server and SQLite. MySQL does not yet support any of those features except OFFSET, which is evil.

One last thing before going on to the slides: Use The Index, Luke has a shop. Stickers, coasters, books, mugs. Have a look.

Find the slides on SlideShare.

Original title and author: “Modern SQL in PostgreSQL [and other databases]” by Markus Winand.

Aug
05
2014
--

We need tool support for keyset pagination

Did you know pagination with offset is very troublesome but easy to avoid?

offset instructs the databases skip the first N results of a query. However, the database must still fetch these rows from the disk and bring them in order before it can send the following ones.

This is not an implementation problem, it’s the way offset is designed:

…the rows are first sorted according to the <order by clause> and then limited by dropping the number of rows specified in the <result offset clause> from the beginning…

In other words, big offsets impose a lot of work on the database—no matter whether SQL or NoSQL.

But the trouble with offset doesn’t stop here: think about what happens if a new row is inserted between fetching two pages?

When using offset? to skip the previously fetched entries?, you’ll get duplicates in case there were new rows inserted between fetching two pages?. There are other anomalies possible too, this is just the most common one.

This is not even a database problem, it is the way frameworks implement pagination: they just say which page number to fetch or how many rows to skip. With this information alone, no database can do any better.

Life Without OFFSET

Now imagine a world without these problems. As it turns out, living without offset is quite simple: just use a where clause that selects only data you haven’t seen yet.

For that, we exploit the fact that we work on an ordered set—you do have an order by clause, ain’t you? Once there is a definite sort order, we can use a simple filter to only select what follows the entry we have see last:

SELECT ...
  FROM ...
 WHERE ...
   AND id < ?last_seen_id
 ORDER BY id DESC
 FETCH FIRST 10 ROWS ONLY

This is the basic recipe. It gets more interesting when sorting on multiple columns, but the idea is the same. This recipe is also applicable to many NoSQL systems.

This approach—called seek method or keyset pagination—solves the problem of drifting results as illustrated above and is even faster than offset. If you’d like to know what happens inside the database when using offset or keyset pagination, have a look at these slides (benchmarks, benchmarks!):

On slide 43 you can also see that keyset pagination has some limitations: most notably that you cannot directly navigate to arbitrary pages. However, this is not a problem when using infinite scrolling. Showing page number to click on is a poor navigation interface anyway—IMHO.

If you want to read more about how to properly implement keyset pagination in SQL, please read this article. Even if you are not involved with SQL, it’s worth reading that article before starting to implement anything.

But the Frameworks…

The main reason to prefer offset over keyset pagination is the lack of tool support. Most tools offer pagination based on offset, but don’t offer any convenient way to use keyset pagination.

Please note that keyset pagination affects the whole technology stack up to the JavaScript running in the browser doing AJAX for infinite scrolling: instead of passing a simple page number to the server, you must pass a full keyset (often multiple columns) down to the server.

The hall of fame of frameworks that do support keyset pagination is rather short:

This is where I need your help. If you are maintaining a framework that is somehow involved with pagination, I ask you, I urge you, I beg you, to build in native support for keyset pagination too. If you have any questions about the details, I’m happy to help (forum, contact form, Twitter)!

Even if you are just using software that should support keyset pagination such as a content management system or webshop, let the maintainers know about it. You might just file a feature request (link to this page) or, if possible, supply a patch. Again, I’m happy to help out getting the details right.

Take WordPress as an example.

Spread the Word

The problem with keyset pagination is not a technical one. The problem is just that it is hardly known in the field and has no tool support. If you like the idea of offset-less pagination, please help spreading the word. Tweet it, share it, mail it, you can even re-blog this post (CC-BY-NC-ND). Translations are also welcome, just contact me beforehand—I’ll include a link to the translation on this page too!

Oh, and if you are blogging, you could also add a banner on your blog to make your readers aware of it. I’ve prepared a NoOffset banner gallery with some common banner formats. Just pick what suits you best.

Original title and author: “We need tool support for keyset pagination” by Markus Winand.

May
25
2014
--

What I learned about SQLite…at a PostgreSQL conference

So, I’ve been to PgCon 2014 in Ottawa to give a short version of my SQL performance training (hint: special offer expires soon). However, I think I ended up learning more about SQLite than about PostgreSQL there. Here is how that happened and what I actually learned.

Richard Hipp, creator of SQLite was the keynote speaker at this years PgCon. In his keynote (slides, video) he has put the focus on three topics: how PostgreSQL influenced SQLite development (“SQLite was originally written from PostgreSQL 6.5 documentation” and the “What Would PostgreSQL Do?” (WWPD) way of finding out what the SQL standard tries to tell us). The second main topic was that SQLite should be seen as an application file format—an alternative to inventing own file formats or using ZIPped XMLs. The statement “SQLite is not a replacement for PostgreSQL. SQLite is a replacement for fopen()” nails that (slide 21). Finally, Richard put a lot of emphasis on that fact that SQLite takes care of your data (crash safe, ACID)—unlike many of the so-called NoSQL systems, which Richard refers to as “Postmodern Databases: absence of objective truth; Queries return opinions rather than facts”. In his keynote, Richard has also shown that SQLite is pretty relaxed when it comes to data types. As a matter of fact, SQLite accepts strings like “Hello” for INT fields. Note that it still stores “Hello”—no data is lost. I think he mentioned that it is possible to enforce the types via CHECK constraints.

Here I have to mention that I’ve had an e-mail conversation with Richard about covering SQLite on Use The Index, Luke last year. When our ways crossed at the PgCon hallway I just wanted to let him know that this has not been forgotten (despite the minimal progress on my side). Guess what—he was still remembering our mail exchange and immediately suggested to go through those topics once more in PgCon’s hacker lounge later. Said. Done. Here comes what I learned about SQLite.

First of all, SQLite uses the clustered index concept as known from SQL Server or MySQL/InnoDB. However, before version 3.8.2 (released 2013-12-06) you always had to use an INTEGER column as clustering key. If the primary key happened to be a non-integer, SQLite was creating an integer primary key implicitly and used this as the main clustering key. Querying by the real primary key (e.g. TEXT) required a secondary index lookup. This limitation was recently lifted by introducing WITHOUT ROWID tables.

When it comes to execution plans, SQLite has two different variants: the regular explain prefix returns the byte code that is actually executed. To get an execution plan similar to what we are used to, use explain query plan. SQLite has some hints (unlike PostgreSQL) to affect the query plan: indexed by and unlikely, likelihood.

We also discussed whether or not SQLite can cope with search conditions like (col1, col2) < (val1, val2)—it can’t. Here Richard was questioning the motivation to have that, so I gave him an elevator-pitch version of my “Pagination Done The PostgreSQL Way” talk. I think he got “excited” about this concept to avoid offset at all and I’m curious to see if he can make it work in SQLite faster than I’m adding SQLite content to Use The Index, Luke :)

SQLite supports limit and offset (*sigh*) but the optimizer does currently not consider limit for optimization. Offering the SQL:2008 first fetch … rows only syntax is not planned and might actually turn out to be hard because the parser is close to run out of 8-bit token codes (when I understood that right).

Joins are basically always executed as nested loops joins but SQLite might create an “automatic transient index” for the duration of the query. We also figured out that there seems to be a oddity with they way CROSS JOIN works in SQLite: it turn the optimizers table reordering off. Non-cross joins, on the other hand, don’t enforce the presence of ON or USING so that you can still build a Cartesian product using the optimizers smartness to reorder the tables.

Last but not least I’d like to mention that SQLite supports partial indexes in the way it should be: just a WHERE clause at index creation. Partial indexes are available since version 3.8.0 (released 2013-08-26).

On the personal side, I’d describe Richard as very pragmatic and approachable. He joined twitter a few month ago and all he did there is helping people who asked questions about SQLite. Really, look at his time line. That says a lot.

I’m really happy that I had the opportunity to meet Richard at a PostgreSQL conference—as happy as I was to meet Joe “Smartie” Celko years ago…at a another PostgreSQL conference. Their excellent keynote speaker selection is just one more reason to recommend PostgreSQL conferences in general. Here are the upcoming ones just in case you got curious.

ps.: There was also a unconference where I discussed the topic of Bitmap Index Only Scan (slides & minutes).

Apr
24
2014
--

What’s left of NoSQL?

This is my own and very loose translation of an article I wrote for the Austrian newspaper derStandard.at in October 2013. As this article was very well received and the SQL vs. NoSQL discussion is currently hot again, I though it might be a good time for a translation.

Back in 2013 The Register reported that Google sets its bets on SQL again. On the first sight this might look like a surprising move because it was of all things Google’s publications about MapReduce and BigTable that gave the NoSQL movement a big boost in the first place. On a second sight it turns out that there is a trend to use SQL or similar languages to access data from NoSQL systems—and that’s not even a new trend. However, it raises a question: What remains of NoSQL if we add SQL again? To answer this question, I need to start with a brief summary about the history of databases.

Top Dog SQL

It was undisputed for decades: store enterprise data in a relational database and use the programming language SQL to process the stored data.

The theoretical foundation for this approach, the relational model, was laid down as early as 1970. The first commercial representative of this new database type became available in 1979: the Oracle Database in release 2. Todays appearance of relational databases was mainly coined in the 1980s. The most important milestones were the formalization of the ACID criteria (Atomicity, Consistency, Isolation, Durability), which avoid accidental data corruption, and the standardization of the query language SQL—first by ANSI, one year later by ISO. From this time on SQL and ACID compliance were desirable goals for database vendors.

At first glance there were no major improvements during the next decades. The result: SQL and the relational model are sometimes called old or even outdated technology. At a closer look SQL databases evolved into mature products during that time. This process took that long because SQL is a declarative language and was way ahead of the technical possibilities of the 1980s. “Declarative language” means that SQL developers don’t need to specify the steps that lead to the desired result as with imperative programming, they just describe the desired result itself. The database finds the most efficient way to gather this result automatically—which is by no means trivial so that early implementations only mastered it for simple queries. Over these decades, however, the hardware got more powerful and the software more advanced so that modern databases deliver good results for complex queries too (see here if it doesn’t work for you).

It is especially important to note that the capabilities of SQL are not limited to storing and fetching data. In fact, SQL was designed to make refining and transforming data easy. Without any doubt, that was an important factor that made SQL and the relational model the first choice when it comes to databases.

And Then: NoSQL

Despite the omnipresence of SQL, a new trend emerged during the past few years: NoSQL. This term alone struck the nerve of many developers and caused a rapid spread that ultimately turned into a religious war. The opponents were SQL as symbol for outdated, slow, and expensive technology on the one side against an inhomogeneous collection of new, highly-scalable, free software that is united by nothing more than the umbrella brand “NoSQL.”

One possible explanation for the lost appreciation of SQL among developers is the increasing popularity of object-relational mapping tools (ORM) that generally tend to reduce SQL databases to pure storage media (“persistence layer”). The possibility to refine data using SQL is not encouraged but considerably hindered by these tools. The result of the excessive use is a step-by-step processing by the application. Under this circumstances SQL does indeed not deliver any additional value and it becomes understandable why so many developers sympathise with the term NoSQL.

But the problem is that the term NoSQL was not aimed against SQL in the first place. To make that clear the term was defined to mean “not only SQL” later on. Thus, NoSQL is about complementary alternatives. To be precise it is not even about alternatives to SQL but about alternatives to the relational model and ACID. In the meantime the CAP theorem revealed that the ACID criteria will inevitably reduce the availability of distributed databases. That means that traditional, ACID compliant, databases cannot benefit from the virtually unlimited resources available in cloud environments. This is what many NoSQL systems provide a solution for: instead of sticking to the very rigid ACID criteria to keep data 100% consistent all the time they accept temporary inconsistencies to increase the availability in a distributed environment. Simply put: in doubt they prefer to deliver wrong (old) data than no data. A more correct but less catchy term would therefore be NoACID.

Deploying such systems only makes sense for applications that don’t need strict data consistency. These are quite often applications in the social media field that can still fulfil their purpose with old data and even accept the loss of some updates in case of service interruption. These are also applications that could possibly need the unlimited scalability offered by cloud infrastructure. If a central database is sufficient, however, the CAP theorem does not imply a compelling reason to abandon the safety of ACID. However, using NoSQL systems can still make sense in domains where SQL doesn’t provide sufficient power to process the data. Interestingly, this problem is also very dominant in the social media field: although it is easy to express a social graph in a relational model it is rather cumbersome to analyse the edges using SQL.

Nevertheless: Back to SQL

Notwithstanding the above, SQL is still a very good tool to answer countless questions. This is also true for questions that could not be foreseen at the time of application design—a problem many NoSQL deployments face after the first few years. That’s probably also the cause for the huge demand for powerful and generic query languages like SQL.

Another aspect that strengthens the trend back to SQL goes to the heart of NoSQL: without ACID it is very difficult to write reliable software. Google said that very clearly in their paper about the F1 database: without ACID “we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date.” Apparently you only learn to appreciate ACID once you lost it.

What’s left of NoSQL?

If SQL and ACID become the flavor of the time again you may wonder what’s left of the NoSQL movement? Did it really waste half a decade like Jack Clark speculated at The Register? One thing we can say for sure is that SQL and ACID conformity are still desirable goals for database vendors. Further we know that there is a trade-off between ACID and scalability in cloud environments.

Of course, the race for the Holy Grail of the ideal balance between ACID and scalability has already begun. The NoSQL systems entered the field from the corner of unlimited scalability and started adding tools to control data consistency such as causal consistency. Obviously there a newcomers that use the term NewSQL to jump on the bandwagon by developing completely new SQL databases that bring ACID and SQL from the beginning but use NoSQL ideas internally to improve scalability.

And what about the established database vendors? They want to make us believe they are doing NoSQL too and release products like the “Oracle NoSQL Database” or “Windows Azure Table Storage Service.” The intention to create these products for the sole purpose to ride the hype is so striking that one must wonder why they treat neither NoSQL nor NewSQL as a serious threat to their business? When looking a the Oracle Database in it’s latest release 12c, we can even see the opposite trend. Although the version suffix “c” is ought to express its cloud capability it doesn’t change the fact that the killer feature of this release serves a completely different need: the easy and safe operation of multiple databases on a single server. That’s the exact opposite of what many NoSQL systems aim for: running a giant database on a myriad of cheap commodity servers. Virtualization is a way bigger trend than scale-out.

Is it even remotely possible that the established database vendors underwent such a fundamental misjudgment? Or is it more like that NoSQL only serves a small market niche? How many companies really need to cope with data at the scale of Google, Facebook or Twitter? Incidentally three companies that grew up on the open source database MySQL. One might believe the success of NoSQL is also based on the fact that it solves a problem that everybody would love to have. In all reality this problem is only relevant to a very small but prominent community, which managed to get a lot of attention. After all, also judging on the basis that the big database vendors don’t show a serious engagement, NoSQL is nothing more than a storm in a teacup.

If you like my way to explain things, you’ll love SQL Performance Explained.

Original title and author: “What’s left of NoSQL?” by Markus Winand.

Apr
01
2014
--

Thank You MySQL, We’ll Miss You!

Dear MySQL,

Thank you for introducing me to SQL. It must have been 1998 when we first met I and fell in love with the simplicity of SQL immediately. Before that I’ve been using C structs all the time; I had to do my joins programmatically and also create and maintain my indexes manually. It was even hard to combine several search conditions via and and or. But then there was the shiny new world of SQL you were showing me…

Everything was easily. Just write a where clause, no matter how complex, you found the right rows. Joins were equally easy to write and you took all the effort to combine the data from several tables as I needed them. I also remember how easy it became to manage the schema. Instead of writing a program to copy my data from one C struct to another, I just say alter table now—in the meanwhile it even works online in many cases! I didn’t take long until I used SQL for stuff I wouldn’t have thought a database could do for me. So I was quickly embracing group by and co.

But I haven’t spent a lot of time with you lately. It’s not because I was too busy. I’m still practicing what you have shown me! And I’ve moved on. Now I’m using common table expressions to organize complex queries and I use window functions to calculate running totals or just do a ranking. I’m also using joins more efficiently because I know about hash and sort/merge joins. A while ago I was wondering why you didn’t tell me about these things. But then I realized that you don’t know them.

I know it was not always nice what I said about you recently. But now that Oracle announced your retirement to focus on Oracle NoSQL, I realized how right this comment on reddit was. Sure you are neither the most advanced nor the most widely deployed open source SQL database, but you introduced millions of people to SQL. There should be no tears when some of them move away because they want to see what’s next. Isn’t that the greatest compliment a teacher can get? You can be proud of what you have accomplished.

Thank you!

Markus Winand

Markus Winand
April 1, 2014

ps.: Even when fooling around, I can’t resist to inform my fellow readers. Besides the claim that Oracle retires MySQL, everything is true. Initially I thought Oracle NoSQL is an Aprils fool’s joke but it isn’t.

Original title and author: “Thank You MySQL, We’ll Miss You!” by Markus Winand.

Feb
20
2014
--

Results of the SQL Performance Quiz: 60% fail

In 2011, I’ve launched the “The 3-Minute Test: What do you know about SQL performance.” It consists of five questions that follow a simple pattern: each question shows a query/index pair and asks if it demonstrates proper indexing or not. Till today, this test has become one of the most popular features on Use The Index, Luke and has been completed more than 28,000 times.

Note

Just in case you got curious, please be aware that this article is a spoiler. You might want to do the quiz yourself before continuing.

Although the quiz was created for educational purposes, I was wondering if I could get some interesting figures out of these 28,000 results. And I think I could. However, there are several things to keep in mind when looking at these figures. First, the quiz uses the surprise factor to catch attention. That means, three questions show cases that look fine, but aren’t. One question does it the other way around and shows an example that might look dangerous, but isn’t. There is only one question where the correct answer is in line with the first impression. Another effect that might affect the significance of the results is that there was no representative selection of participants. Everybody can take the quiz. You can even do it multiple times and will probably get a better result the second time. Just keep in mind that the quiz was never intended to be used for scientific research upon the indexing knowledge in the field. Nevertheless, I think that the size of the dataset is still good enough to get an impression.

Below I’ll show two different statistics for each question. First, the average rate at which this question was correctly answered. Second, how this figure varies for users of MySQL, Oracle, PostgreSQL and SQL Server databases. In other word, it says if e.g. MySQL users are more knowledgeable about indexing as PostgreSQL users. Spoiler: It’s the other way around. The only reason I’m in the lucky position to have this data is that the test sometimes uses vendor specific syntax. For example, what is LIMIT in MySQL and PostgreSQL is TOP in SQL Server. Therefore, the participants have to select a database at the beginning so that the questions are shown in the native syntax of that product.

Question 1: Functions in the WHERE Clause

Is the following SQL good or bad practice from a performance perspective?

Searching for all rows of the year 2012:

CREATE INDEX tbl_idx ON tbl (date_column);

SELECT text, date_column
  FROM tbl
 WHERE TO_CHAR(date_column, 'YYYY') = '2012';

This is an example where the code uses functions specific to the Oracle and PostgreSQL databases. For MySQL, the question uses YEAR(date_column) and for SQL Server datepart(yyyy, date_column). Of course, I could have used EXTRACT(YEAR date_column), but I thought it is better to use the most common syntax.

The participants have these options:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

The correct answer is “bad practice” because the index on date_column cannot be used when searching on something derived from date_column. If you don’t believe that, please have a look at the proof scripts or the explanations shown at the end of the test. They also contain links to the Use The Index, Luke pages explaining it in more detail.

However, if you didn’t know how functions effectively “disable” indexes, you are not alone. Only about two-thirds gave the correct answer. And as for every multiple choice test, there is a certain probably to pick the correct answer by chance. In this cases it’s a 50/50 chance — by no means negligible. I’ve marked this “guessing score” in the figure to emphasize that.

This is one of the most common problems I see in my everyday work. The same problem can also hit you with VARCHAR fields when using UPPER, TRIM or the like. Please keep in mind: whenever you are applying functions on columns in the where clause, an index on the column itself is no longer useful for this query.

Although this result is quite disappointing — I mean it’s not much better than guessing — it is no surprise to me. What is a surprise for me is how the result differs amongst the users of different databases.

As a matter of fact, MySQL users just score 55% — almost as low as the “guessing score”. PostgreSQL users, on the other hand, get a score of 83%.

An effect that might explain this result is that MySQL doesn’t support function-based indexes while Oracle and PostgreSQL do. Function-based indexes allow you to index expressions like TO_CHAR(date_column, 'YYYY'). Although it is not the recommended solution for this case, the pure existence of this feature might make users of the Oracle and PostgreSQL database more aware of this problem. SQL Server offers a similar feature: although it cannot index expressions directly, you can create a so-called computed column on expressions, which in turn can be indexed.

Although support for function-based indexes might explain why MySQL users underperformed, it is still no excuse. The shown query/index pair is bad — no matter whether the database supports function-based indexes or not. And the major improvement is also possible without function-based indexes:

SELECT text, date_column
  FROM tbl
 WHERE date_column >= TO_DATE('2012-01-01', 'YYYY-MM-DD')
   AND date_column <  TO_DATE('2013-01-01', 'YYYY-MM-DD');

The index doesn’t need to be changed. This solution is very flexible because it supports queries for different ranges too — e.g. by week or month. This is the recommended solution.

As a curious guy, I’d love to know how many of the people who correctly answered this question were thinking of the sub-optimal solution to use a function-based index. I’d rate this solution half-correct at best.

Question 2: Indexed Top-N Queries

Is the following SQL good or bad practice from a performance perspective?

To find the most recent row:

CREATE INDEX tbl_idx ON tbl (a, date_column);

SELECT id, a, date_column
  FROM tbl
 WHERE a = ? 
 ORDER BY date_column DESC
 LIMIT 1;

Note that the question mark is a placeholder, because I always encourage developers to use bind parameters.

As participant, you have these two options again:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

This is the question that is supposed to look dangerous, but isn’t. Generally, it seems like people believe order by must always sort the data. This index, however, eliminates the need to sort the data entirely so that the query is basically as fast as a unique index lookup. Please find a detailed explanation of this trick here.

The result is very close to the “guessing score” which I interpret as “people don’t have a clue about it.”

This result is particularly sad because I’ve seen people building caching tables, regularly refilled by a cron jobs, to avoid queries like this. Interestingly, the cron job tends to cause performance problems because it is running in rather short intervals to make sure the cache table has fresh data. However, the right index is often the better option in the first place.

Here I have to mention that the Oracle database needs the most special syntax for this trick. Up till version 12c released in 2013, the Oracle database did not offer a handy shortcut such as LIMIT or TOP. Instead, you have to use the ROWNUM pseudo-column like this:

SELECT *
  FROM (
        SELECT id, date_column
          FROM tbl
         WHERE a = :a
         ORDER BY date_column DESC
       )
 WHERE rownum <= 1;

The extra complexity of this query might have pushed Oracle users more heavily towards the wrong answer — actually below guessing score!

Another argument I’m getting in response to this question is that including the ID column in the index would allow an index-only scan. Although this is correct, I don’t consider not doing so a “bad practice” because the query touches only one row anyway. An index-only scan could just avoid a single table access. Obviously there are cases were you need that improvement, but in the general case I’d consider it a premature optimization. But that’s just my opinion. However, following this argument might give us an idea why PostgreSQL users got the best score (again): PostgreSQL did not have index-only scans until version 9.2, which was released in September 2012. As a result, PostgreSQL users could not fall into this trap of thinking an index-only scan can bring major improvements in this case. Undoubtedly, the term “major” is troublesome in this context.

Question 3: Index Column Order

Is the following SQL good or bad practice from a performance perspective?

Two queries, searching by a common column:

CREATE INDEX tbl_idx ON tbl (a, b);

SELECT id, a, b
  FROM tbl
 WHERE a = ?
   AND b = ?;


SELECT id, a, b
  FROM tbl
 WHERE b = ?;

The same options:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

The answer is “bad practice,” because the second query cannot use the index properly. Changing the index column order to (b, a) would, however, allow both queries to use this index in the most efficient way. You can find a full explanation here. Adding a second index on (b) would be a poor solution due to the overhead it adds for no reason. Unfortunately, I don’t know how many would have done that.

The result is disappointing, but in line with my expectations — just 12.5% above guessing score.

This is also a problem I see almost every day. People just don’t understand how multi-column indexes work.

All the per-database results are pretty close together. Maybe because there is no syntactic difference or well-known database features that could have a major influence the answer. Less known features like Oracle’s SKIP SCAN could have a minor impact, of course. Generally, the index-only scan could have a influence too, but it pushes the participants to the “right” answer this time.

After all, this result might just say that users of some databases know more about indexing than others. Interestingly, PostgreSQL users get the best score for the third time.

Question 4: LIKE Searches

Is the following SQL troublesome or bulletproof from a performance perspective?

Searching within a string:

CREATE INDEX tbl_idx ON tbl (text);

SELECT id, text
  FROM tbl
 WHERE text LIKE '%TERM%';

I’ve phrased the options differently this time:

  • Bulletproof: It will always run fast.

  • Troublesome: There is high risk for performance problems.

And the correct answer is “troublesome” because the LIKE pattern uses a leading wild card. Otherwise, if it would use the pattern 'TERM%', it could use the index very efficiently. Have a look at this visual explanation for details.

The results of this question are promising. Here I feel safe to say most people know that LIKE is not for full-text search.

The segregated results are also within a very narrow corridor:

It is, however, strange that PostgreSQL users under performed at this question. A closer look at how the question is presented to PostgreSQL users might give an explanation:

CREATE INDEX tbl_idx ON tbl (text varchar_pattern_ops);

SELECT id, text
  FROM tbl
 WHERE text LIKE '%TERM%';

Note the addition to the index (varchar_pattern_ops). In PostgreSQL, this special operator class is required to make the index usable for postfix wild card searches (e.g. 'TERM%'). I added this because I aimed to find out if people know about the problem of leading wild cards in LIKE expressions. Without the operator class, there are two reasons why it doesn’t work: (1) the leading wild card; (2) the missing operator class. I though that would be too obvious. Retrospectivley, I believe some participants interpreted this operator class as “magic that makes it work” and thus took the wrong answer.

Question 5a: Index-Only Scan

Question five is a little bit tricky, because PostgreSQL did not support index-only scans when the quiz was created. For that reason, there are two variants of question five: one about index-only scans for users of MySQL, Oracle and SQL Server databases. And another question about index column order for PostgresSQL users. Both results are presented here, but the segregated data is limited for obvious reasons. We start with the question about index-only scans:

How will the change affect query performance?

Current situation, selecting about hundred rows out of a million:

CREATE INDEX tab_idx ON tbl (a, date_column);

SELECT date_column, count(*)
  FROM tbl
 WHERE a = 123
 GROUP BY date_column;

Changed query, selecting about ten rows out of a million:

SELECT date_column, count(*)
  FROM tbl
 WHERE a = 123
   AND b = 42
 GROUP BY date_column;

Note the added where clause in the second query.

This question is also special because it offers four options:

  • Query performance will roughly stay the same (+/- 10%)

  • Depends on the data.

  • The query will be much slower (impact >10%)

  • The query will be much faster (impact >10%)

When I created the quiz, I was well aware that 50/50 questions have a tendency to render the score meaningless. This question is a trade-off between keeping the questions easy to gasp and answer (questions 1–4) and giving a more accurate result.

To make it short, the correct answer is “the query will be much slower.” The reason is that the original query could use an index-only scan — that is, the query could be answered only using data from the index without fetching any data from the actual table. The second query, however, needs to check column B too, which is not in the index. Consequently, the database must take some extra effort to fetch the candidate rows from the table to evaluate the expression on B That is, it must fetch at least 100 rows from the table — the number of rows returned by the first query. Due to the group by there are probably more rows to fetch. A quite considerable extra effort that will make the query much slower. A more exhaustive explanation is here.

With that number of options, the overall score drops significantly to about 39% or 14% above guessing score.

Still I think saying that about 39% of participants knew the right answers is wrong. They gave the right answer, but there was still a probability of 25% that they gave the right answer without knowing it.

The segregation by database is quite boring. Conspicuously boring.

However, with four options, it is also interesting to see how people actually answered.

That caught me by surprise. Both options “roughly the same” and “depends on the data” got about 25% — the guessing probably. Does this mean half of the participants were guessing? As it is the last question some participants might have picked a random option just to get through. Quite possible. However, the correct option “much slower” got 38.8% at the cost of the “much faster” option, which got just 10.9%.

My intention with this question was to trap people into the “much faster” option because fetching less data should be faster — except when breaking an index-only scan. The only hypothesis I have for this result is that people might have got the idea that the obvious answer isn’t the correct one. That, however, would mean the 39% score doesn’t prove anything about the knowledge of this phenomenon in the field.

Another effect that I expected to have an impact is that “it is always depending on the data.” Of course there are edge cases where the performance impact might roughly stay the same — e.g., when all inspected rows are in the same table block. However, this is rather unlikely — just because there would be no point in adding the date_column for an index-only scan in the first place.

Question 5b: Index Column Order and Range Operators

This question is only shown to PostgreSQL users.

Is the following SQL good or bad practice from a performance perspective?

Searching for entries in state X, not older than 5 years.

CREATE INDEX tbl_idx ON tbl (date_column, state);

SELECT id, date_column, state
  FROM tbl
 WHERE date_column >= CURRENT_DATE - INTERVAL '5' YEAR
   AND state = 'X';

(365 rows)

The data distribution is as follows:

SELECT count(*)
  FROM tbl
 WHERE date_column >= CURRENT_DATE - INTERVAL '5' YEAR;

 count 
-------
  1826

SELECT count(*)
  FROM tbl
 WHERE state = 'X';

 count 
-------
 10000

There is an index with two columns and a query that filters on both of them. One filter uses an equals operator, the other a greater than or equal operator. When using each filter individually the query returns many more rows as when combining both filters.

The options are:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

And the correct answer is “bad practice” because the column order in the index is the wrong way around. The general rule is that index columns can be used efficiently from the left hand side as long as they are used with equals operators. Further, one column can be used efficiently with a range operator. However, the first range operator effectively cuts off the index so that further columns cannot be used efficiently anymore. With efficiently I mean as an index access predicate. Please find a visualization here.

With the original index shown above, the query has to fetch 1826 entries from the index (those matching the date_column filter) and check each of them for the value of the state column. If we turn the index column order around, the database can use both filters efficiently (= as access predicate) and directly limit the index access to those 365 rows of interest.

And this is how people answered:

Wait a moment, that’s below the guessing score! It’s not just people don’t know, they believe the wrong thing. However, I must admit that the term “major” is very problematic again. When I run this example, the speed-up I get is just 70%. Not even twice as fast.

Overall Score: How Many Passed the Test?

Looking at each question individually is interesting, but doesn’t tell us how many participants managed to answer all five questions correctly, for example. The following chart has that information.

Finally, I’d like to boil this chart down to a single figure: how many participants “passed” the test?

Considering that the test has only five questions, out of which four are 50/50 questions, I think it is fair to say three correct answers isn’t enough to pass the test. Requiring five correct answers would quite obviously be asking for too much. Requiring four correct answers to “pass” the test is therefore the only sensible choice I see. Using this definition, only 38.2% passed the test. The chance to pass the test by guessing is still 12.5%.

If you like this article and want to learn about proper indexing, SQL Performance Explained is for you.

Original title and author: “Results of the SQL Performance Quiz: 60% fail” by Markus Winand.

Oct
01
2013
--

MongoDB is to NoSQL like MySQL to SQL — in the most harmful way

Yesterday evening I tweeted: “MongoDB seems to be as bad for NoSQL as MySQL is for SQL.” Unfortunately, I tweeted without context. But I guess I couldn’t have given all the required context in a single tweet anyway, so I’m dedicating this post to it. I hope this answers some of the questions I’ve got in response to the tweet.

First of all, I think everybody should know that I’m not a NoSQL fanboy, yet I’m open to the polyglot persistence idea. This distinction doesn’t seem to make sense if you read NoSQL as "not only SQL" (as you are supposed to do). However, I believe there are NoSQL systems out there that greatly benefit from the idea that SQL is bad and not using SQL is good. On other words, they offer “not using SQL” as their main advantage. MongoDB seems to be one of them. Just my perception.

But if I don’t like NoSQL, then I should like MySQL? Not exactly. In my eyes, MySQL has done great harm to SQL because many of the problems people associate with SQL are in fact just MySQL problems. One of the more important examples is that MySQL is rather poor at joining because is only supports nested loops joins. Most other SQL database implement the hash join and sort/merge join algorithms too—both deliver better performance for non-tiny data sets. Considering the wide adoption of MySQL (“The most popular open source database”) and the observation that many people move away from SQL because “joins are slow,” it isn’t far-fetched to say that an implementation limitation of MySQL pushes people towards NoSQL.

Now let’s look at MongoDB. I think the direct competition between MongoDB and MySQL became most obvious in the epic video “MongoDB is Web Scale.” In the meanwhile, MongoDB even claims to be “the leading NoSQL database” — does that sound like “the most popular open source database”? Nevertheless, MongoDB has disappointed many people because it couldn’t live up to it’s promise of “web scale” (example: global write lock up to release 2.2).

The next piece in the puzzle that eventually caused me to tweet was a funny tweet by Gwen (Chen) Shapira (she’s an Oracle DB consultant):

#mongoDB : the big data platform that is challenging to scale over 100GB. http://blog.mongodb.org/post/62899600960/scaling-advice-from-mongohq

Note that the link was broken for a while (the post originally appeared on Sep 30, then disappeared, but is online since Oct 2 again at a different URL). The article is about handling MongoDB if it grows above 100GB. It gives me the impression that scaling MongoDB to that size is a serious issue. Even though there is no exact definition of “web scale” I guess most people would assume that it should be easy to scale MongoDB to 100GB. 100GB is not big data nowadays. 100GBs can be easily managed with most SQL DBs (joining in MySQL could be a problem). It was really funny to see this post on the MongoDB blog. Chen’s tweet nailed it.

At this point, I was once more thinking about the “misspent half-decade” mentioned by Jack Clark in his article “Google goes back to the future with SQL F1 database.” But as mentioned before, I like the idea of polyglot persistence. I’m not saying NoSQL is bullshit—not just because a single implementation fails to deliver. That would be like saying SQL is bullshit because MySQL is bad at joining. On the contrary, it reminded how Alex Popescu lost his temper in his post “The premature return to SQL” last Friday. His response to the “misspent half-decade” was:

Just take a second a think what we got during this misspent half-decade: Redis, Cassandra, Riak, a multi-parallel fully programmatic way to process data, Cascading, Pig, Cypher, ReQL and many more tools, languages, and APIs for processing data.

Well, I don’t know all of these but I do realize that some of them are interesting tools to have in the tool box. Further, I’m following Alex Popescu long enough to know that he is rather reflective on NoSQL—the title of his post being an exception. That’s why I came back to his post to see if he mentioned “the leading NoSQL database“ in his list. He didn’t. I don’t think it’s a coincidence.

At this point it was inevitable to see MongoDB as a popular, yet poor representative of it’s species—just like MySQL is.

If you like my way to explain things, you’ll love SQL Performance Explained.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com