In a new post to his PHP in Action blog Dagfinn asks the question “is bad code good for you?” He wonders if bad code really is a good thing and how it can be split out from the good parts of your application.
30
2009
Admin notice: comments again broken
(This note should not go to PlanetMySQL, sorry if it goes).
We again have problem with comments to blog posts. It’s technical problem, we did upgrade some components and it seems it affected captcha. We are working on fix.
We are sorry for the inconvenience and thank for reports about the problem.
Entry posted by Vadim |
No comment
29
2009
Quick comparison of MyISAM, Infobright, and MonetDB
Recently I was doing a little work for a client who has MyISAM tables with many columns (the same one Peter wrote about recently). The client’s performance is suffering in part because of the number of columns, which is over 200. The queries are generally pretty simple (sums of columns), but they’re ad-hoc (can access any columns) and it seems tailor-made for a column-oriented database.
I decided it was time to actually give Infobright a try. They have an open-source community edition, which is crippled but not enough to matter for this test. The “Knowledge Grid” architecture seems ideal for the types of queries the client runs. But hey, why not also try MonetDB, another open-source column-oriented database I’ve been meaning to take a look at?
What follows is not a realistic benchmark, it’s not scientific, it’s just some quick and dirty tinkering. I threw up an Ubuntu 9.04 small server on Amazon. (I used this version because there’s a .deb of MonetDB for it). I created a table with 200 integer columns and loaded it with random numbers between 0 and 10000. Initially I wanted to try with 4 million rows, but I had trouble with MonetDB — there was not enough memory for this. I didn’t do anything fancy with the Amazon server — I didn’t fill up the /mnt disk to claim the bits, for example. I used default tuning, out of the box, for all three databases.
The first thing I tried doing was loading the data with SQL statements. I wanted to see how fast MyISAM vs. MonetDB would interpret really large INSERT statements, the kind produced by mysqldump. But MonetDB choked and told me the number of columns mismatched. I found reference to this on the mailing list, and skipped that. I used LOAD DATA INFILE instead (MonetDB’s version of that is COPY INTO). This is the only way to get data into Infobright, anyway.
The tests
I loaded 1 million rows into the table. Here’s a graph of the times (smaller is better):

MyISAM took 88 seconds, MonetDB took 200, and Infobright took 486. Here’s the size of the resulting table on disk (smaller is better):

MyISAM is 787MB, MonetDB is 791MB, and Infobright is 317MB. Next I ran three queries:
-
SELECT sum(c19), sum(c89), sum(c129) FROM t;
-
SELECT sum(c19), sum(c89), sum(c129) FROM t WHERE c11> 5;
-
SELECT sum(c19), sum(c89), sum(c129) FROM t WHERE c11 <5;
Graphs of query performance time for all three databases are really not very helpful, because MyISAM is so much slower that you can’t see the graphs for the others. So I’ll give the numbers and then omit MyISAM from the graphs. Here are the numbers for everything I measured:
| myisam | monetdb | infobright | |
|---|---|---|---|
| size (bytes) | 826000000 | 829946723 | 332497242 |
| load time (seconds) | 88 | 200 | 486 |
| query1 time | 3.4 | 0.012 | 0.0007 |
| query2 time | 3.4 | 0.15 | 1.2 |
| query3 time | 2.5 | 0.076 | 0.15 |
And here is a graph of Infobright duking it out with MonetDB on the three queries I tested (shorter bar is better):

I ran each query a few times, discarded the first run, and averaged the next three together.
Notes on Infobright
A few miscellaneous notes: don’t forget that Infobright is not just a storage engine plugged into MySQL. It’s a complete server with a different optimizer, etc. This point was hammered home during the LOAD DATA INFILE, when I looked to see what was taking so long (I was tempted to use oprofile and see if there are sleep() statements). What did I see in ‘top’ but a program called bhloader. This bhloader program was the only thing doing anything; mysqld wasn’t doing a thing. LOAD DATA INFILE in Infobright isn’t what it seems to be. Otherwise, Infobright behaved about as I expected it to; it seemed pretty normal to a MySQL guy.
Notes on MonetDB
MonetDB was a bit different. I had to be a bit resourceful to get everything going. The documentation was for an old version, and was pretty sparse. I had to go to the mailing lists to find the correct COPY syntax — it wasn’t that listed in the online manual. And there were funny things like a “merovingian” process (think “angel”) that had to be started before the server would start, and I had to destroy the demo database and recreate it before I could start it as shown in the tutorials.
MonetDB has some unexpected properties; it is not a regular RDBMS. Still, I’m quite impressed by it in some ways. For example, it seems quite nicely put together, and it’s not at all hard to learn.
It doesn’t really “speak SQL” — it speaks relational algebra, and the SQL is just a front-end to it. You can talk XQuery to it, too. I’m not sure if you can talk dirty to it, but you can sure talk nerdy to it: you can, should you choose to, give it instructions in MonetDB Assembly Language (MAL), the underlying language. An abstracted front-end is a great idea; MySQL abstracts the storage backend, but why not do both? Last I checked, Drizzle is going this direction, hurrah!
EXPLAIN is enlightening and frightening! You get to see the intermediate code from the compiler. The goggles, they do nothing!
From what I was able to learn about MonetDB in an hour, I believe it uses memory-mapped files to hold the data in-memory. If this is true, it explains why I couldn’t load 4 million rows into it (this was a 32-bit Amazon machine).
The SQL implementation is impressive. It’s a really solid subset of SQL:2003, much more than I expected. It even has CTEs, although not recursive ones. (No, there is no REPLACE, and there is no INSERT/ON DUPLICATE KEY UPDATE.) I didn’t try the XQuery interface.
Although I didn’t try it out, there are what looks like pretty useful instrumentation interfaces for profiling, debugging and the like. The query timer is in milliseconds (why doesn’t mysql show query times in microseconds? I had to resort to Perl + Time::HiRes for timing the Infobright queries).
I think it can be quite useful. However, I’m not quite sure it’s useful for “general-purpose” database use — there are a number of limitations (concurrency, for one) and it looks like it’s still fairly experimental.
Entry posted by Baron Schwartz |
26 comments
29
2009
Launched German Symfony User Group
I launched a new Symfony User Group for the german area Stuttgart, Ludwigsburg and Heilbronn. Everyon who is interessted can join us via http://symfony-usergroup-schwaben.mixxt.de. We will try to hold regular meetings in the future as soon as we have enough members.
So come on over and join us
28
2009
Copy&Paste Detector Task for Phing
Hi,
I use the Hudson Continuous Integration system to build my Symfony projects (will write a blog series about this in the next few days). For that I created a new task for the Phing build system. Sebastian Bergamann wrote a Copy&Paste Detector for PHP some time ago (PHPCPD). This small tool scans PHP sourcecode for duplications. You can use the Task in a Phing build file as shown in the following sample:
<phpcpd haltonerror="false">
<fileset dir="${builddir}" id="filestocpd">
<include name="apps/**/*.php" />
<include name="lib/de/**/*.php" />
<include name="lib/task/**/*.php" />
<include name="lib/services/**/*.php" />
<include name="lib/form/**/*.php" />
<include name="lib/model/**/*.php" />
</fileset>
<formatter type="pmd" outfile="reports/pmd-cpd.xml"/>
</phpcpd>
The CPD task is incorporatated in the current Phing trunk. If you want to use it you need to checkout the Version2.4.0 RC at the moment. The PEAR Installer version doesn’t contain the Task at the moment.
Maybe this is useful for you too. Stay tuned for more informations on Symfony and Continuous integration
28
2009
How number of columns affects performance ?
It is pretty understood the tables which have long rows tend to be slower than tables with short rows. I was interested to check if the row length is the only thing what matters or if number of columns we have to work with also have an important role. I was interested in peak row processing speed so I looked at full table scan in case data fits in OS cache completely. I created 3 tables – First containing single tinyint column which is almost shortest type possible (CHAR(0) could be taking less space), table with 1 tinyint column and char(99) column and table with 100 tinyint columns. The former two tables have the same row length but have number of column different 50 times. Finally I have created 4th table which is also 100 columns but one of them is VARCHAR causes raw format to be dynamic.
More specially:
-
CREATE TABLE `t1` (
-
`t1` tinyint(3) UNSIGNED NOT NULL
-
) ENGINE=MyISAM DEFAULT CHARSET=latin1
-
CREATE TABLE `t1c99` (
-
`t1` tinyint(3) UNSIGNED NOT NULL,
-
`c99` char(99) NOT NULL
-
) ENGINE=MyISAM DEFAULT CHARSET=latin1
-
CREATE TABLE `t100` (
-
`t1` tinyint(3) UNSIGNED NOT NULL,
-
`t2` tinyint(3) UNSIGNED NOT NULL,
-
…
-
`t99` tinyint(3) UNSIGNED NOT NULL,
-
`t100` tinyint(3) UNSIGNED NOT NULL
-
) ENGINE=MyISAM DEFAULT CHARSET=latin1
-
CREATE TABLE `t99v1` (
-
`t1` tinyint(3) UNSIGNED NOT NULL,
-
`t2` tinyint(3) UNSIGNED NOT NULL,
-
…
-
`t99` tinyint(3) UNSIGNED NOT NULL,
-
`v1` varchar(1) NOT NULL
-
) ENGINE=MyISAM DEFAULT CHARSET=latin1
I populated each of the tables with 12M rows. Getting 7 bytes row size for first table and 101 byte for second and third.
I used simple scan query: select max(t1) from t100; for the test.
The result was as follows:
t1 – 1.00 sec (12M rows/sec ; 80MB/sec)
t1c99 – 1.71 sec (7M rows/sec ; 676MB/sec)
t100 – 1.77 sec (7M rows/sec ; 653MB/sec)
t99v1 – 12.36 sec (1M rows/sec ; 93MB/sec)
This shows there is surely the problem with dynamic row format table with many columns. But is it because of large number of columns or dynamic format on its own is slave ?
I have tested yet another table structure:
-
CREATE TABLE `t1v1` (
-
`t1` tinyint(3) UNSIGNED NOT NULL,
-
`v` varchar(1) NOT NULL
-
) ENGINE=MyISAM DEFAULT CHARSET=latin1
This table has row length of 20 (which was a bit of surprise to me) and it has:
t1v1 – 1.83 sec (6.5M rows/sec; 125M/sec)
So there is surely the penalty for dynamic rows, however it is not very significant if number of columns is small. For large number of columns dynamic rows become very expensive and you have to watch out.
I have not looked at the code and would appreciate any developers comments but I guess for dynamic rows tables certain conversion has to take place when internal data structures are populated (everything but TEXTs/BLOBs is fixed length when it is being processed). This conversion process depends on number of columns while for fixed rows the MyISAM storage format matches internal one so you can basically do memory copy which does not depends on number of columns.
Another interesting observation is access speed to different columns. the max(t1) and max(t99) were taking the same time which means there is no penalty for accessing column which is in the end of the table rather than at the start when it comes to MyISAM.
The common workaround working with such wide tables is to use covering indexes. I added one to t99v1 table and repeated the query:
-
mysql [localhost] {msandbox} (test)> SELECT max(t1+0) FROM t99v1;
-
+———–+
-
| max(t1+0) |
-
+———–+
-
| 0 |
-
+———–+
-
1 row IN SET (3.26 sec)
-
-
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT max(t1+0) FROM t99v1;
-
+—-+————-+——-+——-+—————+——+———+——+———-+————-+
-
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
-
+—-+————-+——-+——-+—————+——+———+——+———-+————-+
-
| 1 | SIMPLE | t99v1 | INDEX | NULL | t1 | 1 | NULL | 12000000 | USING INDEX |
-
+—-+————-+——-+——-+—————+——+———+——+———-+————-+
-
1 row IN SET (0.00 sec)
As you can see the index scan is not as fast as table scan scanning about 3.7M rows/sec but which is still pretty fast.
So this is all about MyISAM, what is about Innodb ? Here are results for Innodb with all data in buffer pool, to measure peak speed as well
The results for Innodb were:
t1 – 5.11 sec (2.3M rows/sec)
t1c99 – 5.74 sec (2.1M rows/sec)
t100 – 15.16 sec (0.8M rows/sec)
t99v1 – 14.93 sec (0.8M rows/sec)
t1v1 – 5.26 sec (2.3M rows/sec)
t99v1 (covering idx) – 5.62 sec (2.1M rows/sec)
As you can see Innodb is a lot slower and has behavior similar to Dynamic Row tables in both cases. This is because Innodb does not store data in native MyISAM format and conversion is needed in all cases. We can also see the table scan speed can be up to 5 times slower, for very short rows – some of this goes back to the fact Innodb rows have a lot of transaction control overhead attached to them.
Also note the covering index scan speed is very similar to full table scan speed – this is rather expected as table data is stored in BTREE index very similarly to how indexes are stored.
Summary: Beware of dynamic row format tables with many columns they might bite you with surprise slowdown. MyISAM is much faster than Innodb when it comes to in memory full table scan.
P.S Tests were done on MySQL 5.4.2 on Intel(R) Xeon(R) CPU E5405 @ 2.00GHz CPU.
Entry posted by peter |
9 comments
28
2009
Manipulating Images with PHP and GraphicsMagick
Like most developers, I’ve known for a while that PHP comes with GD support for basic image generation and processing tasks, and I’ve even used it on occasion to dynamically create images. However, what I didn’t know was that, hidden in PECL, is an extension to the GraphicsMagick library, which allows for all kinds of sophisticated image operations and effects. In this article, I’ll introduce you to ext/gmagick, showing you some of the cool things it can do for you (and your images).




