Jul
29
2021
--

PostgreSQL PL/Java – A How-To, Part 1

PostgreSQL PL:Java

We’ve recently received some questions regarding PL/Java and I found it hard to get clear instructions searching on the internet. It’s not that there is no good information out there, but most of it is either incomplete, outdated, or confusing and I decided to create this short “how-to” and show how to install it and how to get it running with few examples.

Installation

I will show here how to install it from sources, first because my platform doesn’t have the compiled binaries, and second because if your platform has the binaries from the package manager you can just install it from there, for example using YUM or APT. Also, note that I’m using PL/Java without the designation “TRUSTED” and a Postgres database superuser for simplicity. I would recommend reading the documentation about users and privileges here[1].

The versions of the software I’m using here are:

  • PostgreSQL 12.7
  • PL/Java 1.6.2
  • OpenJDK 11
  • Apache Maven 3.6.3

I downloaded the sources from “https://github.com/tada/pljava/releases“, unpackaged and compiled with maven:

wget https://github.com/tada/pljava/archive/refs/tags/V1_6_2.tar.gz
tar -xf V1_6_2.tar.gz
cd pljava-1_6_2
mvn clean install
java -jar pljava-packaging/target/pljava-pg12.jar

I’ll assume here that you know maven enough and won’t go through the “mvn” command. The “java -jar pljava-packaging/target/pljava-pg12.jar” will copy/install the needed files and packages into Postgres folders. Note that maven used my Postgres version and created the jar file with the version: “pljava-pg12.jar“, so pay attention to the version you have there as the jar file will change if you have a different Postgres version!

I can now create the extension into the database I will use it. I’m using the database “demo” in this blog:

$ psql demo
psql (12.7)
Type "help" for help.

demo=# CREATE EXTENSION pljava;
WARNING: Java virtual machine not yet loaded
DETAIL: libjvm: cannot open shared object file: No such file or directory
HINT: SET pljava.libjvm_location TO the correct path to the jvm library (libjvm.so or jvm.dll, etc.)
ERROR: cannot use PL/Java before successfully completing its setup
HINT: Check the log for messages closely preceding this one, detailing what step of setup failed and what will be needed, probably setting one of the "pljava." configuration variables, to complete the setup. If there is not enough help in the log, try again with different settings for "log_min_messages" or "log_error_verbosity".

Not exactly what I was expecting but I got a good hint: “HINT: SET pljava.libjvm_location TO the correct path to the jvm library (libjvm.so or jvm.dll, etc.)“. Ok, I had to find the libjvm my system is using to configure Postgres. I used the SET command to do it online:

demo=# SET pljava.libjvm_location TO '/usr/lib/jvm/java-11-openjdk-11.0.11.0.9-5.fc34.x86_64/lib/server/libjvm.so';
NOTICE: PL/Java loaded
DETAIL: versions:
PL/Java native code (1.6.2)
PL/Java common code (1.6.2)
Built for (PostgreSQL 12.7 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.1.1 20210531 (Red Hat 11.1.1-3), 64-bit)
Loaded in (PostgreSQL 12.7 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.1.1 20210531 (Red Hat 11.1.1-3), 64-bit)
OpenJDK Runtime Environment (11.0.11+9)
OpenJDK 64-Bit Server VM (11.0.11+9, mixed mode, sharing)
NOTICE: PL/Java successfully started after adjusting settings
HINT: The settings that worked should be saved (using ALTER DATABASE demo SET ... FROM CURRENT or in the "/v01/data/db/pg12/postgresql.conf" file). For a reminder of what has been set, try: SELECT name, setting FROM pg_settings WHERE name LIKE 'pljava.%' AND source = 'session'
NOTICE: PL/Java load successful after failed CREATE EXTENSION
DETAIL: PL/Java is now installed, but not as an extension.
HINT: To correct that, either COMMIT or ROLLBACK, make sure the working settings are saved, exit this session, and in a new session, either: 1. if committed, run "CREATE EXTENSION pljava FROM unpackaged", or 2. if rolled back, simply "CREATE EXTENSION pljava" again.
SET

Also used the “ALTER SYSTEM” to make it persistent across all my databases as it writes the given parameter setting to the “postgresql.auto.conf” file, which is read in addition to “postgresql.conf“:

demo=# ALTER SYSTEM SET pljava.libjvm_location TO '/usr/lib/jvm/java-11-openjdk-11.0.11.0.9-5.fc34.x86_64/lib/server/libjvm.so';
ALTER SYSTEM

Now we have it installed we can check the system catalog if it is indeed there:

demo=# SELECT * FROM pg_language WHERE lanname LIKE 'java%';
oid    | lanname | lanowner | lanispl | lanpltrusted | lanplcallfoid | laninline | lanvalidator | lanacl
-------+---------+----------+---------+--------------+---------------+-----------+--------------+-------------------
16428  | java    | 10       | t       | t            | 16424         | 0         | 16427        | {charly=U/charly}
16429  | javau   | 10       | t       | f            | 16425         | 0         | 16426        |
(2 rows)

And test if it is working:

demo=# CREATE FUNCTION getProperty(VARCHAR)
RETURNS VARCHAR
AS 'java.lang.System.getProperty'
LANGUAGE java;
CREATE FUNCTION
demo=# SELECT getProperty('java.version');
getproperty
-------------
11.0.11
(1 row)

It’s working! Time to try something useful.

Accessing Database Objects with PL/Java

The majority of the examples I found showed how to do a “hello world” from a Java class or how to calculate a Fibonacci sequence but nothing how to access database objects. Well, nothing wrong with those examples but I suppose that one who installs PL/Java in his database would like to access database objects from inside of a Java function and this is what we gonna do here.

I will use the sample database “pagila” that can be found here[2] for our tests in this post.

For this first example, I will create a simple class with a static method that will be accessed outside like any Postgres function. The function will receive an integer argument and use it to search the table “customer”, column “customer_id” and will print the customer’s id, full name, email,  and address:

package com.percona.blog.pljava;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

public class Customers {
	private static String m_url = "jdbc:default:connection";

	public static String getCustomerInfo(Integer id) throws SQLException {
		Connection conn = DriverManager.getConnection(m_url);
		String query = "SELECT c.customer_id, c.last_name ||', '|| c.first_name as full_name, "
				+ " c.email, a.address, ci.city, a.district "
				+ " FROM customer c"
				+ "	 JOIN address a on a.address_id = c.address_id "
				+ "	JOIN city ci on ci.city_id = a.city_id "
				+ " WHERE customer_id = ?";

		PreparedStatement stmt = conn.prepareStatement(query);
		stmt.setInt(1, id);
		ResultSet rs = stmt.executeQuery();
		rs.next();
		String ret; 
		ret = "- ID: " + rs.getString("customer_id") ;
		ret += "\n- Name: " + rs.getString("full_name");
		ret += "\n- Email: " + rs.getString("email");
		ret += "\n- Address: " + rs.getString("address");
		ret += "\n- City: " + rs.getString("city");
		ret += "\n- District: " + rs.getString("district");
		ret += "\n--------------------------------------------------------------------------------";

		stmt.close();
		conn.close();

		return (ret);
	}
}

I’ve compiled and created the “jar” file manually with the below commands:

javac com/percona/blog/pljava/Customers.java
jar -c -f /app/pg12/lib/demo.jar com/percona/blog/pljava/Customers.class

Note that I’ve created the jar file inside the folder “/app/pg12/lib”, keep notes because we’ll use this information in the next step, loading the jar file inside Postgres:

demo=# SELECT sqlj.install_jar( 'file:///app/pg12/lib/demo.jar', 'demo', true );
 install_jar 
-------------
(1 row)

demo=# SELECT sqlj.set_classpath( 'public', 'demo' );
 set_classpath 
---------------
(1 row)

The install_jar function has the signature “install_jar(<jar_url>, <jar_name>, <deploy>)” and it loads a jar file from a location appointed by an URL into the SQLJ jar repository. It is an error if a jar with the given name already exists in the repository or if the jar doesn’t exist in the URL or the database isn’t able to read it:

demo=# SELECT sqlj.install_jar( 'file:///app/pg12/lib/demo2.jar', 'demo', true );
ERROR:  java.sql.SQLException: I/O exception reading jar file: /app/pg12/lib/demo2.jar (No such file or directory)
demo=# SELECT sqlj.install_jar( 'file:///app/pg12/lib/demo.jar', 'demo', true );
 install_jar 
------------- 
(1 row)

demo=# SELECT sqlj.install_jar( 'file:///app/pg12/lib/demo.jar', 'demo', true );
ERROR:  java.sql.SQLNonTransientException: A jar named 'demo' already exists

The function set_classpath defines a classpath for the given schema, in this example the schema “public”. A classpath consists of a colon-separated list of jar names or class names. It’s an error if the given schema does not exist or if one or more jar names references non-existent jars.

The next step is to create the Postgres functions:

demo=# CREATE FUNCTION getCustomerInfo( INT ) RETURNS CHAR AS 
    'com.percona.blog.pljava.Customers.getCustomerInfo( java.lang.Integer )'
LANGUAGE java;
CREATE FUNCTION

We can now use it:

demo=# SELECT getCustomerInfo(100);
                                 getcustomerinfo                                  
----------------------------------------------------------------------------------
 - ID: 100                                                                       +
 - Name: HAYES, ROBIN                                                            +
 - Email: ROBIN.HAYES@sakilacustomer.org                                         +
 - Address: 1913 Kamakura Place                                                  +
 - City: Jelets                                                                  +
 - District: Lipetsk                                                             +
 --------------------------------------------------------------------------------
(1 row)

Sweet, we have our first Java function inside our Postgres demo database.

Now, in our last example here I will add another method to this class, now to list all the payments from a given customer and calculate its total:

package com.percona.blog.pljava;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

public class Customers {
	private static String m_url = "jdbc:default:connection";

	public static String getCustomerInfo(Integer id) throws SQLException {
		Connection conn = DriverManager.getConnection(m_url);
		String query = "SELECT c.customer_id, c.last_name ||', '|| c.first_name as full_name, "
				+ " c.email, a.address, ci.city, a.district "
				+ " FROM customer c"
				+ "	 JOIN address a on a.address_id = c.address_id "
				+ "	JOIN city ci on ci.city_id = a.city_id "
				+ " WHERE customer_id = ?";

		PreparedStatement stmt = conn.prepareStatement(query);
		stmt.setInt(1, id);
		ResultSet rs = stmt.executeQuery();
		rs.next();
		String ret; 
		ret = "- ID: " + rs.getString("customer_id") ;
		ret += "\n- Name: " + rs.getString("full_name");
		ret += "\n- Email: " + rs.getString("email");
		ret += "\n- Address: " + rs.getString("address");
		ret += "\n- City: " + rs.getString("city");
		ret += "\n- District: " + rs.getString("district");
		ret += "\n--------------------------------------------------------------------------------";

		stmt.close();
		conn.close();

		return (ret);
	}

	public static String getCustomerTotal(Integer id) throws SQLException {
		Connection conn;
		PreparedStatement stmt;
		ResultSet rs;
		String result;
		double total;

		conn = DriverManager.getConnection(m_url);
		stmt = conn.prepareStatement(
				"SELECT c.customer_id, c.first_name, c.last_name FROM customer c WHERE c.customer_id = ?");
		stmt.setInt(1, id);
		rs = stmt.executeQuery();
		if (rs.next()) {
			result = "Customer ID  : " + rs.getInt("customer_id");
			result += "\nCustomer Name: " + rs.getString("last_name") + ", " + rs.getString("first_name");
			result += "\n--------------------------------------------------------------------------------------------------------";
		} else {
			return null;
		}

		stmt = conn.prepareStatement("SELECT p.payment_date, p.amount FROM payment p WHERE p.customer_id = ? ORDER BY 1");
		stmt.setInt(1, id);
		rs = stmt.executeQuery();
		total = 0;

		while (rs.next()) {
			result += "\nPayment date: " + rs.getString("payment_date") + ",    Value: " + rs.getString("amount");
			total += rs.getFloat("amount");
		}
		result += "\n--------------------------------------------------------------------------------------------------------";
		result += "\nTotal: " +String.format("%1$,.2f",  total);
		
		stmt.close();
		conn.close();
		return (result);
	}
}

Same instructions to compile:

javac com/percona/blog/pljava/Customers.java 
jar -c -f /app/pg12/lib/demo.jar com/percona/blog/pljava/Customers.class

Then we need to replace the loaded jar file for the newly created and create the function inside Postgres:

demo=# SELECT sqlj.replace_jar( 'file:///app/pg12/lib/demo.jar', 'demo', true );
 replace_jar 
-------------
(1 row)

demo=# CREATE FUNCTION getCustomerTotal( INT ) RETURNS CHAR AS 
    'com.percona.blog.pljava.Customers.getCustomerTotal( java.lang.Integer )'
LANGUAGE java;
CREATE FUNCTION

And the result is:

demo=# SELECT getCustomerInfo(100);
                                 getcustomerinfo                                  
----------------------------------------------------------------------------------
 - ID: 100                                                                       +
 - Name: HAYES, ROBIN                                                            +
 - Email: ROBIN.HAYES@sakilacustomer.org                                         +
 - Address: 1913 Kamakura Place                                                  +
 - City: Jelets                                                                  +
 - District: Lipetsk                                                             +
 --------------------------------------------------------------------------------
(1 row)

We finish this part here and with this last example. At this point, we are able to access objects, loop through a resultset, and return the result back as a single object like a TEXT. I will discuss how to return an array/resultset, how to use PL/Java functions within triggers, and how to use external resources in part two and part three of this article, stay tuned!

[1] https://tada.github.io/pljava/use/policy.html
[2] https://www.postgresql.org/ftp/projects/pgFoundry/dbsamples/pagila/

Jul
12
2018
--

Why MySQL Stored Procedures, Functions and Triggers Are Bad For Performance

Execution map for func1()

MySQL stored procedures, functions and triggers are tempting constructs for application developers. However, as I discovered, there can be an impact on database performance when using MySQL stored routines. Not being entirely sure of what I was seeing during a customer visit, I set out to create some simple tests to measure the impact of triggers on database performance. The outcome might surprise you.

Why stored routines are not optimal performance wise: short version

Recently, I worked with a customer to profile the performance of triggers and stored routines. What I’ve learned about stored routines: “dead” code (the code in a branch which will never run) can still significantly slow down the response time of a function/procedure/trigger. We will need to be careful to clean up what we do not need.

Profiling MySQL stored functions

Let’s compare these four simple stored functions (in MySQL 5.7):

Function 1:

CREATE DEFINER=`root`@`localhost` FUNCTION `func1`() RETURNS int(11)
BEGIN
	declare r int default 0;
RETURN r;
END

This function simply declares a variable and returns it. It is a dummy function

Function 2:

CREATE DEFINER=`root`@`localhost` FUNCTION `func2`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2
    THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

This function calls another function, levenshtein_limit_n (calculates levenshtein distance). But wait: this code will never run – the condition IF 1=2 will never be true. So that is the same as function 1.

Function 3:

CREATE DEFINER=`root`@`localhost` FUNCTION `func3`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select levenshtein_limit_n('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select levenshtein_limit_n('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select levenshtein_limit_n('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

Here there are four conditions and none of these conditions will be true: there are 4 calls of “dead” code. The result of the function call for function 3 will be the same as function 2 and function 1.

Function 4:

CREATE DEFINER=`root`@`localhost` FUNCTION `func3_nope`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select does_not_exit('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select does_not_exit('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select does_not_exit('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select does_not_exit('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

This is the same as function 3 but the function we are running does not exist. Well, it does not matter as the

select does_not_exit

  will never run.

So all the functions will always return 0. We expect that the performance of these functions will be the same or very similar. Surprisingly it is not the case! To measure the performance I used the “benchmark” function to run the same function 1M times. Here are the results:

+-----------------------------+
| benchmark(1000000, func1()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (1.75 sec)
+-----------------------------+
| benchmark(1000000, func2()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (2.45 sec)
+-----------------------------+
| benchmark(1000000, func3()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (3.85 sec)
+----------------------------------+
| benchmark(1000000, func3_nope()) |
+----------------------------------+
|                                0 |
+----------------------------------+
1 row in set (3.85 sec)

As we can see func3 (with four dead code calls which will never be executed, otherwise identical to func1) runs almost 3x slower compared to func1(); func3_nope() is identical in terms of response time to func3().

Visualizing all system calls from functions

To figure out what is happening inside the function calls I used performance_schema / sys schema to create a trace with ps_trace_thread() procedure

  1. Get the thread_id for the MySQL connection:
    mysql> select THREAD_ID from performance_schema.threads where processlist_id = connection_id();
    +-----------+
    | THREAD_ID |
    +-----------+
    |        49 |
    +-----------+
    1 row in set (0.00 sec)
  2. Run ps_trace_thread in another connection passing the thread_id=49:
    mysql> CALL sys.ps_trace_thread(49, concat('/var/lib/mysql-files/stack-func1-run1.dot'), 10, 0, TRUE, TRUE, TRUE);
    +--------------------+
    | summary            |
    +--------------------+
    | Disabled 0 threads |
    +--------------------+
    1 row in set (0.00 sec)
    +---------------------------------------------+
    | Info                                        |
    +---------------------------------------------+
    | Data collection starting for THREAD_ID = 49 |
    +---------------------------------------------+
    1 row in set (0.00 sec)
  3. At that point I switched to the original connection (thread_id=49) and run:
    mysql> select func1();
    +---------+
    | func1() |
    +---------+
    |       0 |
    +---------+
    1 row in set (0.00 sec)
  4. The sys.ps_trace_thread collected the data (for 10 seconds, during which I ran the
    select func1()

     ), then it finished its collection and created the dot file:

    +-----------------------------------------------------------------------+
    | Info                                                                  |
    +-----------------------------------------------------------------------+
    | Stack trace written to /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-----------------------------------------------------------------------+
    1 row in set (9.21 sec)
    +-------------------------------------------------------------------------------+
    | Convert to PDF                                                                |
    +-------------------------------------------------------------------------------+
    | dot -Tpdf -o /tmp/stack_49.pdf /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-------------------------------------------------------------------------------+
    1 row in set (9.21 sec)
    +-------------------------------------------------------------------------------+
    | Convert to PNG                                                                |
    +-------------------------------------------------------------------------------+
    | dot -Tpng -o /tmp/stack_49.png /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-------------------------------------------------------------------------------+
    1 row in set (9.21 sec)
    Query OK, 0 rows affected (9.45 sec)

I repeated these steps for all the functions above and then created charts of the commands.

Here are the results:

Func1()

Execution map for func1()

Func2()

Execution map for func2()

Func3()

Execution map for func3()

 

As we can see there is a sp/jump_if_not call for every “if” check followed by an opening tables statement (which is quite interesting). So parsing the “IF” condition made a difference.

For MySQL 8.0 we can also see MySQL source code documentation for stored routines which documents how it is implemented. It reads:

Flow Analysis Optimizations
After code is generated, the low level sp_instr instructions are optimized. The optimization focuses on two areas:

Dead code removal,
Jump shortcut resolution.
These two optimizations are performed together, as they both are a problem involving flow analysis in the graph that represents the generated code.

The code that implements these optimizations is sp_head::optimize().

However, this does not explain why it executes “opening tables”. I have filed a bug.

When slow functions actually make a difference

Well, if we do not plan to run one million of those stored functions we will never even notice the difference. However, where it will make a difference is … inside a trigger. Let’s say that we have a trigger on a table: every time we update that table it executes a trigger to update another field. Here is an example: let’s say we have a table called “form” and we simply need to update its creation date:

mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.31 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

That is good and fast. Now we create a trigger which will call our dummy func1():

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
	select func1() into r;
END

Now repeat the update. Remember: it does not change the result of the update as we do not really do anything inside the trigger.

mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.90 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Just adding a dummy trigger will add 2x overhead: the next trigger, which does not even run a function, introduces a slowdown:

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.52 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Now, lets use func3 (which has “dead” code and is equivalent to func1):

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
	select func3() into r;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (1.06 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

However, running the code from the func3 inside the trigger (instead of calling a function) will speed up the update:

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select levenshtein_limit_n('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select levenshtein_limit_n('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select levenshtein_limit_n('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.66 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Memory allocation

Potentially, even if the code will never run, MySQL will still need to parse the stored routine—or trigger—code for every execution, which can potentially lead to a memory leak, as described in this bug.

Conclusion

Stored routines and trigger events are parsed when they are executed. Even “dead” code that will never run can significantly affect the performance of bulk operations (e.g. when running this inside the trigger). That also means that disabling a trigger by setting a “flag” (e.g.

if @trigger_disable = 0 then ...

 ) can still affect performance of bulk operations.

The post Why MySQL Stored Procedures, Functions and Triggers Are Bad For Performance appeared first on Percona Database Performance Blog.

Aug
16
2009
--

A micro-benchmark of stored routines in MySQL

Ever wondered how fast stored routines are in MySQL? I just ran a quick micro-benchmark to compare the speed of a stored function against a “roughly equivalent” subquery. The idea — and there may be shortcomings that are poisoning the results here, your comments welcome — is to see how fast the SQL procedure code is at doing basically the same thing the subquery code does natively (so to speak).

Before we go further, I want to make sure you know that the queries I’m writing here are deliberately mis-optimized to force a bad execution plan. You should never use IN() subqueries the way I do, at least not in MySQL 5.1 and earlier.

I loaded the World sample database and cooked up this query:

SQL:

  1. SELECT sql_no_cache sum(ci.Population) FROM City AS ci
  2.   WHERE CountryCode IN (
  3.     SELECT DISTINCT co.Code FROM Country AS co
  4.       INNER JOIN CountryLanguage AS cl ON cl.CountryCode = co.Code
  5.     WHERE lower(cl.LANGUAGE) = ‘English’);
  6. +——————–+
  7. | sum(ci.Population) |
  8. +——————–+
  9. |          237134840 |
  10. +——————–+
  11. 1 row IN SET (0.23 sec)

This pretty consistently runs in just about 1/4th of a second. If you look at the abridged explain plan below, you’ll see the query is doing a table scan against the first query, and then executing the subquery for each row:

SQL:

  1. mysql> EXPLAIN SELECT ….\G
  2. *************************** 1. row ***************************
  3.            id: 1
  4.   select_type: PRIMARY
  5.         TABLE: ci
  6.          type: ALL
  7. possible_keys: NULL
  8.           KEY: NULL
  9.       key_len: NULL
  10.           ref: NULL
  11.          rows: 4079
  12.         Extra: USING WHERE
  13. *************************** 2. row ***************************
  14.            id: 2
  15.   select_type: DEPENDENT SUBQUERY
  16. *************************** 3. row ***************************
  17.            id: 2
  18.   select_type: DEPENDENT SUBQUERY

Now I took the subquery and basically rewrote it as a stored function.

SQL:

  1. mysql> delimiter //
  2. mysql> CREATE FUNCTION speaks_english(c char(3)) returns integer deterministic
  3.     > begin
  4.     > declare res int;
  5.     > SELECT count(DISTINCT co.Code) INTO res FROM Country AS co INNER JOIN CountryLanguage AS cl ON cl.CountryCode = co.Code WHERE lower(cl.LANGUAGE) = ‘English’ AND co.Code = c;
  6.     > RETURN res;
  7.     > end//
  8. mysql> delimiter ;

Now the query can be rewritten as this:

SQL:

  1. mysql> SELECT sql_no_cache sum(ci.Population) FROM City AS ci WHERE speaks_english(CountryCode)> 0;
  2. +——————–+
  3. | sum(ci.Population) |
  4. +——————–+
  5. |          237134840 |
  6. +——————–+
  7. 1 row IN SET (1.00 sec)

If we explain it, we get output similar to the first table shown above, but the further two rows are not shown. The query can’t be optimized to use indexes, and the stored function is opaque to the optimizer. This is why I purposefully wrote the subquery badly in the first query! (If you think of a better way to compare apples and uhm, apples… please comment).

The poorly-optimized-subquery portion of the query essentially happens inside that function now.

And it’s four times slower, consistently, and that’s all I wanted to show here. Thanks for reading.


Entry posted by Baron Schwartz |
7 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com