Nov
19
2018
--

Installing and Configuring JIT in PostgreSQL 11

JIT with PostgreSQL

JIT in PostgreSQLJust-in-time (JIT in PostgreSQL) compilation of SQL statements is one of the highlighted features in PostgreSQL 11. There is great excitement in the community because of the many claims of up to a 30% jump in performance. Not all queries and workloads get the benefit of JIT compilation. So you may want to test your workload against this new feature.

However, It is important to have a general understanding of what it does and where we can expect the performance gains. Installing PostgreSQL 11 with the new JIT compilation feature requires few extra steps and packages. Taking the time and effort to figure out how to do this shouldn’t be a reason to shy away from trying these cutting-edge features and testing a workload against the JIT feature. This blog post is for those who want to try it.

What is JIT and What it does in PostgreSQL

Normal SQL execution in any DBMS software is similar to what an interpreted language does to the source code. No machine code gets generated out of your SQL statement. But we all know that how dramatic the performance gains can be from a JIT compilation and execution of the machine code it generates. We saw the magic Google V8 engine did to JavaScript language. The quest for doing a similar thing with SQL statement was there for quite some time. But it is a challenging task.

It is challenging because we don’t have the source code (SQL statement) ready within the PostgreSQL server. The source code that needs to undergo JIT need to come from client connections and there could be expressions/functions with a different number of arguments, and it may be dealing with tables of different number and type of columns.

Generally, a computer program won’t get modified at this level while it is running, so branching-predictions are possible. The unpredictability and dynamic nature of SQL statements coming from client connections and hitting the database from time-to-time give no scope for doing advance prediction or compilation in advance. That means the JIT compiler should kick in every time the database gets an SQL statement. For this reason, PostgreSQL needs the help of compiler infrastructure like LLVM  continuously available behind. Even though there were a couple of other options, the main developer of this feature (Andres Freund) had a strong reason why LLVM was the right choice.

. In PostgreSQL 11, the JIT feature currently does:

  1. Accelerating expression evaluation: Expressions in  WHERE clauses, target lists, aggregates and projections
  2. Tuple deforming: Converting on-disk image to corresponding memory representation.
  3. In-lining: bodies of small custom functions, operators and user-defined data types are inline-ed into the expressions using them
  4. You can use compiler optimizations provided by LLVM for preparing optimized machine code.

In this blog, we are going to see how to install PostgreSQL with JIT. Just like regular PostgreSQL installations, we have two options:

  1. Get PostgreSQL from the packages in the PGDG repository
  2. Build PostgreSQL from source

Option 1. Install from PGDG repository.

Compiling from source requires us to install all compilers and tools. We might want to avoid this for various reasons. Installing packages from a PGDG repository is straightforward. On production systems or a container, you might want to install only the bare minimum required packages. Additional packages you don’t really use are always a security concern. Distributions like Ubuntu provide more recent versions of libraries and tool-sets in their default repos. However, distributions like CentOS / RHEL are quite conservative — their priority is stability and proven servers rather than cutting-edge features. So In this section of the post is mostly relevant for CentOS7/RHEL 7.

Here are the steps for the bare minimum installation of PostgreSQL with JIT feature on CentOS7

Step 1. Install PGDG repo and Install PostgreSQL server package.

This is usually the bare minimum installation if we don’t need the JIT feature.

sudo yum install https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm
sudo yum install postgresql11-server

At this stage, we can initialize the data directory and start the service if we don’t need JIT:

sudo /usr/pgsql-11/bin/postgresql*-setup initdb
sudo systemctl start postgresql-11

Step 2. Install EPEL repository

sudo yum install epel-release

Step 3. Install package for PostgreSQL with llvmjit

sudo yum install postgresql11-llvmjit

Since we have already added the EPEL repository, now the dependancy can be resolved by YUM and it can pull and install the necessary package from EPEL. Installation message contains the necessary packages.

...
Installing:
postgresql11-llvmjit      x86_64     11.1-1PGDG.rhel7     pgdg11    9.0 M
Installing for dependencies:
llvm5.0                   x86_64     5.0.1-7.el7          epel      2.6 M
llvm5.0-libs              x86_64     5.0.1-7.el7          epel      13 M
...

As we can see, there are two packages: llvm5.0 and llvm5.0-libs get installed.

Note for Ubuntu users:

As we already mentioned, Repositories of recent versions of Ubuntu contains recent versions of LLVM libraries. For example, Ubuntu 16.04 LTS repo contains libllvm6.0 by default. Moreover, PostgreSQL server package is not divided to have a separate package for jit related files. So default installation of PostgreSQL 11 can get you JIT feature also.

wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
sudo apt install postgresql-11

Option 2. Building from Source

The primary means of distributing PostgreSQL is the source code. Building a minimal PostgreSQL instance requires just a C compiler. But building JIT options requires a few more things. One of the challenges you can run into is different errors during the build process due to older versions of LLVM and Clang present in the system.

Step 1. Download PostgreSQL source tarball and unpack

Tarballs are available in the repository. We can grab and unpack the latest:

curl -LO https://ftp.postgresql.org/pub/source/v11.0/postgresql-11.0.tar.bz2
tar -xvf postgresql-11.0.tar.bz2

Step 2.  Get SCL Repository and Install toolset

Latest versions of LLVM, CLang and GCC are available in SCL. We can get everything in a stretch:

sudo yum install centos-release-scl
sudo yum install llvm-toolset-7 llvm-toolset-7-llvm-devel.x86_64

Now either you can set or edit your PATH to have all new tools in PATH. I would prefer to put that into my profile file:

PATH=/opt/rh/devtoolset-7/root/usr/bin/:/opt/rh/llvm-toolset-7/root/usr/bin/:$PATH

Alternatively, we can open a shell with SCL enabled:

scl enable devtoolset-7 llvm-toolset-7 bash

We should attempt to compile the source from a shell with all these paths set.

Step 3. Install Additional libraries/tools

Based on the configuration options you want, this list may change. Consider this as a sample for demonstration purposes:

sudo yum install  readline-devel zlib-devel libxml2-devel openssl-devel

Step 4. Configure with –with-llvm option and make

Now we should be able to configure and make with our preferred options. The JIT feature will be available if the 

--with-llvm

 option is specified. For this demonstration, I am using an installation directory with my home (/home/postgres/pg11):

./configure --prefix=/home/postgres/pg11 --with-openssl --with-libxml --with-zlib --with-llvm
make
make install

Enabling JIT

You may observe that there is a new directory under the PostgreSQL’s lib folder with name

bit code

Which contains lots of files with .bc extension these are pre-generated bytecodes for LLVM for facilitating features like in-lining.

By default, the JIT feature is disabled in PostgreSQL 11. If you want to test it, you may have to enable the parameter

jit

:

postgres=# ALTER SYSTEM SET jit=on;
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)
postgres=# show jit;
 jit
-----
 on
(1 row)

By default, most of the simple queries won’t use JIT because of the cost. The cost is high when JIT kicks in. In case we want to test if JIT is properly configured, we can lie to PostgreSQL that that cost is very low by adjusting the parameter value. However, we should keep in mind that we are accepting negative performance gains. Let me show a quick example:

postgres=# SET jit_above_cost=5;
SET
postgres=# create table t1 (id int);
CREATE TABLE
postgres=# insert into t1 (SELECT (random()*100)::int FROM generate_series(1,800000) as g);
INSERT 0 800000
postgres=# analyze t1;
ANALYZE
postgres=# explain select sum(id) from t1;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=8706.88..8706.89 rows=1 width=8)
   ->  Gather  (cost=8706.67..8706.88 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=7706.67..7706.68 rows=1 width=8)
               ->  Parallel Seq Scan on t1  (cost=0.00..6873.33 rows=333333 width=4)
 JIT:
   Functions: 6
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(8 rows)

As we can see in the above example, a separate JIT section comes up in the explain plan.

We expect JIT compilation to make a difference in complex analytical queries because the overhead in JIT compilation gets compensated only if the code runs for the duration. Here is a simple aggregate query for demonstration. (I know this is not a complex query, and not the perfect example for demonstrating JIT feature):

postgres=# EXPLAIN ANALYZE SELECT COMPANY_ID,
      SUM(SHARES) TOT_SHARES,
      SUM(SHARES* RATE) TOT_INVEST,
      MIN(SHARES* RATE) MIN_TRADE,
      MAX(SHARES* RATE) MAX_TRADE,
      SUM(SHARES* RATE * 0.002) BROKERAGE
FROM TRADING
GROUP BY COMPANY_ID;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=757298.72..758741.91 rows=5005 width=138) (actual time=16992.290..17011.395 rows=5000 loops=1)
   Group Key: company_id
   ->  Gather Merge  (cost=757298.72..758466.64 rows=10010 width=138) (actual time=16992.270..16996.919 rows=15000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=756298.70..756311.21 rows=5005 width=138) (actual time=16983.900..16984.356 rows=5000 loops=3)
               Sort Key: company_id
               Sort Method: quicksort  Memory: 1521kB
               Worker 0:  Sort Method: quicksort  Memory: 1521kB
               Worker 1:  Sort Method: quicksort  Memory: 1521kB
               ->  Partial HashAggregate  (cost=755916.09..755991.16 rows=5005 width=138) (actual time=16975.997..16981.354 rows=5000 loops=3)
                     Group Key: company_id
                     ->  Parallel Seq Scan on trading  (cost=0.00..287163.65 rows=12500065 width=12) (actual time=0.032..1075.833 rows=10000000 loops=3)
 Planning Time: 0.073 ms
 Execution Time: 17013.116 ms
(15 rows)

We can switch on the JIT parameter at the session level and retry the same query:

postgres=# SET JIT=ON;
SET
postgres=# EXPLAIN ANALYZE SELECT COMPANY_ID,
      SUM(SHARES) TOT_SHARES,
      SUM(SHARES* RATE) TOT_INVEST,
      MIN(SHARES* RATE) MIN_TRADE,
      MAX(SHARES* RATE) MAX_TRADE,
      SUM(SHARES* RATE * 0.002) BROKERAGE
FROM TRADING
GROUP BY COMPANY_ID;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=757298.72..758741.91 rows=5005 width=138) (actual time=15672.809..15690.901 rows=5000 loops=1)
   Group Key: company_id
   ->  Gather Merge  (cost=757298.72..758466.64 rows=10010 width=138) (actual time=15672.781..15678.736 rows=15000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=756298.70..756311.21 rows=5005 width=138) (actual time=15661.144..15661.638 rows=5000 loops=3)
               Sort Key: company_id
               Sort Method: quicksort  Memory: 1521kB
               Worker 0:  Sort Method: quicksort  Memory: 1521kB
               Worker 1:  Sort Method: quicksort  Memory: 1521kB
               ->  Partial HashAggregate  (cost=755916.09..755991.16 rows=5005 width=138) (actual time=15653.390..15658.581 rows=5000 loops=3)
                     Group Key: company_id
                     ->  Parallel Seq Scan on trading  (cost=0.00..287163.65 rows=12500065 width=12) (actual time=0.039..1084.820 rows=10000000 loops=3)
 Planning Time: 0.072 ms
 JIT:
   Functions: 28
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 5.844 ms, Inlining 137.226 ms, Optimization 201.152 ms, Emission 125.022 ms, Total 469.244 ms
 Execution Time: 15696.092 ms
(19 rows)

Here we see a 7.7% improvement in performance. I executed this several times and found that the performance gain is consistently 7-8% for this simple query (which takes 15 seconds to execute). The gains are higher for queries with more calculations/expressions.

Summary

It is fairly simple to install and configure JIT with PostgreSQL as demonstrated above. One point we would like to highlight is that installing JIT packages and enabling the JIT feature can be done online while the database is up and running. This is because all JIT related parameters are dynamic in nature. Parameter changes can be loaded a SIGHUP signal or

SELECT pg_reload_conf()

 by the superuser. If it is not helping our workload, we can turn it off anytime. Nothing stops you from trying it in a non-production environment. We might not see a gain in small and simple queries that take less time for execution because the overhead in doing the JIT compilation can become more than executing the SQL statement. But we should expect a good gain in OLAP workload with complex queries that run for a longer duration.

Apr
05
2016
--

Percona Live featured talk with Anastasia Ailamaki — RAW: Fast queries on JIT databases

Percona Live featured talk

Percona Live featured talkWelcome to the next Percona Live featured talk with Percona Live Data Performance Conference 2016 speakers! In this series of blogs, we’ll highlight some of the speakers that will be at this year’s conference, as well as discuss the technologies and outlooks of the speakers themselves. Make sure to read to the end to get a special Percona Live registration bonus!

In this Percona Live featured talk, we’ll meet Anastasia Ailamaki, Professor and CEO, EPFL and RAW Labs. Her talk will be RAW: Fast queries on JIT databases. RAW is a query engine that reads data in its raw format and processes queries using adaptive, just-in-time operators. The key insight is its use of virtualization and dynamic generation of operators. I had a chance to speak with Anastasia and learn a bit more about RAW and JIT databases:

Percona: Give me a brief history of yourself: how you got into database development, where you work, what you love about it.

Anastasia: I am a computer engineer and initially trained on networks. I came across databases in the midst of the object-oriented hype — and was totally smitten by both the power of data models and the wealth of problems one had to solve to create a functioning and performant database system. In the following years, I built several systems as a student and (later) as a coder. At some point, however, I needed to learn more about the machine. I decided to do a Masters in computer architecture, which led to a Ph.D. in databases and microarchitecture. I became a professor at CMU, where for eight years I guided students as they built their ideas into real systems that assessed their ideas potential and value. During my sabbatical at EPFL, I was fascinated by the talent and opportunities in Switzerland — I decided to stay and, seven years later, co-founded RAW Labs.

Percona: Your talk is going to be on “RAW: Fast queries on JIT databases.” Would you say you’re an advocate of abandoning (or at least not relying on) the traditional “big structured database accessed by queries” model that have existed for most of computing? Why?

Anastasia: The classical usage paradigm for databases has been “create a database, then ask queries.” Traditionally, “creating a database” means creating a structured copy of the entire dataset. This is now passé for the simple reason that data is growing too fast, and loading overhead grows with data size. What’s more, we typically use only a small fraction of the data available, and investing in the mass of owned data is a waste of resources — people have to wait too long from the time they receive a dataset until they can ask a query. And it doesn’t stop there: the users are asked to pick a database engine based on the format and intended use of the data. We associate row stores to transactions, NoSQL to JSON, and column stores to analytics, but true insight comes from combining all of the data semantically as opposed to structurally. With each engine optimizing for specific kinds of queries and data formats, analysts subconsciously factor in limitations when piecing together their infrastructure. We only know the best way to structure data when we see the queries, so loading data and developing query processing operators before knowing the queries is premature.

Percona: What are the conditions that make JIT databases in general (and RAW specifically) the optimum solution?

Anastasia: JIT databases push functionality to the last minute, and execute it right when it’s actually needed. Several systems perform JIT compilation of queries, which offer great performance benefits (an example is Hyper, a system recently acquired by Tableau). RAW is JIT on steroids: it leaves data at its source and only reads it or asks for any system resources when they’re actually required. You may have 10000 files, and a file will only be read when you ask a query that needs the data in it. With RAW, when the user asks a query the RAW code-generates raw source data adaptors and the entire query engine needed to run the query. It stores all useful information about the accessed data, as well as popular operators generated in the past, and uses them to accelerate future queries. It adapts to system resources on the fly and only asks for them when needed. RAW is an interface to raw data and operational databases, and uses them to accelerate future queries. It adapts to system resources on the fly and only asks for them when needed. In addition, the RAW query language is incredibly rich; it is a superset of SQL which allows navigation on hierarchical data and tables at the same time, with support for variable assignments, regular expressions, and more for log processing — while staying in declarative land. Therefore, the analysts only need to describe the desired result in SQL, without thinking of data format.

Percona: What would you say in the next step for JIT and RAW? What keeps you up at night concerning the future of this approach?

Anastasia: The next step for RAW is to reach out to as many people as possible — especially users with complex operational data pipelines — and reduce cost and eliminate pipeline stages, unneeded data copies, and extensive scripting. RAW is a new approach that can work with existing infrastructures in a non-intrusive way. We are well on our way with several proof-of-concept projects that create verticals for RAW, and demonstrate its usefulness for different applications.

Percona: What are you most looking forward to at Percona Live Data Performance Conference 2016?

Anastasia: I am looking forward to meeting as many users and developers as possible, hearing their feedback on RAW and our ideas, and learning from their experiences.

You can read more about RAW and JIT databases at Anastasia’s academic group’s website: dias.epfl.ch.

Want to find out more about Anastasia and RAW? Register for Percona Live Data Performance Conference 2016, and see her talk RAW: Fast queries on JIT databases. Use the code “FeaturedTalk” and receive $100 off the current registration price!

Percona Live Data Performance Conference 2016 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 18-21 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com