Jun
18
2019
--

Upcoming Webinar 6/19: A Deep Dive Into PostgreSQL Indexing

PostgreSQL Indexing

PostgreSQL IndexingPlease join Percona’s Senior Software Engineer Ibrar Ahmed as he presents his talk “A Deep Dive Into PostgreSQL Indexing” on Wednesday, June 19th, 2019 at 6:00 AM PDT (UTC-7) / 9:00 AM EDT (UTC-4).

Register Now

Indexes are a basic feature of relational databases, and PostgreSQL offers a rich collection of options to developers and designers. To fully take advantage of these, users need to understand the basic concept of indexes, and to be able to compare the different index types and how they apply to different application scenarios. Only then can you make an informed decision about your database index strategy and design.

One thing is for sure: not all indexes are appropriate for all circumstances, and using a “wrong” index can have the opposite effect to that you intend, and problems might only surface once in production. Armed with more advanced knowledge, you can avoid this worst-case scenario!

We’ll take a look at how to use pg_stat_statment to find opportunities for adding indexes to your database. We’ll look at when to add an index, and when adding an index is unlikely to result in a good solution. So should you add an index to every column? Come and discover why this strategy is rarely recommended as we take a deep dive into PostgreSQL indexing.

If you can’t make the webinar, you can sign up and we’ll send out the recording and slides.

Photo by Adrian RF on Unsplash.

Jun
14
2019
--

Bloom Indexes in PostgreSQL

Bloom Indexes in PostgreSQL

PostgreSQL LogoThere is a wide variety of indexes available in PostgreSQL. While most are common in almost all databases, there are some types of indexes that are more specific to PostgreSQL. For example, GIN indexes are helpful to speed up the search for element values within documents. GIN and GiST indexes could both be used for making full-text searches faster, whereas BRIN indexes are more useful when dealing with large tables, as it only stores the summary information of a page. We will look at these indexes in more detail in future blog posts. For now, I would like to talk about another of the special indexes that can speed up searches on a table with a huge number of columns and which is massive in size. And that is called a bloom index.

In order to understand the bloom index better, let’s first understand the bloom filter data structure. I will try to keep the description as short as I can so that we can discuss more about how to create this index and when will it be useful.

Most readers will know that an array in computer sciences is a data structure that consists of a collection of values and variables. Whereas a bit or a binary digit is the smallest unit of data represented with either 0 or 1. A bloom filter is also a bit array of m bits that are all initially set to 0.

A bit array is an array that could store a certain number of bits (0 and 1). It is one of the most space-efficient data structures to test whether an element is in a set or not.

Why use bloom filters?

Let’s consider some alternates such as list data structure and hash tables. In the case of a list data structure, it needs to iterate through each element in the list to search for a specific element. We can also try to maintain a hash table where each element in the list is hashed, and we then see if the hash of the element we are searching for matches a hash in the list. But checking through all the hashes may be a higher order of magnitude than expected. If there is a hash collision, then it does a linear probing which may be time-consuming. When we add hash tables to disk, it requires some additional IO and storage. For an efficient solution, we can look into bloom filters which are similar to hash tables.

Type I and Type II errors

While using bloom filters, we may see a result that falls into a

type I error

but never a

type II error

. A nice example of a type I error is a result that a person with last name: “vallarapu” exists in the relation: foo.bar whereas it does not exist in reality (a

false positive

conclusion). An example for a type II error is a result that a person with the last name as “vallarapu” does not exist in the relation: foo.bar, but in reality, it does exist (a

false negative

conclusion). A bloom filter is 100% accurate when it says the element is not present. But when it says the element is present, it may be 90% accurate or less. So it is usually called a

probabilistic data structure

.

The bloom filter algorithm

Let’s now understand the algorithm behind bloom filters better. As discussed earlier, it is a bit array of m bits, where m is a certain number. And we need a k number of hash functions. In order to tell whether an element exists and to give away the item pointer of the element, the element (data in columns) will be passed to the hash functions. Let’s say that there are only two hash functions to store the presence of the first element “avi” in the bit array. When the word “avi” is passed to the first hash function, it may generate the output as 4 and the second may give the output as 5. So now the bit array could look like the following:

All the bits are initially set to 0. Once we store the existence of the element “avi” in the bloom filter, it sets the 4th and 5th bits to 1. Let’s now store the existence of the word “percona”. This word is again passed to both the hash functions and assumes that the first hash function generates the value as 5 and the second hash function generated the value as 6. So, the bit array now looks like the following – since the 5th bit was already set to 1 earlier, it doesn’t make any modifications there:

Now, consider that our query is searching for a predicate with the name as “avi”. The input: “avi” will now be passed to the hash functions. The first hash function returns the value as 4 and the second returns the value as 5, as these are the same hash functions that were used earlier. Now when we look in position 4 and 5 of the bloom filter (bit array), we can see that the values are set to 1. This means that the element is present.

Collision with bloom filters

Consider a query that is fetching the records of a table with the name: “don”. When this word “don” is passed to both the hash functions, the first hash function returns the value as 6 (let’s say) and the second hash function returns the value as 4. As the bits at positions 6 and 4 are set to 1, the membership is confirmed and we see from the result that a record with the name: “don” is present. In reality, it is not. This is one of the chances of collisions. However, this is not a serious problem.

A point to remember is – “The fewer the hash functions, the more the chances of collisions. And the more the hash functions, lesser the chances of collision. But if we have k hash functions, the time it takes for validating membership is in the order of k“.

Bloom Indexes in PostgreSQL

As you’ll now have understood bloom filters, you’ll know a bloom index uses bloom filters. When you have a table with too many columns, and there are queries using too many combinations of columns  – as predicates – on such tables, you could need many indexes. Maintaining so many indexes is not only costly for the database but is also a performance killer when dealing with larger data sets.

So, if you create a bloom index on all these columns, a hash is calculated for each of the columns and merged into a single index entry of the specified length for each row/record. When you specify a list of columns on which you need a bloom filter, you could also choose how many bits need to be set per each column. The following is an example syntax with the length of each index entry and the number of bits per a specific column.

CREATE INDEX bloom_idx_bar ON foo.bar USING bloom (id,dept_id,zipcode)
WITH (length=80, col1=4, col2=2, col3=4);

length

is rounded to the nearest multiple of 16. Default is 80. And the maximum is 4096. The default

number of bits

per column is 2. We can specify a maximum of 4095 bits.

Bits per each column

Here is what it means in theory when we have specified length = 80 and col1=2, col2=2, col3=4. A bit array of length 80 bits is created per row or a record. Data inside col1 (column1) is passed to two hash functions because col1 was set to 2 bits. Let’s say these two hash functions generate the values as 20 and 40. The bits at the 20th and 40th positions are set to 1 within the 80 bits (m) since the length is specified as 80 bits. Data in col3 is now passed to four hash functions and let’s say the values generated are 2, 4, 9, 10. So four bits – 2, 4, 9, 10 –are set to 1 within the 80 bits.

There may be many empty bits, but it allows for more randomness across the bit arrays of each of the individual rows. Using a signature function, a signature is stored in the index data page for each record along with the row pointer that points to the actual row in the table. Now, when a query uses an equality operator on the column that has been indexed using bloom, a number of hash functions, as already set for that column, are used to generate the appropriate number of hash values. Let’s say four for col3 – so 2, 4, 9, 10. The index data is extracted row-by-row and searched if the rows have those bits (bit positions generated by hash functions) set to 1.

And finally, it says a certain number of rows have got all of these bits set to 1. The greater the length and the bits per column, the more the randomness and the fewer the false positives. But the greater the length, the greater the size of the index.

Bloom Extension

Bloom index is shipped through the contrib module as an extension, so you must create the bloom extension in order to take advantage of this index using the following command:

CREATE EXTENSION bloom;

Example

Let’s start with an example. I am going to create a table with multiple columns and insert 100 million records.

percona=# CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);
CREATE TABLE
percona=# INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() * 1000000)::int,
(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,
(random() * 1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()* (20000-9999 + 1) + 9999)
from generate_series(1,100*1e6) g;
INSERT 0 100000000

The size of the table is now 9647 MB as you can see below.

percona=# \dt+ foo.bar
List of relations
Schema | Name | Type  | Owner    | Size    | Description
-------+------+-------+----------+---------+-------------
foo    | bar  | table | postgres | 9647 MB | (1 row)

Let’s say that all the columns: id, dept, id2, id3, id4, id5, id6 and zip code of table: foo.bar are used in several queries in random combinations according to different reporting purposes. If we create individual indexes on each column, it is going to take almost 2 GB disk space for each index.

Testing with btree indexes

We’ll try creating a single btree index on all the columns that are most used by the queries hitting this table. As you can see in the following log, it took 91115.397 ms to create this index and the size of the index is 4743 MB.

postgres=# CREATE INDEX idx_btree_bar ON foo.bar (id, dept, id2,id3,id4,id5,id6,zipcode);
CREATE INDEX
Time: 91115.397 ms (01:31.115)
postgres=# \di+ foo.idx_btree_bar
                             List of relations
 Schema |     Name      | Type  |  Owner   | Table |  Size   | Description
--------+---------------+-------+----------+-------+---------+-------------
 foo    | idx_btree_bar | index | postgres | bar   | 4743 MB |
(1 row)

Now, let’s try some of the queries with a random selection of columns. You can see that the execution plans of these queries are 2440.374 ms and 2406.498 ms for query 1 and query 2 respectively. To avoid issues with the disk IO, I made sure that the execution plan was captured when the index was cached to memory.

Query 1
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id4 = 295294 and zipcode = 13266;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_bar on bar  (cost=0.57..1607120.58 rows=1 width=69) (actual time=1832.389..2440.334 rows=1 loops=1)
   Index Cond: ((id4 = 295294) AND (zipcode = 13266))
 Planning Time: 0.079 ms
 Execution Time: 2440.374 ms
(4 rows)
Query 2
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 281326 and id6 = 894198;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_bar on bar  (cost=0.57..1607120.58 rows=1 width=69) (actual time=1806.237..2406.475 rows=1 loops=1)
   Index Cond: ((id5 = 281326) AND (id6 = 894198))
 Planning Time: 0.096 ms
 Execution Time: 2406.498 ms
(4 rows)

Testing with Bloom Indexes

Let’s now create a bloom index on the same columns. As you can see from the following log, there is a huge size difference between the bloom (1342 MB) and the btree index (4743 MB). This is the first win. It took almost the same time to create the btree and the bloom index.

postgres=# CREATE INDEX idx_bloom_bar ON foo.bar USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4);
CREATE INDEX
Time: 94833.801 ms (01:34.834)
postgres=# \di+ foo.idx_bloom_bar
                             List of relations
 Schema |     Name      | Type  |  Owner   | Table |  Size   | Description
--------+---------------+-------+----------+-------+---------+-------------
 foo    | idx_bloom_bar | index | postgres | bar   | 1342 MB |
(1 row)

Let’s run the same queries, check the execution time, and observe the difference.

Query 1
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 326756 and id6 = 597560;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on bar  (cost=1171823.08..1171824.10 rows=1 width=69) (actual time=1265.269..1265.550 rows=1 loops=1)
   Recheck Cond: ((id4 = 295294) AND (zipcode = 13266))
   Rows Removed by Index Recheck: 2984788
   Heap Blocks: exact=59099 lossy=36090
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..1171823.08 rows=1 width=0) (actual time=653.865..653.865 rows=99046 loops=1)
         Index Cond: ((id4 = 295294) AND (zipcode = 13266))
 Planning Time: 0.073 ms
 Execution Time: 1265.576 ms
(8 rows)
Query 2
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 281326 and id6 = 894198;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on bar  (cost=1171823.08..1171824.10 rows=1 width=69) (actual time=950.561..950.799 rows=1 loops=1)
   Recheck Cond: ((id5 = 281326) AND (id6 = 894198))
   Rows Removed by Index Recheck: 2983893
   Heap Blocks: exact=58739 lossy=36084
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..1171823.08 rows=1 width=0) (actual time=401.588..401.588 rows=98631 loops=1)
         Index Cond: ((id5 = 281326) AND (id6 = 894198))
 Planning Time: 0.072 ms
 Execution Time: 950.827 ms
(8 rows)

From the above tests, it is evident that the bloom indexes performed better. Query 1 took 1265.576 ms with a bloom index and 2440.374 ms with a btree index. And query 2 took 950.827 ms with bloom and 2406.498 ms with btree. However, the same test will show a better result for a btree index, if you would have created a btree index on those 2 columns only (instead of many columns).

Reducing False Positives

If you look at the execution plans generated after creating the bloom indexes (consider Query 2),  98631 rows are considered to be matching rows. However, the output says only one row. So, the rest of the rows – all 98630 – are false positives. The btree index would not return any false positives.

In order to reduce the false positives, you may have to increase the signature length and also the bits per column through some of the formulas mentioned in this interesting blog post through experimentation and testing. As you increase the signature length and bits, you might see the bloom index growing in size. Nevertheless, this may reduce false positives. If the time spent is greater due to the number of false positives returned by the bloom index, you could increase the length. If increasing the length does not make much difference to the performance, then you can leave the length as it is.

Points to be carefully noted

  1. In the above tests, we have seen how a bloom index has performed better than a btree index. But, in reality, if we had created a btree index just on top of the two columns being used as predicates, the query would have performed much faster with a btree index than with a bloom index. This index does not replace a btree index unless we wish to replace a chunk of the indexes with a single bloom index.
  2. Just like hash indexes, a bloom index is applicable for equality operators only.
  3. Some formulas on how to calculate the appropriate length of a bloom filter and the bits per column can be read on Wikipedia or in this blog post.

Conclusion

Bloom indexes are very helpful when we have a table that stores huge amounts of data and a lot of columns, where we find it difficult to create a large number of indexes, especially in OLAP environments where data is loaded from several sources and maintained for reporting. You could consider testing a single bloom index to see if you can avoid implementing a huge number of individual or composite indexes that could take additional disk space without much performance gain.

May
24
2019
--

An Overview of Sharding in PostgreSQL and How it Relates to MongoDB’s

PostgreSQL LogoA couple of weeks ago I presented at Percona University São Paulo about the new features in PostgreSQL that allow the deployment of simple shards. I’ve tried to summarize the main points in this post, as well as providing an introductory overview of sharding itself. Please note I haven’t included any third-party extensions that provide sharding for PostgreSQL in my discussion below.

Partitioning in PostgreSQL

In a nutshell, until not long ago there wasn’t a dedicated, native feature in PostgreSQL for table partitioning. Not that that prevented people from doing it anyway: the PostgreSQL community is very creative. There’s a table inheritance feature in PostgreSQL that allows the creation of child tables with the same structure as a parent table. That, combined with the employment of proper constraints in each child table along with the right set of triggers in the parent table, has provided practical “table partitioning” in PostgreSQL for years (and still works). Here’s an example:

Using table inheritance

CREATE TABLE temperature (
  id BIGSERIAL PRIMARY KEY NOT NULL,
  city_id INT NOT NULL,
  timestamp TIMESTAMP NOT NULL,
  temp DECIMAL(5,2) NOT NULL
);

Figure 1a. Main (or parent) table

CREATE TABLE temperature_201901 (CHECK (timestamp >= DATE '2019-01-01' AND timestamp <= DATE '2019-01-31')) INHERITS (temperature);
CREATE TABLE temperature_201902 (CHECK (timestamp >= DATE '2019-02-01' AND timestamp <= DATE '2019-02-28')) INHERITS (temperature);
CREATE TABLE temperature_201903 (CHECK (timestamp >= DATE '2019-03-01' AND timestamp <= DATE '2019-03-31')) INHERITS (temperature);

Figure 1b. Child tables inherit the structure of the parent table and are limited by constraints

CREATE OR REPLACE FUNCTION temperature_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF ( NEW.timestamp >= DATE '2019-01-01' AND NEW.timestamp <= DATE '2019-01-31' ) THEN INSERT INTO temperature_201901 VALUES (NEW.*);
    ELSIF ( NEW.timestamp >= DATE '2019-02-01' AND NEW.timestamp <= DATE '2019-02-28' ) THEN INSERT INTO temperature_201902 VALUES (NEW.*);
    ELSIF ( NEW.timestamp >= DATE '2019-03-01' AND NEW.timestamp <= DATE '2019-03-31' ) THEN INSERT INTO temperature_201903 VALUES (NEW.*);
    ELSE RAISE EXCEPTION 'Date out of range!';
    END IF;
    RETURN NULL;
END;
$$
LANGUAGE plpgsql;

Figure 1c. A function that controls in which child table a new entry should be added according to the timestamp field

CREATE TRIGGER insert_temperature_trigger
    BEFORE INSERT ON temperature
    FOR EACH ROW EXECUTE PROCEDURE temperature_insert_trigger();

Figure 1d. A trigger is added to the parent table that calls the function above when an INSERT is performed

The biggest drawbacks for such an implementation was related to the amount of manual work needed to maintain such an environment (even though a certain level of automation could be achieved through the use of 3rd party extensions such as pg_partman) and the lack of optimization/support for “distributed” queries. The PostgreSQL optimizer wasn’t advanced enough to have a good understanding of partitions at the time, though there were workarounds that could be used such as employing constraint exclusion.

Declarative partitioning

About 1.5 year ago, PostgreSQL 10 was released with a bunch of new features, among them native support for table partitioning through the new declarative partitioning feature. Here’s how we could partition the same temperature table using this new method:

CREATE TABLE temperature (
  id BIGSERIAL NOT NULL,
  city_id INT NOT NULL,
  timestamp TIMESTAMP NOT NULL,
  temp DECIMAL(5,2) NOT NULL
) PARTITION BY RANGE (timestamp);

Figure 2a. Main table structure for a partitioned table

CREATE TABLE temperature_201901 PARTITION OF temperature FOR VALUES FROM ('2019-01-01') TO ('2019-02-01');
CREATE TABLE temperature_201902 PARTITION OF temperature FOR VALUES FROM ('2019-02-01') TO ('2019-03-01');
CREATE TABLE temperature_201903 PARTITION OF temperature FOR VALUES FROM ('2019-03-01') TO ('2019-04-01');

Figure 2b. Tables defined as partitions of the main table; with declarative partitioning, there was no need for triggers anymore.

It still missed the greater optimization and flexibility needed to consider it a complete partitioning solution. It wasn’t possible, for example, to perform an UPDATE that would result in moving a row from one partition to a different one, but the foundation had been laid. Fast forward another year and PostgreSQL 11 builds on top of this, delivering additional features like:

  • the possibility to define a default partition, to which any entry that wouldn’t fit a corresponding partition would be added to.
  • having indexes added to the main table “replicated” to the underlying partitions, which improved declarative partitioning usability.
  • support for Foreign Keys

These are just a few of the features that led to a more mature partitioning solution.

Sharding in PostgreSQL

By now you might be reasonably questioning my premise, and that partitioning is not sharding, at least not in the sense and context you would have expected this post to cover. In fact, PostgreSQL has implemented sharding on top of partitioning by allowing any given partition of a partitioned table to be hosted by a remote server. The basis for this is in PostgreSQL’s Foreign Data Wrapper (FDW) support, which has been a part of the core of PostgreSQL for a long time. While technically possible to implement, we just couldn’t make practical use of it for sharding using the table inheritance + triggers approach. Declarative partitioning allowed for much better integration of these pieces making sharding – partitioned tables hosted by remote servers – more of a reality in PostgreSQL.

CREATE TABLE temperature_201904 (
  id BIGSERIAL NOT NULL,
  city_id INT NOT NULL,
  timestamp TIMESTAMP NOT NULL,
  temp DECIMAL(5,2) NOT NULL
);

Figure 3a. On the remote server we create a “partition” – nothing but a simple table

CREATE EXTENSION postgres_fdw;
GRANT USAGE ON FOREIGN DATA WRAPPER postgres_fdw to app_user;
CREATE SERVER shard02 FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (dbname 'postgres', host 'shard02', port
    '5432');
CREATE USER MAPPING for app_user SERVER shard02
    OPTIONS (user 'fdw_user', password 'secret');

Figure 3b. On the local server the preparatory steps involve loading the postgres_fdw extension, allowing our local application user to use that extension, creating an entry to access the remote server, and finally mapping that user with a user in the remote server (fdw_user) that has local access to the table we’ll use as a remote partition.

CREATE FOREIGN TABLE temperature_201904 PARTITION OF temperature
    FOR VALUES FROM ('2019-04-01') TO ('2019-05-01')
    SERVER remoteserver01;

Figure 3c. Now it’s simply a matter of creating a proper partition of our main table in the local server that will be linked to the table of the same name in the remote server

You can read more about postgres_fdw in Foreign Data Wrappers in PostgreSQL and a closer look at postgres_fdw.

When does it make sense to partition a table?

There are a several principle reasons to partition a table:

  1. When a table grows so big that searching it becomes impractical even with the help of indexes (which will invariably become too big as well).
  2. When data management is such that the target data is often the most recently added and/or older data is constantly being purged/archived, or even not being searched anymore (at least not as often).
  3. If you are loading data from different sources and maintaining it as a data warehousing for reporting and analytics.
  4. For a less expensive archiving or purging of massive data that avoids exclusive locks on the entire table.

When should we resort to sharding?

Here are a couple of classic cases:

  1. To scale out (horizontally), when even after partitioning a table the amount of data is too great or too complex to be processed by a single server.
  2. Use cases where the data in a big table can be divided into two or more segments that would benefit the majority of the search patterns. A common example used to describe a scenario like this is that of a company whose customers are evenly spread across the United States and searches to a target table involves the customer ZIP code. A shard then could be used to host entries of customers located on the East coast and another for customers on the West coast.

Note though this is by no means an extensive list.

How should we shard the data?

With sharding (in this context) being “distributed” partitioning, the essence for a successful (performant) sharded environment lies in choosing the right shard key – and by “right” I mean one that will distribute your data across the shards in a way that will benefit most of your queries. In the example above, using the customer ZIP code as shard key makes sense if an application will more often be issuing queries that will hit one shard (East) or the other (West). However, if most queries would filter by, say, birth date, then all queries would need to be run through all shards to recover the full result set. This could easily backfire on performance with the shard approach, by not selecting the right shard key or simply by having such a heterogeneous workload that no shard key would be able to satisfy it.

It only ever makes sense to shard if the nature of the queries involving the target table(s) is such that distributed processing will be the norm and constitute an advantage far greater than any overhead caused by a minority of queries that rely on JOINs involving multiple shards. Due to the distributed nature of sharding such queries will necessarily perform worse if compared to having them all hosted on the same server.

Why not simply rely on replication or clustering?

Sharding should be considered in those situations where you can’t efficiently break down a big table through data normalization or use an alternative approach and maintaining it on a single server is too demanding. The table is then partitioned and the partitions distributed across different servers to spread the load across many servers. It doesn’t need to be one partition per shard, often a single shard will host a number of partitions.

Note how sharding differs from traditional “share all” database replication and clustering environments: you may use, for instance, a dedicated PostgreSQL server to host a single partition from a single table and nothing else. However, these data scaling technologies may well complement each other: a PostgreSQL database may host a shard with part of a big table as well as replicate smaller tables that are often used for some sort of consultation (read-only), such as a price list, through logical replication.

How does sharding in PostgreSQL relates to sharding in MongoDB®?

MongoDB® tackles the matter of managing big collections straight through sharding: there is no concept of local partitioning of collections in MongoDB. In fact, the whole MongoDB scaling strategy is based on sharding, which takes a central place in the database architecture. As such, the sharding process has been made as transparent to the application as possible: all a DBA has to do is to define the shard key.

Instead of connecting to a reference database server the application will connect to an auxiliary router server named mongos which will process the queries and request the necessary information to the respective shard. It knows which shard contains what because they maintain a copy of the metadata that maps chunks of data to shards, which they get from a config server, another important and independent component of a MongoDB sharded cluster. Together, they also play a role in maintaining good data distribution across the shards, actively splitting and migrating chunks of data between servers as needed.

In PostgreSQL the application will connect and query the main database server. There isn’t an intermediary router such as the mongos but PostgreSQL’s query planner will process the query and create an execution plan. When data requested from a partitioned table is found on a remote server PostgreSQL will request the data from it, as shown in the EXPLAIN output below:

…
   Remote SQL: UPDATE public.emp SET sal = $2 WHERE ctid = $1
   ->  Nested Loop  (cost=100.00..300.71 rows=669 width=118)
     	Output: emp.empno, emp.ename, emp.job, emp.mgr, emp.hiredate, (emp.sal * '1.1'::double precision), emp.comm, emp.deptno, emp.ctid, salgrade.ctid
     	Join Filter: ((emp.sal > (salgrade.losal)::double precision) AND (emp.sal < (salgrade.hisal)::double precision)) ->  Foreign Scan on public.emp  (cost=100.00..128.06 rows=602 width=112)
           	Output: emp.empno, emp.ename, emp.job, emp.mgr, emp.hiredate, emp.sal, emp.comm, emp.deptno, emp.ctid
…

Figure 4: excerpt of an EXPLAIN plan that involves processing a query in a remote server

Note in the above query the mention “Remote SQL”. A lot of optimizations have been made in the execution of remote queries in PostgreSQL 10 and 11, which contributed to mature and improve the sharding solution. Among them is support for having grouping and aggregation operations executed on the remote server itself (“push down”) rather than recovering all rows and having them processed locally.

What is missing in PostgreSQL implementation?

There is, however, still room for improvement. In terms of remote execution, reports from the community indicate not all queries are performing as they should. For example, in some cases the PostgreSQL planner is not performing a full push-down, resulting in shards transferring more data than required. Parallel scheduling of queries that touch multiple shards is not yet implemented: for now, the execution is taking place sequentially, one shard at a time, which takes longer to complete. When it comes to the maintenance of partitioned and sharded environments, changes in the structure of partitions are still complicated and not very practical. For example, when you add a new partition to a partitioned table with an appointed default partition you may need to detach the default partition first if it contains rows that would now fit in the new partition, manually move those to the new partition, and finally re-attach the default partition back in place.

But that is all part of a maturing technology. We’re looking forward to PostgreSQL 12 and what it will bring in the partitioning and sharding fronts.


Image based on photos by Leonardo Quatrocchi from Pexels

 

May
13
2019
--

PostgreSQL Minor Versions Released May 9, 2019

PostgreSQL Logo

Usually, the PostgreSQL Community releases minor patches on the Thursday of the second week of the second month of each quarter. This can vary depending on the nature of the fixes. For example, if the fixes include critical fixes as a postgres community response to security vulnerabilities or bugs that might affect data integrity. The previous minor release happened on February 14, 2019. In this case, the fysnc failure issue was corrected, along with some other enhancements and fixes.

The latest minor release, published on May 9, 2019, has some security fixes and bug fixes for these PostgreSQL Major versions.

PostgreSQL 11 (11.3)
PostgreSQL 10 (10.8)
PostgreSQL 9.6 (9.6.13)
PostgreSQL 9.5 (9.5.17)
PostgreSQL 9.4 (9.4.22)

Let’s take a look at some of the security fixes in this release.

Security Fixes

CVE-2019-10130 (Applicable to PostgreSQL 9.5, 9.6, 10 and 11 versions)

When you

ANALYZE

a table in PostgreSQL, statistics of all the database objects are stored in

pg_statistic

. The query planner uses this statistical data is and it may contain some sensitive data, for example, min and max values of a column. Some of the planner’s selectivity estimators apply user-defined operators to values found in

pg_statistic

. There was a security fix :

CVE-2017-7484

in the past, that restricted a leaky user-defined operator from disclosing some of the entries of a data column.

Starting from PostgreSQL 9.5, tables in PostgreSQL not only have SQL-standard privilege system but also row security policies. To keep it short and simple, you can restrict a user so that they can only access specific rows of a table. We call this RLS (Row-Level Security).

CVE-2019-10130

is about restricting a user who has SQL permissions to read a column but is forbidden to read some rows due to RLS policy from discovering the restricted rows through a leaky operator. Through this patch, leaky operators to statistical data are only allowed when there is no relevant

RLS

policy. We shall see more about

RLS

in our future blog posts.

CVE-2019-10129 (Applicable to PostgreSQL 11 only)

This fixes the situation where a user could access arbitrary back end memory with correctly crafted DDLs. This is only applicable to PostgreSQL 11 version and no other major release has been impacted.

Fixes roundup

There have been a good number of fixes around partitioning in both PostgreSQL 10 and 11 versions that included a fix to behaviour of an UPDATE or an INSERT on certain scenarios, failure while ALTER INDEX .. ATTACH PARTITION, tuple routing in multi-level partitioned tables that have dropped attributes. etc. .

We should also mention a couple of interesting bugs addressed in this new release.

Bug # 15631 is about the catalog corruption when a temporary table on commit drop is created with an identity column. Due to this issue, PostgreSQL 11.1 would not allow the creation of temporary tables, and you might have observed this error message prior to this release:

# create temporary table foo ( bar int generated by default as identity ) on
# commit drop;
# CREATE TABLE
# \q
# psql postgres postgres
# create temporary table a (b varchar);
ERROR: could not open relation with OID 16388
LINE 1: create temporary table a (b varchar);

Another bug that got fixed is Bug # 15734 which is about a

Walsender

process crashing when executing

SHOW ALL

using replication protocol.

Some more fixes include:

1. Fix

CREATE VIEW

to allow zero-column views.
2. Add missing support for the

CREATE TABLE IF NOT EXISTS .. AS EXECUTE

  .. statement.
3. Fix incompatibility of GIN-index WAL records that were introduced in 11.2, 10.7, 9.6.12, 9.5.16, and 9.4.21 that affected replica servers running these versions reading in changes to GIN indexes from primary servers of older versions.
4. Make

pg_verify_checksums

verify that the data directory it’s pointed at is of the right PostgreSQL version.
5. Fixes for

postgres_fdw

where an

UPDATE

could lead to incorrect results or a crash.
6. Several memory leak fixes as well as fixes to management of dynamic shared memory.

The postgres community have fixed more than just the bug fixes and enhancements we’ve highlighted. We would recommend you to go through the release notes, which you can access by clicking the appropriate release in the beginning of this blog post. Let us know if you have any specific questions you’d like us to address.

Applying updates

We always recommend that you keep your PostgreSQL databases updated to the latest minor versions. Be aware that applying a minor release might need a restart after updating the new binaries.

Here is the sequence of steps you should follow to upgrade to the latest minor versions after thorough testing :

1. Shutdown the PostgreSQL database server
2. Install the updated binaries
3. Restart your PostgreSQL database server

Most of the time, you can choose to update the minor versions in a rolling fashion in a master-slave (replication) setup, because it avoids downtime for both reads and writes simultaneously. PostgreSQL logoFor a rolling style update, you could perform the update on one server after another. However, the best method that we’d almost always recommend is – shutdown, update and restart all instances at once.

If you are currently running your databases on PostgreSQL 9.3.x or earlier, we recommend that you to prepare a plan to upgrade your PostgreSQL databases to a supported version ASAP. We’ve previously published some blog posts about the various options for upgrading your PostgreSQL databases to a supported major version, and you can read our post archive here. Please subscribe to our blog posts so that you can see more interesting topics around PostgreSQL.

And don’t forget that at Percona Live in Austin, May 28-30 2019, we’ll have two days of PostgreSQL content in a postgres dedicated track. Please see all our PostgreSQL talks here.

May
10
2019
--

pgBackRest – A Great Backup Solution and a Wonderful Year of Growth

pgBackRest postgresql backup solution

pgBackRest postgresql backup solutionpgBackRest addresses many of the must-have features that you’ll want to look for in a PostgreSQL backup solution. I have been a great fan of the pgBackRest project for quite some time, and it gets better all the time. Historically, it was written in perl and now over the last year, the project is making steady progress converting into native C code. At the time of writing, the latest version is 2.13 and there remains dependency on a long list of perl libraries. In case you’ve never tried pgBackRest, now it is a great time to do it. This post should help you to set up a simple backup with a local backup repository.

Installation

The pgBackRest project packages are maintained in the PGDG repository. If you have already added the PGDG repository to package manager,  installation is a breeze.

On RHEL/CentOS/OEL:

$ sudo yum install pgbackrest

On Ubuntu/Debian

$ sudo apt install pgbackrest

This will fetch all the required perl libraries too:

The backrest is a native executable now (version 2):

$ file /usr/bin/pgbackrest
/usr/bin/pgbackrest: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=5e3f6123d02e0013b53f6568f99409378d43ad89, not stripped

Some of the other changes DBAs should keep in mind are:

  1. thread-max option is no longer valid – use process-max instead.
  2. archive-max-mb option is no longer valid and has been replaced with the archive-push-queue-max option which has different semantics
  3. The default for the backup-user (deprecated), which is a new repo-host-user, has changed from backrest to pgbackrest.
  4. The configuration file has changed from /etc/pgbackrest.conf to /etc/pgbackrest/pgbackrest.conf

Building from source

We may want to build pgBackRest depending on our environment and version. Building pgBackrest from source on Debian / Ubuntu is already covered in the official documentation. Below I’ve provided the steps to follow for the Red Hat family.

Get the tarball of the latest release:

curl -LO https://github.com/pgbackrest/pgbackrest/archive/release/2.13.tar.gz

In order to build from source, we need to build tools and libraries:

sudo yum install gcc make zlib-devel libxml2-devel openssl-devel

Now we can unpack the tarball and move it to the unpacked the directory:

tar -xvf 2.13.tar.gz
cd pgbackrest-release-2.13

Before proceeding any further, we should make sure that perl is built with the  USE_64_BIT_INT option.

$ perl -V | grep USE_64_BIT_INT
----------output-------------
USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES

pgBackRest depends on a few perl libaries. If they are not already present in the system, we have to install them:

sudo yum install perl-devel perl-ExtUtils-Embed.noarch perl-JSON-PP.noarch perl-Compress-Raw-Zlib.x86_64 perl-DBD-Pg.x86_64

Now we can build pgBackRest by executing “make” in the src directory

cd src
make

If the build is successful we should see the pgbackrest executable in the src directory. We should move it to a standard path location like /usr/bin

sudo cp ./pgbackrest /usr/bin/

Now we can create the repository location and log directory for pgBackRest. We grant privileges on that directory to the OS user that will execute pgBackRest.

sudo mkdir /var/lib/pgbackrest
sudo chown postgres:postgres /var/lib/pgbackrest
sudo mkdir -p /var/log/pgbackrest
sudo chown postgres:postgres /var/log/pgbackrest

Configuration

The very first step is to create a configuration file  /etc/pgbackrest.conf for the stanza. A stanza defines the backup configuration for a specific PostgreSQL database cluster. Any global configuration sections can be overridden to define stanza-specific settings.

sudo mkdir /etc/pgbackrest
sudo bash -c 'cat << EOF  > /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
[pg0app]
pg1-path=/var/lib/pgsql/11/data
pg1-port=5432
EOF'

Here, we have specified the backup repository location to be /var/lib/pgbackrest.

Once we have defined the configuration file, we can proceed to create the stanza

$ pgbackrest stanza-create --stanza=pg0app --log-level-console=info

This will produce output like this:

2019-04-23 11:47:50.640 P00   INFO: stanza-create command begin 2.13: --log-level-console=info --pg1-path=/var/lib/pgsql/11/data --pg1-port=5432 --repo1-path=/var/lib/pgbackrest --stanza=pg0app
2019-04-23 11:47:51.166 P00   INFO: stanza-create command end: completed successfully (527ms)

Now we need to set up the database parameters to use pgbackrest. In particular, we should use pgbackrest for archive_command. One of the major concerns for using cp like utilities in archive_command is that they are lazy writers: they do not ensure that everything is written and that fsync is called. This is a potential hole in many backup configurations and pgbackrest can solve this issue. Here is a sample set of parameter changes:

ALTER SYSTEM SET wal_level = 'replica';
ALTER SYSTEM SET archive_mode = 'on';
ALTER SYSTEM SET archive_command = 'pgbackrest --stanza=pg0app archive-push %p';
ALTER SYSTEM SET max_wal_senders = '10';
ALTER SYSTEM SET hot_standby = 'on';

From version 2.12 onwards, the archive-push used in the above setting is entirely coded in C, making it more lightweight and faster to execute.

Basic backup configuration is complete now and we can check the backup setup:

$ pgbackrest check --stanza=pg0app --log-level-console=info

This should produce output like this:

2019-04-23 13:57:33.241 P00   INFO: check command begin 2.13: --log-level-console=info --pg1-path=/var/lib/pgsql/11/data --pg1-port=5432 --repo1-path=/var/lib/pgbackrest --stanza=pg0app
2019-04-23 13:57:35.184 P00   INFO: WAL segment 00000001000000000000003D successfully stored in the archive at '/var/lib/pgbackrest/archive/pg0app/11-1/0000000100000000/00000001000000000000003D-9335cf780f0e1e468de62e0cbf22e7953a84c0c4.gz'
2019-04-23 13:57:35.185 P00   INFO: check command end: completed successfully (1946ms)

Taking a full backup

Now we are all set to take a full backup. We can push a backup without specifying its type like this:

$ pgbackrest backup --stanza=pg0app --log-level-console=info

In my case, I received this output:

2019-04-23 13:58:31.912 P00   INFO: backup command begin 2.13: --log-level-console=info --pg1-path=/var/lib/pgsql/11/data --pg1-port=5432 --repo1-path=/var/lib/pgbackrest --repo1-retention-full=2 --stanza=pg0app
WARN: no prior backup exists, incr backup has been changed to full
2019-04-23 13:58:32.775 P00   INFO: execute non-exclusive pg_start_backup() with label "pgBackRest backup started at 2019-04-23 13:58:32": backup begins after the next regular checkpoint completes
2019-04-23 13:58:32.981 P00   INFO: backup start archive = 00000001000000000000003F, lsn = 0/3F000028
2019-04-23 13:58:36.712 P01   INFO: backup file /var/lib/pgsql/11/data/base/18126/26241 (128MB, 57%) checksum 40a6c0cc69e81b9aaa93977bc5e3f809ae336b79
2019-04-23 13:58:37.958 P01   INFO: backup file /var/lib/pgsql/11/data/base/18126/26259 (21.4MB, 66%) checksum c34d484c34431dcc757e48c709c9ae68e694fb13
2019-04-23 13:58:38.223 P01   INFO: backup file /var/lib/pgsql/11/data/base/18126/26272 (15.7MB, 73%) checksum fe3155a0c6b7f6c94e846c47557f5e1eca6a89ef

There are a couple of points to note here.

  1. Since we are taking a backup for the first time, it detects that there is no prior backup and switches to full database backup.
  2. The output says pg_start_backup() is issued in non-exclusive mode. pgBackRest handles the creation of label file.

At the end of the backup,  I received these details, and you should look for something similar:

2019-04-23 13:58:56.024 P01   INFO: backup file /var/lib/pgsql/11/data/base/1/13723 (0B, 100%)
2019-04-23 13:58:56.031 P01   INFO: backup file /var/lib/pgsql/11/data/base/1/13718 (0B, 100%)
2019-04-23 13:58:56.076 P00   INFO: full backup size = 224.5MB
2019-04-23 13:58:56.076 P00   INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments to archive
2019-04-23 13:58:56.200 P00   INFO: backup stop archive = 00000001000000000000003F, lsn = 0/3F000130
2019-04-23 13:58:56.675 P00   INFO: new backup label = 20190423-135832F
2019-04-23 13:58:56.727 P00   INFO: backup command end: completed successfully (24817ms)
2019-04-23 13:58:56.728 P00   INFO: expire command begin
2019-04-23 13:58:56.737 P00   INFO: full backup total < 2 - using oldest full backup for 11-1 archive retention
2019-04-23 13:58:56.739 P00   INFO: expire command end: completed successfully (11ms)

Check the backup related information

$ pgbackrest info
stanza: pg0app
    status: ok
    cipher: none
    db (current)
        wal archive min/max (11-1): 000000010000000000000043/000000010000000000000043
        full backup: 20190423-143526F
            timestamp start/stop: 2019-04-23 14:35:26 / 2019-04-23 14:35:39
            wal start/stop: 000000010000000000000043 / 000000010000000000000043
            database size: 224.6MB, backup size: 224.6MB
            repository size: 14.1MB, repository backup size: 14.1MB

As we can see, pgBackRest has compressed the backup to 14.1MB in size. This represents a great reduction compared to the size of the original database.

If a good full backup exists, then by default pgBackRest will attempt to perform an incremental backup.

Restoring backup

Restoring the backup is as simple as a single liner. pgBackRest will find out the correct full backup/ differential backup and incremental backup and do the restore for you.

$ pgbackrest restore --stanza=pg0app --log-level-console=info

The restore will display details on screen. Towards the end of the restore you should see a few lines as follows:

2019-04-23 15:28:29.113 P01   INFO: restore file /var/lib/pgsql/11/data/base/1/13723 (0B, 100%)
2019-04-23 15:28:29.120 P01   INFO: restore file /var/lib/pgsql/11/data/base/1/13718 (0B, 100%)
2019-04-23 15:28:29.121 P00   INFO: write /var/lib/pgsql/11/data/recovery.conf
2019-04-23 15:28:29.123 P00   INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2019-04-23 15:28:29.127 P00   INFO: restore command end: completed successfully (22290ms)

So pgBackRest creates a recovery.conf file automatically as the log output says. pgBackRest automatically inserts the restore_command to restore archived WAL logs like:

restore_command = 'pgbackrest --log-level-console=info --stanza=pg0app archive-get %f "%p"'

As a DBA user, you can just start up PostgreSQL as usual. In my case I just needed this command:

$ sudo systemctl start postgresql-11

PostgreSQL executes the above-mentioned restore_command as it starts and this will get all the archived WALs that are needed to complete the recovery before starting up.

Summary

pgBackRest is continuously evolving and it as it matures it’s emerging as one of the best backup solutions for PostgreSQL. The new features, especially in Version 2, are impressive. It takes away the mundane tasks of backup configuration and retention policies, which is extremely helpful.

Don’t forget! At Percona Live in Austin, May 28-30 2019, we’ll have two days of PostgreSQL content in a postgres dedicated track. Please see all our PostgreSQL talks here.


Photo of elephant by Magda Ehlers from Pexels

May
09
2019
--

Improving OLAP Workload Performance for PostgreSQL with ClickHouse Database

Improving OLAP Workload Performance for PostgreSQL

Every database management system is not optimized for every workload. Database systems are designed for specific loads, and thereby give better performance for that workload. Similarly, some kinds of queries work better on some database systems and worse on others. This is the era of specialization, where a product is designed for a specific requirement or a specific set of requirements. We cannot achieve everything performance-wise from a single database system. PostgreSQL is one of the finest object databases and performs really well in OLTP types of workloads, but I have observed that its performance is not as good as some other database systems for OLAP workloads. ClickHouse is one of the examples that outperforms PostgreSQL for OLAP.

ClickHouse is an open source, column-based database management system which claims to be 100–1,000x faster than traditional approaches, capable of processing of more than a billion rows in less than a second.

Foreign Data Wrapper (Clickhousedb_fdw)

If we have a different kind of workload that PostgreSQL is not optimized for, what is the solution? Fortunately, PostgreSQL provides a mechanism where you can handle a different kind of workload while using it, by creating a data proxy within PostgreSQL that internally calls and fetches data from a different database system. This is called Foreign Data Wrapper, which is based on SQL-MED. Percona provides a foreign data wrapper for the Clickhousedb database system, which is available at Percona’s GitHub project repository.

Benchmark Machine

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: Samsung  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Linux smblade01 4.15.0-42-generic #45~16.04.1-Ubuntu
  • PostgreSQL: version 11

Benchmark Workload

Ontime (On Time Reporting Carrier On-Time Performance) is an openly-available dataset I have used to benchmark. It has a table size of 85GB with 109 number of different types of columns, and its designed queries more closely emulate the real world. 

https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh

Queries

Q# Query Contains Aggregates and Group By
Q1 SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q2 SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q3 SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
Q4 SELECT Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC;
Q5 SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY Carrier ) a
INNER JOIN (  SELECT   Carrier,count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY Carrier)b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q6 SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier,  count(*) AS c FROM ontime  WHERE DepDelay>10 AND Year >= 2000 AND
Year <= 2008 GROUP BY Carrier) a INNER JOIN ( SELECT  Carrier, count(*) AS c2 FROM ontime  WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier ) b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q7 SELECT Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
Q8 SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
Q9 select Year, count(*) as c1 from ontime group by Year;
Q10 SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
Q11 select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month) a;
Q12 SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
Q13 SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
Query Contains Joins
Q14 SELECT a.Year, c1/c2 FROM ( select Year, count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY Year) a
INNER JOIN (select Year, count(*) as c2 from ontime GROUP BY Year ) b on a.Year=b.Year ORDER BY a.Year;
Q15 SELECT a.”Year”, c1/c2 FROM ( select “Year”, count(*)*1000 as c1 FROM fontime WHERE “DepDelay”>10 GROUP BY “Year”) a INNER JOIN (select “Year”, count(*) as c2
FROM fontime GROUP BY “Year” ) b on a.”Year”=b.”Year”;

Table-1: Queries used in the benchmark

 

Benchmark Result

Improving OLAP Workload Performance for PostgreSQL

 

This graph shows the time taken by each query in PostgreSQL, ClickHouse, and Clickhusedb_fdw. You can clearly see that ClickHouse performs really well on its own while PostgreSQL doesn’t. The benchmark has been done on a single node with multiple CPU cores and can scale with multiple nodes. But I think we can see enough performance as ClickHouse is almost 1000x faster in most of the queries. In a previous blog post, I went over some of the reasons why some queries do not perform well in clickhouse_fdw, if you would like to learn more.

Conclusion

The graph above proves my point that PostgreSQL is not fit for every type of workload and query. However, the beauty of PostgreSQL is its ability to use its foreign data wrappers to create a heterogeneous environment with other database systems, which can then optimally handle specific workloads.

The clickhousedb_fdw clearly gives users the power to use ClickHouse for much-improved performance for OLAP workloads whilst still interacting with PostgreSQL only. This has simplified client applications that don’t need to interact with multiple database systems.

May
01
2019
--

Benchmark ClickHouse Database and clickhousedb_fdw

postgres clickhouse fdw

In this research,  I wanted to see what kind of performance improvements could be gained by using a ClickHouse data source rather than PostgreSQL. Assuming that I would see performance advantages with using ClickHouse, would those advantages be retained if I access ClickHouse from within postgres using a foreign data wrapper (FDW)? The FDW in question is clickhousedb_fdw – an open source project from Percona!

The database environments under scrutiny are PostgreSQL v11, clickhousedb_fdw and a ClickHouse database. Ultimately, from within PostgreSQL v11, we are going to issue various SQL queries routed through our clickhousedb_fdw to the ClickHouse database. Then we’ll see how the FDW performance compares with those same queries executed in native PostgreSQL and native ClickHouse.

Clickhouse Database

ClickHouse is an open source column based database management system which can achieve performance of between 100 and 1000 times faster than traditional database approaches, capable of processing more than a billion rows in less than a second.

Clickhousedb_fdw

clickhousedb_fdw, a ClickHouse database foreign data wrapper, or FDW, is an open source project from Percona. Here’s a link for the GitHub project repository:

https://github.com/Percona-Lab/clickhousedb_fdw

I wrote a blog in March which tells you more about our FDW: https://www.percona.com/blog/2019/03/29/postgresql-access-clickhouse-one-of-the-fastest-column-dbmss-with-clickhousedb_fdw/

As you’ll see, this provides for an FDW for ClickHouse that allows you to SELECT from, and INSERT INTO, a ClickHouse database from within a PostgreSQL v11 server.

The FDW supports advanced features such as aggregate pushdown and joins pushdown. These significantly improve performance by utilizing the remote server’s resources for these resource intensive operations.

Benchmark environment

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: Samsung  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Linux smblade01 4.15.0-42-generic #45~16.04.1-Ubuntu
  • PostgreSQL: version 11

Benchmark tests

Rather than using some machine generated dataset for this benchmark, we used the “On Time Reporting Carrier On-Time Performance” data from 1987 to 2018. You can access the data using our shell script available here:

https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh

The size of the database is 85GB, providing a single table of 109 columns.

Benchmark Queries

Here are the queries I used to benchmark the ClickHouse, clickhousedb_fdw, and PostgreSQL.

Q# Query Contains Aggregates and Group By
Q1 SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q2 SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q3 SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
Q4 SELECT Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC;
Q5 SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY Carrier ) a
INNER JOIN (  SELECT   Carrier,count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY Carrier)b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q6 SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier,  count(*) AS c FROM ontime  WHERE DepDelay>10 AND Year >= 2000 AND
Year <= 2008 GROUP BY Carrier) a INNER JOIN ( SELECT  Carrier, count(*) AS c2 FROM ontime  WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier ) b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q7 SELECT Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
Q8 SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
Q9 select Year, count(*) as c1 from ontime group by Year;
Q10 SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
Q11 select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month) a;
Q12 SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
Q13 SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
Query Contains Joins
Q14 SELECT a.Year, c1/c2 FROM ( select Year, count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY Year) a
INNER JOIN (select Year, count(*) as c2 from ontime GROUP BY Year ) b on a.Year=b.Year ORDER BY a.Year;
Q15 SELECT a.”Year”, c1/c2 FROM ( select “Year”, count(*)*1000 as c1 FROM fontime WHERE “DepDelay”>10 GROUP BY “Year”) a INNER JOIN (select “Year”, count(*) as c2
FROM fontime GROUP BY “Year” ) b on a.”Year”=b.”Year”;

Table-1: Queries used in benchmark

Query executions

Here are the results from the each of the queries when run in different database set ups: PostgreSQL with and without indexes, native ClickHouse and clickhousedb_fdw. The time is shown in milliseconds. 

Q# PostgreSQL PostgreSQL (Indexed) ClickHouse clickhousedb_fdw
Q1 27920 19634 23 57
Q2 35124 17301 50 80
Q3 34046 15618 67 115
Q4 31632 7667 25 37
Q5 47220 8976 27 60
Q6 58233 24368 55 153
Q7 30566 13256 52 91
Q8 38309 60511 112 179
Q9 20674 37979 31 81
Q10 34990 20102 56 148
Q11 30489 51658 37 155
Q12 39357 33742 186 1333
Q13 29912 30709 101 384
Q14 54126 39913 124 1364212
Q15 97258 30211 245 259

Table-1: Time taken to execute the queries used in benchmark

Reviewing the results

The graph shows the query execution time in milliseconds, the X-axis shows the query number from the tables above, while the Y-axis shows the execution time in milliseconds. The results for ClickHouse and the data accessed from postgres using clickhousedb_fdw are shown. From the table, you can see there is a huge difference between PostgreSQL and ClickHouse, but there is minimal difference between ClickHouse and clickhousedb_fdw.

 

Clickhouse Vs Clickhousedb_fdw (Shows the overhead of clickhousedb_fdw)

This graph shows the difference between ClickhouseDB and clickhousedb_fdw. In most of the queries, the FDW overhead is not that great, and barely significant apart from in Q12. This query involves joins and ORDER BY clause. Because of the ORDER BY clause the GROUP/BY and ORDER BY does not push down to ClickHouse.

In Table-2 we can see the spike of time in query Q12 and Q13. To reiterate, this is caused by the ORDER BY clause. To confirm this, I ran a queries Q-14 and Q-15 with and without the ORDER BY clause.  Without the ORDER BY clause the completion time is  259ms and with ORDER BY clause, it is 1364212. To debug that the query I explain both the queries, and here is the results of explain.

bm=# EXPLAIN VERBOSE SELECT a."Year", c1/c2
     FROM (SELECT "Year", count(*)*1000 AS c1 FROM fontime WHERE "DepDelay" > 10 GROUP BY "Year") a
     INNER JOIN(SELECT "Year", count(*) AS c2 FROM fontime GROUP BY "Year") b ON a."Year"=b."Year";

 

                                                    QUERY PLAN                                                      
Hash Join  (cost=2250.00..128516.06 rows=50000000 width=12)  
Output: fontime."Year", (((count(*) * 1000)) / b.c2)  
Inner Unique: true   Hash Cond: (fontime."Year" = b."Year")  
->  Foreign Scan  (cost=1.00..-1.00 rows=100000 width=12)        
Output: fontime."Year", ((count(*) * 1000))        
Relations: Aggregate on (fontime)        
Remote SQL: SELECT "Year", (count(*) * 1000) FROM "default".ontime WHERE (("DepDelay" > 10)) GROUP BY "Year"  
->  Hash  (cost=999.00..999.00 rows=100000 width=12)        
Output: b.c2, b."Year"        
->  Subquery Scan on b  (cost=1.00..999.00 rows=100000 width=12)              
Output: b.c2, b."Year"              
->  Foreign Scan  (cost=1.00..-1.00 rows=100000 width=12)                    
Output: fontime_1."Year", (count(*))                    
Relations: Aggregate on (fontime)                    
Remote SQL: SELECT "Year", count(*) FROM "default".ontime GROUP BY "Year"(16 rows)

 

bm=# EXPLAIN VERBOSE SELECT a."Year", c1/c2 FROM(SELECT "Year", count(*)*1000 AS c1 FROM fontime WHERE "DepDelay" > 10 GROUP BY "Year") a
     INNER JOIN(SELECT "Year", count(*) as c2 FROM fontime GROUP BY "Year") b  ON a."Year"= b."Year"
     ORDER BY a."Year";

 

                                                          QUERY PLAN
Merge Join  (cost=2.00..628498.02 rows=50000000 width=12)  
Output: fontime."Year", (((count(*) * 1000)) / (count(*)))  
Inner Unique: true   Merge Cond: (fontime."Year" = fontime_1."Year")  
->  GroupAggregate  (cost=1.00..499.01 rows=1 width=12)       
Output: fontime."Year", (count(*) * 1000)        
Group Key: fontime."Year"        
->  Foreign Scan on public.fontime  (cost=1.00..-1.00 rows=100000 width=4)              
Remote SQL: SELECT "Year" FROM "default".ontime WHERE (("DepDelay" > 10))
            ORDER BY "Year" ASC  
->  GroupAggregate  (cost=1.00..499.01 rows=1 width=12)        
Output: fontime_1."Year", count(*)         Group Key: fontime_1."Year"        
->  Foreign Scan on public.fontime fontime_1  (cost=1.00..-1.00 rows=100000 width=4) 
             
Remote SQL: SELECT "Year" FROM "default".ontime ORDER BY "Year" ASC(16 rows)

Conclusion

The results from these experiments show that ClickHouse offers really good performance, and clickhousedb_fdw offers the benefits of ClickHouse performance from within PostgreSQL. While there is some overhead when using clickhousedb_fdw, it is negligible and is comparable to the performance achieved when running natively within the ClickHouse database. This also confirms that the PostgreSQL foreign data wrapper push-down feature provides wonderful results.

Apr
26
2019
--

Watch Webinar: Upgrading / Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions

Upgrading / Migrating your legacy PostgreSQL to newer PostgreSQL versions

Upgrading / Migrating your legacy PostgreSQL to newer PostgreSQL versionsPlease join Percona’s PostgreSQL Support Technical Lead, Avinash Vallarapu and Senior Support Engineers, Fernando Laudares, Jobin Augustine and Nickolay Ihalainen as they demonstrate the methods to upgrade a legacy version of PostgreSQL to a newer version using built-in as well as open source tools.

Watch the Recorded Webinar

To start, this webinar opens with a list of solutions that are built-in to PostgreSQL to help us upgrade a legacy version of PostgreSQL with minimal downtime. Next, the advantages of choosing such methods will be discussed. You will then notice a list of prerequisites for each solution, reducing the scope of possible mistakes. It’s important to minimize downtime when upgrading from an older version of PostgreSQL server. Therefore, we will present 3 open source solutions that will help us either minimize or completely avoid downtime.

Additionally, we will be publishing a series of 5 blog posts that will help us understand the solutions available to perform a PostgreSQL upgrade. Our presentation will show the full process of upgrading a set of PostgreSQL servers to the latest available version. Furthermore, we’ll show the pros and cons of each of the methods we employed.

Topics covered in this webinar include the following:

1. PostgreSQL upgrade using pg_dump/pg_restore (with downtime)
2. PostgreSQL upgrade using pg_dumpall (with downtime)
3. Continuous replication from a legacy PostgreSQL version to a newer version using Slony.
4. Replication between major PostgreSQL versions using Logical Replication
5. Fast upgrade of legacy PostgreSQL with minimum downtime.

We will walk you through and demonstrate methods that you may find useful in your database environment. We will witness how simple and quick it is to perform the upgrade using these methods.

Apr
23
2019
--

Adding PostgreSQL Queries Overview Dashboards to the PMM Plugin

Following on the heels of our PostgreSQL Tuples Statistics Dashboard, here’s another blog post describing how you can gain additional visibility of PostgreSQL queries using PMM. We take a look at using an extension called pg_stat_statements. This allows us to collect information about the various queries running in your PostgreSQL instance. We’ll describe how to check if you already have pg_stat_statements running, and if not how to enable the extension in PostgreSQL. Finally, we’ll see how to enable collection using a custom query file and pmm-admin option flag.

We have taken much of our inspiration for these new PG dashboards from Gregory Stark’s presentation at PGConf.eu in 2018, where he demonstrated some excellent work using Grafana and Prometheus to build a PostgreSQL monitoring solution.

PMM (Percona Monitoring and Management) is a free and open-source platform for managing and monitoring MySQL®, MongoDB®, and PostgreSQL performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL, MongoDB, and PostgreSQL servers to ensure that your data works as efficiently as possible.

Setting up query monitoring in PMM for postgres

Postgres extensions, in simplified terms, are lower level APIs that exist within PostgreSQL that allow to change or extend its functionality. The extension pg_stat_statements is known as a contrib extension, found in the contrib directory of a PostgreSQL distribution.

So let’s check if this extension is in your database installation.

postgres=# SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements';
        name        | default_version | installed_version |                          comment
--------------------+-----------------+-------------------+-----------------------------------------------------------
 pg_stat_statements | 1.6             |                   | track execution statistics of all SQL statements executed
(1 row)

If no installed version is provided please enable the extension.

postgres=# CREATE EXTENSION pg_stat_statements;
CREATE EXTENSION
postgres=# SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements';
        name        | default_version | installed_version |                          comment
--------------------+-----------------+-------------------+-----------------------------------------------------------
 pg_stat_statements | 1.6             | 1.6               | track execution statistics of all SQL statements executed
(1 row)

The next step is to teach our exporter to collect information from the extension. So we need create a file with a custom query or you may use the already created file in our exporter repository.

Now lets run the exporter with this file /home/ec2-user/queries.yaml. Please use flag extend.query-path as in the example below.

[root@ip-10-178-1-82 ec2-user]# pmm-admin rm postgresql:metrics PS_NODE-1
OK, removed PostgreSQL metrics PS_NODE-1 from monitoring.
[root@ip-10-178-1-82 ec2-user]# pmm-admin add postgresql:metrics --user postgres PS_NODE-1 -- --extend.query-path /home/ec2-user/queries.yaml
OK, now monitoring PostgreSQL metrics using DSN postgresql:***@/postgres

So now you can upload two dashboards in order to checks queries through the PMM interface.

Here are some screenshots from our test installation.

This is the first of the dashboards “PostgreSQL Queries Overview” that shows information about all queries and databases.


Please notice that the query id is clickable and leads to the second dashboard “PostgreSQL Query Drill-Down”. So you can check charts related to a selected query.

Both dashboards are available in GrafanaLab and have been adapted for PMM.

Apr
17
2019
--

PostgreSQL CommitFest 2019-03 Ends

committed

PostgreSQL has a very unique way to review the code submitted by developers. Most open source software uses Github pull requests to accommodate users’ code. PostgreSQL has a Github page but doesn’t  manage pull requests using Github. Many years ago, PostgreSQL introduced CommitFest to manage its patches, where postgres opens for a three to four month “festival” to accept patches. The CommitFest is set up to keep track of patches’ statuses. In practice, CommitFest is mainly an admin page used to manage patches. Volunteers and community committers review the submitted code, after which committers are able commit those patches that have been approved. There have been almost 22 CommitFest events since 2014.

The process

When a patch is submitted, a volunteer can choose to review it. A review can be done by more than one person. At any point, a patch will be in one of the following statuses:

  • Committed
  • Moved to next CF (CommitFest)
  • Needs Review
  • Ready for Committer
  • Rejected
  • Returned with Feedback
  • Waiting on Author
  • Withdrawn

Needs review indicates that the patch is waiting for a reviewer to analyze it. Once a reviewer has confirmed that the patch in question is good to be committed, a community committer will pick up the patch, review it and commit it if all is well, otherwise, they reject it or return it with feedback.

Patch categories

Patches are categorized in various ways including:

  • Bug Fixes
  • Clients
  • Code Comments
  • Documentation
  • Miscellaneous
  • Monitoring & Control
  • Performance
  • Procedural Languages
  • Refactoring
  • Replication & Recovery
  • SQL Commands
  • Server Features
  • System Administration

A complete history of a patch, along with a discussion on email, is maintained with each revision which makes it very simple to track the complete process from submission of a patch to its eventual conclusion.

Here are some graphs that show various stats for the past CommitFests.

Total new patches in PostgreSQL commitfest by date

 

Total patches in PostgreSQL Commitfests including not new

 

Total patches reviewed during Commitfest versus those committed to PostgreSQL

Now it’s time to end the 22nd CommitFest, the first such festival for 2019. This event saw 207 patches, almost 52% of which were committed to the master branch, which is a good ratio. This master branch becomes the development branch for the next major PostgreSQL release.

Summary of outcomes from CommitFest 22:

  • Patches committed: 100.
  • Moved to next CommitFest: 84.
  • Withdrawn: 5.
  • Rejected: 3.
  • Returned with Feedback: 15

As a reviewer, I worked on some of these patches this time around, and I’ll devote some more time in the next CommitFest to more reviews. If you like the idea, then I’d encourage you to come and participate in the review process – we can always use more people. Even if you are not a developer, you can help to test a patch and verify the fix on a different environment that the developer may not have.

You can read more about how a CommitFest is run at this PostgreSQL wiki page.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com