Partitioning is the concept of splitting large tables logically into smaller pieces for better performance of the database.
Methods of built-in PostgreSQL partition techniques
- Range partitioning
- List partitioning
- Hash partitioning
When to use partitioning
- Bulk operations like data loads and deletes can be performed using the partition feature of ATTACH and DETACH effectively.
- The exact point at which a table will benefit from partitioning depends on the application. However, a rule of thumb is that the size of the table should exceed the physical memory of the database server.
- As data is growing, sub-partitions can be created, which enhances the performance, and also old partitions can be deleted either by making them standalone or dropping them entirely.
Benefits of partitioning
- Query performance for DDL and DML operations can be improved in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions which is explained below.
- When queries or updates access a large percentage of a single partition, performance can be improved by using a sequential scan of that partition instead of using an index, which would require random-access reads scattered across the whole table.
- Dropping the partition table or truncating the partition table can be done using DROP TABLE and TRUNCATE TABLE, respectively, reducing the load through DELETE operations.
Range partitioning
Database partition that is based on a specific range of columns with data like dates and Numeric values.
Here, as an example, I created a table with range partitioning and partition tables for each quarter on a Date column.
CREATE TABLE employees (id INT NOt NULL , fname VARCHAR(20) , lname VARCHAR (20) , dob DATE NOT NULL , joined DATE NOT NULL) PARTITION BY RANGE (joined);
CREATE TABLE employees_q1 PARTITION of employees for VALUES FROM ('2022-01-01') to ('2022-04-01'); CREATE TABLE employees_q2 PARTITION of employees for VALUES FROM ('2022-04-01') to ('2022-07-01'); CREATE TABLE employees_q3 PARTITION of employees for VALUES FROM ('2022-07-01') to ('2022-10-01'); CREATE TABLE employees_q4 PARTITION of employees for VALUES FROM ('2022-10-01') to ('2023-01-01');
Range partitions are seen below in the table structure.
d+ employees Partitioned table "public.employees" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------+-----------------------+-----------+----------+---------+----------+-------------+--------------+------------- id | integer | | not null | | plain | | | fname | character varying(20) | | | | extended | | | lname | character varying(20) | | | | extended | | | dob | date | | not null | | plain | | | joined | date | | not null | | plain | | | Partition key: RANGE (joined) Partitions: employees_q1 FOR VALUES FROM ('2022-01-01') TO ('2022-04-01'), employees_q2 FOR VALUES FROM ('2022-04-01') TO ('2022-07-01'), employees_q3 FOR VALUES FROM ('2022-07-01') TO ('2022-10-01'), employees_q4 FOR VALUES FROM ('2022-10-01') TO ('2023-01-01')
Inserted some random data for entries with 365 days a year.
INSERT INTO employees (id ,fname,lname,dob ,joined) VALUES ( generate_series(1, 365) ,(array['Oswald', 'Henry', 'Bob', 'Vennie'])[floor(random() * 4 + 1)], (array['Leo', 'Jack', 'Den', 'Daisy' ,'Woody'])[floor(random() * 5 + 1)], '1995-01-01'::date + trunc(random() * 366 * 3)::int, generate_series('1/1/2022'::date, '12/31/2022'::date, '1 day'));
Range partitioned data is seen as below distributed among its partitions.
SELECT employees_q1 , employees_q2 , employees_q3 , employees_q4 , employees_totalcnt from ( SELECT COUNT(*) FROM employees_q1 ) AS employees_q1, ( SELECT COUNT(*) FROM employees_q2 ) AS employees_q2, ( SELECT COUNT(*) FROM employees_q3 ) AS employees_q3, ( SELECT COUNT(*) FROM employees_q3 ) AS employees_q4 , ( SELECT COUNT(*) FROM employees ) AS employees_totalcnt ; employees_q1 | employees_q2 | employees_q3 | employees_q4 | employees_totalcnt --------------+--------------+--------------+--------------+-------------------- (90) | (91) | (92) | (92) | (365) (1 row)
Performance of DDL operations
Here, I created a table without a partition and inserted the same data, similar to the partitioned table.
A query plan is seen better for DDL operations when performed on data with a single partition or fewer partitions.
CREATE TABLE employees_nopartition (id INT NOt NULL , fname VARCHAR(20) , lname VARCHAR (20) , dob DATE NOT NULL , joined DATE NOT NULL) ;
INSERT INTO employees_nopartition (id ,fname,lname,dob ,joined) VALUES ( generate_series(1, 365) ,(array['Oswald', 'Henry', 'Bob', 'Vennie'])[floor(random() * 4 + 1)], (array['Leo', 'Jack', 'Den', 'Daisy' ,'Woody'])[floor(random() * 5 + 1)], '1995-01-01'::date + trunc(random() * 366 * 3)::int, generate_series('1/1/2022'::date, '12/31/2022'::date, '1 day'));
EXPLAIN select * from employees_nopartition where joined >= '2022-05-12' and joined < '2022-06-10'; QUERY PLAN ------------------------------------------------------------------------------ Seq Scan on employees_nopartition (cost=0.00..8.47 rows=29 width=22) Filter: ((joined >= '2022-05-12'::date) AND (joined < '2022-06-10'::date)) (2 rows)
Here we can see a better query plan when data is fetched from the partitioned table than data fetched from the non-partitioned table.
EXPLAIN select * from employees where joined >= '2022-05-12' and joined < '2022-06-10'; QUERY PLAN ------------------------------------------------------------------------------ Seq Scan on employees_q2 (cost=0.00..2.37 rows=29 width=22) Filter: ((joined >= '2022-05-12'::date) AND (joined < '2022-06-10'::date)) (2 rows)
List partitioning
Database partition that is based on key value(s) or discrete values and partition can also be done with the expression of the column like (RANGE BY LIST(expression)), which is explained below:
For example, I created a table with a list partition and a few list-partitioned tables and inserted some random data with 1,000 rows.
CREATE TABLE sales (id INT NOT NULL , branch VARCHAR(3),type text, Amount int ) PARTITION BY LIST (branch);
CREATE TABLE HYD_sales PARTITION of sales for VALUES IN ('HYD'); CREATE TABLE BLR_sales PARTITION of sales for VALUES IN ('BLR'); CREATE TABLE DEL_sales PARTITION of sales for VALUES IN ('DEL'); CREATE TABLE TPT_sales PARTITION of sales for VALUES IN ('TPT');
INSERT into sales (id , branch ,type , amount ) VALUES ( generate_series(1, 1000) , (array['HYD', 'BLR', 'DEL', 'TPT'])[floor(random() * 4 + 1)] , (array['Laptops', 'Printers', 'Hardisks', 'Desktops' ,'Monitors'])[floor(random() * 5 + 1)], (random()*200000)::int );
List partitions are seen in the table definition as below:
d+ sales Partitioned table "public.sales" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+----------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | branch | character varying(3) | | | | extended | | type | text | | | | extended | | amount | integer | | | | plain | | Partition key: LIST (branch) Partitions: blr_sales FOR VALUES IN ('BLR'), del_sales FOR VALUES IN ('DEL'), hyd_sales FOR VALUES IN ('HYD'), tpt_sales FOR VALUES IN ('TPT')
Partitioned data distributed among its partitions is seen below:
SELECT blr_sales , del_sales , hyd_sales,tpt_sales, total_cnt from ( SELECT COUNT(*) FROM blr_sales ) AS blr_sales, ( SELECT COUNT(*) FROM del_sales ) AS del_sales, ( SELECT COUNT(*) FROM hyd_sales ) AS hyd_sales, ( SELECT COUNT(*) FROM tpt_sales ) AS tpt_sales , ( SELECT COUNT(*) FROM sales ) AS total_cnt; blr_sales | del_sales | hyd_sales | tpt_sales | total_cnt -----------+-----------+-----------+-----------+----------- (262) | (258) | (228) | (252) | (1001) (1 row)
List partitioning using expression
For example, I created a table with list partitioning using the expression of a column.
CREATE TABLE donors (id INT NOt NULL , name VARCHAR(20) , bloodgroup VARCHAR (15) , last_donated DATE , contact_num VARCHAR(10)) PARTITION BY LIST (left(upper(bloodgroup),3));
CREATE TABLE A_positive PARTITION of donors for VALUES IN ('A+ '); CREATE TABLE A_negative PARTITION of donors for VALUES IN ('A- '); CREATE TABLE B_positive PARTITION of donors for VALUES IN ('B+ '); CREATE TABLE B_negative PARTITION of donors for VALUES IN ('B- '); CREATE TABLE AB_positive PARTITION of donors for VALUES IN ('AB+'); CREATE TABLE AB_negative PARTITION of donors for VALUES IN ('AB-'); CREATE TABLE O_positive PARTITION of donors for VALUES IN ('O+ '); CREATE TABLE O_negative PARTITION of donors for VALUES IN ('O- ');
List partitions are seen in the table definition below:
d+ donors Partitioned table "public.donors" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------------+-----------------------+-----------+----------+---------+----------+-------------+--------------+------------- id | integer | | not null | | plain | | | name | character varying(20) | | | | extended | | | bloodgroup | character varying(15) | | | | extended | | | last_donated | date | | | | plain | | | contact_num | character varying(10) | | | | extended | | | Partition key: LIST ("left"(upper((bloodgroup)::text), 3)) Partitions: a_negative FOR VALUES IN ('A- '), a_positive FOR VALUES IN ('A+ '), ab_negative FOR VALUES IN ('AB-'), ab_positive FOR VALUES IN ('AB+'), b_negative FOR VALUES IN ('B- '), b_positive FOR VALUES IN ('B+ '), o_negative FOR VALUES IN ('O- '), o_positive FOR VALUES IN ('O+ ')
Here, I inserted some random 100 rows.
INSERT INTO donors (id , name , bloodgroup , last_donated , contact_num) VALUES (generate_series(1, 100) ,'user_' || trunc(random()*100) , (array['A+ group', 'A- group', 'O- group', 'O+ group','AB+ group','AB- group','B+ group','B- group'])[floor(random() * 8 + 1)] , '2022-01-01'::date + trunc(random() * 366 * 1)::int, CAST(1000000000 + floor(random() * 9000000000) AS bigint));
List partitioned data with expression distributed among its partitions is seen below:
SELECT a_negative , a_positive , ab_negative , ab_positive , b_negative ,b_positive ,o_negative , o_positive , total_cnt from ( SELECT COUNT(*) FROM a_negative ) AS a_negative, ( SELECT COUNT(*) FROM a_positive ) AS a_positive, ( SELECT COUNT(*) FROM ab_negative ) AS ab_negative, ( SELECT COUNT(*) FROM ab_positive ) AS ab_positive , ( SELECT COUNT(*) FROM b_negative ) AS b_negative, ( SELECT COUNT(*) FROM b_positive ) AS b_positive , ( SELECT COUNT(*) FROM o_positive ) AS o_positive , ( SELECT COUNT(*) FROM o_negative ) AS o_negative, ( SELECT COUNT(*) FROM donors ) AS total_cnt; a_negative | a_positive | ab_negative | ab_positive | b_negative | b_positive | o_negative | o_positive | total_cnt ------------+------------+-------------+-------------+------------+------------+------------+------------+----------- (9) | (19) | (10) | (12) | (12) | (10) | (18) | (10) | (100) (1 row)
Performance of DML operations
Here is an example shown with the table, which is created without partitions and inserted the same data similar to that of the partitioned table.
Below I created a table without a partition and inserted some random data with 1,000 rows to show query performance.
CREATE TABLE sales_nopartition (id INT NOT NULL , branch VARCHAR(3),type text, Amount int );
INSERT into sales_nopartition (id , branch ,type , amount ) VALUES ( generate_series(1, 1000) , (array['HYD', 'BLR', 'DEL', 'TPT'])[floor(random() * 4 + 1)] , (array['Laptops', 'Printers', 'Hardisks', 'Desktops' ,'Monitors'])[floor(random() * 5 + 1)], (random()*200000)::int );
UPDATE Query Performance
EXPLAIN update sales_nopartition set type = 'Smart Watches' where branch = 'HYD'; QUERY PLAN --------------------------------------------------------------------------- Update on sales_nopartition (cost=0.00..19.50 rows=229 width=50) -> Seq Scan on sales_nopartition (cost=0.00..19.50 rows=229 width=50) Filter: ((branch)::text = 'HYD'::text) (3 rows)
EXPLAIN update sales set type = 'Smart Watches' where branch = 'HYD'; QUERY PLAN ------------------------------------------------------------------ Update on sales (cost=0.00..5.10 rows=248 width=50) Update on hyd_sales -> Seq Scan on hyd_sales (cost=0.00..5.10 rows=248 width=50) Filter: ((branch)::text = 'HYD'::text) (4 rows)
DELETE Query Performance
EXPLAIN DELETE from sales_nopartition where branch='HYD'; QUERY PLAN -------------------------------------------------------------------------- Delete on sales_nopartition (cost=0.00..19.50 rows=229 width=6) -> Seq Scan on sales_nopartition (cost=0.00..19.50 rows=229 width=6) Filter: ((branch)::text = 'HYD'::text) (3 rows)
EXPLAIN DELETE from sales where branch='HYD'; QUERY PLAN ----------------------------------------------------------------- Delete on sales (cost=0.00..5.10 rows=248 width=6) Delete on hyd_sales -> Seq Scan on hyd_sales (cost=0.00..5.10 rows=248 width=6) Filter: ((branch)::text = 'HYD'::text) (4 rows)
The above examples show the performance of DELETE and UPDATE operations with data fetched from a single partitioned table having a better query plan than the one with no partitions.
Hash partitioning
Hash partitioning table is defined as the table partitioned by specifying a modulus and a remainder for each partition.
- Each partition will hold the rows for which the hash value of the partition key divided by the specified modulus will produce the specified remainder.
- Hash partitioning is best used when each partition is on different table spaces residing on separate physical disks, so the IO is equally divided by more devices.
For example, I created a table with hash partitioning and a few partitioned tables with modulus five.
CREATE TABLE students ( id int NOT NULL, name varchar(30) NOT NULL , course varchar(100) ,joined date ) PARTITION BY hash(id);
CREATE TABLE student_0 PARTITION OF students FOR VALUES WITH (MODULUS 5,REMAINDER 0); CREATE TABLE student_1 PARTITION OF students FOR VALUES WITH (MODULUS 5,REMAINDER 1); CREATE TABLE student_2 PARTITION OF students FOR VALUES WITH (MODULUS 5,REMAINDER 2); CREATE TABLE student_3 PARTITION OF students FOR VALUES WITH (MODULUS 5,REMAINDER 3); CREATE TABLE student_4 PARTITION OF students FOR VALUES WITH (MODULUS 5,REMAINDER 4);
The table structure looks like the one below with five created partitions:
d+ students Partitioned table "public.students" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | name | character varying(30) | | not null | | extended | | course | character varying(100) | | | | extended | | joined | date | | | | plain | | Partition key: HASH (id) Partitions: student_0 FOR VALUES WITH (modulus 5, remainder 0), student_1 FOR VALUES WITH (modulus 5, remainder 1), student_2 FOR VALUES WITH (modulus 5, remainder 2), student_3 FOR VALUES WITH (modulus 5, remainder 3), student_4 FOR VALUES WITH (modulus 5, remainder 4)
Here, I Inserted some random data with 100,000 rows.
INSERT into students (id , name , course ,joined ) VALUES (generate_series(1, 100000) , 'student_' || trunc(random()*1000) , (array['Finance & Accounts', 'Business Statistics', 'Environmental Science'])[floor(random() * 3 + 1)],'2019-01-01'::date + trunc(random() * 366 * 3)::int);
We see below the hash partitioned data among its partitioned tables.
SELECT relname,reltuples as rows FROM pg_class WHERE relname IN ('student_0','student_1','student_2','student_3','student_4') ORDER BY relname; relname | rows -----------+------- student_0 | 19851 student_1 | 20223 student_2 | 19969 student_3 | 19952 student_4 | 20005 (5 rows)
Benefits of hash partitioning
- The primary benefit is to ensure an even distribution of data among a predetermined number of partitions.
- Hash keys are used effectively and efficiently in cases where ranges are not applicable, like employee number, product ID, etc.
What if the data is out of range or list?
For this purpose, we use default partitions on range and list partitioned tables.
For both range and list partitions, data can be stored temporarily, which is out-of-range, by creating a default partition and later creating an appropriate partition.
Hash-partitioned tables may not have a default partition, as the creation of a default partition for hash partitioning does not make any sense and is not needed.
We see here what happens when I try to insert data for which a partition doesn’t exist and how the default partition helps in this case.
INSERT into sales VALUES ( 1001 , 'MYS' , 'Scanners' , 190000); ERROR: no partition of relation "sales" found for row DETAIL: Partition key of the failing row contains (branch) = (MYS).
CREATE TABLE sales_default PARTITION of sales DEFAULT; CREATE TABLE INSERT into sales VALUES ( 1001 , 'MYS' , 'Scanners' , 190000); INSERT 0 1
select * from sales_default ; id | branch | type | amount ------+--------+----------+-------- 1001 | MYS | Scanners | 190000 (1 row)
So the data we inserted is sent to the default partition, and partitions can be created later based on the data in the default table and available partitions.
Conclusion
Here we discussed default partitioning techniques in PostgreSQL using single columns, and we can also create multi-column partitioning. PostgreSQL Partition Manager(pg_partman) can also be used for creating and managing partitions effectively. Further details will be explained in upcoming blogs.
Also, please find below the related blogs for reference:
PostgreSQL Sharding: An Overview and MongoDB Comparison
Performing ETL Using Inheritance in PostgreSQL
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.