Huge pages make PostgreSQL faster; can we implement it in Kubernetes? Modern servers operate with terabytes of RAM, and by default, processors work with virtual memory address translation for each 4KB page. OS maintains a huge list of allocated and free pages to make slow but reliable address translation from virtual to physical.
Please check out the Why Linux HugePages are Super Important for Database Servers: A Case with PostgreSQL blog post for more information.
Setup
I recommend starting with 2MB huge pages because it’s trivial to set up. Unfortunately, the performance in benchmarks is almost the same as for 4KB pages. Kubernetes worker nodes should be configured with GRUB_CMDLINE_LINUX or sysctl vm.nr_hugepages=N: https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/
This step could be hard with managed Kubernetes services, like GCP, but easy for kubeadm, kubespray, k3d, and kind installations.
Kubectl helps to check the amount of huge pages available.
kubectl describe nodes NODENAME … hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 1Gi (25%) 1Gi (25%) …
The tool reports only 2MB pages availability in the above output. During the deployment procedure on the custom resource apply stage, Percona Operator for PostgreSQL 2.2.0 is not able to start on such nodes:
$ kubectl -n pgo get pods -l postgres-operator.crunchydata.com/data=postgres NAME READY STATUS RESTARTS AGE cluster1-instance1-f65t-0 3/4 CrashLoopBackOff 6 (112s ago) 8m35s cluster1-instance1-2bss-0 3/4 CrashLoopBackOff 6 (100s ago) 8m35s cluster1-instance1-89v7-0 3/4 CrashLoopBackOff 6 (104s ago) 8m35s
Logs are very confusing:
kubectl -n pgo logs cluster1-instance1-f65t-0 -c database selecting dynamic shared memory implementation ... posix sh: line 1: 737 Bus error (core dumped) "/usr/pgsql-15/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
By default, PostgreSQL is configured to use huge pages, but Kubernetes needs to allow it first. .spec.instances.resources.limits should be modified to mention huge pages. PG pods are not able to start without proper limits on the node with huge pages enabled.
instances: - name: instance1 replicas: 3 resources: limits: hugepages-2Mi: 1024Mi memory: 1Gi cpu: 500m
hugepages-2Mi works in combination with the memory parameter; you can’t just specify huge pages limits.
Finally, let’s verify huge pages usage in postmaster memory map:
$ kubectl -n pgo exec -it cluster1-instance1-hgrp-0 -c database -- bash ps -eFH # check process tree and find “first” postgres process pmap -X -p 107|grep huge Address Perm Offset Device Inode Size Rss Pss Pss_Dirty Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping 7f35c5c00000 rw-s 00000000 00:0f 145421787 432128 0 0 0 0 0 0 0 0 18432 264192 0 0 0 0 /anon_hugepage (deleted)
Both Shared_Hugetlb Private_Hugetlb columns are set (18432 and 264192). It confirms that PostgreSQL can use huge pages.
Don’t set huge pages to the exact value of shared_buffers, as shared memory could also be consumed by extensions and many internal structures.
postgres=# SELECT sum(allocated_size)/1024/1024 FROM pg_shmem_allocations ; ?column? ---------------------- 422.0000000000000000 (1 row) postgres=# select * from pg_shmem_allocations order by allocated_size desc LIMIT 10; name | off | size | allocated_size ----------------------+-----------+-----------+---------------- <anonymous> | | 275369344 | 275369344 Buffer Blocks | 6843520 | 134217728 | 134217728 pg_stat_monitor | 147603584 | 20971584 | 20971648 XLOG Ctl | 54144 | 4208200 | 4208256 | 439219200 | 3279872 | 3279872 Buffer Descriptors | 5794944 | 1048576 | 1048576 CommitTs | 4792192 | 533920 | 534016 Xact | 4263040 | 529152 | 529152 Checkpointer Data | 146862208 | 393280 | 393344 Checkpoint BufferIds | 141323392 | 327680 | 327680 (10 rows)
Pg_stat_statements and pg_stat_monitor could introduce a significant difference to the small value of shared_buffers. Thus you need “hugepages-2Mi: 512Mi” for “shared_buffers: 128MB”.
Now you know all the caveats and may want to repeat the configuration.
It’s easy with anydbver and k3d. Allocate 2MB huge pages:
sysctl vm.nr_hugepages=2048
Verify huge pages availability:
egrep 'Huge|Direct' /proc/meminfo AnonHugePages: 380928 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 2048 HugePages_Free: 2048 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 4194304 kB DirectMap4k: 1542008 kB DirectMap2M: 19326976 kB DirectMap1G: 0 kB
- Install and configure anydbver.
git clone https://github.com/ihanick/anydbver.git cd anydbver ansible-galaxy collection install theredgreek.sqlite echo PROVIDER=docker > .anydbver (cd images-build;./build.sh)
- Start k3d cluster and install Percona Operator for PostgreSQL 2.2.0:
./anydbver deploy k8s-pg:2.2.0
- The command hangs on the cluster deployment stage, and the second terminal shows CrashLoopBackoff state:
kubectl -n pgo get pods -l postgres-operator.crunchydata.com/data=postgres
- Change data/k8s/percona-postgresql-operator/deploy/cr.yaml
Uncomment .spec.instances[0].resources.limits and set memory: 1Gi, hugepages-2Mi: 1024Mi - Apply CR again:
kubectl -n pgo apply -f data/k8s/percona-postgresql-operator/deploy/cr.yaml
In summary:
- Huge pages are not supported out of the box in public clouds
- Database crashes can occur if huge pages allocation fails with a bus error
- Fresh containerd 1.1.10+ is required.
- Reserve huge pages amounts bigger than shared_buffers, and verify estimations with pg_shmem_allocations.
- Huge pages is not a silver bullet.
- Without frequent CPU context switches and massively random large shared buffer access, default 4K pages show comparable results.
- Workloads with less than 4-5k transactions per second are fine even without huge pages