Dec
20
2023
--

Using Huge Pages with PostgreSQL Running Inside Kubernetes

Huge pages make PostgreSQL faster; can we implement it in Kubernetes? Modern servers operate with terabytes of RAM, and by default, processors work with virtual memory address translation for each 4KB page. OS maintains a huge list of allocated and free pages to make slow but reliable address translation from virtual to physical.

Please check out the Why Linux HugePages are Super Important for Database Servers: A Case with PostgreSQL blog post for more information.

Setup

I recommend starting with 2MB huge pages because it’s trivial to set up. Unfortunately, the performance in benchmarks is almost the same as for 4KB pages. Kubernetes worker nodes should be configured with GRUB_CMDLINE_LINUX or sysctl vm.nr_hugepages=N: https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/

This step could be hard with managed Kubernetes services, like GCP, but easy for kubeadm, kubespray, k3d, and kind installations.

Kubectl helps to check the amount of huge pages available.

kubectl describe nodes NODENAME
…
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      1Gi (25%)  1Gi (25%)
…

The tool reports only 2MB pages availability in the above output. During the deployment procedure on the custom resource apply stage, Percona Operator for PostgreSQL 2.2.0 is not able to start on such nodes:

$ kubectl -n pgo get pods -l postgres-operator.crunchydata.com/data=postgres
NAME                        READY   STATUS             RESTARTS       AGE
cluster1-instance1-f65t-0   3/4     CrashLoopBackOff   6 (112s ago)   8m35s
cluster1-instance1-2bss-0   3/4     CrashLoopBackOff   6 (100s ago)   8m35s
cluster1-instance1-89v7-0   3/4     CrashLoopBackOff   6 (104s ago)   8m35s

Logs are very confusing:

kubectl -n pgo logs cluster1-instance1-f65t-0 -c database
selecting dynamic shared memory implementation ... posix
sh: line 1:   737 Bus error               (core dumped) "/usr/pgsql-15/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1

By default, PostgreSQL is configured to use huge pages, but Kubernetes needs to allow it first. .spec.instances.resources.limits should be modified to mention huge pages. PG pods are not able to start without proper limits on the node with huge pages enabled.

instances:
  - name: instance1
    replicas: 3
    resources:
      limits:
        hugepages-2Mi: 1024Mi
        memory: 1Gi
        cpu: 500m

hugepages-2Mi works in combination with the memory parameter; you can’t just specify huge pages limits.

Finally, let’s verify huge pages usage in postmaster memory map:

$ kubectl -n pgo exec -it cluster1-instance1-hgrp-0 -c database -- bash

ps -eFH # check process tree and find “first” postgres process

pmap -X -p 107|grep huge

         Address Perm   Offset Device     Inode   Size   Rss  Pss Pss_Dirty Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping

    7f35c5c00000 rw-s 00000000  00:0f 145421787 432128     0    0         0          0         0        0              0             0          18432          264192    0       0      0           0 /anon_hugepage (deleted)

Both Shared_Hugetlb Private_Hugetlb columns are set (18432 and 264192). It confirms that PostgreSQL can use huge pages.

Don’t set huge pages to the exact value of shared_buffers, as shared memory could also be consumed by extensions and many internal structures.

postgres=# SELECT sum(allocated_size)/1024/1024 FROM pg_shmem_allocations ;
       ?column?       
----------------------
 422.0000000000000000
(1 row)
postgres=# select * from pg_shmem_allocations order by allocated_size desc LIMIT 10;
         name         |    off    |   size    | allocated_size 
----------------------+-----------+-----------+----------------
 <anonymous>          |           | 275369344 |      275369344
 Buffer Blocks        |   6843520 | 134217728 |      134217728
 pg_stat_monitor      | 147603584 |  20971584 |       20971648
 XLOG Ctl             |     54144 |   4208200 |        4208256
                      | 439219200 |   3279872 |        3279872
 Buffer Descriptors   |   5794944 |   1048576 |        1048576
 CommitTs             |   4792192 |    533920 |         534016
 Xact                 |   4263040 |    529152 |         529152
 Checkpointer Data    | 146862208 |    393280 |         393344
 Checkpoint BufferIds | 141323392 |    327680 |         327680
(10 rows)

Pg_stat_statements and pg_stat_monitor could introduce a significant difference to the small value of shared_buffers. Thus you need “hugepages-2Mi: 512Mi” for “shared_buffers: 128MB”.

Now you know all the caveats and may want to repeat the configuration.

It’s easy with anydbver and k3d. Allocate 2MB huge pages:

sysctl vm.nr_hugepages=2048

Verify huge pages availability:

egrep 'Huge|Direct' /proc/meminfo
AnonHugePages:    380928 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    2048
HugePages_Free:     2048
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         4194304 kB
DirectMap4k:     1542008 kB
DirectMap2M:    19326976 kB
DirectMap1G:           0 kB

  1. Install and configure anydbver.

    git clone https://github.com/ihanick/anydbver.git
    cd anydbver
    ansible-galaxy collection install theredgreek.sqlite
    echo PROVIDER=docker > .anydbver
    (cd images-build;./build.sh)
  2. Start k3d cluster and install Percona Operator for PostgreSQL 2.2.0:

    ./anydbver deploy k8s-pg:2.2.0
  3. The command hangs on the cluster deployment stage, and the second terminal shows CrashLoopBackoff state:

    kubectl -n pgo get pods -l postgres-operator.crunchydata.com/data=postgres
  4. Change data/k8s/percona-postgresql-operator/deploy/cr.yaml
    Uncomment .spec.instances[0].resources.limits and set memory: 1Gi, hugepages-2Mi: 1024Mi
  5. Apply CR again:

    kubectl -n pgo apply -f data/k8s/percona-postgresql-operator/deploy/cr.yaml

In summary:

  • Huge pages are not supported out of the box in public clouds
  • Database crashes can occur if huge pages allocation fails with a bus error
  • Huge pages is not a silver bullet.
    • Without frequent CPU context switches and massively random large shared buffer access, default 4K pages show comparable results.
    • Workloads with less than 4-5k transactions per second are fine even without huge pages

 

Learn more about Percona Operator for PostgreSQL

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com