Dec
07
2017
--

Hands-On Look at ZFS with MySQL

ZFS with MySQL

ZFS with MySQLThis post is a hands-on look at ZFS with MySQL.

In my previous post, I highlighted the similarities between MySQL and ZFS. Before going any further, I’d like you to be able to play and experiment with ZFS. This post shows you how to configure ZFS with MySQL in a minimalistic way on either Ubuntu 16.04 or Centos 7.

Installation

In order to be able to use ZFS, you need some available storage space. For storage – since the goal here is just to have a hands-on experience – we’ll use a simple file as a storage device. Although simplistic, I have now been using a similar setup on my laptop for nearly three years (just can’t get rid of it, it is too useful). For simplicity, I suggest you use a small Centos7 or Ubuntu 16.04 VM with one core, 8GB of disk and 1GB of RAM.

First, you need to install ZFS as it is not installed by default. On Ubuntu 16.04, you simply need to run:

root@Ubuntu1604:~# apt-get install zfs-dkms zfsutils-linux

On RedHat or Centos 7.4, the procedure is a bit more complex. First, we need to install the EPEL ZFS repository:

[root@Centos7 ~]# yum install http://download.zfsonlinux.org/epel/zfs-release.el7_4.noarch.rpm
[root@Centos7 ~]# gpg --quiet --with-fingerprint /etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux
[root@Centos7 ~]# gpg --quiet --with-fingerprint /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7

Apparently, there were issues with ZFS kmod kernel modules on RedHat/Centos. I never had any issues with Ubuntu (and who knows how often the kernel is updated). Anyway, it is recommended that you enable kABI-tracking kmods. Edit the file /etc/yum.repos.d/zfs.repo, disable the ZFS repo and enable the zfs-kmod repo. The beginning of the file should look like:

[zfs]
name=ZFS on Linux for EL7 - dkms
baseurl=http://download.zfsonlinux.org/epel/7.4/$basearch/
enabled=0
metadata_expire=7d
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux
[zfs-kmod]
name=ZFS on Linux for EL7 - kmod
baseurl=http://download.zfsonlinux.org/epel/7.4/kmod/$basearch/
enabled=1
metadata_expire=7d
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux
...

Now, we can proceed and install ZFS:

[root@Centos7 ~]# yum install zfs

After the installation, I have ZFS version 0.6.5.6 on Ubuntu and version 0.7.3.0 on Centos7. The version difference doesn’t matter for what will follow.

Setup

So, we need a container for the data. You can use any of the following options for storage:

  • A free disk device
  • A free partition
  • An empty LVM logical volume
  • A file

The easiest solution is to use a file, and so that’s what I’ll use here. A file is not the fastest and most efficient storage, but it is fine for our hands-on. In production, please use real devices. A more realistic server configuration will be discussed in a future post. The following steps are identical on Ubuntu and Centos. The first step is to create the storage file. I’ll use a file of 1~GB in /mnt. Adjust the size and path to whatever suits the resources you have:

[root@Centos7 ~]# dd if=/dev/zero of=/mnt/zfs.img bs=1024 count=1048576

The result is a 1GB file in /mnt:

[root@Centos7 ~]# ls -lh /mnt
total 1,0G
-rw-r--r--.  1 root root 1,0G 16 nov 16:50 zfs.img

Now, we will create our ZFS pool, mysqldata, using the file we just created:

[root@Centos7 ~]# modprobe zfs
[root@Centos7 ~]# zpool create mysqldata /mnt/zfs.img
[root@Centos7 ~]# zpool status
  pool: mysqldata
 state: ONLINE
  scan: none requested
config:
        NAME            STATE     READ WRITE CKSUM
        mysqldata       ONLINE       0     0     0
          /mnt/zfs.img  ONLINE       0     0     0
errors: No known data errors
[root@Centos7 ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
mysqldata  79,5K   880M    24K  /mysqldata

If you have a result similar to the above, congratulations, you have a ZFS pool. If you put files in /mysqldata, they are in ZFS.

MySQL installation

Now, let’s install MySQL and play around a bit. We’ll begin by installing the Percona repository:

root@Ubuntu1604:~# cd /tmp
root@Ubuntu1604:/tmp# wget https://repo.percona.com/apt/percona-release_0.1-4.$(lsb_release -sc)_all.deb
root@Ubuntu1604:/tmp# dpkg -i percona-release_*.deb
root@Ubuntu1604:/tmp# apt-get update
[root@Centos7 ~]# yum install http://www.percona.com/downloads/percona-release/redhat/0.1-4/percona-release-0.1-4.noarch.rpm

Next, we install Percona Server for MySQL 5.7:

root@Ubuntu1604:~# apt-get install percona-server-server-5.7
root@Ubuntu1604:~# systemctl start mysql
[root@Centos7 ~]# yum install Percona-Server-server-57
[root@Centos7 ~]# systemctl start mysql

The installation command pulls all the dependencies and sets up the MySQL root password. On Ubuntu, the install script asks for the password, but on Centos7 a random password is set. To retrieve the random password:

[root@Centos7 ~]# grep password /var/log/mysqld.log
2017-11-21T18:37:52.435067Z 1 [Note] A temporary password is generated for root@localhost: XayhVloV+9g+

The following step is to reset the root password:

[root@Centos7 ~]# mysql -p -e "ALTER USER 'root'@'localhost' IDENTIFIED BY 'Mysql57OnZfs_';"
Enter password:

Since 5.7.15, the password validation plugin by defaults requires a length greater than 8, mixed cases, at least one digit and at least one special character. On either Linux distributions, I suggest you set the credentials in the /root/.my.cnf file like this:

[# cat /root/.my.cnf
[client]
user=root
password=Mysql57OnZfs_

MySQL configuration for ZFS

Now that we have both ZFS and MySQL, we need some configuration to make them play together. From here, the steps are the same on Ubuntu and Centos. First, we stop MySQL:

# systemctl stop mysql

Then, we’ll configure ZFS. We will create three ZFS filesystems in our pool:

  • mysql will be the top level filesystem for the MySQL related data. This filesystem will not directly have data in it, but data will be stored in the other filesystems that we create. The utility of the mysql filesystem will become obvious when we talk about snapshots. Something to keep in mind for the next steps, the properties of a filesystem are by default inherited from the upper level.
  • mysql/data will be the actual datadir. The files in the datadir are mostly accessed through random IO operations, so we’ll set the ZFS recordsize to match the InnoDB page size.
  • mysql/log will be where the log files will be stored. By log files, I primarily mean the InnoDB log files. But the binary log file, the slow query log and the error log will all be stored in that directory. The log files are accessed through sequential IO operations. We’ll thus use a bigger ZFS recordsize in order to maximize the compression efficiency.

Let’s begin with the top-level MySQL container. I could have used directly mysqldata, but that would somewhat limit us. The following steps create the filesystem and set some properties:

# zfs create mysqldata/mysql
# zfs set compression=gzip mysqldata/mysql
# zfs set recordsize=128k mysqldata/mysql
# zfs set atime=off mysqldata/mysql

I just set compression to ‘gzip’ (the equivalent of gzip level 6), recordsize to 128KB and atime (the file’s access time) to off. Once we are done with the mysql filesystem, we can proceed with the data and log filesystems:

# zfs create mysqldata/mysql/log
# zfs create mysqldata/mysql/data
# zfs set recordsize=16k mysqldata/mysql/data
# zfs set primarycache=metadata mysqldata/mysql/data
# zfs get compression,recordsize,atime mysqldata/mysql/data
NAME                  PROPERTY     VALUE     SOURCE
mysqldata/mysql/data  compression  gzip      inherited from mysqldata/mysql
mysqldata/mysql/data  recordsize   16K       local
mysqldata/mysql/data  atime        off       inherited from mysqldata/mysql

Of course, there are other properties that could be set, but let’s keep things simple. Now that the filesystems are ready, let’s move the files to ZFS (make sure you stopped MySQL):

# mv /var/lib/mysql/ib_logfile* /mysqldata/mysql/log/
# mv /var/lib/mysql/* /mysqldata/mysql/data/

and then set the real mount points:

# zfs set mountpoint=/var/lib/mysql mysqldata/mysql/data
# zfs set mountpoint=/var/lib/mysql-log mysqldata/mysql/log
# chown mysql.mysql /var/lib/mysql /var/lib/mysql-log

Now we have:

# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
mysqldata             1,66M   878M  25,5K  /mysqldata
mysqldata/mysql       1,54M   878M    25K  /mysqldata/mysql
mysqldata/mysql/data   890K   878M   890K  /var/lib/mysql
mysqldata/mysql/log    662K   878M   662K  /var/lib/mysql-log

We must adjust the MySQL configuration accordingly. Here’s what I put in my /etc/my.cnf file (/etc/mysql/my.cnf on Ubuntu):

[mysqld]
datadir=/var/lib/mysql
innodb_log_group_home_dir = /var/lib/mysql-log
innodb_doublewrite = 0
innodb_checksum_algorithm = none
slow_query_log = /var/lib/mysql-log/slow.log
log-error = /var/lib/mysql-log/error.log
server_id = 12345
log_bin = /var/lib/mysql-log/binlog
relay_log=/var/lib/mysql-log/relay-bin
expire_logs_days=7
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
pid-file=/var/run/mysqld/mysqld.pid

On Centos 7, selinux prevented MySQL from accessing files in /var/lib/mysql-log. I had to perform the following steps:

[root@Centos7 ~]# yum install policycoreutils-python
[root@Centos7 ~]# semanage fcontext -a -t mysqld_db_t "/var/lib/mysql-log(/.*)?"
[root@Centos7 ~]# chcon -Rv --type=mysqld_db_t /var/lib/mysql-log/

I could have just disabled selinux since it is a test server, but if I don’t get my hands dirty on selinux once in a while with semanage and chcon I will not remember how to do it. Selinux is an important security tool on Linux (but that’s another story).

At this point, feel free to start using your test MySQL database on ZFS.

Monitoring ZFS

To monitor ZFS, you can use the zpool command like this:

[root@Centos7 ~]# zpool iostat 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
mysqldata   19,6M   988M      0      0      0    290
mysqldata   19,3M   989M      0     44      0  1,66M
mysqldata   23,4M   985M      0     49      0  1,33M
mysqldata   23,4M   985M      0     40      0   694K
mysqldata   26,7M   981M      0     39      0   561K
mysqldata   26,7M   981M      0     37      0   776K
mysqldata   23,8M   984M      0     43      0   634K

This shows the ZFS activity while I was loading some data. Also, the following command gives you an estimate of the compression ratio:

[root@Centos7 ~]# zfs get compressratio,used,logicalused mysqldata/mysql
NAME             PROPERTY       VALUE  SOURCE
mysqldata/mysql  compressratio  4.10x  -
mysqldata/mysql  used           116M   -
mysqldata/mysql  logicalused    469M   -
[root@Centos7 ~]# zfs get compressratio,used,logicalused mysqldata/mysql/data
NAME                  PROPERTY       VALUE  SOURCE
mysqldata/mysql/data  compressratio  4.03x  -
mysqldata/mysql/data  used           67,9M  -
mysqldata/mysql/data  logicalused    268M   -
[root@Centos7 ~]# zfs get compressratio,used,logicalused mysqldata/mysql/log
NAME                 PROPERTY       VALUE  SOURCE
mysqldata/mysql/log  compressratio  4.21x  -
mysqldata/mysql/log  used           47,8M  -
mysqldata/mysql/log  logicalused    201M   -

In my case, the dataset compresses very well (4x). Another way to see how files are compressed is to use ls and du. ls returns the actual uncompressed size of the file, while du returns the compressed size. Here’s an example:

[root@Centos7 mysql]# -lah ibdata1
-rw-rw---- 1 mysql mysql 90M nov 24 16:09 ibdata1
[root@Centos7 mysql]# du -hs ibdata1
14M     ibdata1

I really invite you to further experiment and get a feeling of how ZFS and MySQL behave together.

Snapshots and backups

A great feature of ZFS that work really well with MySQL are snapshots. A snapshot is a consistent view of the filesystem at a given point in time. Normally, it is best to perform a snapshot while a flush tables with read lock is held. That allows you to record the master position, and also to flush MyISAM tables. It is quite easy to do that. Here’s how I create a snapshot with MySQL:

[root@Centos7 ~]# mysql -e 'flush tables with read lock;show master status;! zfs snapshot -r mysqldata/mysql@my_first_snapshot'
+---------------+-----------+--------------+------------------+-------------------+
| File          | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------+-----------+--------------+------------------+-------------------+
| binlog.000002 | 110295083 |              |                  |                   |
+---------------+-----------+--------------+------------------+-------------------+
[root@Centos7 ~]# zfs list -t snapshot
NAME                                     USED  AVAIL  REFER  MOUNTPOINT
mysqldata/mysql@my_first_snapshot          0B      -    24K  -
mysqldata/mysql/data@my_first_snapshot     0B      -  67,9M  -
mysqldata/mysql/log@my_first_snapshot      0B      -  47,8M  -

The command took about 1s. The only time where such commands would take more time is when there are MyISAM tables with a lot of pending updates to the indices, or when there are long running transactions. You probably wonder why the “USED” column reports 0B. That’s simply because there were no changes to the filesystem since the snapshot was created. It is a measure of the amount of data that hasn’t been free because the snapshot requires the data. Said otherwise, it how far the snapshot has diverged from its parent. You can access the snapshot through a clone or through ZFS as a file system. To access the snapshot through ZFS, you have to set the snapdir parameter to “visible, ” and then you can see the files. Here’s how:

[root@Centos7 ~]# zfs set snapdir=visible mysqldata/mysql/data
[root@Centos7 ~]# zfs set snapdir=visible mysqldata/mysql/log
[root@Centos7 ~]# ls /var/lib/mysql-log/.zfs/snapshot/my_first_snapshot/
binlog.000001  binlog.000002  binlog.index  error.log  ib_logfile0  ib_logfile1

The files in the snapshot directory are read-only. If you want to be able to write to the files, you first need to clone the snapshots:

[root@Centos7 ~]# zfs create mysqldata/mysqlslave
[root@Centos7 ~]# zfs clone mysqldata/mysql/data@my_first_snapshot mysqldata/mysqlslave/data
[root@Centos7 ~]# zfs clone mysqldata/mysql/log@my_first_snapshot mysqldata/mysqlslave/log
[root@Centos7 ~]# zfs list
NAME                        USED  AVAIL  REFER  MOUNTPOINT
mysqldata                   116M   764M    26K  /mysqldata
mysqldata/mysql             116M   764M    24K  /mysqldata/mysql
mysqldata/mysql/data       67,9M   764M  67,9M  /var/lib/mysql
mysqldata/mysql/log        47,8M   764M  47,8M  /var/lib/mysql-log
mysqldata/mysqlslave         28K   764M    26K  /mysqldata/mysqlslave
mysqldata/mysqlslave/data     1K   764M  67,9M  /mysqldata/mysqlslave/data
mysqldata/mysqlslave/log      1K   764M  47,8M  /mysqldata/mysqlslave/log

At this point, it is up to you to use the clones to spin up a local slave. Like for the snapshots, the clone only grows in size when actual data is written to it. ZFS records that haven’t changed since the snapshot was taken are shared. That’s a huge space savings. For a customer, I once wrote a script to automatically create five MySQL slaves for their developers. The developers would do tests, and often replication broke. Rerunning the script would recreate fresh slaves in a matter of a few minutes. My ZFS snapshot script and the script I wrote to create the clone based slaves are available here: https://github.com/y-trudeau/Yves-zfs-tools

Optional features

In the previous post, I talked about a SLOG device for the ZIL and the L2ARC, a disk extension of the ARC cache. If you promise to never use the following trick in production, here’s how to speed MySQL on ZFS drastically:

[root@Centos7 ~]# dd if=/dev/zero of=/dev/shm/zil_slog.img bs=1024 count=131072
131072+0 enregistrements lus
131072+0 enregistrements écrits
134217728 octets (134 MB) copiés, 0,373809 s, 359 MB/s
[root@Centos7 ~]# zpool add mysqldata log /dev/shm/zil_slog.img
[root@Centos7 ~]# zpool status
  pool: mysqldata
 state: ONLINE
  scan: none requested
config:
        NAME                     STATE     READ WRITE CKSUM
        mysqldata                ONLINE       0     0     0
          /mnt/zfs.img           ONLINE       0     0     0
        logs
          /dev/shm/zil_slog.img  ONLINE       0     0     0
errors: No known data errors

The data in the SLOG is critical for ZFS recovery. I performed some tests with virtual machines, and if you crash the server and lose the SLOG you may lose all the data stored in the ZFS pool. Normally, the SLOG is on a mirror in order to lower the risk of losing it. The SLOG can be added and removed online.

I know I asked you to promise to never use an shm file as SLOG in production. Actually, there are exceptions. I would not hesitate to temporarily use such a trick to speed up a lagging slave. Another situation where such a trick could be used is with Percona XtraDB Cluster. With a cluster, there are multiple copies of the dataset. Even if one node crashed and lost its ZFS filesystems, it could easily be reconfigured and reprovisioned from the surviving nodes.

The other optional feature I want to cover is a cache device. The cache device is what is used for the L2ARC. The content of the L2ARC is compressed as the original data is compressed. To add a cache device (again an shm file), do:

[root@Centos7 ~]# dd if=/dev/zero of=/dev/shm/l2arc.img bs=1024 count=131072
131072+0 enregistrements lus
131072+0 enregistrements écrits
134217728 octets (134 MB) copiés, 0,272323 s, 493 MB/s
[root@Centos7 ~]# zpool add mysqldata cache /dev/shm/l2arc.img
[root@Centos7 ~]# zpool status
  pool: mysqldata
 state: ONLINE
  scan: none requested
config:
    NAME                     STATE     READ WRITE CKSUM
    mysqldata                ONLINE       0     0     0
      /mnt/zfs.img           ONLINE       0     0     0
    logs
      /dev/shm/zil_slog.img  ONLINE       0     0     0
    cache
      /dev/shm/l2arc.img     ONLINE       0     0     0
errors: No known data errors

To monitor the L2ARC (and also the ARC), look at the file: /proc/spl/kstat/zfs/arcstats. As the ZFS filesystems are configured right now, very little will go to the L2ARC. This can be frustrating. The reason is that the L2ARC is filled by the elements evicted from the ARC. If you recall, we have set primarycache=metatdata for the filesystem containing the actual data. Hence, in order to get some data to our L2ARC, I suggest the following steps:

[root@Centos7 ~]# zfs set primarycache=all mysqldata/mysql/data
[root@Centos7 ~]# echo 67108864 > /sys/module/zfs/parameters/zfs_arc_max
[root@Centos7 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@Centos7 ~]# grep '^size' /proc/spl/kstat/zfs/arcstats
size                            4    65097584

It takes the echo command to drop_caches to force a re-initialization of the ARC. Now, InnoDB data starts to be cached in the L2ARC. The way data is sent to the L2ARC has many tunables, which I won’t discuss here. I chose 64MB for the ARC size mainly because I am using a low memory VM. A size of 64MB is aggressively small and will slow down ZFS if the metadata doesn’t fit in the ARC. Normally you should use a larger value. The actual good size depends on many parameters like the filesystem system size, the number of files and the presence of a L2ARC. You can monitor the ARC and L2ARC using the arcstat tool that comes with ZFS on Linux (when you use Centos 7). With Ubuntu, download the tool from here.

Removal

So the ZFS party is over? We need to clean up the mess! Let’s begin:

[root@Centos7 ~]# systemctl stop mysql
[root@Centos7 ~]# zpool remove /dev/shm/l2arc.img
[root@Centos7 ~]# zpool remove mysqldata /dev/shm/zil_slog.img
[root@Centos7 ~]# rm -f /dev/shm/*.img
[root@Centos7 ~]# zpool destroy mysqldata
[root@Centos7 ~]# rm -f /mnt/zfs.img
[root@Centos7 ~]# yum erase spl kmod-spl libzpool2 libzfs2 kmod-zfs zfs

The last step is different on Ubuntu:

root@Ubuntu1604:~# apt-get remove spl-dkms zfs-dkms libzpool2linux libzfs2linux spl zfsutils-linux zfs-zed

Conclusion

With this guide, I hope I provided a positive first experience in using ZFS with MySQL. The configuration is simple, and not optimized for performance. However, we’ll look at more realistic configurations in future posts.

Nov
09
2017
--

MySQL and Linux Context Switches

Context Switches

In this blog post, I’ll look at MySQL and Linux context switches and what is the normal number per second for a database environment.

You might have heard many times about the importance of looking at the number of context switches to indicate if MySQL is suffering from the internal contention issues. I often get the question of what is a “normal” or “acceptable” number, and at what point should you worry about the number of context switches per second?

First, let’s talk about what context switches are in Linux. This StackOverflow Thread provides a good discussion, with a lot of details, but basically it works like this:  

The process (or thread in MySQL’s case) is running its computations. Sooner or later, it has to do some blocking operation: disk IO, network IO, block waiting on a mutex or yield. The execution switches to the other process, and this is called voluntary context switch.On the other hand, the process/thread may need to be preempted by the scheduler because it used an allotted amount of CPU time (and now other tasks need to run) or because it is required to run high priority task. This is called involuntary context switches. When all the process in the system are added together and totaled, this is the system-wide number of context switches reported (using, for example, vmstat):

root@nuc2:~# vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
17  0      0 12935036 326152 2387388    0    0     0     5     0      1  9  0 91  0  0
20  0      0 12933936 326152 2387384    0    0     0     3 32228 124791 77 22  1  0  0
17  0      0 12933348 326152 2387364    0    0     0    11 33212 124575 78 22  1  0  0
16  0      0 12933380 326152 2387364    0    0     0    78 32470 126100 78 22  1  0  0

This is a global number. In many cases, however, it is better to look at it as context switches per CPU logical core. This is because cores execute tasks independently. As such, they have mostly independent causes for context switches. If you have a large number of cores, there can be quite a difference:

MySQL Context Switches

The number of context switches per second on this system looks high (at more than 1,000,000). Considering it has 56 logical cores, however, it is only about 30,000 per second per logical core (which is not too bad).

So how do we judge if the number of context switches is too high in your system? One answer is that it is too high if you’re wasting too much CPU on context switches. This brings up the question: how many context switches can the system handle if it is only doing context switches?

It is easy to find this out!  

Sysbench has a “threads” test designed specifically to measure this. For example:

sysbench --thread-locks=128 --time=7200 --threads=1024 threads run

Check the vmstat output or the Context Switches PMM graph:

MySQL Context Switches 1

We can see this system can handle up to 35 million context switches per second in total (or some 500K per logical CPU core on average).

I don’t recommend using more than 10% of CPU resources on context switching, so I would try to keep the number of the context switches at no more than 50K per logical CPU core.

Now let’s think about context switches from the other side: how many context switches do we expect to have at the very minimum for given load? Even if all the stars align and your query to MySQL doesn’t need any disk IO or context switches due to waiting for mutexes, you should expect at least two context switches: one to the client thread which processes the query and one for the query response sent to the client.    

Using this logic, if we have 100,000 queries/sec we should expect 200,000 context switches at the very minimum.

In the real world, though, I would not worry about contention being a big issue if you have less than ten context switches per query.

It is worth noting that in MySQL not every contention results in a context switch. InnoDB implements its own mutexes and RW-locks, which often try to “spin” to wait for a resource to become available. This wastes CPU time directly rather than doing a context switch.

Summary:

  • Look at the number of context switches per logical core rather than the total for easier-to-compare numbers
  • Find out how many context switches your system can handle per second, and don’t get too concerned if your context switches are no more than 10% of that number
  • Think about the number of context switches per query: the minimum possible is two, and values less than 10 make contention an unlikely issue
  • Not every MySQL contention results in a high number of context switches
Oct
13
2017
--

Red Hat continues steady march toward $5 billion revenue goal

 The last time I spoke to Red Hat CEO Jim Whitehurst, he had set a pretty audacious goal for his company to achieve $5 billion in revenue. At the time, that seemed a bit far-fetched. But the company has continued to thrive and is on track to pass $3 billion in revenue some time in the next couple of quarters. Read More

Sep
13
2016
--

Microsoft’s Azure Service Fabric for running and managing microservices is coming to Linux

A Microsoft logo sits on a flag flying in the grounds of the Nokia Oyj mobile handset factory, operated by Microsoft Corp., in Komarom, Hungary, on Monday, July 21, 2014. Microsoft said it will eliminate as many as 18,000 jobs, the largest round of cuts in its history, as Chief Executive Officer Satya Nadella integrates Nokia Oyj's handset unit and slims down the software maker. Photographer: Akos Stiller/Bloomberg via Getty Images Microsoft’s CTO for Azure (and occasional novelist) Mark Russinovich is extremely bullish about microservices. In his view, the vast majority of apps — including enterprise apps — will soon be built using microservices. Microsoft, with its variety of cloud services and developer tools, obviously wants a piece of that market. With Service Fabric, the company offers a service… Read More

Aug
12
2016
--

Tuning Linux for MongoDB

tuning Linux for MongoDB

tuning Linux for MongoDBIn this post, we’ll discuss tuning Linux for MongoDB deployments.

By far the most common operating system you’ll see MongoDB running on is Linux 2.6 and 3.x. Linux flavors such as CentOS and Debian do a fantastic job of being a stable, general-purpose operating system. Linux runs software on hardware ranging from tiny computers like the Raspberry Pi up to massive data center servers. To make this flexibility work, however, Linux defaults to some “lowest common denominator” tunings so that the OS will boot on anything.

Working with databases, we often focus on the queries, patterns and tunings that happen inside the database process itself. This means we sometimes forget that the operating system below it is the life-support of database, the air that it breathes so-to-speak. Of course, a highly-scalable database such as MongoDB runs fine on these general-purpose defaults without complaints, but the efficiency can be equivalent to running in regular shoes instead of sleek runners. At small scale, you might not notice the lost efficiency, but at large scale (especially when data exceeds RAM) improved tunings equate to fewer servers and less operational costs. For all use cases and scale, good OS tunings also provide some improvement in response times and removes extra “what if…?” questions when troubleshooting.

Overall, memory, network and disk are the system resources important to MongoDB. This article covers how to optimize each of these areas. Of course, while we have successfully deployed these tunings to many live systems, it’s always best to test before applying changes to your servers.

If you plan on applying these changes, I suggest performing them with one full reboot of the host. Some of these changes don’t require a reboot, but test that they get re-applied if you reboot in the future. MongoDB’s clustered nature should make this relatively painless, plus it might be a good time to do that dreaded “yum upgrade” / “aptitude upgrade“, too.

Linux Ulimit

To prevent a single user from impacting the entire system, Linux has a facility to implement some system resource constraints on processes, file handles and other system resources on a per-user-basis. For medium-high-usage MongoDB deployments, the default limits are almost always too low. Considering MongoDB generally uses dedicated hardware, it makes sense to allow the Linux user running MongoDB (e.g., “mongod”) to use a majority of the available resources.

Now you might be thinking: “Why not disable the limit (or set it to unlimited)?” This is a common recommendation for database servers. I think you should avoid this for two reasons:

  • If you hit a problem, a lack of a limit on system resources can allow a relatively smaller problem to spiral out of control, often bringing down other services (such as SSH) crucial to solving the original problem.
  • All systems DO have an upper-limit, and understanding those limitations instead of masking them is an important exercise.

In most cases, a limit of 64,000 “max user processes” and 64,000 “open files” (both have defaults of 1024) will suffice. To be more exact you need to do some math on the number of applications/clients, the maximum size of their connection pools and some case-by-case tuning for the number of inter-node connections between replica set members and sharding processes. (We might address this in a future blog post.)

You can deploy these limits by adding a file in “/etc/security/limits.d” (or appending to “/etc/security/limits.conf” if there is no “limits.d”). Below is an example file for the Linux user “mongod”, raising open-file and max-user-process limits to 64,000:

mongod       soft        nproc        64000
mongod       hard        nproc        64000
mongod       soft        nofile       64000
mongod       hard        nofile       64000

Note: this change only applies to new shells, meaning you must restart “mongod” or “mongos” to apply this change!

Virtual Memory
Dirty Ratio

The “dirty_ratio” is the percentage of total system memory that can hold dirty pages. The default on most Linux hosts is between 20-30%. When you exceed the limit the dirty pages are committed to disk, creating a small pause. To avoid this hard pause there is a second ratio: “dirty_background_ratio” (default 10-15%) which tells the kernel to start flushing dirty pages to disk in the background without any pause.

20-30% is a good general default for “dirty_ratio”, but on large-memory database servers this can be a lot of memory! For example, on a 128GB-memory host this can allow up to 38.4GB of dirty pages. The background ratio won’t kick in until 12.8GB! We recommend that you lower this setting and monitor the impact to query performance and disk IO. The goal is reducing memory usage without impacting query performance negatively. Reducing caches sizes also guarantees data gets written to disk in smaller batches more frequently, which increases disk throughput (than huge bulk writes less often).

A recommended setting for dirty ratios on large-memory (64GB+ perhaps) database servers is: “vm.dirty_ratio = 15″ and vm.dirty_background_ratio = 5″, or possibly less. (Red Hat recommends lower ratios of 10 and 3 for high-performance/large-memory servers.)

You can set this by adding the following lines to /etc/sysctl.conf”:

vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

To check these current running values:

$ sysctl -a | egrep "vm.dirty.*_ratio"
vm.dirty_background_ratio = 5
vm.dirty_ratio = 15

Swappiness

“Swappiness” is a Linux kernel setting that influences the behavior of the Virtual Memory manager when it needs to allocate a swap, ranging from 0-100. A setting of 0 tells the kernel to swap only to avoid out-of-memory problems. A setting of 100 tells it to swap aggressively to disk. The Linux default is usually 60, which is not ideal for database usage.

It is common to see a setting of 0″ (or sometimes “10”) on database servers, telling the kernel to prefer to swap to memory for better response times. However, Ovais Tariq details a known bug (or feature) when using a setting of 0 in this blog post: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/.

Due to this bug, we recommended using a setting of 1″ (or “10” if you  prefer some disk swapping) by adding the following to your /etc/sysctl.conf”:

vm.swappiness = 1

To check the current swappiness:

$ sysctl vm.swappiness
vm.swappiness = 1

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply a dirty_ratio or swappiness change!

Transparent HugePages

*Does not apply to Debian/Ubuntu or CentOS/RedHat 5 and lower*

Transparent HugePages is an optimization introduced in CentOS/RedHat 6.0, with the goal of reducing overhead on systems with large amounts of memory. However, due to the way MongoDB uses memory, this feature actually does more harm than good as memory access are rarely contiguous.

Disabled THP entirely by adding the following flag below to your Linux kernel boot options:

transparent_hugepage=never

Usually this requires changes to the GRUB boot-loader config in the directory /boot/grub” or /etc/grub.d” on newer systems. Red Hat covers this in more detail in this article (same method on CentOS): https://access.redhat.com/solutions/46111.

Note: We recommended rebooting the system to clear out any previous huge pages and validate that the setting will persist on reboot.

NUMA (Non-Uniform Memory Access) Architecture

Non-Uniform Memory Access is a recent memory architecture that takes into account the locality of caches and CPUs for lower latency. Unfortunately, MongoDB is not “NUMA-aware” and leaving NUMA setup in the default behavior can cause severe memory in-balance.

There are two ways to disable NUMA: one is via an on/off switch in the system BIOS config, the 2nd is using the numactl” command to set NUMA-interleaved-mode (similar effect to disabling NUMA) when starting MongoDB. Both methods achieve the same result. I lean towards using the numactl” command due to future-proofing yourself for the mostly inevitable addition of NUMA awareness. On CentOS 7+ you may need to install the numactl” yum/rpm package.

To make mongod start using interleaved-mode, add numactl –interleave=all” before your regular mongod” command:

$ numactl --interleave=all mongod <options here>

To check mongod’s NUMA setting:

$ sudo numastat -p $(pidof mongod)
Per-node process memory usage (in MBs) for PID 7516 (mongod)
                           Node 0           Total
                  --------------- ---------------
Huge                         0.00            0.00
Heap                        28.53           28.53
Stack                        0.20            0.20
Private                      7.55            7.55
----------------  --------------- ---------------
Total                       36.29           36.29

If you see only 1 x NUMA-node column (“Node0”) NUMA is disabled. If you see more than 1 x NUMA-node, make sure the metric numbers (Heap”, etc.) are balanced between nodes. Otherwise, NUMA is NOT in “interleave” mode.

Note: some MongoDB packages already ship logic to disable NUMA in the init/startup script. Check for this using “grep” first. Your hardware or BIOS manual should cover disabling NUMA via the system BIOS.

Block Device IO Scheduler and Read-Ahead

For tuning flexibility, we recommended that MongoDB data sits on its own disk volume, preferably with its own dedicated disks/RAID array. While it may complicate backups, for the best performance you can also dedicate a separate volume for the MongoDB journal to separate it’s disk activity noise from the main data set. The journal does not yet have it’s own config/command-line setting, so you’ll need to mount a volume to the journal” directory inside the dbPath. For example, /var/lib/mongo/journal” would be the journal mount-path if the dbPath was set to /var/lib/mongo”.

Aside from good hardware, the block device MongoDB stores its data on can benefit from 2 x major adjustments:

IO Scheduler

The IO scheduler is an algorithm the kernel will use to commit reads and writes to disk. By default most Linux installs use the CFQ (Completely-Fair Queue) scheduler. This is designed to work well for many general use cases, but with little latency guarantees. Two other popular schedulers are deadline” and noop”. Deadline excels at latency-sensitive use cases (like databases) and noop is closer to no scheduling at all.

We generally suggest using the deadline” IO scheduler for cases where you have real, non-virtualised disks under MongoDB. (For example, a “bare metal” server.) In some cases I’ve seen noop” perform better with certain hardware RAID controllers, however. The difference between deadline” and cfq” can be massive for disk-bound deployments.

If you are running MongoDB inside a VM (which has it’s own IO scheduler beneath it) it is best to use noop” and let the virtualization layer take care of the IO scheduling itself.

Read-Ahead

Read-ahead is a per-block device performance tuning in Linux that causes data ahead of a requested block on disk to be read and then cached into the filesystem cache. Read-ahead assumes that there is a sequential read pattern and something will benefit from those extra blocks being cached. MongoDB tends to have very random disk patterns and often does not benefit from the default read-ahead setting, wasting memory that could be used for more hot data. Most Linux systems have a default setting of 128KB/256 sectors (128KB = 256 x 512-byte sectors). This means if MongoDB fetches a 64kb document from disk, 128kb of filesystem cache is used and maybe the extra 64kb is never accessed later, wasting memory.

For this setting, we suggest a starting-point of 32 sectors (=16KB) for most MongoDB workloads. From there you can test increasing/reducing this setting and then monitor a combination of query performance, cached memory usage and disk read activity to find a better balance. You should aim to use as little cached memory as possible without dropping the query performance or causing significant disk activity.

Both the IO scheduler and read-ahead can be changed by adding a file to the udev configuration at /etc/udev/rules.d”. In this example I am assuming the block device serving mongo data is named /dev/sda” and I am setting “deadline” as the IO scheduler and 16kb/32-sectors as read-ahead:

# set deadline scheduler and 16kb read-ahead for /dev/sda
ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

To check the IO scheduler was applied ([square-brackets] = enabled scheduler):

$ cat /sys/block/sda/queue/scheduler
noop [deadline] cfq

To check the current read-ahead setting:

$ sudo blockdev --getra /dev/sda
32

Note: this change should be applied and tested with a full system reboot!

Filesystem and Options

It is recommended that MongoDB uses only the ext4 or XFS filesystems for on-disk database data. ext3 should be avoided due to its poor pre-allocation performance. If you’re using WiredTiger (MongoDB 3.0+) as a storage engine, it is strongly advised that you ONLY use XFS due to serious stability issues on ext4.

Each time you read a file, the filesystems perform an access-time metadata update by default. However, MongoDB (and most applications) does not use this access-time information. This means you can disable access-time updates on MongoDB’s data volume. A small amount of disk IO activity that the access-time updates cause stops in this case.

You can disable access-time updates by adding the flag noatime” to the filesystem options field in the file /etc/fstab” (4th field) for the disk serving MongoDB data:

/dev/mapper/data-mongodb /var/lib/mongo        ext4        defaults,noatime    0 0

Use noatime” to verify the volume is currently mounted:

$ grep "/var/lib/mongo" /proc/mounts
/dev/mapper/data-mongodb /var/lib/mongo ext4 rw,seclabel,noatime,data=ordered 0 0

Note: to apply a filesystem-options change, you must remount (umount + mount) the volume again after stopping MongoDB, or reboot.

Network Stack

Several defaults of the Linux kernel network tunings are either not optimal for MongoDB, limit a typical host with 1000mbps network interfaces (or better) or cause unpredictable behavior with routers and load balancers. We suggest some increases to the relatively low throughput settings (net.core.somaxconn and net.ipv4.tcp_max_syn_backlog) and a decrease in keepalive settings, seen below.

Make these changes permanent by adding the following to /etc/sysctl.conf” (or a new file /etc/sysctl.d/mongodb-sysctl.conf – if /etc/sysctl.d exists):

net.core.somaxconn = 4096
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_max_syn_backlog = 4096

To check the current values of any of these settings:

$ sysctl net.core.somaxconn
net.core.somaxconn = 4096

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply this change!

NTP Daemon

All of these deeper tunings make it easy to forget about something as simple as your clock source. As MongoDB is a cluster, it relies on a consistent time across nodes. Thus the NTP Daemon should run permanently on all MongoDB hosts, mongos and arbiters included. Be sure to check the time syncing won’t fight with any guest-based virtualization tools like “VMWare tools” and “VirtualBox Guest Additions”.

This is installed on RedHat/CentOS with:

$ sudo yum install ntp

And on Debian/Ubuntu:

$ sudo apt-get install ntp

Note: Start and enable the NTP Daemon (for starting on reboots) after installation. The commands to do this vary by OS and OS version, so please consult your documentation.

Security-Enhanced Linux (SELinux)

Security-Enhanced Linux is a kernel-level security access control module that has an unfortunate tendency to be disabled or set to warn-only on Linux deployments. As SELinux is a strict access control system, sometimes it can cause unexpected errors (permission denied, etc.) with applications that were not configured properly for SELinux. Often people disable SELinux to resolve the issue and forget about it entirely. While implementing SELinux is not an end-all solution, it massively reduces the local attack surface of the server. We recommend deploying MongoDB with SELinus Enforcing” mode on.

The modes of SELinux are:

  1. Enforcing – Block and log policy violations.
  2. Permissive – Log policy violations only.
  3. Disabled – Completely disabled.

As database servers are usually dedicated to one purpose, such as running MongoDB, the work of setting up SELinux is a lot simpler than a multi-use server with many processes and users (such as an application/web server, etc.). The OS access pattern of a database server should be extremely predictable. Introducing Enforcing” mode at the very beginning of your testing/installation instead of after-the-fact avoids a lot of gotchas with SELinux. Logging for SELinux is directed to /var/log/audit/audit.log” and the configuration is at /etc/selinux”.

Luckily, Percona Server for MongoDB RPM packages (CentOS/RedHat) are SELinux “Enforcing” mode compatible as they install/enable an SELinux policy at RPM install time! Debian/Ubuntu SELinux support is still in planning.

Here you can see the SELinux policy shipped in the Percona Server for MongoDB version 3.2 server package:

$ rpm -ql Percona-Server-MongoDB-32-server | grep selinux
/etc/selinux/targeted/modules/active/modules/mongod.pp

To change the SELinux mode to Enforcing”:

$ sudo setenforce Enforcing

To check the running SELinux mode:

$ sudo getenforce
Enforcing

Linux Kernel and Glibc Version

The version of the Linux kernel and Glibc itself may be more important than you think. Some community benchmarks show a significant improvement on OLTP throughput benchmarks with the recent Linux 3.x kernels versus the 2.6 still widely deployed. To avoid serious bugs, MongoDB should at minimum use Linux 2.6.36 and Glibc 2.13 or newer.

I hope to create a follow-up post on the specific differences seen under MongoDB workloads on Linux 3.2+ versus 2.6. Until then, I recommend you test the difference using your own workloads and any results/feedback are appreciated.

What’s Next?

What’s the next thing to tune? At this point, tuning becomes case-by-case and open-ended. I appreciate any comments on use-case/tunings pairings that worked for you. Also, look out for follow-ups to this article for a few tunings I excluded due to lack of testing.

Not knowing the next step might mean you’re done tuning, or that you need more visibility into your stack to find the next bottleneck. Good monitoring and data visibility are invaluable for this type of investigation. Look out for future posts regarding monitoring your MongoDB (or MySQL) deployment and consider using Percona Monitoring and Management as an all-in-one monitoring solution. You could also try using Percona-Lab/prometheus_mongodb_exporterprometheus/node_exporter and Percona-Lab/grafana_mongodb_dashboards for monitoring MongoDB/Linux with Prometheus and Grafana.

The road to an efficient database stack requires patience, analysis and iteration. Tomorrow a new hardware architecture or change in kernel behavior could come, be the first to spot the next bottleneck! Happy hunting.

Aug
09
2016
--

Mirantis and SUSE team up to give OpenStack users new support options

IMG_20160809_103941 Mirantis, which specializes in offering software, support and training for running OpenStack, today announced that it is partnering with Germany-based SUSE, best known for its Linux distribution, to offer its customers support for SUSE’s enterprise Linux offering. The two companies also said that they will work on making SUSE Linux Enterprise Server a development platform for use… Read More

Jul
06
2016
--

Canonical-Pivotal partnership makes Ubuntu preferred Linux distro for Cloud Foundry

Pivotal offices. Pivotal, developers of the Cloud Foundry open source cloud development platform and Canonical, the company behind the popular Ubuntu Linux distribution, announced a partnership today where Ubuntu becomes the preferred operating system for Cloud Foundry.
In fact, the two companies have been BFFs since the earliest days of Cloud Foundry when it was an open source project developed at VMware. Read More

Jul
05
2016
--

LzLabs launches product to move mainframe COBOL code to Linux cloud

Black white photo of man sitting in front of a mainframe computer in the 1960s Somewhere in a world full of advanced technology that we write about regularly here on TechCrunch, there exists an ancient realm where mainframe computers are still running programs written in COBOL.
This is a programming language, mind you, that was developed in the late 1950s, and used widely in the ’60s and ’70s and even into the ’80s, but it’s never really gone away. Read More

Jun
21
2016
--

As Red Hat aims for $5 billion in revenue, Linux won’t be only driver

Red Hat race car with #1 painted on top. Last year Red Hat, which has been mostly known for selling Linux in the enterprise became the first $2 billion open source company. Now it wants to be the first to $5 billion, but it might not be just Linux that gets it there. A couple of years ago Red Hat CEO Jim Whitehurst recognized, even in the face of rising revenue, that the company couldn’t continue growing forever… Read More

Jun
20
2016
--

Running Percona XtraDB Cluster nodes with Linux Network namespaces on the same host

Percona XtraDB Cluster nodes with Linux Network namespaces

Percona XtraDB Cluster nodes with Linux Network namespacesThis post is a continuance of my Docker series, and examines Running Percona XtraDB Cluster nodes with Linux Network namespaces on the same host.

In this blog I want to look into a lower-level building block: Linux Network Namespace.

The same as with cgroups, Docker uses Linux Network Namespace for resource isolation. I was looking into cgroup a year ago, and now I want to understand more about Network Namespace.

The goal is to both understand a bit more about Docker internals, and to see how we can provide network isolation for different processes within the same host. You might need to isolate process when running several MySQL or MongoDB instances on the same server (which might come in handy during testing). In this case, I needed to test ProxySQL without Docker.

We can always use different ports for different MySQL instances (such as 3306, 3307, 3308), but it quickly gets complicated.

We could also use IP address aliases for an existing network interface, and use bind=<IP.ADD.RE.SS> for each instance. But since Percona XtraDB Cluster can use three different IP ports and network channels for communications, this also quickly gets complicated.

Linux Network Namespace provides greater network isolation for resources so that it can be a better fit for Percona XtraDB Cluster nodes. Now, setting up Network namespaces in and of itself can be confusing; my recommendation is if you can use Docker, use Docker instead. It provides isolation on process ID and mount points, and takes care of all the script plumbing to create and destroy networks. As you will see in our scripts, we need to talk about directory location for datadirs.

Let’s create a network for Percona XtraDB Cluster with Network Namespaces.

I will try to do the following:

  • Start four nodes of Percona XtraDB Cluster
  • For each node, create separate network namespace so the nodes will be able to allocate network ports 3306, 4567, 4568 without conflicts
  • Assign the nodes IP addresses: 10.200.10.2-10.200.10.5
  • Create a “bridge interface” for the nodes to communicate, using IP address 10.200.10.1.

For reference, I took ideas from this post: Linux Switching – Interconnecting Namespaces

First, we must create the bridge interface on the host:

BRIDGE=br-pxc
brctl addbr $BRIDGE
brctl stp $BRIDGE off
ip addr add 10.200.10.1/24 dev $BRIDGE
ip link set dev $BRIDGE up

Next, we create four namespaces (one per Percona XtraDB Cluster node) using the following logic:

for i in 1 2 3 4
do
 ip netns add pxc_ns$i
 ip link add pxc-veth$i type veth peer name br-pxc-veth$i
 brctl addif $BRIDGE br-pxc-veth$i
 ip link set pxc-veth$i netns pxc_ns$i
 ip netns exec pxc_ns$i ip addr add 10.200.10.$((i+1))/24 dev pxc-veth$i
 ip netns exec pxc_ns$i ip link set dev pxc-veth$i up
 ip link set dev br-pxc-veth$i up
 ip netns exec pxc_ns$i ip link set lo up
 ip netns exec pxc_ns$i ip route add default via 10.200.10.1
done

We see the following interfaces on the host:

1153: br-pxc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
 link/ether 32:32:4c:36:22:87 brd ff:ff:ff:ff:ff:ff
 inet 10.200.10.1/24 scope global br-pxc
 valid_lft forever preferred_lft forever
 inet6 fe80::2ccd:6ff:fe04:c7d5/64 scope link
 valid_lft forever preferred_lft forever
1154: br-pxc-veth1@if1155: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-pxc state UP qlen 1000
 link/ether c6:28:2d:23:3b:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 8
 inet6 fe80::c428:2dff:fe23:3ba4/64 scope link
 valid_lft forever preferred_lft forever
1156: br-pxc-veth2@if1157: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-pxc state UP qlen 1000
 link/ether 32:32:4c:36:22:87 brd ff:ff:ff:ff:ff:ff link-netnsid 12
 inet6 fe80::3032:4cff:fe36:2287/64 scope link
 valid_lft forever preferred_lft forever
1158: br-pxc-veth3@if1159: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-pxc state UP qlen 1000
 link/ether 8a:3a:c1:e0:8a:67 brd ff:ff:ff:ff:ff:ff link-netnsid 13
 inet6 fe80::883a:c1ff:fee0:8a67/64 scope link
 valid_lft forever preferred_lft forever
1160: br-pxc-veth4@if1161: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-pxc state UP qlen 1000
 link/ether aa:56:7f:41:1d:c3 brd ff:ff:ff:ff:ff:ff link-netnsid 11
 inet6 fe80::a856:7fff:fe41:1dc3/64 scope link
 valid_lft forever preferred_lft forever

We also see the following network namespaces:

# ip netns
pxc_ns4 (id: 11)
pxc_ns3 (id: 13)
pxc_ns2 (id: 12)
pxc_ns1 (id: 8)

After that, we can check the namespace IP address:

# ip netns exec pxc_ns3 bash
# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 inet 127.0.0.1/8 scope host lo
 valid_lft forever preferred_lft forever
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
1159: pxc-veth3@if1158: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
 link/ether 4a:ad:be:6a:aa:c6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
 inet 10.200.10.4/24 scope global pxc-veth3
 valid_lft forever preferred_lft forever
 inet6 fe80::48ad:beff:fe6a:aac6/64 scope link
 valid_lft forever preferred_lft forever

To enable communication from inside the network namespace to the external world, we should add some iptables rules, e.g.:

iptables -t nat -A POSTROUTING -s 10.200.10.0/255.255.255.0 -o enp2s0f0 -j MASQUERADE
iptables -A FORWARD -i enp2s0f0 -o $BRIDGE -j ACCEPT
iptables -A FORWARD -o enp2s0f0 -i $BRIDGE -j ACCEPT

where enp2s0f0 is an interface that has an external IP address (by some reason modern Linux distros decided to use names like enp2s0f0 for network interfaces, instead old good "eth0").

To start a node (or mysqld instance) inside a network namespace, we should use

ip netns exec prefix

 for commands.

For example to start Percona XtraDB Cluster first node, in the namespace pxc_ns1, with IP address 10.200.10.2, we use:

ip netns exec pxc_ns1 mysqld --defaults-file=node.cnf --datadir=/data/datadir/node1 --socket=/tmp/node1_mysql.sock --user=root --wsrep_cluster_name=cluster1

To start following nodes:

NODE=2 ip netns exec pxc_ns${NODE} mysqld --defaults-file=node${NODE}.cnf --datadir=/data/datadir/node${NODE} --socket=/tmp/node${NODE}_mysql.sock --user=root --wsrep_cluster_address="gcomm://10.200.10.2" --wsrep_cluster_name=cluster1  
NODE=3 ip netns exec pxc_ns${NODE} mysqld --defaults-file=node${NODE}.cnf --datadir=/data/datadir/node${NODE} --socket=/tmp/node${NODE}_mysql.sock --user=root --wsrep_cluster_address="gcomm://10.200.10.2" --wsrep_cluster_name=cluster1  
etc

As the result of this procedure, we have four Percona XtraDB Cluster nodes running in an individual network namespace, not worrying about IP address and ports conflicts. We also allocated a dedicated IP range for our cluster.

This procedure isn’t trivial, but it is easy to script. I also think provides a good understanding what Docker, LXC or other containerization technologies do behind the scenes with networks.

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com