Oct
13
2017
--

Red Hat continues steady march toward $5 billion revenue goal

 The last time I spoke to Red Hat CEO Jim Whitehurst, he had set a pretty audacious goal for his company to achieve $5 billion in revenue. At the time, that seemed a bit far-fetched. But the company has continued to thrive and is on track to pass $3 billion in revenue some time in the next couple of quarters. Read More

Sep
13
2016
--

Microsoft’s Azure Service Fabric for running and managing microservices is coming to Linux

A Microsoft logo sits on a flag flying in the grounds of the Nokia Oyj mobile handset factory, operated by Microsoft Corp., in Komarom, Hungary, on Monday, July 21, 2014. Microsoft said it will eliminate as many as 18,000 jobs, the largest round of cuts in its history, as Chief Executive Officer Satya Nadella integrates Nokia Oyj's handset unit and slims down the software maker. Photographer: Akos Stiller/Bloomberg via Getty Images Microsoft’s CTO for Azure (and occasional novelist) Mark Russinovich is extremely bullish about microservices. In his view, the vast majority of apps — including enterprise apps — will soon be built using microservices. Microsoft, with its variety of cloud services and developer tools, obviously wants a piece of that market. With Service Fabric, the company offers a service… Read More

Aug
12
2016
--

Tuning Linux for MongoDB

tuning Linux for MongoDB

tuning Linux for MongoDBIn this post, we’ll discuss tuning Linux for MongoDB deployments.

By far the most common operating system you’ll see MongoDB running on is Linux 2.6 and 3.x. Linux flavors such as CentOS and Debian do a fantastic job of being a stable, general-purpose operating system. Linux runs software on hardware ranging from tiny computers like the Raspberry Pi up to massive data center servers. To make this flexibility work, however, Linux defaults to some “lowest common denominator” tunings so that the OS will boot on anything.

Working with databases, we often focus on the queries, patterns and tunings that happen inside the database process itself. This means we sometimes forget that the operating system below it is the life-support of database, the air that it breathes so-to-speak. Of course, a highly-scalable database such as MongoDB runs fine on these general-purpose defaults without complaints, but the efficiency can be equivalent to running in regular shoes instead of sleek runners. At small scale, you might not notice the lost efficiency, but at large scale (especially when data exceeds RAM) improved tunings equate to fewer servers and less operational costs. For all use cases and scale, good OS tunings also provide some improvement in response times and removes extra “what if…?” questions when troubleshooting.

Overall, memory, network and disk are the system resources important to MongoDB. This article covers how to optimize each of these areas. Of course, while we have successfully deployed these tunings to many live systems, it’s always best to test before applying changes to your servers.

If you plan on applying these changes, I suggest performing them with one full reboot of the host. Some of these changes don’t require a reboot, but test that they get re-applied if you reboot in the future. MongoDB’s clustered nature should make this relatively painless, plus it might be a good time to do that dreaded “yum upgrade” / “aptitude upgrade“, too.

Linux Ulimit

To prevent a single user from impacting the entire system, Linux has a facility to implement some system resource constraints on processes, file handles and other system resources on a per-user-basis. For medium-high-usage MongoDB deployments, the default limits are almost always too low. Considering MongoDB generally uses dedicated hardware, it makes sense to allow the Linux user running MongoDB (e.g., “mongod”) to use a majority of the available resources.

Now you might be thinking: “Why not disable the limit (or set it to unlimited)?” This is a common recommendation for database servers. I think you should avoid this for two reasons:

  • If you hit a problem, a lack of a limit on system resources can allow a relatively smaller problem to spiral out of control, often bringing down other services (such as SSH) crucial to solving the original problem.
  • All systems DO have an upper-limit, and understanding those limitations instead of masking them is an important exercise.

In most cases, a limit of 64,000 “max user processes” and 64,000 “open files” (both have defaults of 1024) will suffice. To be more exact you need to do some math on the number of applications/clients, the maximum size of their connection pools and some case-by-case tuning for the number of inter-node connections between replica set members and sharding processes. (We might address this in a future blog post.)

You can deploy these limits by adding a file in “/etc/security/limits.d” (or appending to “/etc/security/limits.conf” if there is no “limits.d”). Below is an example file for the Linux user “mongod”, raising open-file and max-user-process limits to 64,000:

mongod       soft        nproc        64000
mongod       hard        nproc        64000
mongod       soft        nofile       64000
mongod       hard        nofile       64000

Note: this change only applies to new shells, meaning you must restart “mongod” or “mongos” to apply this change!

Virtual Memory
Dirty Ratio

The “dirty_ratio” is the percentage of total system memory that can hold dirty pages. The default on most Linux hosts is between 20-30%. When you exceed the limit the dirty pages are committed to disk, creating a small pause. To avoid this hard pause there is a second ratio: “dirty_background_ratio” (default 10-15%) which tells the kernel to start flushing dirty pages to disk in the background without any pause.

20-30% is a good general default for “dirty_ratio”, but on large-memory database servers this can be a lot of memory! For example, on a 128GB-memory host this can allow up to 38.4GB of dirty pages. The background ratio won’t kick in until 12.8GB! We recommend that you lower this setting and monitor the impact to query performance and disk IO. The goal is reducing memory usage without impacting query performance negatively. Reducing caches sizes also guarantees data gets written to disk in smaller batches more frequently, which increases disk throughput (than huge bulk writes less often).

A recommended setting for dirty ratios on large-memory (64GB+ perhaps) database servers is: “vm.dirty_ratio = 15″ and vm.dirty_background_ratio = 5″, or possibly less. (Red Hat recommends lower ratios of 10 and 3 for high-performance/large-memory servers.)

You can set this by adding the following lines to /etc/sysctl.conf”:

vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

To check these current running values:

$ sysctl -a | egrep "vm.dirty.*_ratio"
vm.dirty_background_ratio = 5
vm.dirty_ratio = 15

Swappiness

“Swappiness” is a Linux kernel setting that influences the behavior of the Virtual Memory manager when it needs to allocate a swap, ranging from 0-100. A setting of 0 tells the kernel to swap only to avoid out-of-memory problems. A setting of 100 tells it to swap aggressively to disk. The Linux default is usually 60, which is not ideal for database usage.

It is common to see a setting of 0″ (or sometimes “10”) on database servers, telling the kernel to prefer to swap to memory for better response times. However, Ovais Tariq details a known bug (or feature) when using a setting of 0 in this blog post: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/.

Due to this bug, we recommended using a setting of 1″ (or “10” if you  prefer some disk swapping) by adding the following to your /etc/sysctl.conf”:

vm.swappiness = 1

To check the current swappiness:

$ sysctl vm.swappiness
vm.swappiness = 1

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply a dirty_ratio or swappiness change!

Transparent HugePages

*Does not apply to Debian/Ubuntu or CentOS/RedHat 5 and lower*

Transparent HugePages is an optimization introduced in CentOS/RedHat 6.0, with the goal of reducing overhead on systems with large amounts of memory. However, due to the way MongoDB uses memory, this feature actually does more harm than good as memory access are rarely contiguous.

Disabled THP entirely by adding the following flag below to your Linux kernel boot options:

transparent_hugepage=never

Usually this requires changes to the GRUB boot-loader config in the directory /boot/grub” or /etc/grub.d” on newer systems. Red Hat covers this in more detail in this article (same method on CentOS): https://access.redhat.com/solutions/46111.

Note: We recommended rebooting the system to clear out any previous huge pages and validate that the setting will persist on reboot.

NUMA (Non-Uniform Memory Access) Architecture

Non-Uniform Memory Access is a recent memory architecture that takes into account the locality of caches and CPUs for lower latency. Unfortunately, MongoDB is not “NUMA-aware” and leaving NUMA setup in the default behavior can cause severe memory in-balance.

There are two ways to disable NUMA: one is via an on/off switch in the system BIOS config, the 2nd is using the numactl” command to set NUMA-interleaved-mode (similar effect to disabling NUMA) when starting MongoDB. Both methods achieve the same result. I lean towards using the numactl” command due to future-proofing yourself for the mostly inevitable addition of NUMA awareness. On CentOS 7+ you may need to install the numactl” yum/rpm package.

To make mongod start using interleaved-mode, add numactl –interleave=all” before your regular mongod” command:

$ numactl --interleave=all mongod <options here>

To check mongod’s NUMA setting:

$ sudo numastat -p $(pidof mongod)
Per-node process memory usage (in MBs) for PID 7516 (mongod)
                           Node 0           Total
                  --------------- ---------------
Huge                         0.00            0.00
Heap                        28.53           28.53
Stack                        0.20            0.20
Private                      7.55            7.55
----------------  --------------- ---------------
Total                       36.29           36.29

If you see only 1 x NUMA-node column (“Node0”) NUMA is disabled. If you see more than 1 x NUMA-node, make sure the metric numbers (Heap”, etc.) are balanced between nodes. Otherwise, NUMA is NOT in “interleave” mode.

Note: some MongoDB packages already ship logic to disable NUMA in the init/startup script. Check for this using “grep” first. Your hardware or BIOS manual should cover disabling NUMA via the system BIOS.

Block Device IO Scheduler and Read-Ahead

For tuning flexibility, we recommended that MongoDB data sits on its own disk volume, preferably with its own dedicated disks/RAID array. While it may complicate backups, for the best performance you can also dedicate a separate volume for the MongoDB journal to separate it’s disk activity noise from the main data set. The journal does not yet have it’s own config/command-line setting, so you’ll need to mount a volume to the journal” directory inside the dbPath. For example, /var/lib/mongo/journal” would be the journal mount-path if the dbPath was set to /var/lib/mongo”.

Aside from good hardware, the block device MongoDB stores its data on can benefit from 2 x major adjustments:

IO Scheduler

The IO scheduler is an algorithm the kernel will use to commit reads and writes to disk. By default most Linux installs use the CFQ (Completely-Fair Queue) scheduler. This is designed to work well for many general use cases, but with little latency guarantees. Two other popular schedulers are deadline” and noop”. Deadline excels at latency-sensitive use cases (like databases) and noop is closer to no scheduling at all.

We generally suggest using the deadline” IO scheduler for cases where you have real, non-virtualised disks under MongoDB. (For example, a “bare metal” server.) In some cases I’ve seen noop” perform better with certain hardware RAID controllers, however. The difference between deadline” and cfq” can be massive for disk-bound deployments.

If you are running MongoDB inside a VM (which has it’s own IO scheduler beneath it) it is best to use noop” and let the virtualization layer take care of the IO scheduling itself.

Read-Ahead

Read-ahead is a per-block device performance tuning in Linux that causes data ahead of a requested block on disk to be read and then cached into the filesystem cache. Read-ahead assumes that there is a sequential read pattern and something will benefit from those extra blocks being cached. MongoDB tends to have very random disk patterns and often does not benefit from the default read-ahead setting, wasting memory that could be used for more hot data. Most Linux systems have a default setting of 128KB/256 sectors (128KB = 256 x 512-byte sectors). This means if MongoDB fetches a 64kb document from disk, 128kb of filesystem cache is used and maybe the extra 64kb is never accessed later, wasting memory.

For this setting, we suggest a starting-point of 32 sectors (=16KB) for most MongoDB workloads. From there you can test increasing/reducing this setting and then monitor a combination of query performance, cached memory usage and disk read activity to find a better balance. You should aim to use as little cached memory as possible without dropping the query performance or causing significant disk activity.

Both the IO scheduler and read-ahead can be changed by adding a file to the udev configuration at /etc/udev/rules.d”. In this example I am assuming the block device serving mongo data is named /dev/sda” and I am setting “deadline” as the IO scheduler and 16kb/32-sectors as read-ahead:

# set deadline scheduler and 16kb read-ahead for /dev/sda
ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

To check the IO scheduler was applied ([square-brackets] = enabled scheduler):

$ cat /sys/block/sda/queue/scheduler
noop [deadline] cfq

To check the current read-ahead setting:

$ sudo blockdev --getra /dev/sda
32

Note: this change should be applied and tested with a full system reboot!

Filesystem and Options

It is recommended that MongoDB uses only the ext4 or XFS filesystems for on-disk database data. ext3 should be avoided due to its poor pre-allocation performance. If you’re using WiredTiger (MongoDB 3.0+) as a storage engine, it is strongly advised that you ONLY use XFS due to serious stability issues on ext4.

Each time you read a file, the filesystems perform an access-time metadata update by default. However, MongoDB (and most applications) does not use this access-time information. This means you can disable access-time updates on MongoDB’s data volume. A small amount of disk IO activity that the access-time updates cause stops in this case.

You can disable access-time updates by adding the flag noatime” to the filesystem options field in the file /etc/fstab” (4th field) for the disk serving MongoDB data:

/dev/mapper/data-mongodb /var/lib/mongo        ext4        defaults,noatime    0 0

Use noatime” to verify the volume is currently mounted:

$ grep "/var/lib/mongo" /proc/mounts
/dev/mapper/data-mongodb /var/lib/mongo ext4 rw,seclabel,noatime,data=ordered 0 0

Note: to apply a filesystem-options change, you must remount (umount + mount) the volume again after stopping MongoDB, or reboot.

Network Stack

Several defaults of the Linux kernel network tunings are either not optimal for MongoDB, limit a typical host with 1000mbps network interfaces (or better) or cause unpredictable behavior with routers and load balancers. We suggest some increases to the relatively low throughput settings (net.core.somaxconn and net.ipv4.tcp_max_syn_backlog) and a decrease in keepalive settings, seen below.

Make these changes permanent by adding the following to /etc/sysctl.conf” (or a new file /etc/sysctl.d/mongodb-sysctl.conf – if /etc/sysctl.d exists):

net.core.somaxconn = 4096
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_max_syn_backlog = 4096

To check the current values of any of these settings:

$ sysctl net.core.somaxconn
net.core.somaxconn = 4096

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply this change!

NTP Daemon

All of these deeper tunings make it easy to forget about something as simple as your clock source. As MongoDB is a cluster, it relies on a consistent time across nodes. Thus the NTP Daemon should run permanently on all MongoDB hosts, mongos and arbiters included. Be sure to check the time syncing won’t fight with any guest-based virtualization tools like “VMWare tools” and “VirtualBox Guest Additions”.

This is installed on RedHat/CentOS with:

$ sudo yum install ntp

And on Debian/Ubuntu:

$ sudo apt-get install ntp

Note: Start and enable the NTP Daemon (for starting on reboots) after installation. The commands to do this vary by OS and OS version, so please consult your documentation.

Security-Enhanced Linux (SELinux)

Security-Enhanced Linux is a kernel-level security access control module that has an unfortunate tendency to be disabled or set to warn-only on Linux deployments. As SELinux is a strict access control system, sometimes it can cause unexpected errors (permission denied, etc.) with applications that were not configured properly for SELinux. Often people disable SELinux to resolve the issue and forget about it entirely. While implementing SELinux is not an end-all solution, it massively reduces the local attack surface of the server. We recommend deploying MongoDB with SELinus Enforcing” mode on.

The modes of SELinux are:

  1. Enforcing – Block and log policy violations.
  2. Permissive – Log policy violations only.
  3. Disabled – Completely disabled.

As database servers are usually dedicated to one purpose, such as running MongoDB, the work of setting up SELinux is a lot simpler than a multi-use server with many processes and users (such as an application/web server, etc.). The OS access pattern of a database server should be extremely predictable. Introducing Enforcing” mode at the very beginning of your testing/installation instead of after-the-fact avoids a lot of gotchas with SELinux. Logging for SELinux is directed to /var/log/audit/audit.log” and the configuration is at /etc/selinux”.

Luckily, Percona Server for MongoDB RPM packages (CentOS/RedHat) are SELinux “Enforcing” mode compatible as they install/enable an SELinux policy at RPM install time! Debian/Ubuntu SELinux support is still in planning.

Here you can see the SELinux policy shipped in the Percona Server for MongoDB version 3.2 server package:

$ rpm -ql Percona-Server-MongoDB-32-server | grep selinux
/etc/selinux/targeted/modules/active/modules/mongod.pp

To change the SELinux mode to Enforcing”:

$ sudo setenforce Enforcing

To check the running SELinux mode:

$ sudo getenforce
Enforcing

Linux Kernel and Glibc Version

The version of the Linux kernel and Glibc itself may be more important than you think. Some community benchmarks show a significant improvement on OLTP throughput benchmarks with the recent Linux 3.x kernels versus the 2.6 still widely deployed. To avoid serious bugs, MongoDB should at minimum use Linux 2.6.36 and Glibc 2.13 or newer.

I hope to create a follow-up post on the specific differences seen under MongoDB workloads on Linux 3.2+ versus 2.6. Until then, I recommend you test the difference using your own workloads and any results/feedback are appreciated.

What’s Next?

What’s the next thing to tune? At this point, tuning becomes case-by-case and open-ended. I appreciate any comments on use-case/tunings pairings that worked for you. Also, look out for follow-ups to this article for a few tunings I excluded due to lack of testing.

Not knowing the next step might mean you’re done tuning, or that you need more visibility into your stack to find the next bottleneck. Good monitoring and data visibility are invaluable for this type of investigation. Look out for future posts regarding monitoring your MongoDB (or MySQL) deployment and consider using Percona Monitoring and Management as an all-in-one monitoring solution. You could also try using Percona-Lab/prometheus_mongodb_exporterprometheus/node_exporter and Percona-Lab/grafana_mongodb_dashboards for monitoring MongoDB/Linux with Prometheus and Grafana.

The road to an efficient database stack requires patience, analysis and iteration. Tomorrow a new hardware architecture or change in kernel behavior could come, be the first to spot the next bottleneck! Happy hunting.

Aug
09
2016
--

Mirantis and SUSE team up to give OpenStack users new support options

IMG_20160809_103941 Mirantis, which specializes in offering software, support and training for running OpenStack, today announced that it is partnering with Germany-based SUSE, best known for its Linux distribution, to offer its customers support for SUSE’s enterprise Linux offering. The two companies also said that they will work on making SUSE Linux Enterprise Server a development platform for use… Read More

Jul
06
2016
--

Canonical-Pivotal partnership makes Ubuntu preferred Linux distro for Cloud Foundry

Pivotal offices. Pivotal, developers of the Cloud Foundry open source cloud development platform and Canonical, the company behind the popular Ubuntu Linux distribution, announced a partnership today where Ubuntu becomes the preferred operating system for Cloud Foundry.
In fact, the two companies have been BFFs since the earliest days of Cloud Foundry when it was an open source project developed at VMware. Read More

Jul
05
2016
--

LzLabs launches product to move mainframe COBOL code to Linux cloud

Black white photo of man sitting in front of a mainframe computer in the 1960s Somewhere in a world full of advanced technology that we write about regularly here on TechCrunch, there exists an ancient realm where mainframe computers are still running programs written in COBOL.
This is a programming language, mind you, that was developed in the late 1950s, and used widely in the ’60s and ’70s and even into the ’80s, but it’s never really gone away. Read More

Jun
21
2016
--

As Red Hat aims for $5 billion in revenue, Linux won’t be only driver

Red Hat race car with #1 painted on top. Last year Red Hat, which has been mostly known for selling Linux in the enterprise became the first $2 billion open source company. Now it wants to be the first to $5 billion, but it might not be just Linux that gets it there. A couple of years ago Red Hat CEO Jim Whitehurst recognized, even in the face of rising revenue, that the company couldn’t continue growing forever… Read More

Jun
20
2016
--

Running Percona XtraDB Cluster nodes with Linux Network namespaces on the same host

Percona XtraDB Cluster nodes with Linux Network namespaces

Percona XtraDB Cluster nodes with Linux Network namespacesThis post is a continuance of my Docker series, and examines Running Percona XtraDB Cluster nodes with Linux Network namespaces on the same host.

In this blog I want to look into a lower-level building block: Linux Network Namespace.

The same as with cgroups, Docker uses Linux Network Namespace for resource isolation. I was looking into cgroup a year ago, and now I want to understand more about Network Namespace.

The goal is to both understand a bit more about Docker internals, and to see how we can provide network isolation for different processes within the same host. You might need to isolate process when running several MySQL or MongoDB instances on the same server (which might come in handy during testing). In this case, I needed to test ProxySQL without Docker.

We can always use different ports for different MySQL instances (such as 3306, 3307, 3308), but it quickly gets complicated.

We could also use IP address aliases for an existing network interface, and use bind=<IP.ADD.RE.SS> for each instance. But since Percona XtraDB Cluster can use three different IP ports and network channels for communications, this also quickly gets complicated.

Linux Network Namespace provides greater network isolation for resources so that it can be a better fit for Percona XtraDB Cluster nodes. Now, setting up Network namespaces in and of itself can be confusing; my recommendation is if you can use Docker, use Docker instead. It provides isolation on process ID and mount points, and takes care of all the script plumbing to create and destroy networks. As you will see in our scripts, we need to talk about directory location for datadirs.

Let’s create a network for Percona XtraDB Cluster with Network Namespaces.

I will try to do the following:

  • Start four nodes of Percona XtraDB Cluster
  • For each node, create separate network namespace so the nodes will be able to allocate network ports 3306, 4567, 4568 without conflicts
  • Assign the nodes IP addresses: 10.200.10.2-10.200.10.5
  • Create a “bridge interface” for the nodes to communicate, using IP address 10.200.10.1.

For reference, I took ideas from this post: Linux Switching – Interconnecting Namespaces

First, we must create the bridge interface on the host:

BRIDGE=br-pxc
brctl addbr $BRIDGE
brctl stp $BRIDGE off
ip addr add 10.200.10.1/24 dev $BRIDGE
ip link set dev $BRIDGE up

Next, we create four namespaces (one per Percona XtraDB Cluster node) using the following logic:

for i in 1 2 3 4
do
 ip netns add pxc_ns$i
 ip link add pxc-veth$i type veth peer name br-pxc-veth$i
 brctl addif $BRIDGE br-pxc-veth$i
 ip link set pxc-veth$i netns pxc_ns$i
 ip netns exec pxc_ns$i ip addr add 10.200.10.$((i+1))/24 dev pxc-veth$i
 ip netns exec pxc_ns$i ip link set dev pxc-veth$i up
 ip link set dev br-pxc-veth$i up
 ip netns exec pxc_ns$i ip link set lo up
 ip netns exec pxc_ns$i ip route add default via 10.200.10.1
done

We see the following interfaces on the host:

1153: br-pxc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
 link/ether 32:32:4c:36:22:87 brd ff:ff:ff:ff:ff:ff
 inet 10.200.10.1/24 scope global br-pxc
 valid_lft forever preferred_lft forever
 inet6 fe80::2ccd:6ff:fe04:c7d5/64 scope link
 valid_lft forever preferred_lft forever
1154: br-pxc-veth1@if1155: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-pxc state UP qlen 1000
 link/ether c6:28:2d:23:3b:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 8
 inet6 fe80::c428:2dff:fe23:3ba4/64 scope link
 valid_lft forever preferred_lft forever
1156: br-pxc-veth2@if1157: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-pxc state UP qlen 1000
 link/ether 32:32:4c:36:22:87 brd ff:ff:ff:ff:ff:ff link-netnsid 12
 inet6 fe80::3032:4cff:fe36:2287/64 scope link
 valid_lft forever preferred_lft forever
1158: br-pxc-veth3@if1159: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-pxc state UP qlen 1000
 link/ether 8a:3a:c1:e0:8a:67 brd ff:ff:ff:ff:ff:ff link-netnsid 13
 inet6 fe80::883a:c1ff:fee0:8a67/64 scope link
 valid_lft forever preferred_lft forever
1160: br-pxc-veth4@if1161: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-pxc state UP qlen 1000
 link/ether aa:56:7f:41:1d:c3 brd ff:ff:ff:ff:ff:ff link-netnsid 11
 inet6 fe80::a856:7fff:fe41:1dc3/64 scope link
 valid_lft forever preferred_lft forever

We also see the following network namespaces:

# ip netns
pxc_ns4 (id: 11)
pxc_ns3 (id: 13)
pxc_ns2 (id: 12)
pxc_ns1 (id: 8)

After that, we can check the namespace IP address:

# ip netns exec pxc_ns3 bash
# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 inet 127.0.0.1/8 scope host lo
 valid_lft forever preferred_lft forever
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
1159: pxc-veth3@if1158: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
 link/ether 4a:ad:be:6a:aa:c6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
 inet 10.200.10.4/24 scope global pxc-veth3
 valid_lft forever preferred_lft forever
 inet6 fe80::48ad:beff:fe6a:aac6/64 scope link
 valid_lft forever preferred_lft forever

To enable communication from inside the network namespace to the external world, we should add some iptables rules, e.g.:

iptables -t nat -A POSTROUTING -s 10.200.10.0/255.255.255.0 -o enp2s0f0 -j MASQUERADE
iptables -A FORWARD -i enp2s0f0 -o $BRIDGE -j ACCEPT
iptables -A FORWARD -o enp2s0f0 -i $BRIDGE -j ACCEPT

where enp2s0f0 is an interface that has an external IP address (by some reason modern Linux distros decided to use names like enp2s0f0 for network interfaces, instead old good "eth0").

To start a node (or mysqld instance) inside a network namespace, we should use

ip netns exec prefix

 for commands.

For example to start Percona XtraDB Cluster first node, in the namespace pxc_ns1, with IP address 10.200.10.2, we use:

ip netns exec pxc_ns1 mysqld --defaults-file=node.cnf --datadir=/data/datadir/node1 --socket=/tmp/node1_mysql.sock --user=root --wsrep_cluster_name=cluster1

To start following nodes:

NODE=2 ip netns exec pxc_ns${NODE} mysqld --defaults-file=node${NODE}.cnf --datadir=/data/datadir/node${NODE} --socket=/tmp/node${NODE}_mysql.sock --user=root --wsrep_cluster_address="gcomm://10.200.10.2" --wsrep_cluster_name=cluster1  
NODE=3 ip netns exec pxc_ns${NODE} mysqld --defaults-file=node${NODE}.cnf --datadir=/data/datadir/node${NODE} --socket=/tmp/node${NODE}_mysql.sock --user=root --wsrep_cluster_address="gcomm://10.200.10.2" --wsrep_cluster_name=cluster1  
etc

As the result of this procedure, we have four Percona XtraDB Cluster nodes running in an individual network namespace, not worrying about IP address and ports conflicts. We also allocated a dedicated IP range for our cluster.

This procedure isn’t trivial, but it is easy to script. I also think provides a good understanding what Docker, LXC or other containerization technologies do behind the scenes with networks.

 

Mar
30
2016
--

Docker MySQL Replication 101

Docker

Precona Server DockerIn this blog post, we’ll discuss some of the basics regarding Docker MySQL replication. Docker has gained widespread popularity in recent years as a lightweight alternative to virtualization. It is ideal for building virtual development and testing environments. The solution is flexible and seamlessly integrates with popular CI tools.

 

This post walks through the setup of MySQL replication with Docker using Percona Server 5.6 images. To keep things simple we’ll configure a pair of instances and override only the most important variables for replication. You can add whatever other variables you want to override in the configuration files for each instance.

Note: the configuration described here is suitable for development or testing. We’ve also used the operating system repository packages; for the latest version use the official Docker images. The steps described can be used to setup more slaves if required, as long as each slave has a different server-id.

First, install Docker and pull the Percona images (this will take some time and is only executed once):

# Docker install for Debian / Ubuntu
apt-get install docker.io
# Docker install for Red Hat / CentOS (requires EPEL repo)
yum install epel-release # If not installed already
yum install docker-io
# Pull docker repos
docker pull percona

Now create locally persisted directories for the:

  1. Instance configuration
  2. Data files
# Create local data directories
mkdir -p /opt/Docker/masterdb/data /opt/Docker/slavedb/data
# Create local my.cnf directories
mkdir -p /opt/Docker/masterdb/cnf /opt/Docker/slavedb/cnf
### Create configuration files for master and slave
vi /opt/Docker/masterdb/cnf/config-file.cnf
# Config Settings:
[mysqld]
server-id=1
binlog_format=ROW
log-bin
vi /opt/Docker/slavedb/cnf/config-file.cnf
# Config Settings:
[mysqld]
server-id=2

Great, now we’re ready start our instances and configure replication. Launch the master node, configure the replication user and get the initial replication co-ordinates:

# Launch master instance
docker run --name masterdb -v /opt/Docker/masterdb/cnf:/etc/mysql/conf.d -v /opt/Docker/masterdb/data:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=mysecretpass -d percona:5.6
00a0231fb689d27afad2753e4350192bebc19ab4ff733c07da9c20ca4169759e
# Create replication user
docker exec -ti masterdb 'mysql' -uroot -pmysecretpass -vvv -e"GRANT REPLICATION SLAVE ON *.* TO repl@'%' IDENTIFIED BY 'slavepass'G"
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
GRANT REPLICATION SLAVE ON *.* TO repl@"%"
--------------
Query OK, 0 rows affected (0.02 sec)
Bye
### Get master status
docker exec -ti masterdb 'mysql' -uroot -pmysecretpass -e"SHOW MASTER STATUSG"
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
             File: mysqld-bin.000004
         Position: 310
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set:

If you look carefully at the “docker run” command for masterdb, you’ll notice we’ve defined two paths to share from local storage:

/opt/Docker/masterdb/data:/var/lib/mysql

  • This maps the local “/opt/Docker/masterdb/data” to the masterdb’s container’s “/var/lib/mysql path”
  • All files within the datadir “/var/lib/mysql” persist locally on the host running docker rather than in the container
/opt/Docker/masterdb/cnf:/etc/mysql/conf.d

  • This maps the local “/opt/Docker/masterdb/cnf” directory to the container’s “/etc/mysql/conf.d” path
  • The configuration files for the masterdb instance persist locally as well
  • Remember these files augment or override the file in “/etc/mysql/my.cnf” within the container (i.e., defaults will be used for all other variables)

We’re done setting up the master, so let’s continue with the slave instance. For this instance the “docker run” command also includes the “–link masterdb:mysql” command, which links the slave instance to the master instance for replication.

After starting the instance, set the replication co-ordinates captured in the previous step:

docker run --name slavedb -d -v /opt/Docker/slavedb/cnf:/etc/mysql/conf.d -v /opt/Docker/slavedb/data:/var/lib/mysql --link masterdb:mysql -e MYSQL_ROOT_PASSWORD=mysecretpass -d percona:5.6
eb7141121300c104ccee0b2df018e33d4f7f10bf5d98445ed4a54e1316f41891
docker exec -ti slavedb 'mysql' -uroot -pmysecretpass -e'change master to master_host="mysql",master_user="repl",master_password="slavepass",master_log_file="mysqld-bin.000004",master_log_pos=310;"' -vvv
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
change master to master_host="mysql",master_user="repl",master_password="slavepass",master_log_file="mysqld-bin.000004",master_log_pos=310
--------------
Query OK, 0 rows affected, 2 warnings (0.23 sec)
Bye

Almost ready to go! The last step is to start replication and verify that replication running:

# Start replication
docker exec -ti slavedb 'mysql' -uroot -pmysecretpass -e"START SLAVE;" -vvv
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
START SLAVE
--------------
Query OK, 0 rows affected, 1 warning (0.00 sec)
Bye
# Verify replication is running OK
docker exec -ti slavedb 'mysql' -uroot -pmysecretpass -e"SHOW SLAVE STATUSG" -vvv
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
SHOW SLAVE STATUS
--------------
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: mysql
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysqld-bin.000004
          Read_Master_Log_Pos: 310
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 284
        Relay_Master_Log_File: mysqld-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 310
              Relay_Log_Space: 458
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 230d005a-f1a6-11e5-b546-0242ac110004
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0.00 sec)
Bye

Finally, we have a pair of dockerized Percona Server 5.6 master-slave servers replicating!

As mentioned before, this is suitable for a development or testing environment. Before going into production with this configuration, think carefully about the tuning of the “my.cnf” variables and the choice of disks used for the data/binlog directories. It is important to remember that newer versions of Docker recommend using “networks” rather than “linking” for communication between containers.

Feb
17
2016
--

Microsoft Brings Red Hat Enterprise Linux To Azure

MSB10_ServIT_001 Microsoft is now selling Red Hat Enterprise Linux licenses. Starting today, you will be able to deploy Red Hat Linux Enterprise (RHLE) from the Azure Marketplace and get support for your deployments from both Microsoft and Red Hat. In addition, Microsoft today announced that it is now offering certified Bitnami images in the Azure Marketplace and it now supports Walmart‘s (yes —… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com