OpenStack adds the StarlingX edge computing stack to its top-level projects

The OpenStack Foundation today announced that StarlingX, a container-based system for running edge deployments, is now a top-level project. With this, it joins the main OpenStack private and public cloud infrastructure project, the Airship lifecycle management system, Kata Containers and the Zuul CI/CD platform.

What makes StarlingX a bit different from some of these other projects is that it is a full stack for edge deployments — and in that respect, it’s maybe more akin to OpenStack than the other projects in the foundation’s stable. It uses open-source components from the Ceph storage platform, the KVM virtualization solution, Kubernetes and, of course, OpenStack and Linux. The promise here is that StarlingX can provide users with an easy way to deploy container and VM workloads to the edge, all while being scalable, lightweight and providing low-latency access to the services hosted on the platform.

Early StarlingX adopters include China UnionPay, China Unicom and T-Systems. The original codebase was contributed to the foundation by Intel and Wind River System in 2018. Since then, the project has seen 7,108 commits from 211 authors.

“The StarlingX community has made great progress in the last two years, not only in building great open source software but also in building a productive and diverse community of contributors,” said Ildiko Vancsa, ecosystem technical lead at the OpenStack Foundation. “The core platform for low-latency and high-performance applications has been enhanced with a container-based, distributed cloud architecture, secure booting, TPM device enablement, certificate management and container isolation. StarlingX 4.0, slated for release later this year, will feature enhancements such as support for Kata Containers as a container runtime, integration of the Ussuri version of OpenStack, and containerization of the remaining platform services.”

It’s worth remembering that the OpenStack Foundation has gone through a few changes in recent years. The most important of these is that it is now taking on other open-source infrastructure projects that are not part of the core OpenStack project but are strategically aligned with the organization’s mission. The first of these to graduate out of the pilot project phase and become top-level projects were Kata Containers and Zuul in April 2019, with Airship joining them in October.

Currently, the only pilot project for the OpenStack Foundation is its OpenInfra Labs project, a community of commercial vendors and academic institutions, including the likes of Boston University, Harvard, MIT, Intel and Red Hat, that are looking at how to better test open-source code in production-like environments.



Red Hat acquires hybrid cloud data management service NooBaa

Red Hat is in the process of being acquired by IBM for a massive $34 billion, but that deal hasn’t closed yet and, in the meantime, Red Hat is still running independently and making its own acquisitions, too. As the company today announced, it has acquired Tel Aviv-based NooBaa, an early-stage startup that helps enterprises manage their data more easily and access their various data providers through a single API.

NooBaa’s technology makes it a good fit for Red Hat, which has recently emphasized its ability to help enterprise more effectively manage their hybrid and multicloud deployments. At its core, NooBaa is all about bringing together various data silos, which should make it a good fit in Red Hat’s portfolio. With OpenShift and the OpenShift Container Platform, as well as its Ceph Storage service, Red Hat already offers a range of hybrid cloud tools, after all.

“NooBaa’s technologies will augment our portfolio and strengthen our ability to meet the needs of developers in today’s hybrid and multicloud world,” writes Ranga Rangachari, the VP and general manager for storage and hyperconverged infrastructure at Red Hat, in today’s announcement. “We are thrilled to welcome a technical team of nine to the Red Hat family as we work together to further solidify Red Hat as a leading provider of open hybrid cloud technologies.”

While virtually all of Red Hat’s technology is open source, NooBaa’s code is not. The company says that it plans to open source NooBaa’s technology in due time, though the exact timeline has yet to be determined.

NooBaa was founded in 2013. The company has raised some venture funding from the likes of Jerusalem Venture Partners and OurCrowd, with a strategic investment from Akamai Capital thrown in for good measure. The company never disclosed the size of that round, though, and neither Red Hat nor NooBaa are disclosing the financial terms of the acquisition.


The Ceph storage project gets a dedicated open-source foundation

Ceph is an open source technology for distributed storage that gets very little public attention but that provides the underlying storage services for many of the world’s largest container and OpenStack deployments. It’s used by financial institutions like Bloomberg and Fidelity, cloud service providers like Rackspace and Linode, telcos like Deutsche Telekom, car manufacturers like BMW and software firms like SAP and Salesforce.

These days, you can’t have a successful open source project without setting up a foundation that manages the many diverging interests of the community and so it’s maybe no surprise that Ceph is now getting its own foundation. Like so many other projects, the Ceph Foundation will be hosted by the Linux Foundation.

“While early public cloud providers popularized self-service storage infrastructure, Ceph brings the same set of capabilities to service providers, enterprises, and individuals alike, with the power of a robust development and user community to drive future innovation in the storage space,” writes Sage Weil, Ceph co-creator, project leader, and chief architect at Red Hat for Ceph. “Today’s launch of the Ceph Foundation is a testament to the strength of a diverse open source community coming together to address the explosive growth in data storage and services.”

Given its broad adoption, it’s also no surprise that there’s a wide-ranging list of founding members. These include Amihan Global, Canonical, CERN, China Mobile, Digital Ocean, Intel, ProphetStor Data Service, OVH Hosting Red Hat, SoftIron, SUSE, Western Digital, XSKY Data Technology and ZTE. It’s worth noting that many of these founding members were already part of the slightly less formal Ceph Community Advisory Board.

“Ceph has a long track record of success what it comes to helping organizations with effectively managing high growth and expand data storage demands,” said Jim Zemlin, the executive director of the Linux Foundation. “Under the Linux Foundation, the Ceph Foundation will be able to harness investments from a much broader group to help support the infrastructure needed to continue the success and stability of the Ceph ecosystem.”

cepha and linux foundation logo

Ceph is an important building block for vendors who build both OpenStack- and container-based platforms. Indeed, two-thirds of OpenStack users rely on Ceph and it’s a core part of Rook, a Cloud Native Computing Foundation project that makes it easier to build storage services for Kubernetes-based applications. As such, Ceph straddles many different worlds and it makes sense for the project to gets its own neutral foundation now, though I can’t help but think that the OpenStack Foundation would’ve also liked to host the project.

Today’s announcement comes only days after the Linux Foundation also announced that it is now hosting the GraphQL Foundation.


Percona XtraDB Cluster on Ceph


CephThis post discusses how XtraDB Cluster and Ceph are a good match, and how their combination allows for faster SST and a smaller disk footprint.

My last post was an introduction to Red Hat’s Ceph. As interesting and useful as it was, it wasn’t a practical example. Like most of the readers, I learn about and see the possibilities of technologies by burning my fingers on them. This post dives into a real and novel Ceph use case: handling of the Percona XtraDB Cluster SST operation using Ceph snapshots.

If you are familiar with Percona XtraDB Cluster, you know that a full state snapshot transfer (SST) is required to provision a new cluster node. Similarly, SST can also be triggered when a cluster node happens to have a corrupted dataset. Those SST operations consist essentially of a full copy of the dataset sent over the network. The most common SST methods are Xtrabackup and rsync. Both of these methods imply a significant impact and load on the donor while the SST operation is in progress.

For example, the whole dataset will need to be read from the storage and sent over the network, an operation that requires a lot of IO operations and CPU time. Furthermore, with the rsync SST method, the donor is under a read lock for the whole duration of the SST. Consequently, it can take no write operations. Such constraints on SST operations are often the main motivations beyond the reluctance of using Percona XtraDB cluster with large datasets.

So, what could we do to speed up SST? In this post, I will describe a method of performing SST operations when the data is not local to the nodes. You could easily modify the solution I am proposing for any non-local data source technology that supports snapshots/clones, and has an accessible management API. Off the top of my head (other than Ceph) I see AWS EBS and many SAN-based storage solutions as good fits.

The challenges of clone-based SST

If we could use snapshots and clones, what would be the logical steps for an SST? Let’s have a look at the following list:

  1. New node starts (joiner) and unmounts its current MySQL datadir
  2. The joiner and asks for an SST
  3. The donor creates a consistent snapshot of its MySQL datadir with the Galera position
  4. The donor sends to the joiner the name of the snapshot to use
  5. The joiner creates a clone of the snapshot name provided by the donor
  6. The joiner mounts the snapshot clone as the MySQL datadir and adjusts ownership
  7. The joiner initializes MySQL on the mounted clone

As we can see, all these steps are fairly simple, but hide some challenges for an SST method base on cloning. The first challenge is the need to mount the snapshot clone. Mounting a block device requires root privileges – and SST scripts normally run under the MySQL user. The second challenge I encountered wasn’t expected. MySQL opens the datadir and some files in it before the SST happens. Consequently, those files are then kept opened in the underlying mount point, a situation that is far from ideal. Fortunately, there are solutions to both of these challenges as we will see below.

SST script

So, let’s start with the SST script. The script is available in my Github at:


You should install the script in the /usr/bin directory, along with the other user scripts. Once installed, I recommend:

chown root.root /usr/bin/wsrep_sst_ceph
chmod 755 /usr/bin/wsrep_sst_ceph

The script has a few parameters that can be defined in the [sst] section of the my.cnf file.

The Ceph pool where this node should create the clone. It can be a different pool from the one of the original dataset. For example, it could have a replication factor of 1 (no replication) for a read scaling node. The default value is: mysqlpool
What mount point to use. It defaults to the MySQL datadir as provided to the SST script.
The options used to mount the filesystem. The default value is: rw,noatime
The Ceph keyring file to authenticate against the Ceph cluster with cephx. The user under which MySQL is running must be able to read the file. The default value is: /etc/ceph/ceph.client.admin.keyring
Whether or not the script should cleanup the snapshots and clones that are no longer is used. Enable = 1, Disable = 0. The default value is: 0
Root privileges

In order to allow the SST script to perform privileged operations, I added an extra SST role: “mount”. The SST script on the joiner will call itself back with sudo and will pass “mount” for the role parameter. To allow the elevation of privileges, the follow line must be added to the /etc/sudoers file:

mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

Files opened by MySQL before the SST

Upon startup, MySQL opens files at two places in the code before the SST completes. The first one is in the function


 , which sets the current working directory to the datadir (an empty directory at that point).  After the SST, a block device is mounted on the datadir. The issue is that MySQL tries to find the files in the empty mount point directory. I wrote a simple patch, presented below, and issued a pull request:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index 90760ba..bd9fa38 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name);
+  /*
+   * Forcing a new setwd in case the SST mounted the datadir
+   */
+  if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help)
+    unireg_abort(1);        /* purecov: inspected */
   if (opt_bin_log)

With this patch, I added a new


 call right after the SST completed. The Percona engineering team approved the patch, and it should be added to the upcoming release of Percona XtraDB Cluster.

The Galera library is the other source of opened files before the SST. Here, the fix is just in the configuration. You must define the


 Galera provider option outside of the datadir. For example, if you use /var/lib/mysql as datadir and cephmountpoint, then you should use:


Of course, if you have other provider options, don’t forget to add them there.


So, what are the steps required to use Ceph with Percona XtraDB Cluster? (I assume that you have a working Ceph cluster.)

1. Join the Ceph cluster

The first thing you need is a working Ceph cluster with the needed CephX credentials. While the setup of a Ceph cluster is beyond the scope of this post, we will address it in a subsequent post. For now, we’ll focus on the client side.

You need to install the Ceph client packages on each node. On my test servers using Ubuntu 14.04, I did:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
sudo apt-add-repository 'deb http://download.ceph.com/debian-infernalis/ trusty main'
apt-get update
apt-get install ceph

These commands also installed all the dependencies. Next, I copied the Ceph cluster configuration file /etc/ceph/ceph.conf:

fsid = 87671417-61e4-442b-8511-12659278700f
mon_initial_members = odroid1, odroid2
mon_host =,,
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_journal = /var/lib/ceph/osd/journal
osd_journal_size = 128
osd_pool_default_size = 2

and the authentication file /etc/ceph/ceph.client.admin.keyring from another node. I made sure these files were readable by all. You can define more refined privileges for a production system with CephX, the security layer of Ceph.

Once everything is in place, you can test if it is working with this command:

root@PXC3:~# ceph -s
    cluster 87671417-61e4-442b-8511-12659278700f
     health HEALTH_OK
     monmap e2: 3 mons at {odroid1=,odroid2=,serveur-famille=}
            election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille
     mdsmap e204: 1/1/1 up {0=odroid3=up:active}
     osdmap e995: 4 osds: 4 up, 4 in
      pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects
            643 GB used, 6318 GB / 7334 GB avail
                1352 active+clean
  client io 16491 B/s rd, 2425 B/s wr, 1 op/s

Which gives the current state of the Ceph cluster.

2. Create the Ceph pool

Before we can use Ceph, we need to create a first RBD image, put a filesystem on it and mount it for MySQL on the bootstrap node. We need at least one Ceph pool since the RBD images are stored in a Ceph pool.  We create a Ceph pool with the command:

ceph osd pool create mysqlpool 512 512 replicated

Here, we have defined the pool mysqlpool with 512 placement groups. On a larger Ceph cluster, you might need to use more placement groups (again, a topic beyond the scope of this post). The pool we just created is replicated. Each object in the pool will have two copies as defined by the osd_pool_default_size parameter in the ceph.conf file. If needed, you can modify the size of a pool and its replication factor at any moment after the pool is created.

3. Create the first RBD image

Now that we have a pool, we can create a first RBD image:

root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

and “map” the RBD image to a host block device:

root@PXC1:~# rbd -p mysqlpool map PXC

The commands return the local RBD block device that corresponds to the RBD image. The other steps are not specific to RBD images, we need to create a filesystem and prepare the mount points.

The rest of the steps are not specific to RBD images. We need to create a filesystem and prepare the mount points:

mkfs.xfs /dev/rbd1
mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid
chown mysql.mysql /var/lib/mysql
mysql_install_db --datadir=/var/lib/mysql --user=mysql
mkdir /var/lib/galera
chown mysql.mysql /var/lib/galera

You need to mount the RBD device and run the


 tool only on the bootstrap node. You need to create the directories /var/lib/mysql and /var/lib/galera on the other nodes and adjust the permissions similarly.

4. Modify the my.cnf files

You will need to set or adjust the specific


 settings in the my.cnf file of all the servers. Here are the relevant lines from the my.cnf file of one of my cluster node:


At this point, we can bootstrap the cluster on the node where we mounted the initial RBD image:

/etc/init.d/mysql bootstrap-pxc

5. Start the other XtraDB Cluster nodes

The first node does not perform an SST, so nothing exciting so far. With the patched version of MySQL (the above patch), starting MySQL on a second node triggers a Ceph SST operation. In my test environment, the SST take about five seconds to complete on low-powered VMs. Interestingly, the duration is not directly related to the dataset size. Because of this, a much larger dataset, on a quiet database, should take about the exact same time. A very busy database may need more time, since an SST requires a “flush tables with read lock” at some point.

So, after their respective Ceph SST, the other two nodes have:

root@PXC2:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC2:~# rbd showmapped
id pool      image           snap device
1  mysqlpool PXC2-1463776424 -    /dev/rbd1
root@PXC3:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC3:~# rbd showmapped
id pool      image           snap device
1  mysqlpool PXC3-1464118729 -    /dev/rbd1

The original RBD image now has two snapshots that are mapped to the clones mounted by other two nodes:

root@PXC3:~# rbd -p mysqlpool ls
root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424
rbd image 'PXC2-1463776424':
        size 10240 MB in 2560 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.108b4246146651
        format: 2
        features: layering
        parent: mysqlpool/PXC@1463776423
        overlap: 10240 MB


Apart from allowing faster SST, what other benefits do we get from using Ceph with Percona XtraDB Cluster?

The first benefit is the inherent data duplication over the network removes the need for local data replication. Thus, instead of using raid-10 or raid-5 with an array of disks, we could use a simple raid-0 stripe set if the data is already replicated to more than one server.

The second benefit is a bit less obvious: you don’t need as much storage. Why? A Ceph clone only stores the delta from its original snapshot. So, for large, read intensive datasets, the disk space savings can be very significant. Of course, over time, the clone will drift away from its parent snapshot and will use more and more space. When we determine that a Ceph clone uses too much disk space, we can simply refresh the clone by restarting MySQL and forcing a full SST. The SST script will automatically drop the old clone and snapshot when the cephcleanup option is set, and it will create a new fresh clone. You can easily evaluate how much space is consumed by the clone using the following commands:

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424
warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow.
PXC2-1463776424      10240M 164M

Also, nothing prevents you using a different configuration of Ceph pools in the same XtraDB cluster. Therefore a Ceph clone can use a different pool than its parent snapshot. That’s the whole purpose of the cephlocalpool parameter. Strictly speaking, you only need one node to use a replicated pool, as the other nodes could run on clones that are stored data in a non-replicated pool (saving a lot of storage space). Furthermore, we can define the OSD affinity of the non-replicated pool in a way that it stores data on the host where it is used, reducing the cross node network latency.

Using Ceph for XtraDB Cluster SST operation demonstrates one of the array of possibilities offered to MySQL by Ceph. We continue to work with the Red Hat team and Red Hat Ceph Storage architects to find new and useful ways of addressing database issues in the Ceph environment. There are many more posts to come, so stay tuned!



 script isn’t officially supported by Percona.


Upcoming Webinar August 2 10:00 am PDT: MySQL and Ceph

MySQL and Ceph

MySQL and CephJoin Brent Compton, Kyle Bader and Yves Trudeau on August 2, 2016 at 10 am PDT (UTC-7) for a MySQL and Ceph webinar.

Many operators select OpenStack as their control plane of choice for providing both internal and external IT services. The OpenStack user survey repeatedly shows Ceph as the dominant backend for providing persistent storage volumes through OpenStack Cinder. When building applications and repatriating old workloads, developers are discovering the need to provide OpenStack infrastructure database services. Given MySQL’s ubiquity, and it’s reliance on persistent storage, it is of utmost importance to understand how to achieve the performance demanded by today’s applications. Databases like MySQL can be incredibly IO intensive, and Ceph offers a great opportunity to go beyond the limitations presented by a single scale-up system. Since Ceph provides a mutable object store with atomic operations, could MySQL store InnoDB pages directly in Ceph?

This talk reviews the general architecture of Ceph, and then discusses benchmark results from small to mid-size Ceph clusters. These benchmarks lead to the development of prescriptive guidance around tuning Ceph storage nodes (OSDs), the impact the amount of physical memory, and the presence of SSDs, high-speed networks or RAID controllers.

MySQL and Ceph
Brent Compton
Director Storage Solution Architectures, Red Hat
Brent Compton is Director Storage Solution Architectures at Red Hat. He leads the team responsible for building Ceph and Gluster storage reference architectures with Red Hat Storage partners. Before Red Hat, Brent was responsible for emerging non-volatile memory software technologies at Fusion-io. Previous enterprise software leadership roles include VP Product Management at Micromuse (now IBM Tivoli Netcool) and Product Marketing Director within HP’s OpenView software division. Brent also served as Director Middleware Development Platforms at the LDS Church and as CIO at Joint Commission International. Brent has a tight-knit family, and can be found on skis or a mountain bike whenever possible.
MySQL and Ceph
Kyle Bader
Sr Solution Architect, Red Hat
Kyle Bader, a Red Hat senior architect, provides expertise in the design and operation of petabyte-scale storage systems using Ceph. He joined Red Hat as part of the 2014 Inktank acquisition. As a senior systems engineer at DreamHost, he helped implement, operate, and design Ceph and OpenStack-based systems for DreamCompute and DreamObjects cloud products.
MySQL and Ceph
Yves Trudeau
Principal Architect
Yves is a Principal Consultant at Percona, specializing in MySQL High-Availability and scaling solutions. Before joining Percona in 2009, he worked as a senior consultant for MySQL AB and Sun Microsystems, assisting customers across North America with NDB Cluster and Heartbeat/DRBD technologies. Yves holds a Ph.D. in Experimental Physics from Université de Sherbrooke. He lives in Québec, Canada with his wife and three daughters.

Using Ceph with MySQL


CephOver the last year, the Ceph world drew me in. Partly because of my taste for distributed systems, but also because I think Ceph represents a great opportunity for MySQL specifically and databases in general. The shift from local storage to distributed storage is similar to the shift from bare disks host configuration to LVM-managed disks configuration.

Most of the work I’ve done with Ceph was in collaboration with folks from RedHat (mainly Brent Compton and Kyle Bader). This work resulted in a number of talks presented at the Percona Live conference in April and the RedHat Summit San Francisco at the end of June. I could write a lot about using Ceph with databases, and I hope this post is the first in a long series on Ceph. Before I starting with use cases, setup configurations and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.

Introduction to Ceph

Inktank created Ceph a few years ago as a spin-off of the hosting company DreamHost. RedHat acquired Inktank in 2014 and now offers it as a storage solution. OpenStack uses Ceph as its dominant storage backend. This blog, however, focuses on a more general review and isn’t restricted to a virtual environment.

A simplistic way of describing Ceph is to say it is an object store, just like S3 or Swift. This is a true statement but only up to a certain point.  There are minimally two types of nodes with Ceph, monitors and object storage daemons (OSDs). The monitor nodes are responsible for maintaining a map of the cluster or, if you prefer, the Ceph cluster metadata. Without access to the information provided by the monitor nodes, the cluster is useless. Redundancy and quorum at the monitor level are important.

Any non-trivial Ceph setup has at least three monitors. The monitors are fairly lightweight processes and can be co-hosted on OSD nodes (the other node type needed in a minimal setup). The OSD nodes store the data on disk, and a single physical server can host many OSD nodes – though it would make little sense for it to host more than one monitor node. The OSD nodes are listed in the cluster metadata (the “crushmap”) in a hierarchy that can span data centers, racks, servers, etc. It is also possible to organize the OSDs by disk types to store some objects on SSD disks and other objects on rotating disks.

With the information provided by the monitors’ crushmap, any client can access data based on a predetermined hash algorithm. There’s no need for a relaying proxy. This becomes a big scalability factor since these proxies can be performance bottlenecks. Architecture-wise, it is somewhat similar to the NDB API, where – given a cluster map provided by the NDB management node – clients can directly access the data on data nodes.

Ceph stores data in a logical container call a pool. With the pool definition comes a number of placement groups. The placement groups are shards of data across the pool. For example, on a four-node Ceph cluster, if a pool is defined with 256 placement groups (pg), then each OSD will have 64 pgs for that pool. You can view the pgs as a level of indirection to smooth out the data distribution across the nodes. At the pool level, you define the replication factor (“size” in Ceph terminology).

The recommended values are a replication factor of three for spinners and two for SSD/Flash. I often use a size of one for ephemeral test VM images. A replication factor greater than one associates each pg with one or more pgs on the other OSD nodes.  As the data is modified, it is replicated synchronously to the other associated pgs so that the data it contains is still available in case an OSD node crashes.

So far, I have just discussed the basics of an object store. But the ability to update objects atomically in place makes Ceph different and better (in my opinion) than other object stores. The underlying object access protocol, rados, updates an arbitrary number of bytes in an object at an arbitrary offset, exactly like if it is a regular file. That update capability allows for much fancier usage of the object store – for things like the support of block devices, rbd devices, and even a network file systems, cephfs.

When using MySQL on Ceph, the rbd disk block device feature is extremely interesting. A Ceph rbd disk is basically the concatenation of a series of objects (4MB objects by default) that are presented as a block device by the Linux kernel rbd module. Functionally it is pretty similar to an iSCSI device as it can be mounted on any host that has access to the storage network and it is dependent upon the performance of the network.

The benefits of using Ceph

In a world striving for virtualization and containers, Ceph gives easily moves database resources between hosts.

IO scalability
On a single host, you have access only to the IO capabilities of that host. With Ceph, you basically put in parallel all the IO capabilities of all the hosts. If each host can do 1000 iops, a four-node cluster could reach up to 4000 iops.

High availability
Ceph replicates data at the storage level, and provides resiliency to storage node crash.  A kind of DRBD on steroids…

Ceph rbd block devices support snapshots, which are quick to make and have no performance impacts. Snapshots are an ideal way of performing MySQL backups.

Thin provisioning
You can clone and mount Ceph snapshots as block devices. This is a useful feature to provision new database servers for replication, either with asynchronous replication or with Galera replication.

The caveats of using Ceph

Of course, nothing is free. Ceph use comes with some caveats.

Ceph reaction to a missing OSD
If an OSD goes down, the Ceph cluster starts copying data with fewer copies than specified. Although good for high availability, the copying process significantly impacts performance. This implies that you cannot run a Ceph with a nearly full storage, you must have enough disk space to handle the loss of one node.

The “no out” OSD attribute mitigates this, and prevents Ceph from reacting automatically to a failure (but you are then on your own). When using the “no out” attribute, you must monitor and detect that you are running in degraded mode and take action. This resembles a failed disk in a RAID set. You can choose this behavior as default with the mon_osd_auto_mark_auto_out_in setting.

Every day and every week (deep), Ceph scrubs operations that, although they are throttled, can still impact performance. You can modify the interval and the hours that control the scrub action. Once per day and once per week are likely fine. But you need to set osd_scrub_begin_hour and osd_scrub_end_hour to restrict the scrubbing to off hours. Also, scrubbing throttles itself to not put too much load on the nodes. The osd_scrub_load_threshold variable sets the threshold.

Ceph has many parameters so that tuning Ceph can be complex and confusing. Since distributed systems push hardware, properly tuning Ceph might require things like distributing interrupt load among cores and thread core pinning, handling of Numa zones – especially if you use high-speed NVMe devices.


Hopefully, this post provided a good introduction to Ceph. I’ve discussed the architecture, the benefits and the caveats of Ceph. In future posts, I’ll present use cases with MySQL. These cases include performing Percona XtraDB Cluster SST operations using Ceph snapshots, provisioning async slaves and building HA setups. I also hope to provide guidelines on how to build and configure an efficient Ceph cluster.

Finally, a note for the ones who think cost and complexity put building a Ceph cluster out of reach. The picture below shows my home cluster (which I use quite heavily). The cluster comprises four ARM-based nodes (Odroid-XU4), each with a two TB portable USB-3 hard disk, a 16 GB EMMC flash disk and a gigabit Ethernet port.

I won’t claim record breaking performance (although it’s decent), but cost-wise it is pretty hard to beat (at around $600)!





Data in the Cloud track at Percona Live with Brent Compton and Ross Turk: The Data Performance Cloud

12 Days Until Percona Live

Data in the Cloud Track at Percona LiveIn this blog, we’ll discuss the Data in the Cloud track at Percona Live with Red Hat’s Brent Compton and Ross Turk.

Welcome to another interview with the Percona Live Data Performance Conference speakers and presenters. This series of blog posts will highlight some of the talks and presentations available at Percona Live Data Performance Conference April 18-21 in Santa Clara. Read through to the end for a discounts for Percona Live registration.

(A webinar sneak preview of their “MySQL on Ceph” cloud storage talk is happening on Wednesday, April 6th at 2 pm EDT. You can register for it here – all attendees will receive a special $200 discount code for Percona Live registration after the webinar! See the end of this blog post for more details!)

First, we need to establish some context. Data storage has traditionally, and for most of its existence, pretty much followed a consistent model: stable and fairly static big box devices that were purpose-built to house data. Needing more storage space meant obtaining more (or bigger) boxes. Classic scale-up storage. Need more, go to the data storage vendor and order a bigger box.

The problem is that data is exploding, and has been exponentially for the last decade. Some estimates put the amount of data being generated worldwide increasing at a rate of 40%-60% per year. That kind of increase, and at that speed, doesn’t leave a lot of ramp up time to make long term big box hardware investments. Things are changing too fast.

The immediate trend – evident by declining revenues of class storage boxes – is placing data in a cloud of scale-out storage. What is the cloud? Since that question has whole books devoted to it, let’s try to simplify it a bit.

Cloud computing benefits include scalability, instantaneous configuration, virtualized consumables and the ability to quickly expand base specifications. Moving workloads to the cloud brings with it numerous business benefits, including agility, focus and cost:

  • Agility. The cloud enables businesses to react to changing needs. As the workload grows or spikes, just add compute cycles, storage, and bandwidth with the click of a mouse.
  • Focus. Deploying workloads to the cloud enables companies to focus more resources on business-critical activities, rather than system administration.
  • Cost. Businesses can pay as they go for the services level they need. Planning and sinking money into long-term plans that may or may not pan out is not as big a problem.

When it comes to moving workloads into the cloud, the low throughput applications were the obvious first choice: email, non-critical business functions, team collaboration assistance. These generally are neither mission critical, nor require high levels of security. As applications driven services became more and more prevalent (think Netflix, Facebook, Instagram), more throughput intensive services were moved to the cloud – mainly for flexibility during service spikes and to accommodate increased users. But tried and true high-performance workloads like databases and other corporate kingdoms that have perceived higher security requirements have traditionally remained stuck in the old infrastructures that have served well – until now.

So what is this all leading to? Well, according to Brent and Ross, ALL data will eventually be going to the cloud, and the old models of storage infrastructure are falling by the wayside. Between the lack of elasticity and scalability of purpose-built hardware, and the oncoming storage crisis, database storage is headed for cloud services solutions.

I had some time to talk with Brent and Ross about data in the cloud, and what we can expect regarding a new data performance cloud model.

Percona: There is always a lot of talk about public versus private paradigms when it comes to cloud discussions. To you, this is fairly inconsequential. How do see “the cloud?” How would you define it terms of infrastructure for workloads?

RHT: Red Hat has long provided software for hybrid clouds, with the understanding that most companies will use a mix of public cloud and private cloud infrastructure for their workloads. This means that Red Hat software is supported both on popular public cloud platforms (such as AWS, Azure, and GCE) as well as on-premise platforms (such as OpenStack private clouds). Our work with Percona in providing a reference architecture for MySQL running on Ceph is all about giving app developers a comparable, deterministic experience when running their MySQL-based apps on a Ceph private storage cloud v. running them in the public cloud.

Percona: So, your contention is that ALL data is headed to the cloud. What are the factors that are going ramp up this trend? What level of information storage will cement this as inevitable?

RHT:  We’d probably restate this to “most data is headed to A cloud.” Two distinctions being made in this statement. The first is “most” versus “all” data.  For years to come, there will be late adopters with on-premise data NOT being served through a private cloud infrastructure. The second distinction is “a” cloud versus “the” cloud.  “A” cloud means either a public cloud or a private cloud (or some hybrid of the two). Private clouds are being constructed by the world’s most advanced companies within their own data centers to provide a similar type of elastic infrastructure with dynamic provisioning and lower CAPEX/OPEX costs (as is found in public clouds).

Percona: What are the concerns you see with moving all workloads to the cloud, and how would you address those concerns?

RHT:  The distinctions laid out in the previous answer address this. For myriad reasons, some data and workloads will reside on-premise within private clouds for a very long time. In fact, as the technology matures for building private clouds (as we’re seeing with OpenStack and Ceph), and can offer many of the same benefits as public clouds, we see the market reaching an equilibrium of sorts. In this equilibrium many of the agility, flexibility, and cost benefits once available only through public cloud services will be matched by private cloud installations. This will re-base the public versus private cloud discussion to fewer, simpler trade-offs – such as which data must reside on-premises to meet an enterprise’s data governance and control requirements.

Percona: So you mentioned the “Data Performance Cloud”? How would you describe that that is, and how it affects enterprises?

RHT:  For many enterprises, data performance workloads have been the last category of workloads to move a cloud, whether public or private. Public cloud services, such as AWS Relational Database Service with Provisioned-IOPS storage, have illustrated improved data performance for many workloads once relegated to the cloud sidelines. Now, with guidelines in the reference architecture being produced by Percona and the Red Hat Ceph team, customers can achieve comparable data performance on their private Ceph storage clouds as they do with high-performance public cloud services.

Percona: What can people expect to get out of the Data in the Cloud track at Percona Live this year?

RHT: Architecture guidelines for building and optimizing MySQL databases on a Ceph private storage cloud.   These architectures will include public cloud benefits along with private cloud control and governance.

Want to find out more about MySQL, Ceph, and Data in the Cloud? Register for Percona Live Data Performance Conference 2016, and see Red Hat’s sponsored Data in the Cloud Keynote Panel: Cloudy with a chance of running out of disk space? Or Sunny times ahead? Use the code “FeaturedTalk” and receive $100 off the current registration price!

The Percona Live Data Performance Conference is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 18-21 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

MySQL and Ceph: Database-as-a-Service sneak preview

Businesses are familiar with running a Database-as-a-Service (DBaaS) in the public cloud. They enjoy the benefits of on-demand infrastructure for spinning-up lots of MySQL instances with predictable performance, without the headaches of managing them on specific, bare-metal highly available clusters.

This webinar lays the foundation for building a DBaaS on your own private cloud, enabled by Red Hat® Ceph Storage. Join senior architects from Red Hat and Percona for reference architecture tips and head-to-head performance results of MySQL on Ceph versus MySQL on AWS.

This is a sneak preview of the labs and talks to be given in April 2016 at the Percona Live Data
Performance Conference
. Attendees received a discount code for $200 off Percona Live registration!


  • Brent Compton, director, Storage Solution Architectures, Red Hat
  • Kyle Bader, senior solutions architect, Red Hat
  • Yves Trudeau, principal consultant, Percona

Join the live event:

Wednesday, April 6, 2016 | 2 p.m. ET | 11 a.m. PT

Time zone converter


Canonical Launches Ubuntu Advantage Storage Support Service For Ceph And OpenStack Swift

ubuntu_logo_wood_cropped Canonical may still be mostly known for its Ubuntu Linux distribution, but the company now also offers a number of (paid) services for enterprises, often with a focus on the OpenStack platform. At the OpenStack Summit in Vancouver, Canada, Canonical founder Mark Shuttleworth today introduced his company’s latest offering: Ubuntu Advantage Storage. Read More


Red Hat Buys Inktank For $175M In Cash To Beef Up Its Cloud Storage Offerings

11847079564_738048a775_b Red Hat, the open source software provider, is squaring up to Amazon in the storage market. It has just announced that it is buying Inktank, a developer of open-source storage systems, for $175 million in cash. Red Hat says it will combine Inktank’s primary product, Inktank Ceph Enterprise, with its own GlusterFS-based storage offering. Red Hat says the deal will make it into the largest… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com