Mirantis, one of the earliest players in the OpenStack ecosystem, today announced that it will end-of-life Mirantis OpenStack support in September 2019. The Mirantis Cloud Platform, which combines OpenStack with the Kubernetes container platform (or which could even be used to run Kubernetes separately), is going to take its place. While Mirantis is obviously not getting out of the OpenStack… Read More
HPE is in an interesting position. Now that it has shut down its Helion public cloud business, its main focus is on its private and managed cloud services, which center around the open source OpenStack cloud platform that’s pretty much the de facto standard for building private clouds now. HPE launched version 4.0 of its Helion OpenStack platform for enterprises this week. With… Read More
OpenStack, the open source cloud computing platform that allows enterprises to essentially run their own version of AWS in their data centers, was founded by NASA and Rackspace in 2010. Today, it’s being used by the likes of Comcast, PayPal, Volkswagen, CERN, AT&T, China Mobile and Her Majesty’s Revenue and Customs in the U.K. The OpenStack Foundation’s bi-annual… Read More
OpenStack, the massive open source project that helps enterprises run the equivalent of AWS in their own data centers, is launching the 14th major version of its software today. Newton, as this new version is called, shows how OpenStack has matured over the last few years. The focus this time is on making some of the core OpenStack services more scalable and resilient. In addition, though,… Read More
AppFormix helps enterprises, including the likes of Rackspace and its customers, monitor and optimize their OpenStack- and container-based clouds. The company today announced that it has also now added support for virtualized network functions (VNF) to its service. Traditionally, networking was the domain of highly specialized hardware, but increasingly, it’s commodity hardware and… Read More
Mirantis, which specializes in offering software, support and training for running OpenStack, today announced that it is partnering with Germany-based SUSE, best known for its Linux distribution, to offer its customers support for SUSE’s enterprise Linux offering. The two companies also said that they will work on making SUSE Linux Enterprise Server a development platform for use… Read More
OpenStack, the open source project that allows enterprises to run an AWS-like cloud computing service in their own data centers, added support for containers over the course of its last few releases. Running OpenStack itself on top of containers is a different problem, though. Even though CoreOS has done some work on running OpenStack in containers thanks to its oddly named… Read More
Over the last year, the Ceph world drew me in. Partly because of my taste for distributed systems, but also because I think Ceph represents a great opportunity for MySQL specifically and databases in general. The shift from local storage to distributed storage is similar to the shift from bare disks host configuration to LVM-managed disks configuration.
Most of the work I’ve done with Ceph was in collaboration with folks from RedHat (mainly Brent Compton and Kyle Bader). This work resulted in a number of talks presented at the Percona Live conference in April and the RedHat Summit San Francisco at the end of June. I could write a lot about using Ceph with databases, and I hope this post is the first in a long series on Ceph. Before I starting with use cases, setup configurations and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.
Introduction to Ceph
Inktank created Ceph a few years ago as a spin-off of the hosting company DreamHost. RedHat acquired Inktank in 2014 and now offers it as a storage solution. OpenStack uses Ceph as its dominant storage backend. This blog, however, focuses on a more general review and isn’t restricted to a virtual environment.
A simplistic way of describing Ceph is to say it is an object store, just like S3 or Swift. This is a true statement but only up to a certain point. There are minimally two types of nodes with Ceph, monitors and object storage daemons (OSDs). The monitor nodes are responsible for maintaining a map of the cluster or, if you prefer, the Ceph cluster metadata. Without access to the information provided by the monitor nodes, the cluster is useless. Redundancy and quorum at the monitor level are important.
Any non-trivial Ceph setup has at least three monitors. The monitors are fairly lightweight processes and can be co-hosted on OSD nodes (the other node type needed in a minimal setup). The OSD nodes store the data on disk, and a single physical server can host many OSD nodes – though it would make little sense for it to host more than one monitor node. The OSD nodes are listed in the cluster metadata (the “crushmap”) in a hierarchy that can span data centers, racks, servers, etc. It is also possible to organize the OSDs by disk types to store some objects on SSD disks and other objects on rotating disks.
With the information provided by the monitors’ crushmap, any client can access data based on a predetermined hash algorithm. There’s no need for a relaying proxy. This becomes a big scalability factor since these proxies can be performance bottlenecks. Architecture-wise, it is somewhat similar to the NDB API, where – given a cluster map provided by the NDB management node – clients can directly access the data on data nodes.
Ceph stores data in a logical container call a pool. With the pool definition comes a number of placement groups. The placement groups are shards of data across the pool. For example, on a four-node Ceph cluster, if a pool is defined with 256 placement groups (pg), then each OSD will have 64 pgs for that pool. You can view the pgs as a level of indirection to smooth out the data distribution across the nodes. At the pool level, you define the replication factor (“size” in Ceph terminology).
The recommended values are a replication factor of three for spinners and two for SSD/Flash. I often use a size of one for ephemeral test VM images. A replication factor greater than one associates each pg with one or more pgs on the other OSD nodes. As the data is modified, it is replicated synchronously to the other associated pgs so that the data it contains is still available in case an OSD node crashes.
So far, I have just discussed the basics of an object store. But the ability to update objects atomically in place makes Ceph different and better (in my opinion) than other object stores. The underlying object access protocol, rados, updates an arbitrary number of bytes in an object at an arbitrary offset, exactly like if it is a regular file. That update capability allows for much fancier usage of the object store – for things like the support of block devices, rbd devices, and even a network file systems, cephfs.
When using MySQL on Ceph, the rbd disk block device feature is extremely interesting. A Ceph rbd disk is basically the concatenation of a series of objects (4MB objects by default) that are presented as a block device by the Linux kernel rbd module. Functionally it is pretty similar to an iSCSI device as it can be mounted on any host that has access to the storage network and it is dependent upon the performance of the network.
The benefits of using Ceph
In a world striving for virtualization and containers, Ceph gives easily moves database resources between hosts.
On a single host, you have access only to the IO capabilities of that host. With Ceph, you basically put in parallel all the IO capabilities of all the hosts. If each host can do 1000 iops, a four-node cluster could reach up to 4000 iops.
Ceph replicates data at the storage level, and provides resiliency to storage node crash. A kind of DRBD on steroids…
Ceph rbd block devices support snapshots, which are quick to make and have no performance impacts. Snapshots are an ideal way of performing MySQL backups.
You can clone and mount Ceph snapshots as block devices. This is a useful feature to provision new database servers for replication, either with asynchronous replication or with Galera replication.
The caveats of using Ceph
Of course, nothing is free. Ceph use comes with some caveats.
Ceph reaction to a missing OSD
If an OSD goes down, the Ceph cluster starts copying data with fewer copies than specified. Although good for high availability, the copying process significantly impacts performance. This implies that you cannot run a Ceph with a nearly full storage, you must have enough disk space to handle the loss of one node.
The “no out” OSD attribute mitigates this, and prevents Ceph from reacting automatically to a failure (but you are then on your own). When using the “no out” attribute, you must monitor and detect that you are running in degraded mode and take action. This resembles a failed disk in a RAID set. You can choose this behavior as default with the mon_osd_auto_mark_auto_out_in setting.
Every day and every week (deep), Ceph scrubs operations that, although they are throttled, can still impact performance. You can modify the interval and the hours that control the scrub action. Once per day and once per week are likely fine. But you need to set osd_scrub_begin_hour and osd_scrub_end_hour to restrict the scrubbing to off hours. Also, scrubbing throttles itself to not put too much load on the nodes. The osd_scrub_load_threshold variable sets the threshold.
Ceph has many parameters so that tuning Ceph can be complex and confusing. Since distributed systems push hardware, properly tuning Ceph might require things like distributing interrupt load among cores and thread core pinning, handling of Numa zones – especially if you use high-speed NVMe devices.
Hopefully, this post provided a good introduction to Ceph. I’ve discussed the architecture, the benefits and the caveats of Ceph. In future posts, I’ll present use cases with MySQL. These cases include performing Percona XtraDB Cluster SST operations using Ceph snapshots, provisioning async slaves and building HA setups. I also hope to provide guidelines on how to build and configure an efficient Ceph cluster.
Finally, a note for the ones who think cost and complexity put building a Ceph cluster out of reach. The picture below shows my home cluster (which I use quite heavily). The cluster comprises four ARM-based nodes (Odroid-XU4), each with a two TB portable USB-3 hard disk, a 16 GB EMMC flash disk and a gigabit Ethernet port.
I won’t claim record breaking performance (although it’s decent), but cost-wise it is pretty hard to beat (at around $600)!
Cray has always been associated with speed and power and its latest computing beast called the Cray Urika-GX system has been designed specifically for big data workloads. What’s more, it runs on OpenStack, the open source cloud platform and supports open source big data processing tools like Hadoop and Spark. Cray recognizes that the computing world had evolved since Seymour Cray… Read More
CoreOS today announced Stackanetes at the OpenStack Summit in Austin. Stackanetes (and yes, that name probably isn’t ideal) brings together OpenStack, the open source cloud computing platform that allows enterprises to run their own AWS-style cloud computing services in their private and public clouds, and Kubernetes, Google’s open source container management service. The… Read More