Nov
02
2018
--

Maintenance Windows in the Cloud

maintenance windows cloud

maintenance windows cloudRecently, I’ve been working with a customer to evaluate the different cloud solutions for MySQL. In this post I am going to focus on maintenance windows and requirements, and what the different cloud platforms offer.

Why is this important at all?

Maintenance windows are required so that the cloud provider can do the necessary updates, patches, and changes to our setup. But there are many questions like:

  • Is this going to impact our production traffic?
  • Is this going to cause any downtime?
  • How long does it take?
  • Any way to avoid it?

Let’s discuss the three most popular cloud provider: AWS, Google, Microsoft. These three each have a MySQL based database service where we can compare the maintenance settings.

AWS

When you create an instance you can define your maintenance window. It’s a 30 minutes block when AWS can update and restart your instances, but it might take more time, AWS does not guarantee the update will be done in 30 minutes. The are two different type of updates, Required and Available. 

If you defer a required update, you receive a notice from Amazon RDS indicating when the update will be performed. Other updates are marked as available, and these you can defer indefinitely.

It is even possible to disable auto upgrade for minor versions, and in that case you can decide when do you want to do the maintenance.

AWS separate OS updates and database engine updates.

OS Updates

It requires some downtime, but you can minimise it by using Multi-AZ deployments. First, the secondary instance will be updated. Then AWS do a failover and update the Primary instance as well. This means some small outage during the failover.

DB Engine Updates

For DB maintenance, the updates are applied to both instances (primary and secondary) at the same time. That will cause some downtime.

More information: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.Maintenance.html#USER_UpgradeDBInstance.Maintenance.Multi-AZ

Google CloudSQL

With CloudSQL you have to define an hour for a maintenance window, for example 01:00–02:00, and in that hour, they can restart the instances at any time. It is not guaranteed the update will be done in that hour. The primary and the secondary have the same maintenance window. The read replicas do not have any maintenance window, they can be stopped at any time.

CloudSQL does not differentiate between OS or DB engine, or between required and available upgrades. Because the failover replica has the same maintenance window, any upgrade might cause database outage in that time frame.

More information: https://cloud.google.com/sql/docs/mysql/instance-settings

Microsoft Azure

Azure provides a service called Azure Database for MySQL servers. I was reading the documentation and doing some research trying to find anything regarding the maintenance window, but I did not find anything.

I span up an instance in Azure to see if there is any available settings, but I did not find anything so at this point I do not know how Azure does OS or DB maintenance or how that impacts production traffic.

If someone knows where can I find this information in the documentation, please let me know.

Conclusion

AWS CloudSQL Azure
Maintenance Window 30m 1h Unknown
Maintenance Window for Read Replicas No No Unknown
Separate OS and DB updates Yes No Unknown
Outage during update Possible Possible Unknown
Postpone an update Possible No Unknown
Different priority for updates Yes No Unknown

 

While I do not intend  to prefer or promote any of the providers, for this specific question, AWS offers the most options and controls for how we want to deal with maintenance.


Photo by Caitlin Oriel on Unsplash

Oct
23
2018
--

Reclaiming space on your Docker PMM server deployment

reclaiming space Docker PMM

reclaiming space Docker PMMRecently we had a customer that had issues with a filled disk on the server hosting their Docker pmm-server environment. They were not able to access the web UI, or even stop the pmm-server container because they had filled the /var/ mount point.

Setting correct expectations

The best way to avoid these kinds of issues in the first place is to plan ahead, and to know exactly with what you are dealing with in terms of disk space requirements. Michael Coburn has written a great blogpost on this matter:

https://www.percona.com/blog/2017/05/04/how-much-disk-space-should-i-allocate-for-percona-monitoring-and-management/

We are now using Prometheus version 2 inside PMM server, so you should take it with a pinch of salt. On the other hand, it will show how you should plan ahead, and think about the “steady state” disk usage, so it’s a good read.

That’s the first step to make sure you won’t get into trouble down the line. But, what happens if you are already in trouble? We’ll see two quick ways that may help reclaiming space.

Before anything else, you should stop any and all PMM clients running, so that you don’t have a race condition after recovering some space, in which metrics coming from the running clients will fill up whatever disk you had freed.

If

pmm-admin stop --all

  won’t work, you can stop the services manually, or even manually kill the running processes as a last resort:

shell> systemctl list-unit-files | grep enabled | grep pmm | awk '{print $1}' | xargs -n 1 systemctl stop
shell> ps ax | egrep "exporter|qan-agent|pmm" | grep -v "ssh" | awk '{print $1}' | xargs kill

Removing unused containers

In order for the next steps to be as effective as possible, make sure there are no unused containers running, or stopped:

shell> docker ps -a

If you see any container that you know you don’t need anymore:

shell> docker stop <container_name>
shell> docker rm -v <container_name>

WARNING! Do not remove the pmm-data container!

Reclaiming space from unused Docker images

After you are done cleaning unused containers, we can move forward with removing unused images. Unless you are manually building your own Docker images, it’s really easy to get them again if needed, so you shouldn’t be afraid of deleting the ones that are not being used. In fact, you don’t need to explicitly download the images. By simply running

docker run … image_name

  Docker will automatically do it for you if it’s not found locally.

shell> docker image prune -a
WARNING! This will remove all images without at least one container associated to them.
Are you sure you want to continue? [y/N] y
Deleted Images:
...
Total reclaimed space: 3.97GB

Not too bad, we just reclaimed 4Gb of disk space. This alone should be enough to restart the Docker service and have the pmm-server container back up. But we want more, just because we can ?

Reclaiming space from orphaned Docker volumes

By default, when removing a container (with

docker rm

 ) Docker will not delete the associated volumes, unless you use the -v switch as we did above. This will mean that, unless you were aware of this fact, you will probably have some other gigabytes worth of data occupying disk space. We can easily do this with the volume prune command:

shell> docker volume prune
WARNING! This will remove all local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
Deleted Volumes:
...
Total reclaimed space: 115GB

Yeah… that’s some significant amount of disk space we just reclaimed back! Again, make sure you don’t care about any of the volumes from your past containers to be able to do this safely, since there is no turning back from this, obviously.

For earlier versions of Docker where this command is not available, you can check this link.

Planning ahead

As mentioned before, you should now revisit Michael’s blogpost, and set the metrics retention and queries retention variables to whatever makes sense for your environment. Even if you plan ahead, you may not be counting on the additional variable overhead of images and orphaned volumes, so you may want to (warning: shameless plug for my own blogpost ahead) use different mount points for your PMM deployment, and avoid using the shared /var/lib/docker/ mount point for it.

PMM also includes a Disk Space usage dashboard, that you can use to monitor this.

Don’t forget to start back up your PMM clients, and continue to monitor them 24×7!

Photo by Andrew Wulf on Unsplash

May
31
2017
--

ProxySQL-Assisted Percona XtraDB Cluster Maintenance Mode

Percona XtraDB Cluster Maintenance Mode

Percona XtraDB Cluster Maintenance ModeIn this blog post, we’ll look at how Percona XtraDB Cluster maintenance mode uses ProxySQL to take cluster nodes offline without impacting workloads.

Percona XtraDB Cluster Maintenance Mode

Since Percona XtraDB Cluster offers a high availability solution, it must consider a data flow where a cluster node gets taken down for maintenance (through isolation from a cluster or complete shutdown).

Percona XtraDB Cluster facilitated this by introducing a maintenance mode. Percona XtraDB Cluster maintenance mode reduces the number of abrupt workload failures if a node is taken down using ProxySQL (as a load balancer).

The central idea is delaying the core node action and allowing ProxySQL to divert the workload.

How ProxySQL Manages Percona XtraDB Cluster Maintenance Mode

With Percona XtraDB Cluster maintenance mode, ProxySQL marks the node as OFFLINE when a user triggers a shutdown signal (or wants to put a specific node into maintenance mode):

  • When a user triggers a shutdown, Percona XtraDB Cluster node sets
    pxc_maint_mode

     to SHUTDOWN (from the DISABLED default) and sleep for x seconds (dictated by

    pxc_maint_transition_period

      — 10 secs by default). ProxySQLauto detects this change and marks the node as OFFLINE. With this change, ProxySQL avoids opening new connections for any DML transactions, but continues to service existing queries until

    pxc_maint_transition_period

    . Once the sleep period is complete, Percona XtraDB Cluster delivers a real shutdown signal — thereby giving ProxySQL enough time to transition the workload.

  • If the user needs to take a node into maintenance mode, the user can simply set
    pxc_maint_mode

     to MAINTENANCE. With that, 

    pxc_maint_mode

     is updated and the client connection updating it goes into sleep for x seconds (as dictated by

    pxc_maint_transition_period

    ) before giving back control to the user. ProxySQL auto-detects this change and marks the node as OFFLINE. With this change ProxySQL avoids opening new connections for any DML transactions but continues to service existing queries.

  • ProxySQL auto-detects this change in maintenance state and then automatically re-routes traffic, thereby reducing abrupt workload failures.

Technical Details:

  • The ProxySQL Galera checker script continuously monitors the state of individual nodes by checking the
    pxc_maint_mode

     variable status (in addition to the existing

    wsrep_local_state

    ) using the ProxySQL scheduler feature

  • Scheduler is a Cron-like implementation integrated inside ProxySQL, with millisecond granularity.
  • If
    proxysql_galera_checker

     detects

    pxc_maint_mode = SHUTDOWN | MAINTENANCE

    , then it marks the node as OFFLINE_SOFT.  This avoids the creation of new connections (or workloads) on the node.

Sample

proxysql_galera_checker

 log:

Thu Dec  8 11:21:11 GMT 2016 Enabling config
Thu Dec  8 11:21:17 GMT 2016 Check server 10:127.0.0.1:25000 , status ONLINE , wsrep_local_state 4
Thu Dec  8 11:21:17 GMT 2016 Check server 10:127.0.0.1:25100 , status ONLINE , wsrep_local_state 4
Thu Dec  8 11:21:17 GMT 2016 Check server 10:127.0.0.1:25200 , status ONLINE , wsrep_local_state 4
Thu Dec  8 11:21:17 GMT 2016 Changing server 10:127.0.0.1:25200 to status OFFLINE_SOFT due to SHUTDOWN
Thu Dec  8 11:21:17 GMT 2016 Number of writers online: 2 : hostgroup: 10
Thu Dec  8 11:21:17 GMT 2016 Enabling config
Thu Dec  8 11:21:22 GMT 2016 Check server 10:127.0.0.1:25000 , status ONLINE , wsrep_local_state 4
Thu Dec  8 11:21:22 GMT 2016 Check server 10:127.0.0.1:25100 , status ONLINE , wsrep_local_state 4
Thu Dec  8 11:21:22 GMT 2016 Check server 10:127.0.0.1:25200 , status OFFLINE_SOFT , wsrep_local_state 4

Ping us below with any questions or comments.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com