Nov
22
2022
--

Troubleshooting Percona Operators With troubleshoot.sh

Troubleshooting Percona Operators

Troubleshooting Percona OperatorsPercona loves and embraces Kubernetes. Percona Operators have found their way into the hearts and minds of developers and operations teams. And with growing adoption, we get a lot of valuable feedback from the community and our support organization. Two of the most common questions that we hear are:

  1. What are the best practices to run Operators?
  2. Where do I look if there is a problem? In other words, how do I troubleshoot the issues?

In this blog post, we will experiment with troubleshoot.sh, an open source tool developed by Replicated, and see if it can help with our problems. 

Prepare

Installation

Troubleshoot.sh is a collection of kubectl plugins. Installation is dead simple and described in the documentation.

Concepts

There are two basic use cases:

  1. Preflight – check if your cluster is ready and compliant to run an application, so it should cover the best practices question
  2. Support bundle – check if the application is running and behaving appropriately, it clearly helps with troubleshooting

Three functional concepts in the tool that you should know about:

  1. Collectors – where you can define what information you want to collect. A lot of basic information is collected about the cluster by default. 
  2. Redact – if there is sensitive information, you can redact it from the bundle. For example, AWS credentials are somewhere in the logs of your application.
  3. Analyze – act on collected information and provide actionable insights. 

 

Preflight

Checks are defined through YAML files. I have created an example to check if the Kubernetes cluster is ready to run Percona Operator for MySQL (based on Percona XtraDB Cluster). This checks the following:

  1. Kubernetes version and flavor – as we only test our releases on specific versions and providers
  2. Default storage class – it is possible to use local storage, but by default, our Operator uses Persistent Volume Claims
  3. Node size – we check if nodes have at least 1 GB of RAM, otherwise, Percona XtraDB Cluster might fail to start

Run the check with the following command:

kubectl preflight https://raw.githubusercontent.com/spron-in/blog-data/master/percona-troubleshoot/pxc-op-1.11-preflight.yaml

It is just an example. Preflight checks are very flexible and you can read more in the documentation.

Support bundle

Support bundles can be used by support organizations or by users to troubleshoot issues with applications on Kubernetes. Once you run the command, it also creates a tar.gz archive with various data points that can be analyzed later or shared with experts.

Similarly to Preflight, support bundle checks are defined in YAML files. See the example that I created here and run it with the following command:

kubectl support-bundle https://raw.githubusercontent.com/spron-in/blog-data/master/percona-troubleshoot/pxc-op-1.11-support.yaml

Check cluster state

First, we are going to check the presence of Custom Resource Definition and the health of the cluster. In my YAML the name of the cluster is hard coded to minimal-cluster (it can be deployed from our repo).

We are using Custom Resource fields to verify the statuses of Percona XtraDB Cluster, HAProxy, and ProxySQL. This is the example checking HAProxy:

 - yamlCompare:
        checkName: Percona Operator for MySQL - HAProxy
        fileName: cluster-resources/custom-resources/perconaxtradbclusters.pxc.percona.com/default.yaml
        path: "[0].status.haproxy.status"
        value: ready
        outcomes:
          - fail:
              when: "false"
              message: HAProxy is not ready
          - pass:
              when: "true"
              message: HAProxy is ready

perconaxtradbclusters.pxc.percona.com/default.yaml

has all

pxc

Custom Resources in the default namespace. I know I have only one there, so I just parse the whole YAML with the

yamlCompare

method.

Verify Pods statuses

If you see that previous checks failed, it is time to check the Pods. In the current implementation, you can get the statuses of the Pods in various namespaces. 

- clusterPodStatuses:
        name: unhealthy
        namespaces:
          - default
        outcomes:
…
          - fail:
              when: "== Pending"
              message: Pod {{ .Namespace }}/{{ .Name }} is in a Pending state with status of {{ .Status.Reason }}.

Parse logs for known issues

Parsing logs of the application can be tedious. But what if you can predefine all known errors and quickly learn about the issue without going into your central logging system? Troubleshoot.sh can do it as well.

First, we define the collectors. We will get the logs from the Operator and Percona XtraDB Cluster itself:

collectors:
    - logs:
        selector:
          - app.kubernetes.io/instance=minimal-cluster
          - app.kubernetes.io/component=pxc
        namespace: default
        name: pxc/container/logs
        limits:
          maxAge: 24h
    - logs:
        selector:
          - app.kubernetes.io/instance=percona-xtradb-cluster-operator
          - app.kubernetes.io/component=operator
        namespace: default
        name: pxc-operator/container/logs
        limits:
          maxAge: 24h

Now the logs are in our support bundle. We can analyze them manually or catch some known messages with a regular expression analyzer:

    - textAnalyze:
        checkName: Failed to update the password
        fileName: pxc-operator/container/logs/percona-xtradb-cluster-operator-.*.log
        ignoreIfNoFiles: true
        regex: 'Error 1396: Operation ALTER USER failed for'
        outcomes:
          - pass:
              when: "false"
              message: "No failures on password change"
          - fail:
              when: "true"
              message: "There was a failure to change the system user password. For more details: https://docs.percona.com/percona-operator-for-mysql/pxc/users.html"

In the example we are checking for a message in the Operator log, that indicates that the system user password change failed.

Check for security best practices

Except for dealing with operational issues, troubleshoot.sh can help with compliance checks. You should have some security guardrails in the code before you deploy to Kubernetes, but if something slipped you will have a second line of defense. 

In our example we check for weak passwords and if the statefulset matches high availability best practices. 

Collecting and sharing passwords is not recommended, but still can be a valuable addition to your security policies. To capture secrets you need to implicitly define a collector:

spec:
  collectors:
    - secret:
        namespace: default
        name: internal-minimal-cluster
        includeValue: true
        key: root

Replay support bundle with sbctl

sbctl is an awesome addition to troubleshoot.sh. It allows users and support engineers to examine support bundles as if it was regular interaction with the Kubernetes cluster. See the demo below:

What’s next

The tool is new and there are certain limitations. To list a few:

  1. For now, all checks are hard coded. It calls out for some templating mechanism or simple generator of YAML files.
  2. It is not possible to add sophisticated logic into analyzers. For example, to link one analyzer to another. 
  3. It would be great to apply targeted filtering. For example, do not gather the information for all Custom Resources, but only one. It can be useful for targeted troubleshooting. 

We are going to work with the community and see how Percona can contribute to this tool, as it can help our users to standardize the troubleshooting of our products. Please let us know about your use cases or stories around debugging Percona products on Kubernetes in the comments.

The Percona Kubernetes Operators automate the creation, alteration, or deletion of members in your Percona Distribution for MySQL, MongoDB, or PostgreSQL environment.

Learn More About Percona Kubernetes Operators

Jun
23
2015
--

New Company Helps SaaS Applications Work Behind Firewall

Ones and zeros to infinity. Replicated, a company that wants to help SaaS vendors ship an on-premises version of their applications more easily, made a series of announcements today including a $1.5M seed round and several Beta customers. The company is taking advantage of Docker containerization technology to build a solution that enables developers to code once and ship two identical versions of the product —… Read More

Jun
23
2015
--

New Company Helps SaaS Applications Work Behind Firewall

Ones and zeros to infinity. Replicated, a company that wants to help SaaS vendors ship an on-premises version of their applications more easily, made a series of announcements today including a $1.5M seed round and several Beta customers. The company is taking advantage of Docker containerization technology to build a solution that enables developers to code once and ship two identical versions of the product —… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com