A Guide to Disaster Recovery in the Kubernetes Cluster

4 min readOct 23, 2022

So far, we have seen various topics about Kubernetes, in this article let’s see another important topic about “A Guide to Disaster recovery in the Kubernetes cluster”. As the usage of the Kubernetes is increasing across everywhere, it is important to consider the industry standard processes part of your cluster implementation/configuration. Part of that backup is the one helps to recover your Kubernetes cluster from any major failure.

Why we need Backup and Recovery?

There are three reasons why we need a backup and recovery mechanism in place for our Kubernetes cluster.

To recover cluster from Disasters: like someone accidentally deleted the namespace where your deployments reside.
Replicate the environment: You want to replicate your production environment to staging environment before any major upgrade.
Migration of Kubernetes Cluster: Let’s say, you want to migrate your Kubernetes cluster from one environment to another.

What to Backup?

We got to know why we need backup, but what to backup? Here is are things,

Your Kubernetes control plane is stored into etcd storage and you need to backup the etcd state to get all the Kubernetes resources.
If you have stateful containers (which you will have in real world), you need a backup of persistent volumes as well.

Best practices for Kubernetes disaster recovery

Kubernetes workloads should not be backed up using a traditional approach. To make sure that the backup and recovery are seamless, organizations should keep following things in mind.

Understand the backup requirement

It is important to understand what to take backup and how it is important. Like if you are running your kubernetes cluster on any cloud environment with GitOps, then backup is less bothered, as all your changes will be on GIT, you can focus on taking backup of volumes if you are using any. Like that understand the backup requirement and plan for it. In this article we assumed you are running the cluster on bare-metal and provided one of possible way to take backup. If you wish to adopt the GitOps way you can follow our ArgoCD series.

Have a restore plan

You should have details steps and plan how to restore the backup incase if anything happened, always test it minimum twice in different environments, so you will be more confident on real-time. Keep the steps with detailed explanation so it can be performed by anyone quickly.

Application-aware backups

Kubernetes’ portability is a double-edged sword. While it makes it easy to build new applications using existing services and helps ease migration to different environments. As many workloads running on the k8s platform are stateless, it’s important to have application-aware backups that provide context to the backup and different components involved in it. This can be done with the help of a Kubernetes backup solution. Organizations can automate the entire backup and recovery process to avoid any failures. These solutions also provide options to deploy the backups in various locations and help to make restoring to a brand-new environment a breeze.

Security is key

We need to protect our backups from any attackers. Organizations can make the mistake of slacking on the backup security. However, your application is as secure as your backup. To avoid unwarranted access to backups, organizations should employ identity access management (IAM) or role-based access control (RBAC). Only the members who are assigned to monitor or verify backups should be given access rights. Another important measure that can be taken to curb any attacks is data encryption. Organizations can invest in a disaster recovery solution that takes care of backup security for them.

Requirement

Make sure both environments are using the same version of kubeadm, kubeclt and Kubelet
You can you https://foxutech.com/setup-a-multi-master-kubernetes-cluster-with-kubeadm/ to setup the cluster on your environment.

ETCD Backup

How to Take etcd backup:

There is a different mechanism to take etcd backup depending on how you set up your etcd cluster in Kubernetes environment. There are two ways to setup etcd cluster in Kubernetes environment:

Internal etcd cluster: It means you’re running your etcd cluster in the form of containers/pods inside the Kubernetes cluster and it is the responsibility of Kubernetes to manage those pods.
External etcd cluster: Etcd cluster you’re running outside of Kubernetes cluster mostly in the form of Linux services and providing its endpoints to Kubernetes cluster to write to.

Backup Strategy for Internal Etcd Cluster:

To take a backup from inside an etcd pod, we will be using Kubernetes CronJob functionality which will not require any etcdctl client to be installed on the host. Or you can use etcdctl also as following,

Note: The backup location should be external or somewhere secure location, which is again backup properly or high available environment like cloud volumes.

Command:

For this you need to have etcdctl should be installed on the node(s).

# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/client.crt --key=/etc/kubernetes/pki/etcd/client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db

If you are not aware of the etcd details, you can find the required information by using below command.

# kubectl get pods etcd-k8s-master -n kube-system -o=jsonpath='{.spec.containers[0].command}' | jq

Cronjob:

Following is the definition of Kubernete CronJob which will take etcd backup every minute:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
   name: backup
   namespace: kube-system
   spec:
     # activeDeadlineSeconds: 100
     schedule: "*/1 * * * *"
     jobTemplate:
       spec:
         template:
           spec:
             containers:
             - name: backup
            # Same image as in /etc/kubernetes/manifests/etcd.yaml
             image: k8s.gcr.io/etcd:3.2.24
             env:
             - name: ETCDCTL_API
               value: "3"
             command: ["/bin/sh"]
             args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/client.crt --key=/etc/kubernetes/pki/etcd/client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
             volumeMounts:
             - mountPath: /etc/kubernetes/pki/etcd
               name: etcd-certs
               readOnly: true
             - mountPath: /backup
               name: backup
           restartPolicy: OnFailure
           hostNetwork: true
           volumes:
           - name: etcd-certs
             hostPath:
               path: /etc/kubernetes/pki/etcd
               type: DirectoryOrCreate
           - name: backup
             hostPath:
               path: /data/backup
               type: DirectoryOrCreate

We can check the snapshot status.

# ETCDCTL_API=3 etcdctl --write-out=table snapshot status /backup/etcd-snapshot.db

Continue reading on: https://foxutech.com/disaster-recovery-in-the-kubernetes-cluster/