add blog post on handling data duplication in data-heavy environments
This commit is contained in:
parent
0a413aa6c2
commit
54185d178d
|
|
@ -0,0 +1,243 @@
|
||||||
|
---
|
||||||
|
layout: blog
|
||||||
|
title: "How to Handle Data Duplication in Data-Heavy Kubernetes Environments"
|
||||||
|
date: 2021-09-07
|
||||||
|
slug: how-to-handle-data-duplication-in-data-heavy-kubernetes-environments
|
||||||
|
---
|
||||||
|
|
||||||
|
**Authors:**
|
||||||
|
Augustinas Stirbis (CAST AI)
|
||||||
|
|
||||||
|
## Why Duplicate Data?
|
||||||
|
|
||||||
|
It’s convenient to create a copy of your application with a copy of its state for each team.
|
||||||
|
For example, you might want a separate database copy to test some significant schema changes
|
||||||
|
or develop other disruptive operations like bulk insert/delete/update...
|
||||||
|
|
||||||
|
**Duplicating data takes a lot of time.** That’s because you need first to download
|
||||||
|
all the data from a source block storage provider to compute and then send
|
||||||
|
it back to a storage provider again. There’s a lot of network traffic and CPU/RAM used in this process.
|
||||||
|
Hardware acceleration by offloading certain expensive operations to dedicated hardware is
|
||||||
|
**always a huge performance boost**. It reduces the time required to complete an operation by orders
|
||||||
|
of magnitude.
|
||||||
|
|
||||||
|
## Volume Snapshots to the rescue
|
||||||
|
|
||||||
|
Kubernetes introduced [VolumeSnapshots](/docs/concepts/storage/volume-snapshots/) as alpha in 1.12,
|
||||||
|
beta in 1.17, and the Generally Available version in 1.20.
|
||||||
|
VolumeSnapshots use specialized APIs from storage providers to duplicate volume of data.
|
||||||
|
|
||||||
|
Since data is already in the same storage device (array of devices), duplicating data is usually
|
||||||
|
a metadata operation for storage providers with local snapshots (majority of on-premise storage providers).
|
||||||
|
All you need to do is point a new disk to an immutable snapshot and only
|
||||||
|
save deltas (or let it do a full-disk copy). As an operation that is inside the storage back-end,
|
||||||
|
it’s much quicker and usually doesn’t involve sending traffic over the network.
|
||||||
|
Public Clouds storage providers under the hood work a bit differently. They save snapshots
|
||||||
|
to Object Storage and then copy back from Object storage to Block storage when "duplicating" disk.
|
||||||
|
Technically there is a lot of Compute and network resources spent on Cloud providers side,
|
||||||
|
but from Kubernetes user perspective VolumeSnapshots work the same way whether is it local or
|
||||||
|
remote snapshot storage provider and no Compute and Network resources are involved in this operation.
|
||||||
|
|
||||||
|
## Sounds like we have our solution, right?
|
||||||
|
|
||||||
|
Actually, VolumeSnapshots are namespaced, and Kubernetes protects namespaced data from
|
||||||
|
being shared between tenants (Namespaces). This Kubernetes limitation is a conscious design
|
||||||
|
decision so that a Pod running in a different namespace can’t mount another application’s
|
||||||
|
[PersistentVolumeClaim](/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) (PVC).
|
||||||
|
|
||||||
|
One way around it would be to create multiple volumes with duplicate data in one namespace.
|
||||||
|
However, you could easily reference the wrong copy.
|
||||||
|
|
||||||
|
So the idea is to separate teams/initiatives by namespaces to avoid that and generally
|
||||||
|
limit access to the production namespace.
|
||||||
|
|
||||||
|
## Solution? Creating a Golden Snapshot externally
|
||||||
|
|
||||||
|
Another way around this design limitation is to create Snapshot externally (not through Kubernetes).
|
||||||
|
This is also called pre-provisioning a snapshot manually. Next, I will import it
|
||||||
|
as a multi-tenant golden snapshot that can be used for many namespaces. Below illustration will be
|
||||||
|
for AWS EBS (Elastic Block Storage) and GCE PD (Persistent Disk) services.
|
||||||
|
|
||||||
|
### High-level plan for preparing the Golden Snapshot
|
||||||
|
|
||||||
|
1. Identify Disk (EBS/Persistent Disk) that you want to clone with data in the cloud provider
|
||||||
|
2. Make a Disk Snapshot (in cloud provider console)
|
||||||
|
3. Get Disk Snapshot ID
|
||||||
|
|
||||||
|
### High-level plan for cloning data for each team
|
||||||
|
|
||||||
|
1. Create Namespace “sandbox01”
|
||||||
|
2. Import Disk Snapshot (ID) as VolumeSnapshotContent to Kubernetes
|
||||||
|
3. Create VolumeSnapshot in the Namespace "sandbox01" mapped to VolumeSnapshotContent
|
||||||
|
4. Create the PersistentVolumeClaim from VolumeSnapshot
|
||||||
|
5. Install Deployment or StatefulSet with PVC
|
||||||
|
|
||||||
|
## Step 1: Identify Disk
|
||||||
|
|
||||||
|
First, you need to identify your golden source. In my case, it’s a PostgreSQL database
|
||||||
|
on PersistentVolumeClaim “postgres-pv-claim” in the “production” namespace.
|
||||||
|
|
||||||
|
```terminal
|
||||||
|
kubectl -n <namespace> get pvc <pvc-name> -o jsonpath='{.spec.volumeName}'
|
||||||
|
```
|
||||||
|
|
||||||
|
The output will look similar to:
|
||||||
|
```
|
||||||
|
pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d90
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 2: Prepare your golden source
|
||||||
|
|
||||||
|
You need to do this once or every time you want to refresh your golden data.
|
||||||
|
|
||||||
|
### Make a Disk Snapshot
|
||||||
|
|
||||||
|
Go to AWS EC2 or GCP Compute Engine console and search for an EBS volume
|
||||||
|
(on AWS) or Persistent Disk (on GCP), that has a label matching the last output.
|
||||||
|
In this case I saw: `pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d9`.
|
||||||
|
|
||||||
|
Click on Create snapshot and give it a name. You can do it in Console manually,
|
||||||
|
in AWS CloudShell / Google Cloud Shell, or in the terminal. To create a snapshot in the
|
||||||
|
terminal you must have the AWS CLI tool (`aws`) or Google's CLI (`gcloud`)
|
||||||
|
installed and configured.
|
||||||
|
|
||||||
|
Here’s the command to create snapshot on GCP:
|
||||||
|
|
||||||
|
```terminal
|
||||||
|
gcloud compute disks snapshot <cloud-disk-id> --project=<gcp-project-id> --snapshot-names=<set-new-snapshot-name> --zone=<availability-zone> --storage-location=<region>
|
||||||
|
```
|
||||||
|
{{< figure src="/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/create-volume-snapshot-gcp.png" alt="Screenshot of a terminal showing volume snapshot creation on GCP" title="GCP snapshot creation" >}}
|
||||||
|
|
||||||
|
|
||||||
|
GCP identifies the disk by its PVC name, so it’s direct mapping. In AWS, you need to
|
||||||
|
find volume by the CSIVolumeName AWS tag with PVC name value first that will be used for snapshot creation.
|
||||||
|
|
||||||
|
{{< figure src="/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/identify-volume-aws.png" alt="Screenshot of AWS web console, showing EBS volume identification" title="Identify disk ID on AWS" >}}
|
||||||
|
|
||||||
|
Mark done Volume (volume-id) ```vol-00c7ecd873c6fb3ec``` and ether create EBS snapshot in AWS Console, or use ```aws cli```.
|
||||||
|
|
||||||
|
```terminal
|
||||||
|
aws ec2 create-snapshot --volume-id '<volume-id>' --description '<set-new-snapshot-name>' --tag-specifications 'ResourceType=snapshot'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 3: Get your Disk Snapshot ID
|
||||||
|
|
||||||
|
In AWS, the command above will output something similar to:
|
||||||
|
```terminal
|
||||||
|
"SnapshotId": "snap-09ed24a70bc19bbe4"
|
||||||
|
```
|
||||||
|
|
||||||
|
If you’re using the GCP cloud, you can get the snapshot ID from the gcloud command by querying for the snapshot’s given name:
|
||||||
|
|
||||||
|
```terminal
|
||||||
|
gcloud compute snapshots --project=<gcp-project-id> describe <new-snapshot-name> | grep id:
|
||||||
|
```
|
||||||
|
You should get similar output to:
|
||||||
|
```
|
||||||
|
id: 6645363163809389170
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 4: Create a development environment for each team
|
||||||
|
|
||||||
|
Now I have my Golden Snapshot, which is immutable data. Each team will get a copy
|
||||||
|
of this data, and team members can modify it as they see fit, given that a new EBS/persistent
|
||||||
|
disk will be created for each team.
|
||||||
|
|
||||||
|
Below I will define a manifest for each namespace. To save time, you can replace
|
||||||
|
the namespace name (such as changing “sandbox01” → “sandbox42”) using tools
|
||||||
|
such as `sed` or `yq`, with Kubernetes-aware templating tools like
|
||||||
|
[Kustomize](/docs/tasks/manage-kubernetes-objects/kustomization/),
|
||||||
|
or using variable substitution in a CI/CD pipeline.
|
||||||
|
|
||||||
|
Here's an example manifest:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
apiVersion: snapshot.storage.k8s.io/v1
|
||||||
|
kind: VolumeSnapshotContent
|
||||||
|
metadata:
|
||||||
|
name: postgresql-orders-db-sandbox01
|
||||||
|
namespace: sandbox01
|
||||||
|
spec:
|
||||||
|
deletionPolicy: Retain
|
||||||
|
driver: pd.csi.storage.gke.io
|
||||||
|
source:
|
||||||
|
snapshotHandle: 'gcp/projects/staging-eu-castai-vt5hy2/global/snapshots/6645363163809389170'
|
||||||
|
volumeSnapshotRef:
|
||||||
|
kind: VolumeSnapshot
|
||||||
|
name: postgresql-orders-db-snap
|
||||||
|
namespace: sandbox01
|
||||||
|
---
|
||||||
|
apiVersion: snapshot.storage.k8s.io/v1
|
||||||
|
kind: VolumeSnapshot
|
||||||
|
metadata:
|
||||||
|
name: postgresql-orders-db-snap
|
||||||
|
namespace: sandbox01
|
||||||
|
spec:
|
||||||
|
source:
|
||||||
|
volumeSnapshotContentName: postgresql-orders-db-sandbox01
|
||||||
|
```
|
||||||
|
|
||||||
|
In Kubernetes, VolumeSnapshotContent (VSC) objects are not namespaced.
|
||||||
|
However, I need a separate VSC for each different namespace to use, so the
|
||||||
|
`metadata.name` of each VSC must also be different. To make that straightfoward,
|
||||||
|
I used the target namespace as part of the name.
|
||||||
|
|
||||||
|
Now it’s time to replace the driver field with the CSI (Container Storage Interface) driver
|
||||||
|
installed in your K8s cluster. Major cloud providers have CSI driver for block storage that
|
||||||
|
support VolumeSnapshots but quite often CSI drivers are not installed by default, consult
|
||||||
|
with your Kubernetes provider.
|
||||||
|
|
||||||
|
That manifest above defines a VSC that works on GCP.
|
||||||
|
On AWS, driver and SnashotHandle values might look like:
|
||||||
|
|
||||||
|
```YAML
|
||||||
|
driver: ebs.csi.aws.com
|
||||||
|
source:
|
||||||
|
snapshotHandle: "snap-07ff83d328c981c98"
|
||||||
|
```
|
||||||
|
|
||||||
|
At this point, I need to use the *Retain* policy, so that the CSI driver doesn’t try to
|
||||||
|
delete my manually created EBS disk snapshot.
|
||||||
|
|
||||||
|
For GCP, you will have to build this string by hand - add a full project ID and snapshot ID.
|
||||||
|
For AWS, it’s just a plain snapshot ID.
|
||||||
|
|
||||||
|
VSC also requires specifying which VolumeSnapshot (VS) will use it, so VSC and VS are
|
||||||
|
referencing each other.
|
||||||
|
|
||||||
|
Now I can create PersistentVolumeClaim from VS above. It’s important to set this first:
|
||||||
|
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
name: postgres-pv-claim
|
||||||
|
namespace: sandbox01
|
||||||
|
spec:
|
||||||
|
dataSource:
|
||||||
|
kind: VolumeSnapshot
|
||||||
|
name: postgresql-orders-db-snap
|
||||||
|
apiGroup: snapshot.storage.k8s.io
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOnce
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: 21Gi
|
||||||
|
```
|
||||||
|
|
||||||
|
If default StorageClass has [WaitForFirstConsumer](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode) policy,
|
||||||
|
then the actual Cloud Disk will be created from the Golden Snapshot only when some Pod bounds that PVC.
|
||||||
|
|
||||||
|
Now I assign that PVC to my Pod (in my case, it’s Postgresql) as I would with any other PVC.
|
||||||
|
|
||||||
|
```terminal
|
||||||
|
kubectl -n <namespace> get volumesnapshotContent,volumesnapshot,pvc,pod
|
||||||
|
```
|
||||||
|
|
||||||
|
Both VS and VSC should be *READYTOUSE* true, PVC bound, and the Pod (from Deployment or StatefulSet) running.
|
||||||
|
|
||||||
|
**To keep on using data from my Golden Snapshot, I just need to repeat this for the
|
||||||
|
next namespace and voilà! No need to waste time and compute resources on the duplication process.**
|
||||||
Binary file not shown.
|
After Width: | Height: | Size: 68 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 297 KiB |
Loading…
Reference in New Issue