add blog post on handling data duplication in data-heavy environments
This commit is contained in:
parent
0a413aa6c2
commit
54185d178d
|
|
@ -0,0 +1,243 @@
|
|||
---
|
||||
layout: blog
|
||||
title: "How to Handle Data Duplication in Data-Heavy Kubernetes Environments"
|
||||
date: 2021-09-07
|
||||
slug: how-to-handle-data-duplication-in-data-heavy-kubernetes-environments
|
||||
---
|
||||
|
||||
**Authors:**
|
||||
Augustinas Stirbis (CAST AI)
|
||||
|
||||
## Why Duplicate Data?
|
||||
|
||||
It’s convenient to create a copy of your application with a copy of its state for each team.
|
||||
For example, you might want a separate database copy to test some significant schema changes
|
||||
or develop other disruptive operations like bulk insert/delete/update...
|
||||
|
||||
**Duplicating data takes a lot of time.** That’s because you need first to download
|
||||
all the data from a source block storage provider to compute and then send
|
||||
it back to a storage provider again. There’s a lot of network traffic and CPU/RAM used in this process.
|
||||
Hardware acceleration by offloading certain expensive operations to dedicated hardware is
|
||||
**always a huge performance boost**. It reduces the time required to complete an operation by orders
|
||||
of magnitude.
|
||||
|
||||
## Volume Snapshots to the rescue
|
||||
|
||||
Kubernetes introduced [VolumeSnapshots](/docs/concepts/storage/volume-snapshots/) as alpha in 1.12,
|
||||
beta in 1.17, and the Generally Available version in 1.20.
|
||||
VolumeSnapshots use specialized APIs from storage providers to duplicate volume of data.
|
||||
|
||||
Since data is already in the same storage device (array of devices), duplicating data is usually
|
||||
a metadata operation for storage providers with local snapshots (majority of on-premise storage providers).
|
||||
All you need to do is point a new disk to an immutable snapshot and only
|
||||
save deltas (or let it do a full-disk copy). As an operation that is inside the storage back-end,
|
||||
it’s much quicker and usually doesn’t involve sending traffic over the network.
|
||||
Public Clouds storage providers under the hood work a bit differently. They save snapshots
|
||||
to Object Storage and then copy back from Object storage to Block storage when "duplicating" disk.
|
||||
Technically there is a lot of Compute and network resources spent on Cloud providers side,
|
||||
but from Kubernetes user perspective VolumeSnapshots work the same way whether is it local or
|
||||
remote snapshot storage provider and no Compute and Network resources are involved in this operation.
|
||||
|
||||
## Sounds like we have our solution, right?
|
||||
|
||||
Actually, VolumeSnapshots are namespaced, and Kubernetes protects namespaced data from
|
||||
being shared between tenants (Namespaces). This Kubernetes limitation is a conscious design
|
||||
decision so that a Pod running in a different namespace can’t mount another application’s
|
||||
[PersistentVolumeClaim](/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) (PVC).
|
||||
|
||||
One way around it would be to create multiple volumes with duplicate data in one namespace.
|
||||
However, you could easily reference the wrong copy.
|
||||
|
||||
So the idea is to separate teams/initiatives by namespaces to avoid that and generally
|
||||
limit access to the production namespace.
|
||||
|
||||
## Solution? Creating a Golden Snapshot externally
|
||||
|
||||
Another way around this design limitation is to create Snapshot externally (not through Kubernetes).
|
||||
This is also called pre-provisioning a snapshot manually. Next, I will import it
|
||||
as a multi-tenant golden snapshot that can be used for many namespaces. Below illustration will be
|
||||
for AWS EBS (Elastic Block Storage) and GCE PD (Persistent Disk) services.
|
||||
|
||||
### High-level plan for preparing the Golden Snapshot
|
||||
|
||||
1. Identify Disk (EBS/Persistent Disk) that you want to clone with data in the cloud provider
|
||||
2. Make a Disk Snapshot (in cloud provider console)
|
||||
3. Get Disk Snapshot ID
|
||||
|
||||
### High-level plan for cloning data for each team
|
||||
|
||||
1. Create Namespace “sandbox01”
|
||||
2. Import Disk Snapshot (ID) as VolumeSnapshotContent to Kubernetes
|
||||
3. Create VolumeSnapshot in the Namespace "sandbox01" mapped to VolumeSnapshotContent
|
||||
4. Create the PersistentVolumeClaim from VolumeSnapshot
|
||||
5. Install Deployment or StatefulSet with PVC
|
||||
|
||||
## Step 1: Identify Disk
|
||||
|
||||
First, you need to identify your golden source. In my case, it’s a PostgreSQL database
|
||||
on PersistentVolumeClaim “postgres-pv-claim” in the “production” namespace.
|
||||
|
||||
```terminal
|
||||
kubectl -n <namespace> get pvc <pvc-name> -o jsonpath='{.spec.volumeName}'
|
||||
```
|
||||
|
||||
The output will look similar to:
|
||||
```
|
||||
pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d90
|
||||
```
|
||||
|
||||
## Step 2: Prepare your golden source
|
||||
|
||||
You need to do this once or every time you want to refresh your golden data.
|
||||
|
||||
### Make a Disk Snapshot
|
||||
|
||||
Go to AWS EC2 or GCP Compute Engine console and search for an EBS volume
|
||||
(on AWS) or Persistent Disk (on GCP), that has a label matching the last output.
|
||||
In this case I saw: `pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d9`.
|
||||
|
||||
Click on Create snapshot and give it a name. You can do it in Console manually,
|
||||
in AWS CloudShell / Google Cloud Shell, or in the terminal. To create a snapshot in the
|
||||
terminal you must have the AWS CLI tool (`aws`) or Google's CLI (`gcloud`)
|
||||
installed and configured.
|
||||
|
||||
Here’s the command to create snapshot on GCP:
|
||||
|
||||
```terminal
|
||||
gcloud compute disks snapshot <cloud-disk-id> --project=<gcp-project-id> --snapshot-names=<set-new-snapshot-name> --zone=<availability-zone> --storage-location=<region>
|
||||
```
|
||||
{{< figure src="/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/create-volume-snapshot-gcp.png" alt="Screenshot of a terminal showing volume snapshot creation on GCP" title="GCP snapshot creation" >}}
|
||||
|
||||
|
||||
GCP identifies the disk by its PVC name, so it’s direct mapping. In AWS, you need to
|
||||
find volume by the CSIVolumeName AWS tag with PVC name value first that will be used for snapshot creation.
|
||||
|
||||
{{< figure src="/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/identify-volume-aws.png" alt="Screenshot of AWS web console, showing EBS volume identification" title="Identify disk ID on AWS" >}}
|
||||
|
||||
Mark done Volume (volume-id) ```vol-00c7ecd873c6fb3ec``` and ether create EBS snapshot in AWS Console, or use ```aws cli```.
|
||||
|
||||
```terminal
|
||||
aws ec2 create-snapshot --volume-id '<volume-id>' --description '<set-new-snapshot-name>' --tag-specifications 'ResourceType=snapshot'
|
||||
```
|
||||
|
||||
## Step 3: Get your Disk Snapshot ID
|
||||
|
||||
In AWS, the command above will output something similar to:
|
||||
```terminal
|
||||
"SnapshotId": "snap-09ed24a70bc19bbe4"
|
||||
```
|
||||
|
||||
If you’re using the GCP cloud, you can get the snapshot ID from the gcloud command by querying for the snapshot’s given name:
|
||||
|
||||
```terminal
|
||||
gcloud compute snapshots --project=<gcp-project-id> describe <new-snapshot-name> | grep id:
|
||||
```
|
||||
You should get similar output to:
|
||||
```
|
||||
id: 6645363163809389170
|
||||
```
|
||||
|
||||
## Step 4: Create a development environment for each team
|
||||
|
||||
Now I have my Golden Snapshot, which is immutable data. Each team will get a copy
|
||||
of this data, and team members can modify it as they see fit, given that a new EBS/persistent
|
||||
disk will be created for each team.
|
||||
|
||||
Below I will define a manifest for each namespace. To save time, you can replace
|
||||
the namespace name (such as changing “sandbox01” → “sandbox42”) using tools
|
||||
such as `sed` or `yq`, with Kubernetes-aware templating tools like
|
||||
[Kustomize](/docs/tasks/manage-kubernetes-objects/kustomization/),
|
||||
or using variable substitution in a CI/CD pipeline.
|
||||
|
||||
Here's an example manifest:
|
||||
|
||||
```yaml
|
||||
---
|
||||
apiVersion: snapshot.storage.k8s.io/v1
|
||||
kind: VolumeSnapshotContent
|
||||
metadata:
|
||||
name: postgresql-orders-db-sandbox01
|
||||
namespace: sandbox01
|
||||
spec:
|
||||
deletionPolicy: Retain
|
||||
driver: pd.csi.storage.gke.io
|
||||
source:
|
||||
snapshotHandle: 'gcp/projects/staging-eu-castai-vt5hy2/global/snapshots/6645363163809389170'
|
||||
volumeSnapshotRef:
|
||||
kind: VolumeSnapshot
|
||||
name: postgresql-orders-db-snap
|
||||
namespace: sandbox01
|
||||
---
|
||||
apiVersion: snapshot.storage.k8s.io/v1
|
||||
kind: VolumeSnapshot
|
||||
metadata:
|
||||
name: postgresql-orders-db-snap
|
||||
namespace: sandbox01
|
||||
spec:
|
||||
source:
|
||||
volumeSnapshotContentName: postgresql-orders-db-sandbox01
|
||||
```
|
||||
|
||||
In Kubernetes, VolumeSnapshotContent (VSC) objects are not namespaced.
|
||||
However, I need a separate VSC for each different namespace to use, so the
|
||||
`metadata.name` of each VSC must also be different. To make that straightfoward,
|
||||
I used the target namespace as part of the name.
|
||||
|
||||
Now it’s time to replace the driver field with the CSI (Container Storage Interface) driver
|
||||
installed in your K8s cluster. Major cloud providers have CSI driver for block storage that
|
||||
support VolumeSnapshots but quite often CSI drivers are not installed by default, consult
|
||||
with your Kubernetes provider.
|
||||
|
||||
That manifest above defines a VSC that works on GCP.
|
||||
On AWS, driver and SnashotHandle values might look like:
|
||||
|
||||
```YAML
|
||||
driver: ebs.csi.aws.com
|
||||
source:
|
||||
snapshotHandle: "snap-07ff83d328c981c98"
|
||||
```
|
||||
|
||||
At this point, I need to use the *Retain* policy, so that the CSI driver doesn’t try to
|
||||
delete my manually created EBS disk snapshot.
|
||||
|
||||
For GCP, you will have to build this string by hand - add a full project ID and snapshot ID.
|
||||
For AWS, it’s just a plain snapshot ID.
|
||||
|
||||
VSC also requires specifying which VolumeSnapshot (VS) will use it, so VSC and VS are
|
||||
referencing each other.
|
||||
|
||||
Now I can create PersistentVolumeClaim from VS above. It’s important to set this first:
|
||||
|
||||
|
||||
```yaml
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: postgres-pv-claim
|
||||
namespace: sandbox01
|
||||
spec:
|
||||
dataSource:
|
||||
kind: VolumeSnapshot
|
||||
name: postgresql-orders-db-snap
|
||||
apiGroup: snapshot.storage.k8s.io
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 21Gi
|
||||
```
|
||||
|
||||
If default StorageClass has [WaitForFirstConsumer](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode) policy,
|
||||
then the actual Cloud Disk will be created from the Golden Snapshot only when some Pod bounds that PVC.
|
||||
|
||||
Now I assign that PVC to my Pod (in my case, it’s Postgresql) as I would with any other PVC.
|
||||
|
||||
```terminal
|
||||
kubectl -n <namespace> get volumesnapshotContent,volumesnapshot,pvc,pod
|
||||
```
|
||||
|
||||
Both VS and VSC should be *READYTOUSE* true, PVC bound, and the Pod (from Deployment or StatefulSet) running.
|
||||
|
||||
**To keep on using data from my Golden Snapshot, I just need to repeat this for the
|
||||
next namespace and voilà! No need to waste time and compute resources on the duplication process.**
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 68 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 297 KiB |
Loading…
Reference in New Issue