add blog post on handling data duplication in data-heavy environments

2021-09-07 16:41:52 +03:00 · 2021-09-07 16:41:52 +03:00 · 54185d178d
parent 0a413aa6c2
commit 54185d178d
3 changed files with 243 additions and 0 deletions
--- a/content/en/blog/_posts/2021-09-07-data-duplication-in-data-heavy-k8s-env.md
+++ b/content/en/blog/_posts/2021-09-07-data-duplication-in-data-heavy-k8s-env.md
@ -0,0 +1,243 @@
 ---
 layout: blog
 title: "How to Handle Data Duplication in Data-Heavy Kubernetes Environments"
 date: 2021-09-07
 slug: how-to-handle-data-duplication-in-data-heavy-kubernetes-environments 
 ---
 **Authors:**
 Augustinas Stirbis (CAST AI)
 ## Why Duplicate Data?
 It’s convenient to create a copy of your application with a copy of its state for each team. 
 For example, you might want a separate database copy to test some significant schema changes 
 or develop other disruptive operations like bulk insert/delete/update...
 **Duplicating data takes a lot of time.** That’s because you need first to download 
 all the data from a source block storage provider to compute and then send 
 it back to a storage provider again. There’s a lot of network traffic and CPU/RAM used in this process.
 Hardware acceleration by offloading certain expensive operations to dedicated hardware is 
 **always a huge performance boost**. It reduces the time required to complete an operation by orders 
 of magnitude.
 ## Volume Snapshots to the rescue
 Kubernetes introduced [VolumeSnapshots](/docs/concepts/storage/volume-snapshots/) as alpha in 1.12,
 beta in 1.17, and the Generally Available version in 1.20. 
 VolumeSnapshots use specialized APIs from storage providers to duplicate volume of data.
 Since data is already in the same storage device (array of devices), duplicating data is usually 
 a metadata operation for storage providers with local snapshots (majority of on-premise storage providers).
 All you need to do is point a new disk to an immutable snapshot and only 
 save deltas (or let it do a full-disk copy). As an operation that is inside the storage back-end,
 it’s much quicker and usually doesn’t involve sending traffic over the network.
 Public Clouds storage providers under the hood work a bit differently. They save snapshots
 to Object Storage and then copy back from Object storage to Block storage when "duplicating" disk.
 Technically there is a lot of Compute and network resources spent on Cloud providers side,
 but from Kubernetes user perspective VolumeSnapshots work the same way whether is it local or
 remote snapshot storage provider and no Compute and Network resources are involved in this operation.
 ## Sounds like we have our solution, right?
 Actually, VolumeSnapshots are namespaced, and Kubernetes protects namespaced data from 
 being shared between tenants (Namespaces). This Kubernetes limitation is a conscious design 
 decision so that a Pod running in a different namespace can’t mount another application’s
 [PersistentVolumeClaim](/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) (PVC).
 One way around it would be to create multiple volumes with duplicate data in one namespace.
 However, you could easily reference the wrong copy.
 So the idea is to separate teams/initiatives by namespaces to avoid that and generally 
 limit access to the production namespace.
 ## Solution? Creating a Golden Snapshot externally
 Another way around this design limitation is to create Snapshot externally (not through Kubernetes).
 This is also called pre-provisioning a snapshot manually. Next, I will import it 
 as a multi-tenant golden snapshot that can be used for many namespaces. Below illustration will be 
 for AWS EBS (Elastic Block Storage) and GCE PD (Persistent Disk) services.
 ### High-level plan for preparing the Golden Snapshot
 1. Identify Disk (EBS/Persistent Disk) that you want to clone with data in the cloud provider
 2. Make a Disk Snapshot (in cloud provider console)
 3. Get Disk Snapshot ID
 ### High-level plan for cloning data for each team
 1. Create Namespace “sandbox01”
 2. Import Disk Snapshot (ID) as VolumeSnapshotContent to Kubernetes
 3. Create VolumeSnapshot in the Namespace "sandbox01" mapped to VolumeSnapshotContent
 4. Create the PersistentVolumeClaim from VolumeSnapshot
 5. Install Deployment or StatefulSet with PVC
 ## Step 1: Identify Disk
 First, you need to identify your golden source. In my case, it’s a PostgreSQL database
 on PersistentVolumeClaim “postgres-pv-claim” in the “production” namespace.
 ```terminal
 kubectl -n <namespace> get pvc <pvc-name> -o jsonpath='{.spec.volumeName}'
 ```
 The output will look similar to:
 ```
 pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d90
 ```
 ## Step 2: Prepare your golden source
 You need to do this once or every time you want to refresh your golden data.
 ### Make a Disk Snapshot
 Go to AWS EC2 or GCP Compute Engine console and search for an EBS volume
 (on AWS) or Persistent Disk (on GCP), that has a label matching the last output.
 In this case I saw: `pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d9`.
 Click on Create snapshot and give it a name. You can do it in Console manually,
 in AWS CloudShell / Google Cloud Shell, or in the terminal. To create a snapshot in the
 terminal you must have the AWS CLI tool (`aws`) or Google's CLI (`gcloud`)
 installed and configured.
 Here’s the command to create snapshot on GCP:
 ```terminal
 gcloud compute disks snapshot <cloud-disk-id> --project=<gcp-project-id> --snapshot-names=<set-new-snapshot-name> --zone=<availability-zone> --storage-location=<region>
 ```
 {{< figure src="/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/create-volume-snapshot-gcp.png" alt="Screenshot of a terminal showing volume snapshot creation on GCP" title="GCP snapshot creation" >}}
 GCP identifies the disk by its PVC name, so it’s direct mapping. In AWS, you need to 
 find volume by the CSIVolumeName AWS tag with PVC name value first that will be used for snapshot creation.
 {{< figure src="/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/identify-volume-aws.png" alt="Screenshot of AWS web console, showing EBS volume identification" title="Identify disk ID on AWS" >}}
 Mark done Volume (volume-id) ```vol-00c7ecd873c6fb3ec``` and ether create EBS snapshot in AWS Console, or use ```aws cli```.
 ```terminal
 aws ec2 create-snapshot --volume-id '<volume-id>' --description '<set-new-snapshot-name>' --tag-specifications 'ResourceType=snapshot'
 ```
 ## Step 3: Get your Disk Snapshot ID
 In AWS, the command above will output something similar to:
 ```terminal
 "SnapshotId": "snap-09ed24a70bc19bbe4"
 ```
 If you’re using the GCP cloud, you can get the snapshot ID from the gcloud command by querying for the snapshot’s given name:
 ```terminal
 gcloud compute snapshots --project=<gcp-project-id> describe <new-snapshot-name> | grep id:
 ```
 You should get similar output to:
 ```
 id: 6645363163809389170
 ```
 ## Step 4: Create a development environment for each team
 Now I have my Golden Snapshot, which is immutable data. Each team will get a copy 
 of this data, and team members can modify it as they see fit, given that a new EBS/persistent 
 disk will be created for each team.
 Below I will define a manifest for each namespace. To save time, you can replace
 the namespace name (such as changing “sandbox01” → “sandbox42”) using tools
 such as `sed` or `yq`, with Kubernetes-aware templating tools like
 [Kustomize](/docs/tasks/manage-kubernetes-objects/kustomization/),
 or using variable substitution in a CI/CD pipeline.
 Here's an example manifest:
 ```yaml
 ---
 apiVersion: snapshot.storage.k8s.io/v1
 kind: VolumeSnapshotContent
 metadata:
 name: postgresql-orders-db-sandbox01
 namespace: sandbox01
 spec:
 deletionPolicy: Retain
 driver: pd.csi.storage.gke.io
 source:
   snapshotHandle: 'gcp/projects/staging-eu-castai-vt5hy2/global/snapshots/6645363163809389170'
 volumeSnapshotRef:
   kind: VolumeSnapshot
   name: postgresql-orders-db-snap
   namespace: sandbox01
 ---
 apiVersion: snapshot.storage.k8s.io/v1
 kind: VolumeSnapshot
 metadata:
 name: postgresql-orders-db-snap
 namespace: sandbox01
 spec:
 source:
   volumeSnapshotContentName: postgresql-orders-db-sandbox01
 ```
 In Kubernetes, VolumeSnapshotContent (VSC) objects are not namespaced.
 However, I need a separate VSC for each different namespace to use, so the
 `metadata.name` of each VSC must also be different. To make that straightfoward,
 I used the target namespace as part of the name.
 Now it’s time to replace the driver field with the CSI (Container Storage Interface) driver
 installed in your K8s cluster. Major cloud providers have CSI driver for block storage that
 support VolumeSnapshots but quite often CSI drivers are not installed by default, consult
 with your Kubernetes provider. 
 That manifest above defines a VSC that works on GCP.
 On AWS, driver and SnashotHandle values might look like:
 ```YAML
  driver: ebs.csi.aws.com
  source:
    snapshotHandle: "snap-07ff83d328c981c98"
 ```
 At this point, I need to use the *Retain* policy, so that the CSI driver doesn’t try to
 delete my manually created EBS disk snapshot.
 For GCP, you will have to build this string by hand - add a full project ID and snapshot ID.
 For AWS, it’s just a plain snapshot ID.
 VSC also requires specifying which VolumeSnapshot (VS) will use it, so VSC and VS are
 referencing each other.
 Now I can create PersistentVolumeClaim from VS above. It’s important to set this first:
 ```yaml
 ---
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
 name: postgres-pv-claim
 namespace: sandbox01
 spec:
 dataSource:
   kind: VolumeSnapshot
   name: postgresql-orders-db-snap
   apiGroup: snapshot.storage.k8s.io
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 21Gi
 ```
 If default StorageClass has [WaitForFirstConsumer](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode) policy,
 then the actual Cloud Disk will be created from the Golden Snapshot only when some Pod bounds that PVC.
 Now I assign that PVC to my Pod (in my case, it’s Postgresql) as I would with any other PVC.
 ```terminal
 kubectl -n <namespace> get volumesnapshotContent,volumesnapshot,pvc,pod
 ```
 Both VS and VSC should be *READYTOUSE* true, PVC bound, and the Pod (from Deployment or StatefulSet) running.
 **To keep on using data from my Golden Snapshot, I just need to repeat this for the
 next namespace and voilà! No need to waste time and compute resources on the duplication process.**
--- a/static/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/create-volume-snapshot-gcp.png
+++ b/static/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/create-volume-snapshot-gcp.png
--- a/static/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/identify-volume-aws.png
+++ b/static/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/identify-volume-aws.png