Autoscale large images faster using Longhorn (distributed storage)

Understanding the problem

Downloading model data from GS Bucket

Using a NFS in Kubernetes

Architecture Diagram

  1. Fire pipeline to download model data from the Google Cloud Storage
  2. Pipeline connects to a bastion host which exec’s into a pod in the cluster dedicated to update the persistent volume claim
  3. The pod downloads data from Google Cloud Storage
  4. The pod then syncs the data to the persistent volume claim
  5. Once the data has been updated in the volume, a fresh deployment is launched to create new pods that will use the new data. We wanted to have a manual trigger to switch to the new model because the model file is so large to make sure the download is clean.

Longhorn Installation

GKE Prerequisite

kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=<name@example.com>

Creation of node pool

Installation

kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.3.1/deploy/longhorn.yaml
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- <YOUR_NODE_POOL_NAME>
- <YOUR_NODE_POOL_NAME2>

Other hacks to improve performance by over 50%

  1. If you’re on GKE, container image streaming is a must. Even for large image like ours GKE is able to spin pods up blazingly fast.
  2. Use the gcloud alpha command to take benefits of multi threading while downloading. This easily increases the download performace by 79%–94%. On AWS you can use S5CMD.

Reading Materials

  1. Gitlab agent for GKE
  2. Longhorn Doc

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
aesher9o1

aesher9o1

Sometimes it is the people no one can imagine anything of, do the things no one can imagine.