Model Caching (PV/PVC)

Model caching is a method for pre-loading models in MLIS to reduce inference service startup times and improve performance. This approach uses a ReadWriteMany (RWX) PersistentVolumeClaim (PVC) that can be accessed across multiple namespaces where you deploy your inference services.

During installation, when model caching is enabled, the MLIS controller automatically sets up and manages this PVC. When you deploy an inference service, the system creates copies of the underlying PersistentVolume (PV) and PVC in the target namespace, ensuring efficient access to cached models.

Benefits

  • Improved startup time performance
  • Automatic management by the MLIS controller
  • Efficient access to models across multiple namespaces

Key features

  • Utilizes a ReadWriteMany (RWX) PersistentVolumeClaim (PVC)
  • Creates efficient PV and PVC copies in target namespaces upon deployment
  • Ensures frequently used models are readily available

Model caching is particularly useful for optimizing resource usage and enhancing overall system efficiency in large-scale deployments.

Before You Start

  • You must be logged into MLIS as a user with an Admin role to enable model caching.
  • You should be familiar with PersistantVolumeClaims, PersistentVolumes, and StorageClasses.
  • You must have a ReadWriteMany StorageClass configured on your Kubernetes cluster (e.g. nfs-client if using nfs-subdir-external-provisioner’s default configuration – otherwise, the name is arbitrary and will vary.)

Model Caching Options

The full set of options for configuring model caching are provided in the default values.yaml and can be reviewed in the following section.


Check Available StorageClasses

Verify that your Kubernetes cluster has a StorageClass that supports ReadWriteMany (RWX) access mode.

  1. Run the following command to list available StorageClasses:

    kubectl get storageclasses.storage.k8s.io
    NAME                     PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
    nfs-client               cluster.local/nfs-subdir-external-provisioner   Delete          Immediate              true                   58m
    premium-rwo              pd.csi.storage.gke.io                           Delete          WaitForFirstConsumer   true                   27h
    standard                 kubernetes.io/gce-pd                            Delete          Immediate              true                   27h
    standard-rwo (default)   pd.csi.storage.gke.io                           Delete          WaitForFirstConsumer   true                   
  2. Copy the StorageClass name you want to use when enabling model caching.

Manually Provisioned PV for RWX Access (Advanced)

If you don’t have an appropriate ReadWriteMany StorageClass configured on your Kubernetes cluster, you can provide a manually configured PV that matches all of the following:

  • Access mode: ReadWriteMany (RWX)
  • Size: modelsCacheStorage.storageSize’s value in your Helm values
  • Storage class: modelsCacheStorage.storageClassName’s value in your Helm values
  • NFS: Your cluster’s NFS server IP address and the path to the NFS server’s filestore directory
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  capacity:
    storage: <STORAGE_SIZE>
  accessModes:
    - ReadWriteMany
  nfs:
    path: <NFS_SERVER_PATH>
    server: <NFS_SERVER_IP>
  persistentVolumeReclaimPolicy: Retain
  storageClassName: <STORAGE_CLASS_NAME>

Enable Model Caching

If you have a ReadWriteMany StorageClass that maps to NFS or a cluster filesystem using a hostPath volume, you can specify the StorageClass name in the Helm values during the install.

Example

modelsCacheStorage:
  enabled: true
  storageSize: 300Gi
  storageClassName: <NFS_STORAGE_CLASS_NAME> # nfs-client if using nfs-subdir-external-provisioner's default configuration.

When MLIS is installed with model caching enabled, a PVC will be created with a name following the format aioli-models-cache-pvc-<release-name>. This PVC will be bound to an appropriate PV based on the storageClassName you specified.

kubectl get pvc -A | egrep "^NAMESPACE|aioli-models-cache-pvc-aioli"
NAMESPACE   NAME                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
default     aioli-models-cache-pvc-aioli   Bound    pvc-faf71fe8-a2d3-4fb4-9ce2-506c76af08df   300Gi      RWX  <STORAGE_CLASS_NAME>     <unset>                 163m

NFS Storage Considerations

Many Kubernetes clusters use NFS-based storage solutions, such as the nfs-subdir-external-provisioner, which provides automatic provisioning of PVCs to an NFS endpoint through the nfs-client StorageClass. This setup offers several advantages for model caching:

  • Direct filesystem access: On Kubernetes systems with nfs-client storage, you can directly access the filesystem, making cleanup and management easier.
  • Automatic provisioning: PVCs are automatically provisioned to an NFS endpoint, simplifying storage management.
  • Flexible naming: Persistent volumes are typically provisioned with names following a defined format.

However, it’s important to be aware of some limitations when using NFS storage for model caching:

  • Storage guarantees: The provisioned storage is not guaranteed. You may allocate more than the NFS share’s total size, and the share may not have enough storage space to accommodate all requests.
  • Storage limits: The provisioned storage limit is not enforced. Applications can expand to use all available storage regardless of the provisioned size.
  • Resizing limitations: Storage resize or expansion operations are not currently supported. Attempting to resize may result in an error state.

Before enabling model caching with NFS storage, ensure that:

  • Your Kubernetes nodes can communicate with the NFS server at the host level.
  • The NFS server is properly configured before using the nfs-client and provisioner.
  • You monitor storage usage to prevent overallocation or running out of space.

Configure Caching Behavior

Adjust the caching behavior using the following parameters:

  • checkUnusedCachedModelsEvery: How frequently the controller checks for and deletes unused cached models.
  • purgeUnusedCachedModelsAfter: How long the controller retains unused cached models.

The time unit can be any of the following: s (seconds), m (minutes), h (hours), d (days), or w (weeks).

modelsCacheStorage:
  enabled: true
  checkUnusedCachedModelsEvery: 1d
  purgeUnusedCachedModelsAfter: 1w

Enable Caching for Specific Models

Individual models must explicitly enable the use of model caching. Users with an Admin role can use the CLI or UI to enable caching for individual models. Once enabled, new versions of the model will continue to have caching enabled even if the new version is created by a user with the Maintainer role

aioli model update <model-name> --enable-caching/--disable-caching

Disable Caching for Specific Deployments

Model caching can be temporarily disabled for a particular deployment by setting the environment variable AIOLI_DISABLE_MODEL_CACHE=true in the deployment. This may be useful if the disk used for the shared network storage becomes full, causing deployments to fail. Disabling the model caching for a particular deployment reverts back to using the local disk on the pod at the expense of having to download the model for each deployment replica.

Manage Cached Models

  • Models are automatically removed when deleted from the database
  • Unused models are purged based on the configured time period
  • The MLIS controller manages the cache content

Switching StorageClasses

If you need to switch to a different StorageClass (e.g., to increase storage size or for performance reasons), you can use the pvcNameSuffix option to create a new PVC. Models will be re-cached on the new PVC.

  1. Upgrade MLIS with the new StorageClass and a unique PVC name suffix by adding the following to your upgrade command:
    helm upgrade aioli aioli-1.3.0.tgz \
      -f my-values.yaml \ # All of the original values from your Helm install should also be included here.
      --set modelsCacheStorage.enabled=true \
      --set modelsCacheStorage.pvcNameSuffix="<UNIQUE_SUFFIX>" \
      --set modelsCacheStorage.storageClassName="<NFS_STORAGE_CLASS_NAME>" \
      --set modelsCacheStorage.storageSize=1000Gi
  2. After the upgrade, perform a canary rollout of the model for a seamless transition. This rollout must include at least one update to your deployment or model settings to trigger the canary rollout.
    aioli deployment update <DEPLOYMENT_NAME> \
    --model <PACKAGED_MODEL_NAME> \ 
    --canary-percentage 100

At this point, you can optionally perform the cleanup process to remove any leftover cached models from the old PV/PVC.


Disable PV/PVC Caching

  1. To disable model caching, perform a Helm upgrade with modelCacheStorage.enabled set to false.
    helm upgrade aioli aioli-1.3.0.tgz \
      -f my-values.yaml \ # All of the original values from your Helm install should also be included here.
      --set global.imagePullSecrets[0].name=regcred \
      --set modelsCacheStorage.enabled=false
  2. Models continue to use space on the PV/PVC until a canary rollout is performed. This rollout must include at least one update to your deployment or model settings to trigger the canary rollout.
    aioli deployment update <DEPLOYMENT_NAME> \
    --model <PACKAGED_MODEL_NAME> \ 
    --canary-percentage 100

At this point, you can optionally perform the cleanup process to remove the cached models from the PV/PVC.


Clean Up Cached Models

After extending the PV/PVC or before uninstalling MLIS, it’s important to clean up old models to free up storage space on the PV. This process should be done before deleting the old PVC, as PVC deletion doesn’t automatically clean up the storage used on the PV.

If you don’t have direct access to the PV server, you can create a temporary pod to access and clean up the models:

  1. Create a cleanup pod:
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: pvc-cleanup
      namespace: default
    spec:
      containers:
      - image: busybox
        name: pvc-inspector
        command: ["tail"]
        args: ["-f", "/dev/null"]
        volumeMounts:
        - mountPath: /mnt/models
          name: pvc-mount
      volumes:
      - name: pvc-mount
        persistentVolumeClaim:
          claimName: aioli-models-cache-pvc-aioli
    EOF
  2. Access the pod and remove the models:
    kubectl exec -it pvc-cleanup -- sh
    cd /mnt/models/
    ls -l  # List contents to identify model directories
    rm -rf <model-directory-to-remove>/
  3. Verify the space has been freed:
    df -h /mnt/models
  4. Remove the cleanup pod when finished:
    kubectl delete pod pvc-cleanup
  5. Delete the old PVC:
    kubectl delete pvc aioli-models-cache-pvc-aioli

Troubleshooting

  • If the PVC remains in Pending state, check StorageClass availability
  • Ensure the PV supports cloning across namespaces (NFS or hostPath)
  • For custom storage types, use modelsCacheStorage.bypassStorageCheck: true

StorageClass Compatibility

If you specify a storageClassName that results in a persistent volume type other than NFS or hostPath, it will be ignored and an ERROR will be logged in the MLIS controller log. If you are certain that your CSI driver or other storage type results in a PV that can be cloned between namespaces, you can bypass the type check by setting modelsCacheStorage.bypassStorageCheck: true in your Helm values.