Manual Model Pre-loading (PVC)

Manual pre-loading is an advanced method for pre-loading models in MLIS using PersistentVolumeClaims (PVCs) to store models. It offers more control for experienced users with specific storage requirements, making it ideal for scenarios involving OpenLLM or bento-archive models. However, this approach requires a deep understanding of Kubernetes and adds complexity, particularly for tasks like canary deployments.

info icon Best Practice
Each model, including different versions of the same model, should be pre-loaded into its own separate directory. Do not store different versions of a model in the same directory. To avoid confusion, it’s recommended to include the model version in the directory name.

Benefits

  • Greater control over model storage and access
  • Fine-grained control over model management
  • Ideal for models not compatible with automated caching

Key features

  • Uses PersistentVolumeClaims (PVCs) for model storage
  • Supports custom storage mapping
  • Provides manual control over model placement and lifecycle

Manual pre-loading is particularly useful for OpenLLM or bento-archive models and scenarios requiring specific storage configurations. However, it requires more in-depth knowledge of Kubernetes and may add complexity to operations like canary deployments.

When to Use Manual Pre-loading

For most users, the simpler Model Caching method is recommended. Consider manual pre-loading if you:

  • Need to pre-load models incompatible with standard caching
  • Have specific storage needs not supported by model caching
  • Possess advanced Kubernetes knowledge and can manage PVCs manually
  • Require precise control over model lifecycles

Before You Start

To use Manual Pre-loading with PVCs, administrators need to ensure the following:

  • Create a PVC: You must create a PVC in the same Kubernetes namespace where you intend to deploy your inference service.
  • Sufficient Storage: Ensure the PVC has storage resources for storing the entire model, based on expected model sizes.
  • Review any necessary resources specific to the model you want to pre-load:

Note on HuggingFace Tokens

If the model you wish to download from HuggingFace requires authentication (i.e., a token), you must set the HF_TOKEN environment variable with your HuggingFace token. The following HuggingFace example assumes that this token is stored in the environment variable HF_TOKEN.


Examples

OpenLLM from HuggingFace

The following example demonstrates how to preload the meta-llama/Meta-Llama-3.1-8B-Instruct model from HuggingFace onto a PVC named models-pvc. This model requires a HuggingFace token and access permissions, which can be requested at huggingface.co. This example employs a Kubernetes Job that executes the huggingface-cli command to download the model to the /mnt/models/meta-llama-3.1-8b-instruct directory on the PVC. However, you can customize the Job to use any method that suits your needs for preloading models.

apiVersion: batch/v1
kind: Job
metadata:
  name: download-llama-3-1-8b-instruct-model
spec:
  template:
    spec:
      containers:
      - name: model-installer
        image: kserve/huggingfaceserver:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            MODEL_DIR="/mnt/models/meta-llama-3.1-8b-instruct"

            huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --token $HF_TOKEN --local-dir $MODEL_DIR

            [ $? -eq 0 ] && echo "Model download complete" || echo "Model download failed"            
        env:
        - name: HF_TOKEN
          value: hf_XXXXXXXXXXXXXXXXXXXXXX
        volumeMounts:
        - name: models-cache
          mountPath: /mnt/models
      restartPolicy: OnFailure
      volumes:
      - name: models-cache
        persistentVolumeClaim:
          claimName: models-pvc

Applying the Kubernetes Job

After creating the YAML file (e.g., download-llama-3-1-8b-instruct-model.yaml) with the content provided in the example above, you can apply it to your Kubernetes cluster using the following command:

kubectl apply -f download-llama-3-1-8b-instruct-model.yaml
tip icon Tip

You can monitor the logs to view the progress of the model being downloaded onto the PVC or to check for any errors that may have occurred.

kubectl logs job/download-llama-3-1-8b-instruct-model

When the model has been successfully downloaded, the last few lines of the log should look similar to the following output:

Download complete. Moving file to /mnt/models/meta-llama-3.1-8b-instruct/model-00001-of-00004.safetensors
Fetching 17 files:  41%|████      | 7/17 [01:04<01:36,  9.67s/it]Download complete. Moving file to /mnt/models/meta-llama-3.1-8b-instruct/original/consolidated.00.pth
Fetching 17 files: 100%|██████████| 17/17 [02:04<00:00,  7.30s/it]
/mnt/models/meta-llama-3.1-8b-instruct
Model download complete

At this point, you can now reference the model stored on the PVC in your inference service deployment.