Manual Model Pre-loading (PVC)
Manual pre-loading is an advanced method for pre-loading models in MLIS using PersistentVolumeClaims (PVCs) to store models. It offers more control for experienced users with specific storage requirements, making it ideal for scenarios involving OpenLLM or bento-archive models. However, this approach requires a deep understanding of Kubernetes and adds complexity, particularly for tasks like canary deployments.
Benefits
- Greater control over model storage and access
- Fine-grained control over model management
- Ideal for models not compatible with automated caching
Key features
- Uses PersistentVolumeClaims (PVCs) for model storage
- Supports custom storage mapping
- Provides manual control over model placement and lifecycle
Manual pre-loading is particularly useful for OpenLLM or bento-archive models and scenarios requiring specific storage configurations. However, it requires more in-depth knowledge of Kubernetes and may add complexity to operations like canary deployments.
When to Use Manual Pre-loading #
For most users, the simpler Model Caching method is recommended. Consider manual pre-loading if you:
- Need to pre-load models incompatible with standard caching
- Have specific storage needs not supported by model caching
- Possess advanced Kubernetes knowledge and can manage PVCs manually
- Require precise control over model lifecycles
Before You Start #
To use Manual Pre-loading with PVCs, administrators need to ensure the following:
- Create a PVC: You must create a PVC in the same Kubernetes namespace where you intend to deploy your inference service.
- Sufficient Storage: Ensure the PVC has storage resources for storing the entire model, based on expected model sizes.
- Review any necessary resources specific to the model you want to pre-load:
Note on HuggingFace Tokens #
If the model you wish to download from HuggingFace requires authentication (i.e., a token), you must set the HF_TOKEN
environment variable with your HuggingFace token. The following HuggingFace example assumes that this token is stored in the environment variable HF_TOKEN
.
Examples #
OpenLLM (HuggingFace) #
The following example demonstrates how to preload the meta-llama/Meta-Llama-3.1-8B-Instruct
model from HuggingFace onto a PVC named models-pvc
. This model requires a HuggingFace token and access permissions, which can be requested at huggingface.co
. This example employs a Kubernetes Job that executes the huggingface-cli
command to download the model to the /mnt/models/meta-llama-3.1-8b-instruct
directory on the PVC. However, you can customize the Job to use any method that suits your needs for preloading models.
apiVersion: batch/v1
kind: Job
metadata:
name: download-llama-3-1-8b-instruct-model
spec:
template:
spec:
containers:
- name: model-installer
image: kserve/huggingfaceserver:latest
command: ["/bin/sh", "-c"]
args:
- |
MODEL_DIR="/mnt/models/meta-llama-3.1-8b-instruct"
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --token $HF_TOKEN --local-dir $MODEL_DIR
[ $? -eq 0 ] && echo "Model download complete" || echo "Model download failed"
env:
- name: HF_TOKEN
value: hf_XXXXXXXXXXXXXXXXXXXXXX
volumeMounts:
- name: models-cache
mountPath: /mnt/models
restartPolicy: OnFailure
volumes:
- name: models-cache
persistentVolumeClaim:
claimName: models-pvc
Applying the Kubernetes Job #
After creating the YAML file (e.g., download-llama-3-1-8b-instruct-model.yaml
) with the content provided in the example above, you can apply it to your Kubernetes cluster using the following command:
kubectl apply -f download-llama-3-1-8b-instruct-model.yaml
You can monitor the logs to view the progress of the model being downloaded onto the PVC or to check for any errors that may have occurred.
kubectl logs job/download-llama-3-1-8b-instruct-model
When the model has been successfully downloaded, the last few lines of the log should look similar to the following output:
Download complete. Moving file to /mnt/models/meta-llama-3.1-8b-instruct/model-00001-of-00004.safetensors
Fetching 17 files: 41%|████ | 7/17 [01:04<01:36, 9.67s/it]Download complete. Moving file to /mnt/models/meta-llama-3.1-8b-instruct/original/consolidated.00.pth
Fetching 17 files: 100%|██████████| 17/17 [02:04<00:00, 7.30s/it]
/mnt/models/meta-llama-3.1-8b-instruct
Model download complete
At this point, you can now reference the model stored on the PVC in your inference service deployment.