Pre-loading Models

When deploying an inference service, it can take a significant amount of time for the service to reach the Ready state if it has to download and load a large AI model during startup. Pre-loading models in MLIS significantly reduces inference service startup times, especially for large AI models.

This section covers two approaches to pre-loading models: Model Caching and Manual Pre-loading.

Feature Model Caching Manual Pre-loading
Description Automatic cache-on-first-use when enabled by an Administrator Manual creation of a PersistentVolumeClaim (PVC) for each model version and referencing the PVC URL
Scope Multiple deployments & namespaces Individual deployments
Model Types Bento-Archive, OpenLLM, and Nim – Custom models not supported; ideal for models with existing URLs (pfs://, openllm://, s3://) Any; ideal for OpenLLM or bento-archive models
Storage Shared PVC across all deployments PVC can be shared or dedicated to a specific model
User Experience - Transparent
- Automatic removal of unused models
- Must create and manage a PVC
- Must use PVC syntax & URL to add packaged model
- Must have PVC in the same namespace as the model being deployed
- Must manually remove unused models
Flexibility Allows defining caching behavior Allows mapping of arbitrary storage

Both methods ensure models are readily available at service startup, improving MLIS deployment responsiveness and efficiency. Choose the method that best fits your specific use case and requirements.

info icon Additional PVC Uses Cases

Note that there are other uses for PVCs, such as:

  • Cache-on-first-use PVC: Using a PVC that automatically caches the model on first use by referencing the PVC URL; only usable with Custom and NIM models where the URL isn’t already used.
  • Arbitrary PVC mounts: Mounting a PVC to add arbitrary storage to your inference service container; the PVC can contain a model or any other data.

Options