Advanced Configuration Options

While adding or editing a packaged model, you can customize the default runtime configuration MLIS uses to start your model server. This is useful for models that require modifications to MLIS’s default settings or when using a custom image that needs specific configuration parameters.

note icon Note

Before You Start

  • You should review your model’s framework documentation (e.g., OpenLLM, BentoML, NIM) and its CLI options.
  • You should already know the arguments and environment variables you’d like to set for your model based on previous testing.

Runtime Configuration Options

Environment Variables

VariableDescription
AIOLI_LOGGER_PORTThe port that the logger service listens on; default is 49160.
AIOLI_PROGRESS_DEADLINEThe deadline for downloading the model; default is 1500s.
AIOLI_READINESS_FAILURE_THRESHOLDThe number of readiness probe failures before the deployment is considered unhealthy; default is 100.
AIOLI_COMMAND_OVERRIDEThe customized deployment command that enables you to override the default deployment command within a predefined runtime.
AIOLI_SERVICE_PORTThe inference service container port used for communication; default is 8080 except for NIMs, which is 8000.
AIOLI_DISABLE_MODEL_CACHEDisables automatic model caching for a deployment, even if it is enabled for the model; default is false.
AIOLI_DISABLE_LOGGERA workaround for the Kserve defect concerning streamed responses.

Command Override Arguments

MLIS executes a default command for your container runtime based on the type of packaged model you have selected. However, you can modify this command using the AIOLI_COMMAND_OVERRIDE environment variable. Any arguments from the packaged model are then appended to the end of this command, followed by any arguments from the deployment (AIOLI_COMMAND_OVERRIDE = [CLI_COMMAND] [MODEL_ARGS] [DEPLOYMENT_ARGS]).

Default Commands

MLIS provides template variables for customizing the runtime command:

FRAMEWORKCOMMANDDESCRIPTION
OpenLLMopenllm start --port {{.containerPort}} {{.modelDir}}You can add any options from OpenLLM version 0.4.44 to your command (see openllm start -h).
Bento Archivebentoml serve ...You can add any options from BentoML version 1.1.11 to your command (see bentoml serve -h).
CustomnoneFor custom models, the default entrypoint for the container is executed.
NVIDIA NIMnoneFor NIM models, the default entrypoint for the container is executed. You must use environment variables; NIM contaiers do not honor CLI arguments.
vLLM–served-model-name {{.modelName}} –model {{.modelName}} –port {{.containerPort}} –download-dir {{.modelDir}}Arguments vary for S3/PVC/PFS URLs.

Template Variables

You can also use the following variables to modify the command’s arguments:

Named ArgumentDescription
{{.numGpus}}The number of GPUs the model is requesting.
{{.modelName}}The MLIS model name being deployed.
{{.modelDir}}The directory into which the model will be downloaded. This is typically /mnt/models. This applies to NIM, OpenLLM, and S3 models.
{{.containerPort}}The http port that the container must listen on for inference requests and readiness checks.

Examples

AIOLI_COMMAND_OVERRIDE="openllm start {{.modelName}} --port {{.containerPort}} --gpu-memory-utilization 0.9 --max-total-tokens 4096"
AIOLI_COMMAND_OVERRIDE="bentoml serve {{.modelDir}}/bentofile.yaml --production --port {{.containerPort}} --host 0.0.0.0"

Multi-Node Deployments and Large Models

Multi-Node Deployments

MLIS does not provide built-in support for multi-node deployments. If your model requires more resources than available from a single node in your cluster, your model runtime must explicitly support multi-distribution.

For example, the Llama 3.1 405B Instruct NIM can run on a single node with 8 H100 GPUs, but requires 16 A100 GPUs. For multi-node NIM deployments, see the NVIDIA documentation.

Large Models

Large models, such as LLama 3.1, have very high and specific resource configuration requirements to run on a single node. When working with these models, ensure that your cluster has the necessary resources and that your model runtime supports the required distribution method.

For more information, see the official LLama Model Requirements page.