Custom Image Requirements #

HPE Machine Learning Inferencing Software provides direct support for BentoML and limited support for OpenLLM (version 0.4.44 only). For newer models, including Llama 3.1/3.2, we recommend using custom model images with KServe HuggingFace Server or vLLM. You can also build custom model images using any container build with any model runtime.

Before You Start #

All custom images must meet the following requirements:

A custom containerized application must listen on port 8080.
Any custom metrics must be provided via a /metrics endpoint.
Readiness and liveness checks are done by attempting to connect to port 8080/tcp. (If successful, the inference service is deemed ready.)

Configure Listening Port #

To configure the port for your packaged model, you have a few options:

Configure your container to listen on port 8080 by default.
Configure your container to read and use the PORT environment variable, which is provided by Knative. This allows the container to listen on the proper port for the inference service. It’s important to note that you don’t need to explicitly set the PORT yourself, as it is provided by Knative. However, your container needs to be designed to honor and use this PORT variable when it’s available.
If your container command line can accept the desired port via command line arguments, you can configure it through the arguments of the packaged model. For example:
```
aioli model update <MODEL_NAME> --env="PORT=8080" 
```

Customizing Metrics #

The default KServe-provided prometheus metrics are automatically available for any container. See the Monitoring section for more information.

Readiness & Health Checks #

By default, KServe performs readiness/health checks by attempting to connect to port 8080/tcp. So as long as the container is listening on 8080, an explicit /health endpoint is not needed.

Customizing the Base Container Image #

You can also customize the base container image used by HPE Machine Learning Inferencing Software to service your models.

Example Container Images #

KServe offers an image called kserve/huggingfaceserver that can fetch and run VLLM-compatible models from Hugging Face. Before using this image, consider the following:

The image accepts various parameters, but only --model_id is mandatory. This specifies the Hugging Face model (e.g., facebook/opt-125m or meta-llama/Llama-2-7b-chat-hf).
If not specified, --model_name defaults to “model”.
For models requiring authentication, provide an openllm registry or set the HF_TOKEN environment variable in your model definition.
By default, the image uses VLLM as the backend, which needs a GPU. For GPU-free testing of small models, you can use --backend=huggingface to switch to the HuggingFace backend.

Create Bento Archive

Learn how to create a Bento Archive that can be served from S3.

Create Image (Bentoml)

Learn how to create a containerized Bento model image.

Create Image (OpenLLM)

Learn how to create a containerized OpenLLM model image.

Customize the Base Container Image

Learn how to customize the base Docker container image servicing a packaged model.