Custom Image Requirements #
HPE Machine Learning Inferencing Software provides direct support for BentoML and limited support for OpenLLM (version 0.4.44 only). For newer models, including Llama 3.1/3.2, we recommend using custom model images with KServe HuggingFace Server or vLLM. You can also build custom model images using any container build with any model runtime.
Before You Start #
All custom images must meet the following requirements:
- A custom containerized application must listen on port
8080
. - Any custom metrics must be provided via a
/metrics
endpoint. - Readiness and liveness checks are done by attempting to connect to port
8080/tcp
. (If successful, the inference service is deemed ready.)
Configure Listening Port #
To configure the port for your packaged model, you have a few options:
-
Configure your container to listen on port
8080
by default. -
Configure your container to read and use the
PORT
environment variable, which is provided by Knative. This allows the container to listen on the proper port for the inference service. It’s important to note that you don’t need to explicitly set thePORT
yourself, as it is provided by Knative. However, your container needs to be designed to honor and use thisPORT
variable when it’s available. -
If your container command line can accept the desired port via command line arguments, you can configure it through the arguments of the packaged model. For example:
aioli model update <MODEL_NAME> --env="PORT=8080"
Customizing Metrics #
The default KServe-provided prometheus metrics are automatically available for any container. See the Monitoring section for more information.
Readiness & Health Checks #
By default, KServe performs readiness/health checks by attempting to connect to port 8080/tcp
. So as long as the container is listening on 8080
, an explicit /health
endpoint is not needed.
Customizing the Base Container Image #
You can also customize the base container image used by HPE Machine Learning Inferencing Software to service your models.
Example Container Images #
KServe offers an image called kserve/huggingfaceserver
that can fetch and run VLLM-compatible models from Hugging Face. Before using this image, consider the following:
- The image accepts various parameters, but only
--model_id
is mandatory. This specifies the Hugging Face model (e.g.,facebook/opt-125m
ormeta-llama/Llama-2-7b-chat-hf
). - If not specified,
--model_name
defaults to “model”. - For models requiring authentication, provide an
openllm
registry or set theHF_TOKEN
environment variable in your model definition. - By default, the image uses VLLM as the backend, which needs a GPU. For GPU-free testing of small models, you can use
--backend=huggingface
to switch to the HuggingFace backend.