Create Image (OpenLLM)
The following steps guide you through creating a containerized image for an inference service using the OpenLLM CLI. The result of this is a publicly accessible container published at <user>/<model-name>
which can be referenced in an HPE Machine Learning Inferencing Software Model definition and deployed as a service.
Before You Start #
- Ensure you have completed Developer System Setup.
- Ensure you have Docker installed and running.
- Ensure you have deployed the HPE Machine Learning Inferencing Software controller and have the
openllm
CLI installed. - Ensure you have an available GPU on the system
- MLIS supports OpenLLM 0.4.44. That means it does not currently support Llama 3.1 models. For more information, see the Considerations section.
How to Create a Containerized Image #
- Build the container using the
openllm build
command.openllm build --backend vllm --containerize <user/model-name>
- Get the
IMAGE ID
of the resulting container.docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE tiiuae--falcon-7b-service 898df1396f35e447d5fe44e0a3ccaaaa69f30d36 4ac4bb1f2dce 3 minutes ago 24.9GB
- Tag the resulting container.
docker tag <image-id> <user>/<model-name>
- Push the container to a publicly-accessible docker repo.
docker push <user>/<model-name>
- Verify the container is accessible.
docker pull <user>/<model-name>
You are now ready to upload this image to a registry and create a packaged model in HPE Machine Learning Inferencing Software.
Model Testing #
You can serve a model for interactive testing before building the container by specifying the desired LLM name in the following command:
openllm start --backend vllm facebook/opt-125m
Once started, the LLM service will be listening on http://localhost:3000
. You may interact with it via the SwaggerUI web interface using a browser pointed to that URL. The SwaggerUI shows the available REST API methods of the service.
You can also use the openllm query
command:
openllm query "What is an LLM?"
What is an LLM?
A degree in engineering
Considerations #
Llama 3.1 and 3.2 Models #
OpenLLM 0.4.44 (the version supported by HPE Machine Learning Inferencing Software) does not support Llama 3.1/3.2 models. For these newer models, we recommend using one of the following alternatives:
-
Use the new direct vLLM model support added in HPE Machine Learning Inferencing Software 1.3.0.
- Refer to Huggingface Registry and Add Packaged Model.
-
Custom model with KServe HuggingFace Server:
- Provides flexibility and direct integration with the Hugging Face model hub.
- Suitable for running Llama 3.1/3.2 models.
- For setup instructions, see our Custom Image Requirements guide.
- For available options, run:
docker run kserve/huggingfaceserver -h
-
vLLM container:
- Offers high-performance inference for large language models, including Llama 3.1/3.2.
- Use the
vllm/vllm-openai
image with specific CLI arguments. - For deployment instructions and available options, refer to the vLLM documentation.
- To see available options, run:
docker run vllm/vllm-openai:latest -h
Example for deploying a Llama 3.1 model with vLLM:
... --model meta-llama/Meta-Llama-3.1-70B-Instruct -tp 8 --port 8080
Example for creating a model with KServe HuggingFace Server:
aioli model create fb125m --image kserve/huggingfaceserver --arg=--model_id=facebook/opt-125m --requests-gpu=1 --limits-gpu=1 --limits-memory=10Gi --registry Huggingface.co