Create Image (OpenLLM)

The following steps guide you through creating a containerized image for an inference service using the OpenLLM CLI. The result of this is a publicly accessible container published at <user>/<model-name> which can be referenced in an HPE Machine Learning Inferencing Software Model definition and deployed as a service.

Before You Start #

Ensure you have completed Developer System Setup.
Ensure you have Docker installed and running.
Ensure you have deployed the HPE Machine Learning Inferencing Software controller and have the openllm CLI installed.
Ensure you have an available GPU on the system
MLIS supports OpenLLM 0.4.44. That means it does not currently support Llama 3.1 models. For more information, see the Considerations section.

How to Create a Containerized Image #

Build the container using the openllm build command.

openllm build --backend vllm --containerize <user/model-name>

Get the IMAGE ID of the resulting container.

docker image ls

REPOSITORY                                      TAG                                        IMAGE ID       CREATED         SIZE
tiiuae--falcon-7b-service                       898df1396f35e447d5fe44e0a3ccaaaa69f30d36   4ac4bb1f2dce   3 minutes ago   24.9GB

Tag the resulting container.

 docker tag <image-id> <user>/<model-name>

Push the container to a publicly-accessible docker repo.
```
docker push <user>/<model-name>
```
Verify the container is accessible.
```
docker pull <user>/<model-name>
```

You are now ready to upload this image to a registry and create a packaged model in HPE Machine Learning Inferencing Software.

Model Testing #

You can serve a model for interactive testing before building the container by specifying the desired LLM name in the following command:

openllm start --backend vllm facebook/opt-125m

Once started, the LLM service will be listening on http://localhost:3000. You may interact with it via the SwaggerUI web interface using a browser pointed to that URL. The SwaggerUI shows the available REST API methods of the service.

You can also use the openllm query command:

openllm query "What is an LLM?"
What is an LLM?
A degree in engineering

Considerations #

Llama 3.1 and 3.2 Models #

OpenLLM 0.4.44 (the version supported by HPE Machine Learning Inferencing Software) does not support Llama 3.1/3.2 models. For these newer models, we recommend using one of the following alternatives:

Use the new direct vLLM model support added in HPE Machine Learning Inferencing Software 1.3.0.
- Refer to Huggingface Registry and Add Packaged Model.
Custom model with KServe HuggingFace Server:
- Provides flexibility and direct integration with the Hugging Face model hub.
- Suitable for running Llama 3.1/3.2 models.
- For setup instructions, see our Custom Image Requirements guide.
- For available options, run: docker run kserve/huggingfaceserver -h
vLLM container:
- Offers high-performance inference for large language models, including Llama 3.1/3.2.
- Use the vllm/vllm-openai image with specific CLI arguments.
- For deployment instructions and available options, refer to the vLLM documentation.
- To see available options, run: docker run vllm/vllm-openai:latest -h

Example for deploying a Llama 3.1 model with vLLM:

... --model meta-llama/Meta-Llama-3.1-70B-Instruct -tp 8 --port 8080

Example for creating a model with KServe HuggingFace Server:

aioli model create fb125m --image kserve/huggingfaceserver --arg=--model_id=facebook/opt-125m --requests-gpu=1 --limits-gpu=1 --limits-memory=10Gi --registry Huggingface.co