Create Image (OpenLLM)

The following steps guide you through creating a containerized image for an inference service using the OpenLLM CLI. The result of this is a publicly accessible container published at <user>/<model-name> which can be referenced in an HPE Machine Learning Inferencing Software Model definition and deployed as a service.

Before You Start

  • Ensure you have completed Developer System Setup.
  • Ensure you have Docker installed and running.
  • Ensure you have deployed the HPE Machine Learning Inferencing Software controller and have the openllm CLI installed.
  • Ensure you have an available GPU on the system
  • MLIS supports OpenLLM 0.4.44. That means it does not currently support Llama 3.1 models. For more information, see the Considerations section.

How to Create a Containerized Image

  1. Build the container using the openllm build command.
    openllm build --backend vllm --containerize <user/model-name>
  2. Get the IMAGE ID of the resulting container.
    docker image ls
    REPOSITORY                                      TAG                                        IMAGE ID       CREATED         SIZE
    tiiuae--falcon-7b-service                       898df1396f35e447d5fe44e0a3ccaaaa69f30d36   4ac4bb1f2dce   3 minutes ago   24.9GB
  3. Tag the resulting container.
     docker tag <image-id> <user>/<model-name>
  4. Push the container to a publicly-accessible docker repo.
    docker push <user>/<model-name>
  5. Verify the container is accessible.
    docker pull <user>/<model-name>

You are now ready to upload this image to a registry and create a packaged model in HPE Machine Learning Inferencing Software.

Model Testing

You can serve a model for interactive testing before building the container by specifying the desired LLM name in the following command:

openllm start --backend vllm facebook/opt-125m

Once started, the LLM service will be listening on http://localhost:3000. You may interact with it via the SwaggerUI web interface using a browser pointed to that URL. The SwaggerUI shows the available REST API methods of the service.

You can also use the openllm query command:

openllm query "What is an LLM?"
What is an LLM?
A degree in engineering

Considerations

Llama 3.1 and 3.2 Models

OpenLLM 0.4.44 (the version supported by HPE Machine Learning Inferencing Software) does not support Llama 3.1/3.2 models. For these newer models, we recommend using one of the following alternatives:

  1. Use the new direct vLLM model support added in HPE Machine Learning Inferencing Software 1.3.0.

  2. Custom model with KServe HuggingFace Server:

    • Provides flexibility and direct integration with the Hugging Face model hub.
    • Suitable for running Llama 3.1/3.2 models.
    • For setup instructions, see our Custom Image Requirements guide.
    • For available options, run: docker run kserve/huggingfaceserver -h
  3. vLLM container:

    • Offers high-performance inference for large language models, including Llama 3.1/3.2.
    • Use the vllm/vllm-openai image with specific CLI arguments.
    • For deployment instructions and available options, refer to the vLLM documentation.
    • To see available options, run: docker run vllm/vllm-openai:latest -h

Example for deploying a Llama 3.1 model with vLLM:

... --model meta-llama/Meta-Llama-3.1-70B-Instruct -tp 8 --port 8080

Example for creating a model with KServe HuggingFace Server:

aioli model create fb125m --image kserve/huggingfaceserver --arg=--model_id=facebook/opt-125m --requests-gpu=1 --limits-gpu=1 --limits-memory=10Gi --registry Huggingface.co