Manage Auto Scaling Templates

HPE Machine Learning Inferencing Software includes default autoscaling templates that users can select when adding or editing a deployment. You can manage these auto scaling templates and create new ones using the MLIS UI, CLI, or API.

Before You Start #

You must have the Admin or Maintainer user role to manage auto scaling templates.

Default Auto Scaling Templates #

The following table shows the default auto scaling templates available in MLIS.

name	description	autoscaling_min_replicas	autoscaling_max_replicas	autoscaling_metric	autoscaling_target
fixed-1	One inference service replica, always available.	1	1	rps	0
fixed-2	Two inference service replicas, always available.	2	2	rps	0
scale-0-to-1-concurrency-3	Scale from 0 to 1 replicas with metric concurrency 3.	0	1	concurrency	3
scale-0-to-4-rps-10	Scale from 0 to 4 replicas metric with requests-per-second 10.	0	4	rps	10
scale-0-to-8-rps-20	Scale from 0 to 8 replicas metric with requests-per-second 20.	0	8	rps	20
scale-1-to-4-rps-10	Scale from 1 to 4 replicas metric with requests-per-second 10.	1	4	rps	10
scale-1-to-8-concurrency-3	Scale from 1 to 8 replicas with metric concurrency 3.	1	8	concurrency	3

How to Add Auto Scaling Templates #

Via the UI #

In the MLIS UI, navigate to Settings > Auto scaling templates.
Select Add new auto scaling template.

Input the name, description, and autoscaling requirements for the template.

Field	Description
Template name	A unique identifier for the auto scaling template
Description	A brief explanation of the template’s purpose or characteristics
Minimum instances	The lowest number of instances that will run, even during periods of low activity
Maximum instances	The highest number of instances allowed to run during peak demand
Auto scaling target	The metric and target value used to trigger scaling actions
- Metric	The type of metric to monitor (e.g., concurrency, CPU utilization, memory usage)
- Target	The threshold value for the chosen metric that triggers scaling actions

Available Metrics

The available metric types depend on the Autoscaler implementation:

KPA Autoscaler (default): Supports concurrency and rps metrics
HPA Autoscaler: Supports the cpu metric

Select Create template.

The new template is now available from the Auto scaling targets template dropdown on the Scaling tab when adding or editing a deployment. To update this template, select the ellipsis icon next to the template name and choose Edit.

Via the CLI #

Add a new resource template with the following command:

aioli templates autoscaling create <TEMPLATE_NAME> \
--autoscaling-min-replicas <MIN_REPLICAS> \
--autoscaling-max-replicas <MAX_REPLICAS> \
--autoscaling-metric <METRIC> \
--autoscaling-target <TARGET>

Via the API #

curl -X 'POST' \
  '<YOUR_EXT_CLUSTER_IP>/api/v1/login' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "username": "<YOUR_USERNAME>",
  "password": "<YOUR_PASSWORD>"
}'

Obtain the Bearer token from the response.

Use the following cURL command to add a new auto scaling template.

curl -X 'POST' \
  'https://<YOUR_EXT_CLUSTER_IP>/api/v1/templates/autoscaling' \
  -H 'Accept: application/json' \
  -H 'Authorization: Bearer <YOUR_ACCESS_TOKEN>' \
  -H 'Content-Type: application/json' \
  -d '{
    "autoScaling": {
      "maxReplicas": 1,
      "metric": "rps",
      "minReplicas": 0,
      "target": 1
    },
    "description": "An autoscaling template",
    "name": "my-template"
}'