Helm Chart Values (HCVs) #
Before You Start #
- Review the following dependencies:
Component Minimum Version Latest Version Validated Dependency Kubernetes 1.20 1.30 Core Docker 2.6.0 2.6.0 Core Helm 3.0 3.13.2 Core KServe 0.11 0.14 Core Istio 1.18 1.20.4 KServe Istio Client 1.20.1 KServe Istio Control Plane 1.20.4 KServe Istio Data Plane 1.20.4 KServe Knative 1.10 1.14.5 KServe Knative Operator 1.14.5 KServe Knative Serving 1.13.1 KServe Cert Manager 1.9.0 1.15.1 KServe
Quick Configuration Values #
The following frequently used configuration values are available from the Helm chart.
Parameter | Description | Default |
---|---|---|
loadBalancerIP |
Static host/IP for accessing the controller via the aioli-proxy. | Assigned by Kubernetes |
loadBalancerProxyPort |
Port for accessing the controller via the aioli-proxy. | 80 |
logLevel |
For debugging, specify debug or trace . |
info |
image.master |
Latest published Master. | Latest Master |
imageRegistry |
The HPE MCS registry for the MLIS SKU you have purchased (e.g., hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU> ). |
|
global.imagePullSecrets |
List of k8s secrets with docker credentials to enable installation from non-public repository. | |
global.env |
List of environment variables to set in the aioli-master pod and deployment pods. |
|
tlsSecret |
k8s secret providing TLS configuration for HTTPS. | |
defaultPassword |
Admin account password. | Auto-generated if not set |
helm show values aioli-1.3.0.tgz
command. In addition, each provided sub-chart (grafana
, loki
, prometheus
, promtail
, dex
) also offers additional configurable values.Global Environment Variables #
Specifying environment variables during the Helm install allows you to inject environment variables directly into the aioli-master
pod and deployment pods. These environment variables can be used to configure various aspects of your deployment, such as setting a proxy server.
For example, to set the http_proxy
environment variable, you can update the values.yaml
file as follows:
global:
env:
- name: http_proxy
value: "http://your-proxy-server:port"
You can also set these values directly from the Helm command line:
helm install <release_name> <chart_name> --set global.env[0].name=http_proxy --set global.env[0].value=http://your-proxy-server:port
The environment variables specified during the Helm installation will be injected into the aioli-master
pod and the inference service deployment pods. These environment variables can be overridden later in the packaged model or in specific deployments, which also allows additional environment variables to be specified.
Helm Chart #
# © Copyright 2023-2024 Hewlett Packard Enterprise Development LP
# HPE Machine Learning Inference Software (MLIS) Default Values
global:
# imagePullSecrets allow you to pull images from private repositories.
# This is required to access the licensed-MLIS containers, and is useful
# to avoid potential throttling when accessing public Docker Hub repositories.
# https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
# Example:
# imagePullSecrets:
# - name: hpe-mlis-registry
# - name: regcred
#
# To provide secrets as helm command line arguments, use:
# --set "global.imagePullSecrets[0].name=hpe-mlis-registry " --set="global.imagePullSecrets[1].name=regcred"
imagePullSecrets: []
# Environment variables to set in the "aioli-master" pod and deployment pods.
env: []
#env:
#- name: MY_VARIABLE
# value: "my_value"
#- name: ANOTHER_VARIABLE
# value: "another_value"
#- name: MY_SECRET_ENV
# valueFrom:
# secretKeyRef:
# name: my-secret
# key: my-secret-key
# imageRegistry specifies the source image repository for MLIS-provided images.
imageRegistry: determinedai
# HPE Machine Learning Inference Software (MLIS) uses the HPE MSC as the image registry
#imageRegistry: hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU>
# Replace <SKU> in the URL shown above with the product SKU that was assigned to you,
# and configure `imagePullSecrets` to include the HPE MSC credentials Kubernetes Secret (e.g. `hpe-mlis-registry`)
#
# To get HPE MSC credentials go to the https://myenterpriselicense.hpe.com website, and along with the information provided with your order
# create an HPE MSC credentials as a Kubernetes Secret (e.g. hpe-mlis-registry) using the following command:
# kubectl create secret docker-registry hpe-mlis-registry \
# --docker-server=hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU> \
# --docker-username=<HPE MSC user name> \
# --docker-password=<HPE MSC MLIS license key> \
# --docker-email=<HPE MSC user email> \
# -n <MLIS deployment K8s namespace, if any>
#
# The image from imageRegistry to be used to pull the MLIS controller master image.
image:
master: aioli-master:1.3.0
master:
nodeSelector: {}
# Configures the size of the PersistentVolumeClaim for the audit log.
# Should be adjusted for scale.
auditLogStorageSize: 1Gi
# storageClassName configures the StorageClass used by the PersistentVolumeClaim for the
# audit log. This can be left blank if a default storage class is specified in
# the cluster. If dynamic provisioning of PersistentVolumes is disabled, users must manually
# create a PersistentVolume that will match the PersistentVolumeClaim.
# storageClassName:
#
# To improve product design, the controller and WebUI both collect anonymous information
# about how MLIS is being used. This information includes various metrics and events such
# as the number of registries, trained models, deployments, and more. Refer to the product
# documentation for more information (search for telemetry). You can disable this data
# collection at any time by setting telemetry-enabled to false.
telemetry-enabled: true
# Image pull policy for the master image.
# Valid values are: 'Never', 'IfNotPresent' and 'Always'
imagePullPolicy: IfNotPresent
# modelsCacheStorage enables model caching on shared network storage.
# Once a model is downloaded to the cache the first time it's used by a deployment,
# subsequent deployments will use the cached model instead of downloading it again.
# Cached models that have not been used by a deployment for the time period
# specified by "purgeUnusedCachedModelsAfter" will be automatically removed from
# the cache. Models are automatically removed from the cache when they are
# removed from the database.
#
# Enable and specify a storageClassName to use model caching. Refer to the product
# documentation for detailed requirements and limitations.
modelsCacheStorage:
enabled: false
checkUnusedCachedModelsEvery: 1d
purgeUnusedCachedModelsAfter: 1w
storageSize: 100Gi
# storageClassName:
# pvcNameSuffix is text appended to the end of the PVC name when creating the PVC
# Name for the storage class. When changing the PV the model caching, a new PVC
# with a unique name must be supplied to for MLIS PVC name.
# pvcNameSuffix: appended-text
# bypassStorageCheck enables use of storage that have not been validated by MLIS
# for use with model caching. The resulting PV must support access to the same
# files when cloned between namespaces.
# bypassStorageCheck: false
# Default images used during the deployment
defaultImages:
# PostgreSQL image
postgreSQL: "postgres:16"
proxy: envoyproxy/envoy:v1.29-latest
# logger is a side car that supports logging inference request/response data from imageRegistry.
logger: aioli-logger
# openllm container with GPU + vllm support from imageRegistry.
openllm: aioli-runtimes:v3-openllm-0.4.44-py-3.11-cuda-12
# openllm-cpu container --backend pt for CPU support from imageRegistry.
openllmCpu: aioli-runtimes:v3-openllm-0.4.44-py-3.11-cpu
# utils support container to enable model download and certificate management.
utils: aioli-runtimes:v3-utils
# bentoml provides the default image that is used to run a bento from
# S3 storage.
bentoml: aioli-runtimes:v3-bentoml-py-3.9
# Kserve storage-initializer container
storageInitializer: kserve/storage-initializer:v0.11.2
# PFS support storage-initializer container
pfs: aioli-runtimes:v1-pfs-2.11.3
# Logger Level in master.yaml - Four severity levels: debug, info, warn, error
logLevel: info
# masterPort configures the port at which the controller listens for connections on.
masterPort: 8080
# Request/Limits for Cpu/Memory for the master deployment
masterCpuRequest: 250m
masterCpuLimit: 2
masterMemRequest: 50Mi
masterMemLimit: 2Gi
# Configure a static IP address for the envoy proxy that provides the inbound load balancer.
loadBalancerIP: ""
# Configure the external port for the envoy proxy that provides the inbound load balancer.
# The default value is 80. When tlsSecret is set, the default value is 443. An explcit
# value is always honored.
# loadBalancerProxyPort: 80
# gpuSelector defines the configuration used when deploying a packaged model that
# specifies a gpuType value.
gpuSelector:
# gke enables the Kserve GKE accelerator annotation when a gpuType is requested.
# Specify it to override automatic detection of GKE.
# gke: false
# tolerationKey is the key name when generating a toleration to match the gpuType
# value. Set the value to "" to disable. The default configuration generates:
# tolerations:
# - effect: NoSchedule
# key: accelerator
# operator: Equal
# value: {gpuType}
tolerationKey: "accelerator"
# resourceName allows the mapping from MLIS GPU number to GPU vendors.
# Default to "nvidia.com/gpu". Specify "amd.com/gpu" to allocate AMD GPUs.
# resourceName: "nvidia.com/gpu"
# Enables the creation of non-namespaced objects - Default: true
# Non-namespaced object are cluster-wide resources, such as the PriorityClasses.
# In multiple installation on a single cluster (using different namespaces),
# this flag set to false avoids to recreate non-namespaced objects. In some cases (e.g., GitOps w/ArgoCD)
# creating existing cluster-wide resources could stop/hang automatic deployments.
#
# WARNING
# The first installation must run with the createNonNamespacedObjects flag set to true to ensure
# the non-namespaced objects are created.
createNonNamespacedObjects: true
# External ca.crt injection certificate/s secret name
# Command to create the ca cert secret:
# kubectl create secret generic <external ca cert secret name, e.g., ext-ca-cert> --from-file=<ca.crt or ca bundle filename> -n <namespace>
#
# externalCaCertSecretName: <external ca cert secret name, e.g., ext-ca-cert>
# When useNodePortForMaster is set to false, a LoadBalancer service is deployed to make
# the controller reachable from outside the cluster. When useNodePortForMaster is set to
# true, the master will instead be exposed behind a NodePort service. When using a NodePort service
# users will typically have to configure an Ingress to make the controller reachable from
# outside the cluster. NodePort service is recommended when configuring TLS termination in a
# load-balancer.
useNodePortForMaster: true
# When useNodePortForMaster is set to true, nodePortForMaster can be set to a value between
# 30000-32767 that sets the NodePort's port number used to receive HTTP traffic.
#
# nodePortForMaster: 30080
# loggerPort provides a port that is reserved for use by Aioli within the inference
# service pod to enable request/response body logs.
loggerPort: "49160"
# loggerResources sets the resource requests and limits for the aioli-logger sidecar container
# that supports logging inference request/response data.
loggerResources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 10m
memory: 20Mi
# Enable route support for Openshift by setting enabled to true. Configure tls termination (i.e edge) if needed.
# openshiftRoute:
# enabled:
# host:
# termination:
# tlsSecret enables TLS encryption for all communication made to the controller (TLS
# termination is performed in the controller). The specified Secret of type tls must already exist in
# the same namespace used for the helm install.
# tlsSecret:
security:
authz:
# type: rbac
jwt_keys_directory: "/etc/aioli/jwt-signing/"
integrations:
# Integration with HPE MLDM/Pachyderm.
pachyderm:
# The full address/protocol of the pachd service. For example:
# grpc://pachd.<releaseNamespace>.svc.cluster.local:30650
# When specified, and the OIDC configuration is specified to refer to the
# pachd auth service, MLIS and MDLM will provide unified authentication tokens.
address: ""
namespaces:
# namespaces.exclude is a list of regex expressions used to filter out namespaces
# that should not be used for deployment. The default value excludes
# KServe, Istio, Knative, GKE, Kubernetes, Cert Manager, and KinD namespaces.
exclude:
- "kube-.*"
- "gke-.*"
- "gmp-.*"
- "cert-manager"
- "istio-system"
- "knative-serving"
- "kserve"
- "local-path-storage"
# namespaces.include is a list of regex expressions used to allow only a limited
# set of namespaces for deployment. The default include expression allows any
# namespace that is not prohibited by the exclude list.
include:
- ".*"
priorityClasses:
# priorityClasses.exclude is a list of regex expressions used to filter out priority classes
# that should not be used for deployment. The default value excludes certain
# reserved priority classes and some generated for use by MLDE.
exclude:
- "system-.*"
- "aioli-system-.*"
- "gmp-critical"
- "[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}-priorityclass"
# priorityClasses.include is a list of regex expressions used to allow only a limited
# set of priority classes for deployment. The default include expression allows any
# priority class that is not prohibited by the exclude list.
include:
- ".*"
# db sets the configurations for the database.
db:
nodeSelector: {}
# To deploy your own Postgres DB, provide a hostAddress. If hostAddress is provided, no
# Postgres DB will be deployed.
# hostAddress:
# Required parameters, whether you are using your own DB or a provided DB.
#
# If password is left blank, a random password will be generated. The helm
# install will give instructions on how to retrieve the generated password.
name: aioli
user: postgres
password:
port: 5432
# Image pull policy for the Postgres image
# Valid values are: 'Never', 'IfNotPresent' and 'Always'
imagePullPolicy: IfNotPresent
# Only used for DB deployment. Configures the size of the PersistentVolumeClaim for the
# deployed database, as well as the CPU and memory requirements. Should be adjusted for
# scale.
storageSize: 1Gi
# Setting a request, breaks GKE deployment
# cpuRequest: 1
memRequest: 1Gi
# useNodePortForDB configures whether ClusterIP or NodePort service type is used for the
# deployed DB. By default ClusterIP is used.
useNodePortForDB: false
# storageClassName configures the StorageClass used by the PersistentVolumeClaim for the
# deployed database. This can be left blank if a default storage class is specified in
# the cluster. If dynamic provisioning of PersistentVolumes is disabled, users must manually
# create a PersistentVolume that will match the PersistentVolumeClaim.
# storageClassName:
# ssl_mode and ssl_root_cert configure the TLS connection to the database. Users must first
# create a kubernetes secret or configMap containing their certificate and specify its name in
# certResourceName. For sslRootCert, specify the name of the file only (not path).
# sslMode: verify-ca
# sslRootCert: <cert_name>
# resourceType: <secret/configMap>
# certResourceName: <secret/configMap name>
# Configuration for the envoy proxy
proxy:
# The type of service to use for the proxy (NodePort, LoadBalancer, ClusterIP)
# When LoadBalancer (the default), a LoadBalancer service is deployed to make
# the controller & grafana reachable from outside the cluster via the proxy.
# When NodePort, the proxy will instead be exposed behind a NodePort service.
# When using a NodePort service users will typically have to configure an Ingress
# to make the proxy reachable from outside the cluster.
# NodePort service is recommended when configuring TLS termination in a load-balancer.
# When ClusterIP, the proxy will be exposed behind a ClusterIP service.
type: LoadBalancer
annotations: {}
labels: {}
nodeSelector: {}
# Proxy gateway timeout
# timeout: 15s
# Set of services made available via the proxy
services:
envoyAdmin: false
grafana: true
loki: false
prometheus: false
# Proxy resource requests/limits
cpuRequest: "1"
memRequest: "500Mi"
#cpulimit:
#memLimit:
################################################################################
# This chart provides subcharts for Promtail, Loki, Prometheus and Grafana that
# are installed by default. The subcharts can be disabled at deployment time.
#
# Configurable values for promtail can be found at
# "https://github.com/grafana/helm-charts/tree/main/charts/promtail"
# Configurable values for loki can be found at
# "https://github.com/grafana/helm-charts/tree/main/charts/loki-stack"
# Configurable values for prometheus can be found at
# "https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus"
# Configurable values for graphana can be found at
# "https://grafana.com/docs/loki/latest/setup/install/helm/"
#
# If installing on a Rancher Kubernetes Engine with the default DNS Provider set
# to use CoreDNS, then Loki's global DNS Service must be set:
# --set loki.global.dnsService=rke2-coredns-rke2-coredns
# The values shown below are used to configure Promtail, Loki Prometheus and
# Grafana.
################################################################################
promtail:
enabled: true
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
- effect: NoSchedule
operator: Exists
config:
snippets:
extraRelabelConfigs:
# Keep all kubernetes labels containing "inference" to preserve Kserve
# and aioli pod labels on the logs.
- action: labelmap
regex: __meta_kubernetes_pod_label_(.*inference.+)
loki:
enabled: true
singleBinary:
replicas: 1
loki:
commonConfig:
replication_factor: 1
storage:
type: 'filesystem'
auth_enabled: false
compactor:
# Enable the compactor to cleanup old data
retention_enabled: true
#limits_config:
# Default loki retention period is 30 days.
# retention_period: 744h
prometheus:
enabled: true
server:
# Enable wal compression to reduce disk usage
extraFlags:
- storage.tsdb.wal-compression
# retentionSize should be less than persistentVolume.size (which is 8Gi by default)
retentionSize: 7GB
global:
# Scrape interval to enable quicker dashboard updates (Prometheus default is 1m)
# Values below 30s will prevent scale-to-zero of inference services.
scrape_interval: 45s
# Configuration for grafana defaults
grafana:
enabled: true
# deployment_dashboard_baseurl a reference to the provided deployment dashboard that is pre-configured when this chart is installed.
deployment_dashboard_baseurl: /grafana/d/b6943e8f-4162-4c88-8912-1f8dbd67e0eb/7f12f4ce-2b5c-5fad-a418-60d3d04968e0?orgId=1
# deployment_dashboard_user is the Grafana user account which the deployment observability UI uses for cross launch into Grafana
# for users with the Admin role. Users without the Admin role are dynamically provisioned as Grafana viewers.
deployment_dashboard_user: admin
grafana.ini:
server:
root_url: /grafana/
serve_from_sub_path: true
auth.jwt:
enabled: true
header_name: X-JWT-Assertion
username_claim: sub
url_login: true
key_file: /etc/aioli-public-key/jwt.pem
# users without the Admin role are auto-created as Grafana viewers if they are not already matched.
auto_sign_up: true
#log:
# level: debug
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: EC961B58-1731-40BB-B7F1-28B5CD6FD6D5
url: http://{{ .Release.Name }}-prometheus-server.{{ .Release.Namespace }}.svc.cluster.local
# Set "readOnly" to false and "editable" to true to avoid this message in the UI:
#
# This data source was added by config and cannot be modified using the UI. Please contact your server admin to update this data source.
readOnly: false
editable: true
- name: Loki
type: loki
uid: 0459878A-F358-4AC2-AB59-96BE56D0D65E
url: http://loki-gateway.{{ .Release.Namespace }}.svc.cluster.local
# Set "readOnly" to false and "editable" to true to avoid this message in the UI:
#
# This data source was added by config and cannot be modified using the UI. Please contact your server admin to update this data source.
readOnly: false
editable: true
extraVolumes:
- name: aioli-jwt-public-key
emptyDir: {}
extraVolumeMounts: # Mounted into grafana container
- name: aioli-jwt-public-key
mountPath: /etc/aioli-public-key/
readOnly: true
extraContainerVolumes:
- name: aioli-jwt-secrets
secret:
secretName: aioli-jwt-signing
extraInitContainers:
- name: aioli-jwt-secret-public-key-create
# NOTE: This cannot be a template because it is in a values file, so we cannot reference defaultImages.utils defined above.
image: determinedai/aioli-runtimes:v3-utils
# Valid values are: 'Never', 'IfNotPresent' and 'Always'
imagePullPolicy: IfNotPresent
args:
- -c
- openssl x509 -inform pem -in /mount/aioli/secrets/tls.crt -pubkey -noout > /etc/aioli-public-key/jwt.pem
volumeMounts:
- name: aioli-jwt-public-key
mountPath: /etc/aioli-public-key/
- name: aioli-jwt-secrets
mountPath: /mount/aioli/secrets/
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
dashboardsConfigMaps:
default: grafana-dashboard-config-{{ .Release.Name }}
################################################################################
# Dex is an OpenID Connect identity hub. Dex can be used to expose a consistent
# OpenID Connect interface to your applications while allowing your users to
# authenticate using their existing credentials from various back-ends,
# including LDAP, SAML, and other OIDC providers.
#
# The sample connector configuration shown below is specific to authenticating
# with the "Auth0" identity provider. Uncomment the "config -> connectors"
# section below and replace the sample connector with the connector that is
# appropriate for your identity provider.
#
# For information on how to configure connectors, such as Google, GitHub, LDAP,
# etc., see the Dex documentation at "https://dexidp.io/docs/connectors".
#
# If your connector requires a "redirectURI", you do not need to enter it, as
# it will be automatically generated for you.
#
# You cannot modify the configuration of the dex configSecret, or the
# DEX_STATIC_CLIENT_SECRET environment variable.
#
# Configurable values for dex can be found at
# "https://github.com/dexidp/helm-charts/tree/master/charts/dex"
################################################################################
dex:
logger:
level: info
configSecret:
create: false
name: aioli-dex-config
envVars:
- name: DEX_STATIC_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: aioli-oidc-secret-name
key: aioli-oidc-secret-key
config:
logger:
level: info
# connectors:
# - type: oidc
# name: Auth0
# id: auth0
# config:
# issuer: <Issuer URL>
# clientID: <Client ID>
# clientSecret: <Client Secret>
################################################################################
# oidc enables OpenID Connect Integration with Dex. The values shown below
# are used to configure the controller as a Dex client.
################################################################################
oidc:
enabled: false
# autoProvisionUsers specifies if users are automatically added to the database
# upon successful authentication by the identity provider and, therefore, there is
# no need to manually add users to the database with the CLI or REST API.
# Valid values are "true" or "false". Default value is true. If set to true,
# users are automatically added to the MLIS database upon successful authentication.
# If set to false, the administrator must explicitly create users in the MLIS database
# and assign their roles.
# autoProvisionUsers: true
# When autoProvisionUsers is set to true, authenticationClaim specifies the user's
# Username when added to the MLIS database. Default value is "email", MLIS sets
# the username of the user to the email address that is used to sign in with
# the identity provider. Valid values are "email", "name", or "preferred_username".
# authenticationClaim: email
# When autoProvisionUsers is set to true, displayNameAttributeName specifies the user's
# Display Name when added to the MLIS database. If not specified, the user's display name
# is empty. Valid values are are "email", "name", or "preferred_username".
# displayNameAttributeName: name
# allowInsecureIssuerURLContext allows discovery to work when the issuer_url
# reported by upstream is mismatched with the discovery URL. This is meant
# for integration with off-spec providers. Valid values are "true" or "false".
# allowInsecureIssuerURLContext: false
# Default password for the admin account for the controller. If defaultPassword is
# left blank, a random password will be generated. The helm install will give
# instructions on how to retrieve the generated password. Once manually changed,
# this value is no longer relevant.
defaultPassword:
# Specify by name a ConfigMap that contains the trusted CAs to be injected into the
# environment of a deployment. See https://cert-manager.io/docs/trust/trust-manager/
# for details on how to create a ConfigMap with trusted CAs.
trustedCAsConfigMap: ""
# Integration of MLIS with AI Essentials.
ezua:
enabled: false
Additional Configuration Options #
Default Password #
The admin password gets generated if it is not set during installation (--set defaultPassword
). You can retrieve the generated password by using the following command:
kubectl get secrets aioli-master-config-<RELEASE_NAME> \
--template='{{ index .data "aioli-master.yaml" | base64decode }}' | grep defaultPassword
Replace <RELEASE_NAME>
with the name of the Helm release (e.g., mlis
).
Using a Remote PostgreSQL Server #
This section describes how to use an existing remote PostgreSQL server instead of the default in-cluster PostgreSQL instance provided by the Helm install.
To configure MLIS to use a remote PostgreSQL server:
-
Ensure you have a PostgreSQL instance that supports SSL connections.
-
Obtain the SSL certificate authority (CA) certificate file for the server.
-
Create a Kubernetes secret or configMap with the certificate:
# Using a secret kubectl create secret generic <secret-name> --from-file=server.pem # Or using a configMap kubectl create configmap <configmap-name> --from-file=server.pem
-
When installing MLIS with Helm, specify the following additional values:
Value Description db.hostname
The hostname of the database server db.port
The port number the database is listening on db.sslMode
The SSL connection mode (e.g., disable, require, verify-ca) db.sslRootCert
The name of the CA certificate file (e.g., server.pem) db.resourceType
Either ‘secret’ or ‘configMap’, depending on how you created the resource db.certResourceName
The name of the secret or configMap you created db.password
The database password For more information on PostgreSQL connection strings and SSL modes, refer to the PostgreSQL documentation on Connection Strings.
Example Helm install command with remote database configuration using a secret:
helm install mlis determined-ai/mlis \ --set db.hostname=your-db-host.example.com \ --set db.port=5432 \ --set db.sslMode=verify-ca \ --set db.sslRootCert=server.pem \ --set db.resourceType=secret \ --set db.certResourceName=your-secret-name \ --set db.password=your-db-password
Example Helm install command with remote database configuration using a configMap:
helm install mlis determined-ai/mlis \ --set db.hostname=your-db-host.example.com \ --set db.port=5432 \ --set db.sslMode=verify-ca \ --set db.sslRootCert=server.pem \ --set db.resourceType=configMap \ --set db.certResourceName=your-certificate-configmap-name \ --set db.password=your-db-password
Either example configuration allows MLIS to securely connect to your existing PostgreSQL server.
Observability Components #
Configuring Observability Components #
You can configure the observability components using the following Helm subcharts:
Defaults are generally used for the observability subcharts with the following items added via the default MLIS values.yaml
file. All of these values may need to be tuned for your particular deployment:
- Loki: Configured for a single replica with the following settings:
- Log retention period set to
30 days
. - Log compaction is enabled.
- Storage defaults set to
10Gi
.
- Log retention period set to
- Promtail: Configured to enable collection of the pod labels that contain the word
inference
to enable identification ofmodel
anddeployment
versions. - Grafana: Configured to include the MLIS dashboard, and to enable SSO using JWT from the MLIS UI. It also automatically adds the Prometheus & Loki datasources.
Prometheus #
MLIS has the following default configuration for Prometheus to enable more rapid reporting of metrics:
prometheus:
server:
extraFlags:
- storage.tsdb.wal-compression
retentionSize: 7GB
global:
scrape_interval: 10s
Some significant Prometheus helm chart defaults that you may want to configure are:
prometheus:
server:
retention: 15d
persistentVolume:
size: 8Gi
By default, Prometheus allocates only 8Gi
of storage for metric history and retains metrics for 15 days (15d
). If the disk requirements of those 15 days of metrics exceeds 8Gi, the prometheus server will fail.
Ensure retentionSize
is less then persistentVolume.size
(default is 8Gi
). If you increase the prometheus.server.persistentVolume.size
, adjust retentionSize
accordingly. Note the different units (GB
vs Gi
).
See the Prometheus troubleshooting guide for more details.
Disabling Observability Components #
--set grafana.enabled=false \
--set promtail.enabled=false \
--set prometheus.enabled=false \
--set loki.enabled=false
Disabling observability components renders the MLIS Deployment Dashboard link non-functional. You can substitute your own Grafana URI using grafana.deployment_dashboard_baseurl
.
# Configuration for grafana defaults
grafana:
enabled: true
# deployment_dashboard_baseurl a reference to the provided deployment dashboard that is pre-configured when this chart is installed.
deployment_dashboard_baseurl: /grafana/d/b6943e8f-4162-4c88-8912-1f8dbd67e0eb/7f12f4ce-2b5c-5fad-a418-60d3d04968e0?orgId=1
If you do not have SSO enabled for Grafana, you can replicate the initialization provided in the MLIS default values.yaml
to enable JWT access from MLIS.
Rancher Kubernetes Engine #
If you are installing on Rancher Kubernetes Engine with the default DNS Provider set to use CoreDNS
, then Loki’s global DNS Service must be set (see RKE DNS Provider):
--set loki.global.dnsService=rke2-coredns-rke2-coredns
Node Selectors #
You can use node labels to control which nodes the pods of this installation will run on by specifying a node selector during the install. The following example uses the node label of kubernetes.io/arch
that has a value of amd64
as the node selector to run the pods:
--set master.nodeSelector."kubernetes.io/arch"=amd64 \
--set db.nodeSelector."kubernetes.io/arch"=amd64 \
--set proxy.nodeSelector."kubernetes.io/arch"=amd64
Model Cache Storage #
MLIS supports model caching on shared network storage. This feature can be enabled and configured using the following parameters:
modelsCacheStorage:
enabled: false
checkUnusedCachedModelsEvery: 1d
purgeUnusedCachedModelsAfter: 1w
storageSize: 100Gi
# storageClassName:
# pvcNameSuffix:
# bypassStorageCheck: false
When enabled, this feature allows models to be cached on shared storage, reducing download times for subsequent deployments. See the Model Cache guide for more details.
GPU Selector #
MLIS provides configuration options for GPU selection when deploying models that specify a gpuType
:
gpuSelector:
# tolerationKey: "accelerator"
# resourceName: "nvidia.com/gpu"
This allows you to control how GPUs are allocated and which types of GPUs are used for model deployments. See the GPU support guide for more details.
Non-Namespaced Objects #
By default, MLIS creates non-namespaced (cluster-wide) objects such as PriorityClasses. This behavior can be controlled with the createNonNamespacedObjects
parameter:
createNonNamespacedObjects: true
External CA Certificates #
You can inject external CA certificates into MLIS by creating a secret and specifying its name:
# externalCaCertSecretName: <external ca cert secret name, e.g., ext-ca-cert>
This can be useful when MLIS needs to trust additional certificate authorities. See the Configure HTTPS/TLS for External Repositories guide for more details.
OpenShift Route #
For OpenShift deployments, MLIS supports configuring routes:
# openshiftRoute:
# enabled:
# host:
# termination:
This allows you to expose MLIS services using OpenShift’s routing layer.