On-Prem Kubernetes Cluster
GenAI Studio can be installed on an on-premises Kubernetes cluster. To understand how Machine Learning Development Environment runs on Kubernetes, visit Deploy on Kubernetes.
Prerequisites #
- Install Docker, kubectl, and Helm
- Deploy Kubernetes on-premises and enable GPU support for NVIDIA GPUs by following this quick start guide.
- Download the values.yaml for Machine Learning Development Environment file.
- Set up a network filesystem to store GenAI models and datasets. To do this, choose from one of the following options:
- Create a PVC with the Helm Chart: Ensure you provide a Storage Class in your cluster that supports
ReadWriteMany
. - Utilize an existing PVC: If have a pre-existing PVC specify it by its
sharedPVCName
. - Mount an existing network filesystem as a Host Path: If a network filesystem already mounted on your nodes, use the
sharedFSMountPath
to designate it.
- Create a PVC with the Helm Chart: Ensure you provide a Storage Class in your cluster that supports
Separating Resource Pools by Kubernetes Namespace
- You can run resource pools in separate namespaces by configuring
kubernetes_namespace
. - Manual Host Paths should be accessible on all nodes.
- For PVC use, create manual
storageClassName: manual
PVC copies for each namespace where your resource pools will operate.
User/Agent ID Permissions
- To operate GenAI Studio with non-root users, MLDE can be set up with specific UNIX user IDs and group IDs. This requires configuring the shared filesystem appropriately. For more details, see the guide on enforcing shared User Agent Group IDs.
How to Install On-Prem #
Configure Helm Values.yaml #
- Update the default values.yaml with your preferred settings. See the configuration reference guide for an exhaustive list of options.
- Add the
genai
section to the values.yaml file.## Configure GenAI Deployment genai: ## Version of GenAI to use. If unset, GenAI will not be deployed version: "0.2.4" ## Port for GenAI backend to use port: 9011 ## Port for GenAI message queue messageQueuePort: 9013 ## Secret to pull the GenAI image # imagePullSecretName: ## GenAI pod memory request memRequest: 1Gi ## GenAI pod cpu request cpuRequest: 100m ## GenAI pod memory limit # memLimit: 1Gi ## GenAI pod cpu limit # cpuLimit: 2 ## Host Path for the Shared File System for GenAI. ## If you are providing your own shared file system to use ## in GenAI, specify its host path here. ## Note: This takes precedence over creating a PVC with ## `generatedPVC`. # sharedFSHostPath: ## Internal path to mount the shared_fs drive to. ## If you are using multiple shared file systems, it can help ## to be able to configure where the file systems mount. ## When unset, defaults to `/run/determined/workdir/shared_fs` # sharedFSMountPath: /run/determined/workdir/shared_fs ## PVC Name for the shared file system for GenAI. ## Note: Either `sharedPVCName` or `generatedPVC.storageSize` (to ## generate a new PVC) is required for GenAI deployment # sharedPVCName: ## Spec for the generated PVC for GenAI ## Note: In order to generate a shared PVC, you will need access to a ## StorageClass that can provide a ReadWriteMany volume generatedPVC: ## Storage class name for the generated PVC storageClassName: standard-rwx ## Size of the generated PVC storageSize: 1Ti ## Unix Agent Group ID for the Shared Filesystem. ## This setting is required to run your cluster with unprivileged users. ## This setting is not required if users will be running their experiments as root. ## Note: All users that work with GenAI need to have this assigned as their ## Agent Group ID in the User Admin settings. ## More info here: ## https://hpe-ai-solutions-documentation.netlify.app/products/gen-ai/latest/admin/set-up/deployment-guides/kubernetes/enforce-shared-user-agent-ids/ # agentGroupID: 1100 ## Whether or not we should attempt to run a Helm hook to initialize ## the shared filesystem to use the agentGroupID as its group. ## This must be turned off on clusters that disable pods that can run as root. ## More info here: ## https://hpe-ai-solutions-documentation.netlify.app/products/gen-ai/latest/admin/set-up/deployment-guides/kubernetes/enforce-shared-user-agent-ids/ shouldInitializeSharedFSGroupPermissions: false ## Extra Resource Pool Metadata is hardcoded information about the ## GPUs available to the resource pools. This information ## is not provided in k8s so we provide it directly. ## Note: All resource pools defined here need to also be reflected in ## the .Values.resourcePools. # extraResourcePoolMetadata: # A100: # gpu_type: A100 # max_agents: 3 # V100: # gpu_type: V100 # max_agents: 2
- Version of GenAI to use: Under
version
, set the version to0.2.4
. - sharedPVCName: Under
sharedPVCName
, specify the name of the shared PVC in your cluster that is designated as the shared network drive for GenAI Studio. Otherwise, ensure that thestorageClassName
reflects a StorageClass with ReadWriteMany enabled. - Extra Resource Pool Metadata: For every GPU type present in the cluster, add an entry under
extraResourcePoolMetadata
. More specifically, you must manually specify the GPU type and max agents (physical nodes) for any of the GPU-based resource pools you are using.
- Version of GenAI to use: Under
- Update the values.yaml
resourcePools
section to include the resource pools you want to use, along with any appropriate taints and tolerations. The types of GPUs available depend on the hardware you have access to.resourcePools: - pool_name: A100 task_container_defaults: kubernetes: max_slots_per_pod: 8 gpu_pod_spec: apiVersion: v1 kind: Pod spec: tolerations: - key: "accelerator" operator: "Equal" value: "NVIDIA-A100-ABCD-80GB" effect: "NoSchedule" - pool_name: T4 task_container_defaults: kubernetes: max_slots_per_pod: 6 gpu_pod_spec: apiVersion: v1 kind: Pod spec: tolerations: - key: "accelerator" operator: "Equal" value: "Tesla-T4" effect: "NoSchedule"
Install #
-
Enable repository:
helm repo add determined-ai https://helm.determined.ai/
-
List repos:
helm repo list
-
Update the repo (Helm doesn’t automatically update):
helm repo update
-
Show current version of determined-ai in repo:
helm search repo determined
-
Install version 0.34.0 of the Determined Helm chart with your modified values.yaml:
helm install -f values.yaml \ --generate-name determined-ai/determined \ --version "0.34.0" \ --set maxSlotsPerPod=4
Configure Shared Filesystem and User Permissions #
After setting up your Kubernetes cluster, you should next configure the shared filesystem and user permissions to ensure effective management of datasets. This configuration step prevents permission-related problems when accessing datasets. For detailed guidance and a sample script, see Enforcing Shared User Agent Group IDs.