GKE for Large Model Training and Serving¶
This Terraform module configures a GKE-based infrastructure environment specifically designed for training and serving large and extremely large deep learning models, including the most recent Generative AI models.
The central element of this environment is a VPC-native GKE Standard cluster. Users of the module can decide whether to deploy the cluster within an existing VPC or create a new VPC specifically for the cluster. The cluster can be configured with multiple CPU, GPU or TPU node pools. The node pools use a custom service account. This service account can be an existing one or a newly created account.
Beyond the cluster, users have the option to create additional services such as Artifact Registry or Cloud Storage buckets.
The module carries out the following tasks: - If a reference to an existing VPC is not provided, it will create a network, a subnet, and IP ranges for GKE pods and services. - Optionally, it can provision Cloud NAT - If a reference to an existing service account is not provided, the module will create a new service account and assign it to a user-defined set of security roles. - Deploys a standard, VPC-native GKE cluster that is configured to utilize Workload Identity. - Creates a user defined number of CPU node pools - Creates a user defined number of TPU node pools - Creates a user defined number of GPU node pools - The node pools are configured to use a custom service account - Optionally, it can create an Artifact Registry. - Creates the specified number of user-defined Cloud Storage buckets.
Examples¶
GKE TPU training environment¶
This example demonstrates how to configure an environment optimized for executing large-scale training workloads on TPUs. In this sample, a new VPC, a new service account, and a new Artifact Registry are created. All resources are generated using default values for the majority of the settings.
module "tpu-training-cluster" {
source = "github.com/GoogleCloudPlatform/applied-ai-engineering-samples//ai-infrastructure/terraform-modules/gke-aiml
project_id = "project_id"
region = "us-central2"
vpc_config = {
network_name = "gke-cluster-network"
subnet_name = "gke-cluster-subnetwork"
}
node_pool_sa = {
name = "gke-node-pool-sa"
}
cluster_config = {
name = "gke-tpu-training-cluster"
}
cpu_node_pools = {
default-cpu-node-pool = {
zones = ["us-central2-a"]
labels = {
default-node-pool=true
}
}
}
tpu_node_pools = {
tpu-v4-16-podslice-1 = {
zones = ["us-central2-b"]
tpu_type = "v4-16"
}
tpu-v4-16-podslice-2 = {
zones = ["us-central2-b"]
tpu_type = "v4-16"
}
}
gcs_configs = {
training-artifacts-bucket = {}
}
registry_config = {
name = "training-images"
location = "us"
}
}
GKE GPU training environment¶
This example demonstrates how to configure an environment optimized for executing large-scale training workloads on GPUs. In this sample, a new VPC, a new service account, and a new Artifact Registry are created. All resources are generated using default values for the majority of the settings. You can use all the GPU machine types and accelerator types available to you. Those are the ones supported: GPU doc
module "gpu-training-cluster" {
source = "github.com/GoogleCloudPlatform/applied-ai-engineering-samples//ai-infrastructure/terraform-modules/gke-aiml
project_id = "project_id"
region = "us-central1"
vpc_config = {
network_name = "gke-cluster-network"
subnet_name = "gke-cluster-subnetwork"
}
node_pool_sa = {
name = "gke-node-pool-sa"
}
cluster_config = {
name = "gke-gpu-training-cluster"
}
gpu_node_pools = {
l4-gpu-node-pool = {
zones = ["us-central1-a"]
min_node_count = 1
max_node_count = 2
machine_type = "g2-standard-4"
accelerator_type = "nvidia-l4"
accelerator_count=1
disk_size_gb = 200
taints = {}
labels = {}
}
}
gcs_configs = {
training-artifacts-bucket = {}
}
registry_config = {
name = "training-images"
location = "us"
}
}
Variables¶
Name | Description | Type | Required | Default |
---|---|---|---|---|
project_id | Environment project ID | string |
✓ | |
region | Environment region | string |
✓ | |
deletion_protection | Prevent Terraform from destroying data storage resources (storage buckets, GKE clusters). When this field is set, a terraform destroy or terraform apply that would delete data storage resources will fail. | string |
true |
|
cluster_config | Cluster level configurations | object({...}) |
{...} |
|
vpc_config | Network configurations of a VPC to create. Must be specified if vpc_reg is null | object({...}) |
{...} |
|
vpc_ref | Settings for the existing VPC to use for the environment. If null, a new VPC based on the vpc_config will be created |
object({...}) |
{...} |
|
node_pool_sa | Settings for a node pool service account | object({...}) |
{...} |
|
cpu_node_pools | Settings for CPU node pools | map(object({...})) |
{...} |
|
tpu_node_pools | Settings for TPU node pools. See below for more information about TPU slice types | map(object({...})) |
{...} |
|
gpu_node_pools | Settings for GPU node pools | map(object({...})) |
{...} |
|
gcs_configs | Settings for Cloud Storage buckets | map(object({...})) |
{...} |
|
registry_config | Settings for Artifact Registry | object({...}) |
{...} |
Specifying TPU type¶
When configuring TPU node pools, ensure that you set the TPU type to one of the following values:
TPU type name | Slice type | Slice topology | TPU VM type | Number of VMs in a slice | Number of chips in a VM |
---|---|---|---|---|---|
v5litepod-1 | tpu-v5-lite-podslice | 1x1 | ct5lp-hightpu-1 | 1 | 1 |
v5litepod-4 | tpu-v5-lite-podslice | 2x2 | ct5lp-hightpu-4t | 1 | 4 |
v5litepod-8 | tpu-v5-lite-podslice | 2x4 | ct5lp-hightpu-4t | 1 | 8 |
v5litepod-16 | tpu-v5-lite-podslice | 4x4 | ct5lp-hightpu-4t | 4 | 4 |
v5litepod-32 | tpu-v5-lite-podslice | 4x8 | ct5lp-hightpu-4t | 8 | 4 |
v5litepod-64 | tpu-v5-lite-podslice | 8x8 | ct5lp-hightpu-4t | 16 | 4 |
v5litepod-128 | tpu-v5-lite-podslice | 8x16 | ct5lp-hightpu-4t | 32 | 4 |
v5litepod-256 | tpu-v5-lite-podslice | 16x16 | ct5lp-hightpu-4t | 64 | 4 |
v4-8 | tpu-v4-podslice | 2x2x1 | ct4p-hightpu-4t | 1 | 4 |
v4-16 | tpu-v4-podslice | 2x2x2 | ct4p-hightpu-4t | 2 | 4 |
v4-32 | tpu-v4-podslice | 2x2x4 | ct4p-hightpu-4t | 4 | 4 |
v4-64 | tpu-v4-podslice | 2x4x4 | ct4p-hightpu-4t | 8 | 4 |
v4-128 | tpu-v4-podslice | 4x4x4 | ct4p-hightpu-4t | 16 | 4 |
v4-256 | tpu-v4-podslice | 4x4x8 | ct4p-hightpu-4t | 32 | 4 |
v4-512 | tpu-v4-podslice | 4x8x8 | ct4p-hightpu-4t | 64 | 4 |
v4-1024 | tpu-v4-podslice | 8x8x8 | ct4p-hightpu-4t | 128 | 4 |
v4-1536 | tpu-v4-podslice | 8x8x12 | ct4p-hightpu-4t | 192 | 4 |
v4-2048 | tpu-v4-podslice | 8x8x16 | ct4p-hightpu-4t | 256 | 4 |
v4-4096 | tpu-v4-podslice | 8x16x16 | ct4p-hightpu-4t | 512 | 4 |
v5p-8 | tpu-v5p-slice | 2x2x1 | ct5p-hightpu-4t | 1 | 4 |
v5p-16 | tpu-v5p-slice | 2x2x2 | ct5p-hightpu-4t | 2 | 4 |
v5p-32 | tpu-v5p-slice | 2x2x4 | ct5p-hightpu-4t | 4 | 4 |
v5p-64 | tpu-v5p-slice | 2x4x4 | ct5p-hightpu-4t | 8 | 4 |
v5p-128 | tpu-v5p-slice | 4x4x4 | ct5p-hightpu-4t | 16 | 4 |
v5p-256 | tpu-v5p-slice | 4x4x8 | ct5p-hightpu-4t | 32 | 4 |
v5p-384 | tpu-v5p-slice | 4x4x12 | ct5p-hightpu-4t | 48 | 4 |
v5p-512 | tpu-v5p-slice | 4x8x8 | ct5p-hightpu-4t | 64 | 4 |
v5p-640 | tpu-v5p-slice | 4x4x20 | ct5p-hightpu-4t | 80 | 4 |
v5p-768 | tpu-v5p-slice | 4x8x12 | ct5p-hightpu-4t | 96 | 4 |
v5p-896 | tpu-v5p-slice | 4x4x28 | ct5p-hightpu-4t | 112 | 4 |
v5p-1024 | tpu-v5p-slice | 8x8x8 | ct5p-hightpu-4t | 128 | 4 |
v5p-1152 | tpu-v5p-slice | 4x12x12 | ct5p-hightpu-4t | 144 | 4 |
v5p-1280 | tpu-v5p-slice | 4x8x20 | ct5p-hightpu-4t | 160 | 4 |
v5p-1408 | tpu-v5p-slice | 4x4x44 | ct5p-hightpu-4t | 176 | 4 |
v5p-1536 | tpu-v5p-slice | 8x8x12 | ct5p-hightpu-4t | 192 | 4 |
v5p-1664 | tpu-v5p-slice | 4x4x52 | ct5p-hightpu-4t | 208 | 4 |
v5p-1792 | tpu-v5p-slice | 4x8x28 | ct5p-hightpu-4t | 224 | 4 |
v5p-1920 | tpu-v5p-slice | 4x12x20 | ct5p-hightpu-4t | 240 | 4 |
v5p-2048 | tpu-v5p-slice | 8x8x16 | ct5p-hightpu-4t | 256 | 4 |
v5p-2176 | tpu-v5p-slice | 4x4x68 | ct5p-hightpu-4t | 272 | 4 |
v5p-2304 | tpu-v5p-slice | 8x12x12 | ct5p-hightpu-4t | 288 | 4 |
v5p-2432 | tpu-v5p-slice | 4x4x76 | ct5p-hightpu-4t | 304 | 4 |
v5p-2560 | tpu-v5p-slice | 8x8x20 | ct5p-hightpu-4t | 320 | 4 |
v5p-2688 | tpu-v5p-slice | 4x12x28 | ct5p-hightpu-4t | 336 | 4 |
v5p-2816 | tpu-v5p-slice | 4x8x44 | ct5p-hightpu-4t | 352 | 4 |
v5p-2944 | tpu-v5p-slice | 4x4x92 | ct5p-hightpu-4t | 368 | 4 |
v5p-3072 | tpu-v5p-slice | 4x12x16 | ct5p-hightpu-4t | 384 | 4 |
v5p-3200 | tpu-v5p-slice | 4x20x20 | ct5p-hightpu-4t | 400 | 4 |
v5p-3328 | tpu-v5p-slice | 4x8x52 | ct5p-hightpu-4t | 416 | 4 |
v5p-3456 | tpu-v5p-slice | 12x12x12 | ct5p-hightpu-4t | 432 | 4 |
v5p-3584 | tpu-v5p-slice | 8x8x28 | ct5p-hightpu-4t | 448 | 4 |
v5p-3712 | tpu-v5p-slice | 4x4x116 | ct5p-hightpu-4t | 464 | 4 |
v5p-3840 | tpu-v5p-slice | 8x12x20 | ct5p-hightpu-4t | 480 | 4 |
v5p-3968 | tpu-v5p-slice | 4x4x124 | ct5p-hightpu-4t | 496 | 4 |
v5p-4096 | tpu-v5p-slice | 8x16x16 | ct5p-hightpu-4t | 512 | 4 |
v5p-4224 | tpu-v5p-slice | 4x12x44 | ct5p-hightpu-4t | 528 | 4 |
v5p-4352 | tpu-v5p-slice | 4x8x68 | ct5p-hightpu-4t | 544 | 4 |
v5p-4480 | tpu-v5p-slice | 4x20x28 | ct5p-hightpu-4t | 560 | 4 |
v5p-4608 | tpu-v5p-slice | 12x12x16 | ct5p-hightpu-4t | 576 | 4 |
v5p-4736 | tpu-v5p-slice | 4x4x148 | ct5p-hightpu-4t | 592 | 4 |
v5p-4864 | tpu-v5p-slice | 4x8x76 | ct5p-hightpu-4t | 608 | 4 |
v5p-4992 | tpu-v5p-slice | 4x12x52 | ct5p-hightpu-4t | 624 | 4 |
v5p-5120 | tpu-v5p-slice | 8x16x20 | ct5p-hightpu-4t | 640 | 4 |
v5p-5248 | tpu-v5p-slice | 4x4x164 | ct5p-hightpu-4t | 656 | 4 |
v5p-5376 | tpu-v5p-slice | 8x12x28 | ct5p-hightpu-4t | 672 | 4 |
v5p-5504 | tpu-v5p-slice | 4x4x172 | ct5p-hightpu-4t | 688 | 4 |
v5p-5632 | tpu-v5p-slice | 8x8x44 | ct5p-hightpu-4t | 704 | 4 |
v5p-5760 | tpu-v5p-slice | 12x12x20 | ct5p-hightpu-4t | 720 | 4 |
v5p-5888 | tpu-v5p-slice | 4x8x92 | ct5p-hightpu-4t | 736 | 4 |
v5p-6016 | tpu-v5p-slice | 4x4x188 | ct5p-hightpu-4t | 752 | 4 |
v5p-6144 | tpu-v5p-slice | 12x16x16 | ct5p-hightpu-4t | 768 | 4 |
v5p-6272 | tpu-v5p-slice | 4x28x28 | ct5p-hightpu-4t | 784 | 4 |
v5p-6400 | tpu-v5p-slice | 8x20x20 | ct5p-hightpu-4t | 800 | 4 |
v5p-6528 | tpu-v5p-slice | 4x12x68 | ct5p-hightpu-4t | 816 | 4 |
v5p-6656 | tpu-v5p-slice | 8x8x52 | ct5p-hightpu-4t | 832 | 4 |
v5p-6784 | tpu-v5p-slice | 4x4x212 | ct5p-hightpu-4t | 848 | 4 |
v5p-6912 | tpu-v5p-slice | 12x12x24 | ct5p-hightpu-4t | 864 | 4 |
v5p-7040 | tpu-v5p-slice | 4x20x44 | ct5p-hightpu-4t | 880 | 4 |
v5p-7168 | tpu-v5p-slice | 8x16x28 | ct5p-hightpu-4t | 896 | 4 |
v5p-7296 | tpu-v5p-slice | 4x12x76 | ct5p-hightpu-4t | 912 | 4 |
v5p-7424 | tpu-v5p-slice | 4x8x116 | ct5p-hightpu-4t | 928 | 4 |
v5p-7552 | tpu-v5p-slice | 4x4x236 | ct5p-hightpu-4t | 944 | 4 |
v5p-7680 | tpu-v5p-slice | 12x16x20 | ct5p-hightpu-4t | 960 | 4 |
v5p-7808 | tpu-v5p-slice | 4x4x244 | ct5p-hightpu-4t | 976 | 4 |
v5p-7936 | tpu-v5p-slice | 4x8x124 | ct5p-hightpu-4t | 992 | 4 |
v5p-8064 | tpu-v5p-slice | 12x12x28 | ct5p-hightpu-4t | 1008 | 4 |
v5p-8192 | tpu-v5p-slice | 16x16x16 | ct5p-hightpu-4t | 1024 | 4 |
v5p-8320 | tpu-v5p-slice | 4x20x52 | ct5p-hightpu-4t | 1040 | 4 |
v5p-8448 | tpu-v5p-slice | 8x12x44 | ct5p-hightpu-4t | 1056 | 4 |
v5p-8704 | tpu-v5p-slice | 8x8x68 | ct5p-hightpu-4t | 1088 | 4 |
v5p-8832 | tpu-v5p-slice | 4x12x92 | ct5p-hightpu-4t | 1104 | 4 |
v5p-8960 | tpu-v5p-slice | 8x20x28 | ct5p-hightpu-4t | 1120 | 4 |
v5p-9216 | tpu-v5p-slice | 12x16x24 | ct5p-hightpu-4t | 1152 | 4 |
v5p-9472 | tpu-v5p-slice | 4x8x148 | ct5p-hightpu-4t | 1184 | 4 |
v5p-9600 | tpu-v5p-slice | 12x20x20 | ct5p-hightpu-4t | 1200 | 4 |
v5p-9728 | tpu-v5p-slice | 8x8x76 | ct5p-hightpu-4t | 1216 | 4 |
v5p-9856 | tpu-v5p-slice | 4x28x44 | ct5p-hightpu-4t | 1232 | 4 |
v5p-9984 | tpu-v5p-slice | 8x12x52 | ct5p-hightpu-4t | 1248 | 4 |
v5p-10240 | tpu-v5p-slice | 16x16x20 | ct5p-hightpu-4t | 1280 | 4 |
v5p-10368 | tpu-v5p-slice | 12x12x36 | ct5p-hightpu-4t | 1296 | 4 |
v5p-10496 | tpu-v5p-slice | 4x8x164 | ct5p-hightpu-4t | 1312 | 4 |
v5p-10752 | tpu-v5p-slice | 12x16x28 | ct5p-hightpu-4t | 1344 | 4 |
v5p-10880 | tpu-v5p-slice | 4x20x68 | ct5p-hightpu-4t | 1360 | 4 |
v5p-11008 | tpu-v5p-slice | 4x8x172 | ct5p-hightpu-4t | 1376 | 4 |
v5p-11136 | tpu-v5p-slice | 4x12x116 | ct5p-hightpu-4t | 1392 | 4 |
v5p-11264 | tpu-v5p-slice | 8x16x44 | ct5p-hightpu-4t | 1408 | 4 |
v5p-11520 | tpu-v5p-slice | 12x20x24 | ct5p-hightpu-4t | 1440 | 4 |
v5p-11648 | tpu-v5p-slice | 4x28x52 | ct5p-hightpu-4t | 1456 | 4 |
v5p-11776 | tpu-v5p-slice | 8x8x92 | ct5p-hightpu-4t | 1472 | 4 |
v5p-11904 | tpu-v5p-slice | 4x12x124 | ct5p-hightpu-4t | 1488 | 4 |
v5p-12032 | tpu-v5p-slice | 4x8x188 | ct5p-hightpu-4t | 1504 | 4 |
v5p-12160 | tpu-v5p-slice | 4x20x76 | ct5p-hightpu-4t | 1520 | 4 |
v5p-12288 | tpu-v5p-slice | 16x16x24 | ct5p-hightpu-4t | 1536 | 4 |
v5p-13824 | tpu-v5p-slice | 12x24x24 | ct5p-hightpu-4t | 1728 | 4 |
v5p-17920 | tpu-v5p-slice | 16x20x28 | ct5p-hightpu-4t | 2240 | 4 |
Outputs¶
Name | Description |
---|---|
node_pool_sa_email | The email of the node pool sa |
cluster_name | The name of the GKE cluster |
cluster_region | The region of the GKE cluster |
gcs_buckets | The names and locations of the created GCS buckets |
artifact_registry_id | The full ID of the created Arifact Registry |
artifact_registry_image_path | Artifact Registry path |