Kubernetes in GCP
Deploy SIE to GKE with GPU node pools, KEDA autoscaling, and Terraform automation.
Architecture
Section titled “Architecture”SIE runs as a router-worker architecture on Kubernetes:
┌─────────────────────────────────────────────┐ │ GKE Cluster │ │ │ Request ──────────┤► ┌──────────┐ ┌─────────────────┐ │ (X-SIE-MACHINE-PROFILE: l4) │ │ Router │──────►│ L4 Worker Pool │ │ │ │ (2 pods) │ │ (StatefulSet) │ │ │ └──────────┘ └─────────────────┘ │ │ │ │ │ │ ┌─────────────────┐ │ │ └────────────►│ A100 Worker Pool│ │ │ │ (StatefulSet) │ │ │ └─────────────────┘ │ │ │ │ KEDA ◄──── Prometheus (queue depth) │ └─────────────────────────────────────────────┘Components:
- Router - Stateless proxy that routes requests to GPU-specific worker pools
- Worker Pools - StatefulSets grouped by GPU type (L4, A100-40GB, A100-80GB)
- KEDA - Scales worker pools from zero based on queue depth metrics
- Prometheus - Provides metrics for autoscaling decisions
Router
Section titled “Router”The router is a stateless FastAPI application that handles GPU-aware routing:
| Feature | Description |
|---|---|
| GPU Routing | Routes requests to appropriate GPU pool via X-SIE-MACHINE-PROFILE header |
| Pool Routing | Supports tenant isolation via X-SIE-Pool header |
| Model Affinity | Prefers workers with the requested model already loaded |
| Load Balancing | Distributes requests across healthy workers |
| 202 Responses | Returns Retry-After when GPU capacity is provisioning |
The router runs as a Deployment with 2+ replicas for high availability.
router: replicas: 2 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "2" memory: "2Gi"Worker Pools
Section titled “Worker Pools”Each GPU type runs as a separate StatefulSet with persistent storage for model caching.
| Pool | GPU | VRAM | Use Case |
|---|---|---|---|
l4 | NVIDIA L4 | 24GB | Standard inference, best price/performance |
a100-40gb | NVIDIA A100 | 40GB | Large models, high throughput |
a100-80gb | NVIDIA A100 | 80GB | Very large models (7B+ parameters) |
Worker configuration:
workers: pools: l4: enabled: true minReplicas: 0 # Scale to zero when idle maxReplicas: 10 gpuType: l4 nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 gpu: count: 1 product: NVIDIA-L4 resources: requests: cpu: "4" memory: "16Gi"Workers use a 50Gi persistent volume for model cache. Models load on first request.
GPU Selection
Section titled “GPU Selection”Specify the target GPU type using the X-SIE-MACHINE-PROFILE header or SDK parameter.
HTTP Header
Section titled “HTTP Header”curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "X-SIE-MACHINE-PROFILE: l4" \ -d '{"items": [{"text": "Hello world"}]}'SDK Parameter
Section titled “SDK Parameter”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://sie.example.com")
# Route to L4 poolresult = client.encode( "BAAI/bge-m3", Item(text="Hello world"), gpu="l4")
# Route to A100 pool for large modelsresult = client.encode( "intfloat/e5-mistral-7b-instruct", Item(text="Hello world"), gpu="a100-40gb")Available GPU Types
Section titled “Available GPU Types”| GPU Type | Header Value | Machine Type |
|---|---|---|
| NVIDIA L4 | l4 | g2-standard-8 |
| NVIDIA T4 | t4 | n1-standard-8 |
| NVIDIA A100 40GB | a100-40gb | a2-highgpu-1g |
| NVIDIA A100 80GB | a100-80gb | a2-ultragpu-1g |
| NVIDIA H100 80GB | h100-80gb | a3-highgpu-1g |
Resource Pools
Section titled “Resource Pools”Resource pools provide tenant isolation by reserving dedicated workers.
Create a Pool via SDK
Section titled “Create a Pool via SDK”Create a pool explicitly (created lazily on first request):
from sie_sdk import SIEClientfrom sie_sdk.types import Item
# Client with dedicated pool (2 L4 workers reserved)client = SIEClient("http://sie.example.com")client.create_pool("tenant-abc", {"l4": 2})
# First request creates the pool, subsequent requests reuse itresult = client.encode( "BAAI/bge-m3", Item(text="Hello world"), gpu="tenant-abc/l4" # pool_name/gpu_type)
# Check pool statusinfo = client.get_pool("tenant-abc")print(f"Pool {info['name']}: {info['status']['state']}")
# Explicit cleanup (optional - pools are GC'd after inactivity)client.delete_pool("tenant-abc")Route to Pool via HTTP
Section titled “Route to Pool via HTTP”Use the X-SIE-Pool header:
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "X-SIE-MACHINE-PROFILE: l4" \ -H "X-SIE-Pool: tenant-abc" \ -d '{"items": [{"text": "Hello world"}]}'The SDK handles lease renewal automatically. Pools are garbage collected after inactivity.
KEDA Autoscaling
Section titled “KEDA Autoscaling”KEDA scales worker pools based on queue depth metrics from Prometheus.
Scale-from-Zero
Section titled “Scale-from-Zero”When no workers are running and a request arrives:
- Router returns
202 AcceptedwithRetry-After: 120header - Router records pending demand metric
- KEDA detects queue depth > activation threshold
- GKE provisions GPU node (60-120 seconds)
- Worker pod starts and registers with router
- Client retries and request succeeds
Configuration
Section titled “Configuration”autoscaling: enabled: true prometheusAddress: http://prometheus-operated.monitoring.svc:9090 pollingInterval: 15 # Check metrics every 15s cooldownPeriod: 600 # Wait 10 min before scaling to zero scaleDownStabilization: 300 # 5 min stabilization window queueDepthThreshold: 10 # Scale up at 10 pending requests/pod queueDepthActivation: 2 # Activate from zero at 2 requests fallbackReplicas: 2 # Fallback if Prometheus unavailableCost Optimization
Section titled “Cost Optimization”GPU nodes scale to zero during idle periods. Configure cooldown based on your traffic patterns:
- Consistent traffic: Lower cooldown (300s) for responsive scaling
- Bursty traffic: Higher cooldown (900s) to avoid thrashing
- Dev/test: Use spot instances for 60-70% cost savings
Terraform Setup
Section titled “Terraform Setup”The Terraform module provisions a complete GKE cluster with GPU node pools.
Prerequisites
Section titled “Prerequisites”- GCP project with billing enabled
- GPU quota for your region (check with
gcloud compute regions describe REGION) - Required APIs enabled:
container.googleapis.comcompute.googleapis.com
Initialize
Section titled “Initialize”cd deploy/terraform/gcp/examples/dev-l4-spot
# Set project IDexport TF_VAR_project_id="your-project-id"
# Initialize Terraformterraform initPlan and Apply
Section titled “Plan and Apply”# Review changesterraform plan
# Deploy cluster (15-20 minutes)terraform applyConfigure kubectl
Section titled “Configure kubectl”# Get credentials$(terraform output -raw kubectl_command)
# Verify clusterkubectl get nodesVariables
Section titled “Variables”Key configuration options:
| Variable | Default | Description |
|---|---|---|
project_id | (required) | GCP project ID |
region | (required) | GKE cluster region |
cluster_name | sie-cluster | Name of the GKE cluster |
gpu_node_pools | L4 pool | List of GPU node pool configurations |
install_keda | true | Enable KEDA for autoscaling |
install_sie | true | Deploy SIE application |
sie_bundle | default | SIE bundle (default, legacy, gte-qwen2) |
Example: Production Multi-GPU
Section titled “Example: Production Multi-GPU”module "sie_gke" { source = "../../"
project_id = "my-project" region = "us-central1" cluster_name = "sie-prod"
gpu_node_pools = [ { name = "l4-pool" machine_type = "g2-standard-8" gpu_type = "nvidia-l4" gpu_count = 1 min_node_count = 1 # Keep 1 warm max_node_count = 20 spot = false }, { name = "a100-pool" machine_type = "a2-highgpu-1g" gpu_type = "nvidia-tesla-a100" gpu_count = 1 min_node_count = 0 max_node_count = 10 spot = true } ]
enable_workload_identity = true gcs_bucket_name = "sie-model-cache"}Helm Installation
Section titled “Helm Installation”Deploy SIE to an existing GKE cluster using Helm.
Prerequisites
Section titled “Prerequisites”- GKE cluster with GPU node pools
- KEDA installed
- Prometheus for metrics
Install
Section titled “Install”# Add SIE repository (if published)helm repo add sie https://charts.superlinked.comhelm repo update
# Or install from local charthelm install sie ./deploy/helm/sie-cluster \ --namespace sie \ --create-namespace \ --values custom-values.yamlCustom Values
Section titled “Custom Values”router: replicas: 3
workers: common: bundle: default cacheVolumeSize: 100Gi clusterCache: enabled: true url: gs://my-bucket/models
pools: l4: enabled: true minReplicas: 1 maxReplicas: 20
autoscaling: enabled: true cooldownPeriod: 300
ingress: enabled: true host: sie.example.com tls: enabled: true secretName: sie-tls
auth: enabled: true oauth2Proxy: oidcIssuerUrl: https://auth.example.com/realms/sie
serviceMonitor: enabled: trueUpgrade
Section titled “Upgrade”helm upgrade sie ./deploy/helm/sie-cluster \ --namespace sie \ --values custom-values.yamlVerify
Section titled “Verify”# Check podskubectl get pods -n sie
# Check router logskubectl logs -n sie -l app.kubernetes.io/component=router
# Test endpoint (SDK)python - <<'PY'from sie_sdk import SIEClientclient = SIEClient("https://sie.example.com", api_key="YOUR_TOKEN")print(client.list_models())client.close()PYAccess + Auth
Section titled “Access + Auth”- Ingress controller: use ingress-nginx for public or private access.
- Public vs private: set ingress-nginx service annotations for internal LBs on GKE.
- Auth options:
- OIDC (oauth2-proxy) with external IdP or Dex.
- Static token (router-level) for OSS/self-hosted without IdP.
- No auth + private ingress (internal LB).
module "sie_gke" { # Turnkey ingress controller install_ingress_nginx = true
# Private LB example ingress_nginx_service_annotations = { "cloud.google.com/load-balancer-type" = "Internal" }
# Router ingress + auth sie_ingress_enabled = true sie_ingress_host = "sie.example.com" sie_ingress_tls_enabled = true sie_ingress_tls_secret_name = "sie-tls"
sie_auth_enabled = true sie_auth_oidc_issuer_url = "https://auth.example.com/realms/sie" sie_auth_secret_name = "oauth2-proxy"
# Static token mode (alternative to OIDC) sie_router_auth_mode = "static" sie_router_auth_secret_name = "sie-router-auth"}Debug-only access via port-forward is still possible, but production paths should use ingress.
What’s Next
Section titled “What’s Next”- Monitoring & Observability - metrics, logging, and dashboards