Kubernetes in GCP

Deploy SIE to GKE with GPU node pools, KEDA autoscaling, and Terraform automation.

Architecture

SIE runs as a router-worker architecture on Kubernetes:

                    ┌─────────────────────────────────────────────┐
                    │              GKE Cluster                     │
                    │                                              │
  Request ──────────┤►  ┌──────────┐       ┌─────────────────┐    │
  (X-SIE-MACHINE-PROFILE: l4)   │   │  Router  │──────►│  L4 Worker Pool │    │
                    │   │ (2 pods) │       │  (StatefulSet)  │    │
                    │   └──────────┘       └─────────────────┘    │
                    │        │                                     │
                    │        │             ┌─────────────────┐    │
                    │        └────────────►│ A100 Worker Pool│    │
                    │                      │  (StatefulSet)  │    │
                    │                      └─────────────────┘    │
                    │                                              │
                    │   KEDA ◄──── Prometheus (queue depth)       │
                    └─────────────────────────────────────────────┘

Components:

Router - Stateless proxy that routes requests to GPU-specific worker pools
Worker Pools - StatefulSets grouped by GPU type (L4, A100-40GB, A100-80GB)
KEDA - Scales worker pools from zero based on queue depth metrics
Prometheus - Provides metrics for autoscaling decisions

Router

The router is a stateless FastAPI application that handles GPU-aware routing:

Feature	Description
GPU Routing	Routes requests to appropriate GPU pool via `X-SIE-MACHINE-PROFILE` header
Pool Routing	Supports tenant isolation via `X-SIE-Pool` header
Model Affinity	Prefers workers with the requested model already loaded
Load Balancing	Distributes requests across healthy workers
202 Responses	Returns `Retry-After` when GPU capacity is provisioning

The router runs as a Deployment with 2+ replicas for high availability.

router:
  replicas: 2
  resources:
    requests:
      cpu: "500m"
      memory: "512Mi"
    limits:
      cpu: "2"
      memory: "2Gi"

Worker Pools

Each GPU type runs as a separate StatefulSet with persistent storage for model caching.

Pool	GPU	VRAM	Use Case
`l4`	NVIDIA L4	24GB	Standard inference, best price/performance
`a100-40gb`	NVIDIA A100	40GB	Large models, high throughput
`a100-80gb`	NVIDIA A100	80GB	Very large models (7B+ parameters)

Worker configuration:

workers:
  pools:
    l4:
      enabled: true
      minReplicas: 0        # Scale to zero when idle
      maxReplicas: 10
      gpuType: l4
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      gpu:
        count: 1
        product: NVIDIA-L4
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"

Workers use a 50Gi persistent volume for model cache. Models load on first request.

GPU Selection

Specify the target GPU type using the X-SIE-MACHINE-PROFILE header or SDK parameter.

HTTP Header

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -d '{"items": [{"text": "Hello world"}]}'

SDK Parameter

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://sie.example.com")

# Route to L4 pool
result = client.encode(
    "BAAI/bge-m3",
    Item(text="Hello world"),
    gpu="l4"
)

# Route to A100 pool for large models
result = client.encode(
    "intfloat/e5-mistral-7b-instruct",
    Item(text="Hello world"),
    gpu="a100-40gb"
)

Available GPU Types

GPU Type	Header Value	Machine Type
NVIDIA L4	`l4`	g2-standard-8
NVIDIA T4	`t4`	n1-standard-8
NVIDIA A100 40GB	`a100-40gb`	a2-highgpu-1g
NVIDIA A100 80GB	`a100-80gb`	a2-ultragpu-1g
NVIDIA H100 80GB	`h100-80gb`	a3-highgpu-1g

Resource Pools

Resource pools provide tenant isolation by reserving dedicated workers.

Create a Pool via SDK

Create a pool explicitly (created lazily on first request):

from sie_sdk import SIEClient
from sie_sdk.types import Item

# Client with dedicated pool (2 L4 workers reserved)
client = SIEClient("http://sie.example.com")
client.create_pool("tenant-abc", {"l4": 2})

# First request creates the pool, subsequent requests reuse it
result = client.encode(
    "BAAI/bge-m3",
    Item(text="Hello world"),
    gpu="tenant-abc/l4"  # pool_name/gpu_type
)

# Check pool status
info = client.get_pool("tenant-abc")
print(f"Pool {info['name']}: {info['status']['state']}")

# Explicit cleanup (optional - pools are GC'd after inactivity)
client.delete_pool("tenant-abc")

Route to Pool via HTTP

Use the X-SIE-Pool header:

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -H "X-SIE-Pool: tenant-abc" \
  -d '{"items": [{"text": "Hello world"}]}'

The SDK handles lease renewal automatically. Pools are garbage collected after inactivity.

KEDA Autoscaling

KEDA scales worker pools based on queue depth metrics from Prometheus.

Scale-from-Zero

When no workers are running and a request arrives:

Router returns 202 Accepted with Retry-After: 120 header
Router records pending demand metric
KEDA detects queue depth > activation threshold
GKE provisions GPU node (60-120 seconds)
Worker pod starts and registers with router
Client retries and request succeeds

Configuration

autoscaling:
  enabled: true
  prometheusAddress: http://prometheus-operated.monitoring.svc:9090
  pollingInterval: 15          # Check metrics every 15s
  cooldownPeriod: 600          # Wait 10 min before scaling to zero
  scaleDownStabilization: 300  # 5 min stabilization window
  queueDepthThreshold: 10      # Scale up at 10 pending requests/pod
  queueDepthActivation: 2      # Activate from zero at 2 requests
  fallbackReplicas: 2          # Fallback if Prometheus unavailable

Cost Optimization

GPU nodes scale to zero during idle periods. Configure cooldown based on your traffic patterns:

Consistent traffic: Lower cooldown (300s) for responsive scaling
Bursty traffic: Higher cooldown (900s) to avoid thrashing
Dev/test: Use spot instances for 60-70% cost savings

Terraform Setup

The Terraform module provisions a complete GKE cluster with GPU node pools.

Prerequisites

GCP project with billing enabled
GPU quota for your region (check with gcloud compute regions describe REGION)
Required APIs enabled:
- container.googleapis.com
- compute.googleapis.com

Initialize

cd deploy/terraform/gcp/examples/dev-l4-spot

# Set project ID
export TF_VAR_project_id="your-project-id"

# Initialize Terraform
terraform init

Plan and Apply

# Review changes
terraform plan

# Deploy cluster (15-20 minutes)
terraform apply

Configure kubectl

# Get credentials
$(terraform output -raw kubectl_command)

# Verify cluster
kubectl get nodes

Variables

Key configuration options:

Variable	Default	Description
`project_id`	(required)	GCP project ID
`region`	(required)	GKE cluster region
`cluster_name`	`sie-cluster`	Name of the GKE cluster
`gpu_node_pools`	L4 pool	List of GPU node pool configurations
`install_keda`	`true`	Enable KEDA for autoscaling
`install_sie`	`true`	Deploy SIE application
`sie_bundle`	`default`	SIE bundle (default, legacy, gte-qwen2)

Example: Production Multi-GPU

module "sie_gke" {
  source = "../../"

  project_id   = "my-project"
  region       = "us-central1"
  cluster_name = "sie-prod"

  gpu_node_pools = [
    {
      name           = "l4-pool"
      machine_type   = "g2-standard-8"
      gpu_type       = "nvidia-l4"
      gpu_count      = 1
      min_node_count = 1    # Keep 1 warm
      max_node_count = 20
      spot           = false
    },
    {
      name           = "a100-pool"
      machine_type   = "a2-highgpu-1g"
      gpu_type       = "nvidia-tesla-a100"
      gpu_count      = 1
      min_node_count = 0
      max_node_count = 10
      spot           = true
    }
  ]

  enable_workload_identity = true
  gcs_bucket_name          = "sie-model-cache"
}

Helm Installation

Deploy SIE to an existing GKE cluster using Helm.

Prerequisites

GKE cluster with GPU node pools
KEDA installed
Prometheus for metrics

Install

# Add SIE repository (if published)
helm repo add sie https://charts.superlinked.com
helm repo update

# Or install from local chart
helm install sie ./deploy/helm/sie-cluster \
  --namespace sie \
  --create-namespace \
  --values custom-values.yaml

Custom Values

router:
  replicas: 3

workers:
  common:
    bundle: default
    cacheVolumeSize: 100Gi
    clusterCache:
      enabled: true
      url: gs://my-bucket/models

  pools:
    l4:
      enabled: true
      minReplicas: 1
      maxReplicas: 20

autoscaling:
  enabled: true
  cooldownPeriod: 300

ingress:
  enabled: true
  host: sie.example.com
  tls:
    enabled: true
    secretName: sie-tls

auth:
  enabled: true
  oauth2Proxy:
    oidcIssuerUrl: https://auth.example.com/realms/sie

serviceMonitor:
  enabled: true

Upgrade

helm upgrade sie ./deploy/helm/sie-cluster \
  --namespace sie \
  --values custom-values.yaml

Verify

# Check pods
kubectl get pods -n sie

# Check router logs
kubectl logs -n sie -l app.kubernetes.io/component=router

# Test endpoint (SDK)
python - <<'PY'
from sie_sdk import SIEClient
client = SIEClient("https://sie.example.com", api_key="YOUR_TOKEN")
print(client.list_models())
client.close()
PY

Access + Auth

Ingress controller: use ingress-nginx for public or private access.
Public vs private: set ingress-nginx service annotations for internal LBs on GKE.
Auth options:
- OIDC (oauth2-proxy) with external IdP or Dex.
- Static token (router-level) for OSS/self-hosted without IdP.
- No auth + private ingress (internal LB).

module "sie_gke" {
  # Turnkey ingress controller
  install_ingress_nginx = true

  # Private LB example
  ingress_nginx_service_annotations = {
    "cloud.google.com/load-balancer-type" = "Internal"
  }

  # Router ingress + auth
  sie_ingress_enabled       = true
  sie_ingress_host          = "sie.example.com"
  sie_ingress_tls_enabled   = true
  sie_ingress_tls_secret_name = "sie-tls"

  sie_auth_enabled          = true
  sie_auth_oidc_issuer_url  = "https://auth.example.com/realms/sie"
  sie_auth_secret_name      = "oauth2-proxy"

  # Static token mode (alternative to OIDC)
  sie_router_auth_mode        = "static"
  sie_router_auth_secret_name = "sie-router-auth"
}

Debug-only access via port-forward is still possible, but production paths should use ingress.

What’s Next

Monitoring & Observability - metrics, logging, and dashboards