Skip to content
SIE

Kubernetes in GCP

Deploy SIE to GKE with GPU node pools, KEDA autoscaling, and Terraform automation.

SIE runs as a router-worker architecture on Kubernetes:

┌─────────────────────────────────────────────┐
│ GKE Cluster │
│ │
Request ──────────┤► ┌──────────┐ ┌─────────────────┐ │
(X-SIE-MACHINE-PROFILE: l4) │ │ Router │──────►│ L4 Worker Pool │ │
│ │ (2 pods) │ │ (StatefulSet) │ │
│ └──────────┘ └─────────────────┘ │
│ │ │
│ │ ┌─────────────────┐ │
│ └────────────►│ A100 Worker Pool│ │
│ │ (StatefulSet) │ │
│ └─────────────────┘ │
│ │
│ KEDA ◄──── Prometheus (queue depth) │
└─────────────────────────────────────────────┘

Components:

  • Router - Stateless proxy that routes requests to GPU-specific worker pools
  • Worker Pools - StatefulSets grouped by GPU type (L4, A100-40GB, A100-80GB)
  • KEDA - Scales worker pools from zero based on queue depth metrics
  • Prometheus - Provides metrics for autoscaling decisions

The router is a stateless FastAPI application that handles GPU-aware routing:

FeatureDescription
GPU RoutingRoutes requests to appropriate GPU pool via X-SIE-MACHINE-PROFILE header
Pool RoutingSupports tenant isolation via X-SIE-Pool header
Model AffinityPrefers workers with the requested model already loaded
Load BalancingDistributes requests across healthy workers
202 ResponsesReturns Retry-After when GPU capacity is provisioning

The router runs as a Deployment with 2+ replicas for high availability.

router:
replicas: 2
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"

Each GPU type runs as a separate StatefulSet with persistent storage for model caching.

PoolGPUVRAMUse Case
l4NVIDIA L424GBStandard inference, best price/performance
a100-40gbNVIDIA A10040GBLarge models, high throughput
a100-80gbNVIDIA A10080GBVery large models (7B+ parameters)

Worker configuration:

workers:
pools:
l4:
enabled: true
minReplicas: 0 # Scale to zero when idle
maxReplicas: 10
gpuType: l4
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
gpu:
count: 1
product: NVIDIA-L4
resources:
requests:
cpu: "4"
memory: "16Gi"

Workers use a 50Gi persistent volume for model cache. Models load on first request.


Specify the target GPU type using the X-SIE-MACHINE-PROFILE header or SDK parameter.

Terminal window
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "X-SIE-MACHINE-PROFILE: l4" \
-d '{"items": [{"text": "Hello world"}]}'
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://sie.example.com")
# Route to L4 pool
result = client.encode(
"BAAI/bge-m3",
Item(text="Hello world"),
gpu="l4"
)
# Route to A100 pool for large models
result = client.encode(
"intfloat/e5-mistral-7b-instruct",
Item(text="Hello world"),
gpu="a100-40gb"
)
GPU TypeHeader ValueMachine Type
NVIDIA L4l4g2-standard-8
NVIDIA T4t4n1-standard-8
NVIDIA A100 40GBa100-40gba2-highgpu-1g
NVIDIA A100 80GBa100-80gba2-ultragpu-1g
NVIDIA H100 80GBh100-80gba3-highgpu-1g

Resource pools provide tenant isolation by reserving dedicated workers.

Create a pool explicitly (created lazily on first request):

from sie_sdk import SIEClient
from sie_sdk.types import Item
# Client with dedicated pool (2 L4 workers reserved)
client = SIEClient("http://sie.example.com")
client.create_pool("tenant-abc", {"l4": 2})
# First request creates the pool, subsequent requests reuse it
result = client.encode(
"BAAI/bge-m3",
Item(text="Hello world"),
gpu="tenant-abc/l4" # pool_name/gpu_type
)
# Check pool status
info = client.get_pool("tenant-abc")
print(f"Pool {info['name']}: {info['status']['state']}")
# Explicit cleanup (optional - pools are GC'd after inactivity)
client.delete_pool("tenant-abc")

Use the X-SIE-Pool header:

Terminal window
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "X-SIE-Pool: tenant-abc" \
-d '{"items": [{"text": "Hello world"}]}'

The SDK handles lease renewal automatically. Pools are garbage collected after inactivity.


KEDA scales worker pools based on queue depth metrics from Prometheus.

When no workers are running and a request arrives:

  1. Router returns 202 Accepted with Retry-After: 120 header
  2. Router records pending demand metric
  3. KEDA detects queue depth > activation threshold
  4. GKE provisions GPU node (60-120 seconds)
  5. Worker pod starts and registers with router
  6. Client retries and request succeeds
autoscaling:
enabled: true
prometheusAddress: http://prometheus-operated.monitoring.svc:9090
pollingInterval: 15 # Check metrics every 15s
cooldownPeriod: 600 # Wait 10 min before scaling to zero
scaleDownStabilization: 300 # 5 min stabilization window
queueDepthThreshold: 10 # Scale up at 10 pending requests/pod
queueDepthActivation: 2 # Activate from zero at 2 requests
fallbackReplicas: 2 # Fallback if Prometheus unavailable

GPU nodes scale to zero during idle periods. Configure cooldown based on your traffic patterns:

  • Consistent traffic: Lower cooldown (300s) for responsive scaling
  • Bursty traffic: Higher cooldown (900s) to avoid thrashing
  • Dev/test: Use spot instances for 60-70% cost savings

The Terraform module provisions a complete GKE cluster with GPU node pools.

  1. GCP project with billing enabled
  2. GPU quota for your region (check with gcloud compute regions describe REGION)
  3. Required APIs enabled:
    • container.googleapis.com
    • compute.googleapis.com
Terminal window
cd deploy/terraform/gcp/examples/dev-l4-spot
# Set project ID
export TF_VAR_project_id="your-project-id"
# Initialize Terraform
terraform init
Terminal window
# Review changes
terraform plan
# Deploy cluster (15-20 minutes)
terraform apply
Terminal window
# Get credentials
$(terraform output -raw kubectl_command)
# Verify cluster
kubectl get nodes

Key configuration options:

VariableDefaultDescription
project_id(required)GCP project ID
region(required)GKE cluster region
cluster_namesie-clusterName of the GKE cluster
gpu_node_poolsL4 poolList of GPU node pool configurations
install_kedatrueEnable KEDA for autoscaling
install_sietrueDeploy SIE application
sie_bundledefaultSIE bundle (default, legacy, gte-qwen2)
module "sie_gke" {
source = "../../"
project_id = "my-project"
region = "us-central1"
cluster_name = "sie-prod"
gpu_node_pools = [
{
name = "l4-pool"
machine_type = "g2-standard-8"
gpu_type = "nvidia-l4"
gpu_count = 1
min_node_count = 1 # Keep 1 warm
max_node_count = 20
spot = false
},
{
name = "a100-pool"
machine_type = "a2-highgpu-1g"
gpu_type = "nvidia-tesla-a100"
gpu_count = 1
min_node_count = 0
max_node_count = 10
spot = true
}
]
enable_workload_identity = true
gcs_bucket_name = "sie-model-cache"
}

Deploy SIE to an existing GKE cluster using Helm.

  • GKE cluster with GPU node pools
  • KEDA installed
  • Prometheus for metrics
Terminal window
# Add SIE repository (if published)
helm repo add sie https://charts.superlinked.com
helm repo update
# Or install from local chart
helm install sie ./deploy/helm/sie-cluster \
--namespace sie \
--create-namespace \
--values custom-values.yaml
custom-values.yaml
router:
replicas: 3
workers:
common:
bundle: default
cacheVolumeSize: 100Gi
clusterCache:
enabled: true
url: gs://my-bucket/models
pools:
l4:
enabled: true
minReplicas: 1
maxReplicas: 20
autoscaling:
enabled: true
cooldownPeriod: 300
ingress:
enabled: true
host: sie.example.com
tls:
enabled: true
secretName: sie-tls
auth:
enabled: true
oauth2Proxy:
oidcIssuerUrl: https://auth.example.com/realms/sie
serviceMonitor:
enabled: true
Terminal window
helm upgrade sie ./deploy/helm/sie-cluster \
--namespace sie \
--values custom-values.yaml
Terminal window
# Check pods
kubectl get pods -n sie
# Check router logs
kubectl logs -n sie -l app.kubernetes.io/component=router
# Test endpoint (SDK)
python - <<'PY'
from sie_sdk import SIEClient
client = SIEClient("https://sie.example.com", api_key="YOUR_TOKEN")
print(client.list_models())
client.close()
PY
  • Ingress controller: use ingress-nginx for public or private access.
  • Public vs private: set ingress-nginx service annotations for internal LBs on GKE.
  • Auth options:
    • OIDC (oauth2-proxy) with external IdP or Dex.
    • Static token (router-level) for OSS/self-hosted without IdP.
    • No auth + private ingress (internal LB).
module "sie_gke" {
# Turnkey ingress controller
install_ingress_nginx = true
# Private LB example
ingress_nginx_service_annotations = {
"cloud.google.com/load-balancer-type" = "Internal"
}
# Router ingress + auth
sie_ingress_enabled = true
sie_ingress_host = "sie.example.com"
sie_ingress_tls_enabled = true
sie_ingress_tls_secret_name = "sie-tls"
sie_auth_enabled = true
sie_auth_oidc_issuer_url = "https://auth.example.com/realms/sie"
sie_auth_secret_name = "oauth2-proxy"
# Static token mode (alternative to OIDC)
sie_router_auth_mode = "static"
sie_router_auth_secret_name = "sie-router-auth"
}

Debug-only access via port-forward is still possible, but production paths should use ingress.