Skip to content
SIE

Hardware & Capacity

Choosing the right hardware impacts cost, latency, and throughput. This guide covers GPU selection, memory planning, and capacity estimation.

SIE supports NVIDIA GPUs via CUDA. Choose based on your model size and throughput requirements.

GPUVRAMBest ForGCP Machine Type
NVIDIA L424 GBMost embedding models, cost-effective inferenceg2-standard-8 (1x), g2-standard-24 (2x)
NVIDIA A100 40GB40 GBLarge models, high throughputa2-highgpu-1g
NVIDIA A100 80GB80 GBVery large models (7B+), multi-model servinga2-ultragpu-1g
NVIDIA H10080 GBHighest throughput, latest generationa3-highgpu-1g
GPUVRAMBest ForGCP Machine Type
NVIDIA T416 GBSmall models, development, testingn1-standard-8 + T4

L4 is recommended for most production workloads. It offers the best price-performance ratio for embedding models under 4B parameters.

GPU memory usage depends on model size, batch size, and sequence length.

CategoryParametersApproximate VRAMExample Models
Small< 100M0.5-1 GBall-MiniLM-L6-v2
Medium100M-500M1-3 GBbge-m3, e5-large-v2, multilingual-e5-large
Large500M-2B3-8 GBgte-Qwen2-1.5B-instruct, stella_en_1.5B_v5
XLarge2B-8B8-20 GBQwen3-Embedding-4B, e5-mistral-7b-instruct, NV-Embed-v2

Beyond model weights, inference requires memory for:

  • Activations: Proportional to batch size and sequence length
  • KV cache: For transformer attention (significant for long sequences)
  • CUDA context: ~500MB-1GB fixed overhead per GPU

Rule of thumb: Reserve 2-3x the model weight size for safe operation with batching.

SIE loads models on-demand and uses LRU eviction when memory pressure exceeds 85%:

# From memory.py - default eviction threshold
pressure_threshold: float = 0.85 # Evict LRU model above 85%

For multi-model deployments, provision VRAM for:

  • Your largest model (always loaded)
  • 1-2 additional frequently-used models
  • Headroom for batch processing

Example: Serving bge-m3 (~2GB) and e5-mistral-7b (~15GB) together requires at least 24GB VRAM.

Throughput varies by model architecture, sequence length, and hardware. Use these estimates as starting points.

Based on actual measurements with 16 concurrent requests:

Model TypeExampleCorpus ThroughputQuery Throughput
Small encoderall-MiniLM-L6-v2~50,000 tokens/sec~5,000 tokens/sec
Medium encoderbge-m3~30,000 tokens/sec~3,000 tokens/sec
Large LLM-basedQwen3-Embedding-4B~5,000 tokens/sec~700 tokens/sec
XLarge LLM-basede5-mistral-7b~3,000 tokens/sec~400 tokens/sec

Corpus vs Query: Corpus encoding uses longer sequences (documents). Query encoding uses shorter sequences (search queries).

For horizontal scaling, estimate required replicas:

replicas = (target_throughput / single_gpu_throughput) * safety_factor

Use a safety factor of 1.3-1.5 to account for traffic spikes and variance.

Example: To achieve 100,000 tokens/sec with bge-m3:

  • Single L4 throughput: ~30,000 tokens/sec
  • Replicas needed: (100,000 / 30,000) * 1.4 = 4-5 replicas

The Terraform configuration supports spot instances for GPU node pools:

# From node_pools.tf
spot = each.value.spot # Enable for 60-90% cost savings

Recommended for:

  • Batch processing workloads
  • Non-latency-critical embedding jobs
  • Development and testing

Not recommended for:

  • Low-latency serving with strict SLAs
  • Single-replica deployments

For variable traffic, configure Kubernetes HPA with minimum replicas of 0. Combine with:

  • Keda for event-driven scaling
  • GKE Autopilot for automatic node provisioning
  • Preemptible node pools for cost savings during scale-up

Cold start latency: Model loading adds 10-60 seconds depending on model size. Consider keeping at least one warm replica for latency-sensitive workloads.

  1. Start with L4 - Upgrade to A100 only if models exceed 24GB VRAM
  2. Use spot instances - Enable for batch workloads and non-critical paths
  3. Measure actual throughput - Run performance evals before capacity planning
  4. Monitor memory pressure - High eviction rates indicate undersized VRAM

Before deploying, request sufficient GPU quota in your target region.

Terminal window
# List GPU quotas in a region
gcloud compute regions describe us-central1 \
--format="table(quotas.filter(metric ~ GPU))"
Quota NameGPU TypeNotes
NVIDIA_L4_GPUSL4Most available, recommended
NVIDIA_A100_GPUSA100 40GBLimited availability
NVIDIA_A100_80GB_GPUSA100 80GBVery limited
NVIDIA_H100_GPUSH100Newest, limited availability
NVIDIA_T4_GPUST4Widely available
  1. Go to IAM & Admin > Quotas
  2. Filter by service “Compute Engine API” and metric containing “GPU”
  3. Select the quota and click “Edit Quotas”
  4. Provide justification and submit

Tip: Request quota in multiple regions. GPU availability varies significantly by zone.

GPU availability varies by zone. Check before provisioning:

Terminal window
# List zones with L4 GPUs
gcloud compute accelerator-types list --filter="name=nvidia-l4"
  • Request Lifecycle - How SIE processes requests through batching and inference
  • CLI Reference - Server configuration options including device selection