Hardware & Capacity

Choosing the right hardware impacts cost, latency, and throughput. This guide covers GPU selection, memory planning, and capacity estimation.

GPU Selection Guide

SIE supports NVIDIA GPUs via CUDA. Choose based on your model size and throughput requirements.

Recommended GPUs

GPU	VRAM	Best For	GCP Machine Type
NVIDIA L4	24 GB	Most embedding models, cost-effective inference	`g2-standard-8` (1x), `g2-standard-24` (2x)
NVIDIA A100 40GB	40 GB	Large models, high throughput	`a2-highgpu-1g`
NVIDIA A100 80GB	80 GB	Very large models (7B+), multi-model serving	`a2-ultragpu-1g`
NVIDIA H100	80 GB	Highest throughput, latest generation	`a3-highgpu-1g`

Budget Options

GPU	VRAM	Best For	GCP Machine Type
NVIDIA T4	16 GB	Small models, development, testing	`n1-standard-8` + T4

L4 is recommended for most production workloads. It offers the best price-performance ratio for embedding models under 4B parameters.

Memory Planning

GPU memory usage depends on model size, batch size, and sequence length.

Model Size Categories

Category	Parameters	Approximate VRAM	Example Models
Small	< 100M	0.5-1 GB	all-MiniLM-L6-v2
Medium	100M-500M	1-3 GB	bge-m3, e5-large-v2, multilingual-e5-large
Large	500M-2B	3-8 GB	gte-Qwen2-1.5B-instruct, stella_en_1.5B_v5
XLarge	2B-8B	8-20 GB	Qwen3-Embedding-4B, e5-mistral-7b-instruct, NV-Embed-v2

Batch Memory Overhead

Beyond model weights, inference requires memory for:

Activations: Proportional to batch size and sequence length
KV cache: For transformer attention (significant for long sequences)
CUDA context: ~500MB-1GB fixed overhead per GPU

Rule of thumb: Reserve 2-3x the model weight size for safe operation with batching.

Multi-Model Serving

SIE loads models on-demand and uses LRU eviction when memory pressure exceeds 85%:

# From memory.py - default eviction threshold
pressure_threshold: float = 0.85  # Evict LRU model above 85%

For multi-model deployments, provision VRAM for:

Your largest model (always loaded)
1-2 additional frequently-used models
Headroom for batch processing

Example: Serving bge-m3 (~2GB) and e5-mistral-7b (~15GB) together requires at least 24GB VRAM.

Capacity Planning

Throughput varies by model architecture, sequence length, and hardware. Use these estimates as starting points.

Throughput by Model Type (L4 GPU)

Based on actual measurements with 16 concurrent requests:

Model Type	Example	Corpus Throughput	Query Throughput
Small encoder	all-MiniLM-L6-v2	~50,000 tokens/sec	~5,000 tokens/sec
Medium encoder	bge-m3	~30,000 tokens/sec	~3,000 tokens/sec
Large LLM-based	Qwen3-Embedding-4B	~5,000 tokens/sec	~700 tokens/sec
XLarge LLM-based	e5-mistral-7b	~3,000 tokens/sec	~400 tokens/sec

Corpus vs Query: Corpus encoding uses longer sequences (documents). Query encoding uses shorter sequences (search queries).

Scaling Estimates

For horizontal scaling, estimate required replicas:

replicas = (target_throughput / single_gpu_throughput) * safety_factor

Use a safety factor of 1.3-1.5 to account for traffic spikes and variance.

Example: To achieve 100,000 tokens/sec with bge-m3:

Single L4 throughput: ~30,000 tokens/sec
Replicas needed: (100,000 / 30,000) * 1.4 = 4-5 replicas

Cost Optimization

Spot/Preemptible Instances

The Terraform configuration supports spot instances for GPU node pools:

# From node_pools.tf
spot = each.value.spot  # Enable for 60-90% cost savings

Recommended for:

Batch processing workloads
Non-latency-critical embedding jobs
Development and testing

Not recommended for:

Low-latency serving with strict SLAs
Single-replica deployments

Scale-to-Zero

For variable traffic, configure Kubernetes HPA with minimum replicas of 0. Combine with:

Keda for event-driven scaling
GKE Autopilot for automatic node provisioning
Preemptible node pools for cost savings during scale-up

Cold start latency: Model loading adds 10-60 seconds depending on model size. Consider keeping at least one warm replica for latency-sensitive workloads.

Right-Sizing Checklist

Start with L4 - Upgrade to A100 only if models exceed 24GB VRAM
Use spot instances - Enable for batch workloads and non-critical paths
Measure actual throughput - Run performance evals before capacity planning
Monitor memory pressure - High eviction rates indicate undersized VRAM

GCP GPU Quotas

Before deploying, request sufficient GPU quota in your target region.

Checking Quotas

# List GPU quotas in a region
gcloud compute regions describe us-central1 \
  --format="table(quotas.filter(metric ~ GPU))"

Common Quota Types

Quota Name	GPU Type	Notes
`NVIDIA_L4_GPUS`	L4	Most available, recommended
`NVIDIA_A100_GPUS`	A100 40GB	Limited availability
`NVIDIA_A100_80GB_GPUS`	A100 80GB	Very limited
`NVIDIA_H100_GPUS`	H100	Newest, limited availability
`NVIDIA_T4_GPUS`	T4	Widely available

Requesting Quota Increases

Go to IAM & Admin > Quotas
Filter by service “Compute Engine API” and metric containing “GPU”
Select the quota and click “Edit Quotas”
Provide justification and submit

Tip: Request quota in multiple regions. GPU availability varies significantly by zone.

Zone Availability

GPU availability varies by zone. Check before provisioning:

# List zones with L4 GPUs
gcloud compute accelerator-types list --filter="name=nvidia-l4"

What’s Next

Request Lifecycle - How SIE processes requests through batching and inference
CLI Reference - Server configuration options including device selection