Hardware & Capacity
Choosing the right hardware impacts cost, latency, and throughput. This guide covers GPU selection, memory planning, and capacity estimation.
GPU Selection Guide
Section titled “GPU Selection Guide”SIE supports NVIDIA GPUs via CUDA. Choose based on your model size and throughput requirements.
Recommended GPUs
Section titled “Recommended GPUs”| GPU | VRAM | Best For | GCP Machine Type |
|---|---|---|---|
| NVIDIA L4 | 24 GB | Most embedding models, cost-effective inference | g2-standard-8 (1x), g2-standard-24 (2x) |
| NVIDIA A100 40GB | 40 GB | Large models, high throughput | a2-highgpu-1g |
| NVIDIA A100 80GB | 80 GB | Very large models (7B+), multi-model serving | a2-ultragpu-1g |
| NVIDIA H100 | 80 GB | Highest throughput, latest generation | a3-highgpu-1g |
Budget Options
Section titled “Budget Options”| GPU | VRAM | Best For | GCP Machine Type |
|---|---|---|---|
| NVIDIA T4 | 16 GB | Small models, development, testing | n1-standard-8 + T4 |
L4 is recommended for most production workloads. It offers the best price-performance ratio for embedding models under 4B parameters.
Memory Planning
Section titled “Memory Planning”GPU memory usage depends on model size, batch size, and sequence length.
Model Size Categories
Section titled “Model Size Categories”| Category | Parameters | Approximate VRAM | Example Models |
|---|---|---|---|
| Small | < 100M | 0.5-1 GB | all-MiniLM-L6-v2 |
| Medium | 100M-500M | 1-3 GB | bge-m3, e5-large-v2, multilingual-e5-large |
| Large | 500M-2B | 3-8 GB | gte-Qwen2-1.5B-instruct, stella_en_1.5B_v5 |
| XLarge | 2B-8B | 8-20 GB | Qwen3-Embedding-4B, e5-mistral-7b-instruct, NV-Embed-v2 |
Batch Memory Overhead
Section titled “Batch Memory Overhead”Beyond model weights, inference requires memory for:
- Activations: Proportional to batch size and sequence length
- KV cache: For transformer attention (significant for long sequences)
- CUDA context: ~500MB-1GB fixed overhead per GPU
Rule of thumb: Reserve 2-3x the model weight size for safe operation with batching.
Multi-Model Serving
Section titled “Multi-Model Serving”SIE loads models on-demand and uses LRU eviction when memory pressure exceeds 85%:
# From memory.py - default eviction thresholdpressure_threshold: float = 0.85 # Evict LRU model above 85%For multi-model deployments, provision VRAM for:
- Your largest model (always loaded)
- 1-2 additional frequently-used models
- Headroom for batch processing
Example: Serving bge-m3 (~2GB) and e5-mistral-7b (~15GB) together requires at least 24GB VRAM.
Capacity Planning
Section titled “Capacity Planning”Throughput varies by model architecture, sequence length, and hardware. Use these estimates as starting points.
Throughput by Model Type (L4 GPU)
Section titled “Throughput by Model Type (L4 GPU)”Based on actual measurements with 16 concurrent requests:
| Model Type | Example | Corpus Throughput | Query Throughput |
|---|---|---|---|
| Small encoder | all-MiniLM-L6-v2 | ~50,000 tokens/sec | ~5,000 tokens/sec |
| Medium encoder | bge-m3 | ~30,000 tokens/sec | ~3,000 tokens/sec |
| Large LLM-based | Qwen3-Embedding-4B | ~5,000 tokens/sec | ~700 tokens/sec |
| XLarge LLM-based | e5-mistral-7b | ~3,000 tokens/sec | ~400 tokens/sec |
Corpus vs Query: Corpus encoding uses longer sequences (documents). Query encoding uses shorter sequences (search queries).
Scaling Estimates
Section titled “Scaling Estimates”For horizontal scaling, estimate required replicas:
replicas = (target_throughput / single_gpu_throughput) * safety_factorUse a safety factor of 1.3-1.5 to account for traffic spikes and variance.
Example: To achieve 100,000 tokens/sec with bge-m3:
- Single L4 throughput: ~30,000 tokens/sec
- Replicas needed: (100,000 / 30,000) * 1.4 = 4-5 replicas
Cost Optimization
Section titled “Cost Optimization”Spot/Preemptible Instances
Section titled “Spot/Preemptible Instances”The Terraform configuration supports spot instances for GPU node pools:
# From node_pools.tfspot = each.value.spot # Enable for 60-90% cost savingsRecommended for:
- Batch processing workloads
- Non-latency-critical embedding jobs
- Development and testing
Not recommended for:
- Low-latency serving with strict SLAs
- Single-replica deployments
Scale-to-Zero
Section titled “Scale-to-Zero”For variable traffic, configure Kubernetes HPA with minimum replicas of 0. Combine with:
- Keda for event-driven scaling
- GKE Autopilot for automatic node provisioning
- Preemptible node pools for cost savings during scale-up
Cold start latency: Model loading adds 10-60 seconds depending on model size. Consider keeping at least one warm replica for latency-sensitive workloads.
Right-Sizing Checklist
Section titled “Right-Sizing Checklist”- Start with L4 - Upgrade to A100 only if models exceed 24GB VRAM
- Use spot instances - Enable for batch workloads and non-critical paths
- Measure actual throughput - Run performance evals before capacity planning
- Monitor memory pressure - High eviction rates indicate undersized VRAM
GCP GPU Quotas
Section titled “GCP GPU Quotas”Before deploying, request sufficient GPU quota in your target region.
Checking Quotas
Section titled “Checking Quotas”# List GPU quotas in a regiongcloud compute regions describe us-central1 \ --format="table(quotas.filter(metric ~ GPU))"Common Quota Types
Section titled “Common Quota Types”| Quota Name | GPU Type | Notes |
|---|---|---|
NVIDIA_L4_GPUS | L4 | Most available, recommended |
NVIDIA_A100_GPUS | A100 40GB | Limited availability |
NVIDIA_A100_80GB_GPUS | A100 80GB | Very limited |
NVIDIA_H100_GPUS | H100 | Newest, limited availability |
NVIDIA_T4_GPUS | T4 | Widely available |
Requesting Quota Increases
Section titled “Requesting Quota Increases”- Go to IAM & Admin > Quotas
- Filter by service “Compute Engine API” and metric containing “GPU”
- Select the quota and click “Edit Quotas”
- Provide justification and submit
Tip: Request quota in multiple regions. GPU availability varies significantly by zone.
Zone Availability
Section titled “Zone Availability”GPU availability varies by zone. Check before provisioning:
# List zones with L4 GPUsgcloud compute accelerator-types list --filter="name=nvidia-l4"What’s Next
Section titled “What’s Next”- Request Lifecycle - How SIE processes requests through batching and inference
- CLI Reference - Server configuration options including device selection