Skip to content
SIE

Monitoring & Observability

SIE provides multiple monitoring interfaces. Use health endpoints for container orchestration. Expose Prometheus metrics for alerting. Stream real-time status via WebSocket. Monitor interactively with the TUI.

SIE exposes Kubernetes-compatible health probes for liveness and readiness checks.

Terminal window
curl http://localhost:8080/healthz
# Returns: ok

The /healthz endpoint returns 200 OK if the server process is alive. Use this for Kubernetes liveness probes. A failed check triggers container restart.

Terminal window
curl http://localhost:8080/readyz
# Returns: ok

The /readyz endpoint returns 200 OK if the server is ready to accept traffic. Use this for Kubernetes readiness probes. A failed check removes the pod from service endpoints.

Kubernetes configuration:

livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

SIE exposes Prometheus-format metrics at /metrics. All metrics use the sie_ prefix.

MetricTypeLabelsDescription
sie_requests_totalCountermodel, endpoint, statusTotal requests processed
sie_request_duration_secondsHistogrammodel, endpoint, phaseRequest duration breakdown
sie_batch_sizeHistogrammodelItems per batch
sie_tokens_processed_totalCountermodelTotal tokens processed
sie_queue_depthGaugemodelCurrent pending items in queue
sie_model_loadedGaugemodel, deviceModel load state (1=loaded, 0=not)
sie_model_memory_bytesGaugemodel, deviceGPU memory usage per model

The sie_request_duration_seconds histogram tracks latency by phase:

PhaseDescription
totalEnd-to-end request latency
queueTime spent waiting in the request queue
tokenizeTokenization and preprocessing time
inferenceGPU inference time

Duration buckets (seconds): 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0

Batch size buckets: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024

prometheus.yml
scrape_configs:
- job_name: 'sie'
static_configs:
- targets: ['localhost:8080']
metrics_path: /metrics
scrape_interval: 15s

The sie-top command provides a real-time terminal interface for monitoring SIE servers.

Terminal window
pip install 'sie-admin[top]'
Terminal window
# Monitor local server (auto-detects mode)
sie-top
# Monitor specific server
sie-top localhost:8080
# Force worker mode (connect to individual worker)
sie-top --worker worker-0.sie.svc:8080
# Force cluster mode (connect to router)
sie-top --cluster router.example.com:8080

Mode is auto-detected by probing the router /health endpoint. Use --worker or --cluster to force a specific mode.

The TUI displays:

  • Server info: Version, uptime, user, PID
  • GPU table: Device name, memory usage, compute utilization, trend sparkline
  • Model table: Name, state, device, memory, queue depth, QPS sparkline
  • Detail panel: Selected GPU or model with 60-second history charts

Keyboard shortcuts:

KeyAction
j / DownMove selection down
k / UpMove selection up
?Show help
qQuit

SIE streams real-time status over WebSocket at /ws/status. Updates push every 200ms.

import asyncio
import websockets
import json
async def monitor():
async with websockets.connect("ws://localhost:8080/ws/status") as ws:
async for message in ws:
status = json.loads(message)
print(f"Loaded models: {status['loaded_models']}")
print(f"GPU type: {status['gpu']}")
{
"timestamp": 1703001234.567,
"gpu": "l4",
"loaded_models": ["bge-m3", "e5-base-v2"],
"server": {
"version": "0.1.0",
"uptime_seconds": 3600,
"user": "sie",
"working_dir": "/app",
"pid": 1
},
"gpus": [
{
"device": "cuda:0",
"name": "NVIDIA L4",
"gpu_type": "l4",
"utilization_pct": 45,
"memory_used_bytes": 8589934592,
"memory_total_bytes": 23622320128,
"memory_threshold_pct": 85
}
],
"models": [
{
"name": "bge-m3",
"state": "loaded",
"device": "cuda:0",
"memory_bytes": 2147483648,
"queue_depth": 0,
"queue_pending_items": 0,
"config": {
"hf_id": "BAAI/bge-m3",
"adapter": "bge_m3",
"inputs": ["text"],
"outputs": ["dense", "sparse"]
}
}
],
"counters": {},
"histograms": {}
}
StateDescription
availableConfig loaded, weights not in memory
loadingWeights currently loading to GPU
loadedReady for inference
unloadingWeights being evicted from GPU

SIE includes pre-built Grafana dashboards in the Helm chart at deploy/helm/sie-cluster/files/dashboards/. These are automatically provisioned when deploying with Grafana’s sidecar.

Example queries for common panels:

sum(rate(sie_requests_total{status="success"}[5m])) by (model)
histogram_quantile(0.99,
sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model)
)
sum(sie_model_memory_bytes) by (device)
sum(sie_queue_depth) by (model)
histogram_quantile(0.5,
sum(rate(sie_batch_size_bucket[5m])) by (le, model)
)

SIE supports both human-readable and structured JSON logging.

Enable verbose logging with --verbose or -v:

Terminal window
sie-server serve --verbose

Enable JSON format for Loki and log aggregation systems:

Terminal window
sie-server serve --json-logs

Or via environment variable:

Terminal window
export SIE_LOG_JSON=true
sie-server serve
{
"timestamp": "2025-12-18T10:30:00.123Z",
"level": "INFO",
"logger": "sie_server.api.encode",
"message": "Inference completed",
"model": "bge-m3",
"request_id": "abc123",
"trace_id": "def456",
"latency_ms": 45.2,
"batch_size": 16,
"gpu_type": "l4"
}

JSON logs include optional fields when available:

FieldDescription
modelModel name for the request
request_idUnique request identifier
trace_idOpenTelemetry trace ID
latency_msRequest latency in milliseconds
batch_sizeNumber of items in the batch
gpu_typeDetected GPU type