Quantization
Quantization reduces vector storage and bandwidth. A 1024-dim float32 vector (4KB) becomes 1KB with int8 or 128 bytes with binary. Quality loss is typically 1-3% for int8, more for binary.
Quick Example
Section titled “Quick Example”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Int8 quantizationresult = client.encode( "BAAI/bge-m3", Item(text="text to encode"), output_dtype="int8")
# Result is int8 array, 4x smaller than float32print(f"Dtype: {result['dense'].dtype}") # int8print(f"Range: [{result['dense'].min()}, {result['dense'].max()}]") # [-127, 127]import { SIEClient } from "@sie/sdk";
const client = new SIEClient("http://localhost:8080");
// Int8 quantizationconst result = await client.encode( "BAAI/bge-m3", { text: "text to encode" }, { outputDtype: "int8" });
// Result is still Float32Array but contains quantized values// Server handles quantization, client receives appropriate formatconsole.log(`Dimensions: ${result.dense?.length}`);
await client.close();Quantization Types
Section titled “Quantization Types”| Type | Size Reduction | Quality Loss | Best For |
|---|---|---|---|
float32 | 1x (baseline) | 0% | Quality-critical |
float16 | 2x | ~0% | Balance |
int8 | 4x | 1-2% | General storage |
uint8 | 4x | 1-2% | Qdrant compatibility |
binary | 32x | 5-10% | Massive scale |
Int8 Quantization
Section titled “Int8 Quantization”Symmetric per-vector quantization mapping values to [-127, 127]:
result = client.encode( "BAAI/bge-m3", Item(text="hello world"), output_dtype="int8")
# Each vector is independently scaled# value_int8 = round(value_float32 / max_abs * 127)const result = await client.encode( "BAAI/bge-m3", { text: "hello world" }, { outputDtype: "int8" });
// Each vector is independently scaled// value_int8 = round(value_float32 / max_abs * 127)Use with vector databases that support int8:
- Qdrant (scalar quantization)
- Milvus (int8 index)
- Pinecone (using product quantization)
Uint8 Quantization
Section titled “Uint8 Quantization”Linear mapping to [0, 255] range:
result = client.encode( "BAAI/bge-m3", Item(text="hello world"), output_dtype="uint8")
# Maps [min, max] → [0, 255] per vectorconst result = await client.encode( "BAAI/bge-m3", { text: "hello world" }, { outputDtype: "uint8" });
// Maps [min, max] → [0, 255] per vectorQdrant’s scalar quantization uses uint8 format.
Binary Quantization
Section titled “Binary Quantization”Bit-packed to 32x smaller:
result = client.encode( "BAAI/bge-m3", Item(text="hello world"), output_dtype="binary")
# 1024-dim float32 (4KB) → 128 bytes# Each dimension becomes 1 bit: positive → 1, negative → 0print(f"Shape: {result['dense'].shape}") # (128,) uint8const result = await client.encode( "BAAI/bge-m3", { text: "hello world" }, { outputDtype: "binary" });
// 1024-dim float32 (4KB) → 128 bytes// Each dimension becomes 1 bit: positive → 1, negative → 0console.log(`Shape: ${result.dense?.length}`); // 128Binary uses Hamming distance instead of cosine:
# Hamming distance = XOR + popcounthamming = np.sum(np.bitwise_xor(a_binary, b_binary).astype(np.uint8))Binary is useful for:
- First-stage candidate filtering
- Memory-constrained environments
- Re-ranking with full-precision vectors
Float16 Precision
Section titled “Float16 Precision”Half precision with minimal quality loss:
result = client.encode( "BAAI/bge-m3", Item(text="hello world"), output_dtype="float16")
print(f"Dtype: {result['dense'].dtype}") # float16const result = await client.encode( "BAAI/bge-m3", { text: "hello world" }, { outputDtype: "float16" });
// Note: JavaScript doesn't have native float16, so values may be returned as float32console.log(`Dimensions: ${result.dense?.length}`);Float16 is effectively lossless for vector search in practice. Use it when your database supports it.
Quality Impact
Section titled “Quality Impact”Approximate NDCG retention on standard benchmarks:
| Quantization | NDCG@10 Retention |
|---|---|
| float32 | 100% (baseline) |
| float16 | ~99.9% |
| int8 | ~98-99% |
| uint8 | ~98-99% |
| binary | ~90-95% |
Actual impact varies by model and task. Run evals on your data.
Two-Stage Pattern
Section titled “Two-Stage Pattern”Use binary for fast candidate retrieval, full precision for reranking:
# Stage 1: Binary search over millionsbinary_result = client.encode(model, query, output_dtype="binary")candidates = binary_index.search(binary_result["dense"], top_k=1000)
# Stage 2: Full precision rerank of top candidatesfull_result = client.encode(model, query) # float32reranked = rerank_with_full_precision(full_result["dense"], candidates)// Stage 1: Binary search over millionsconst binaryResult = await client.encode(model, query, { outputDtype: "binary" });const candidates = await binaryIndex.search(binaryResult.dense!, 1000);
// Stage 2: Full precision rerank of top candidatesconst fullResult = await client.encode(model, query); // float32const reranked = rerankWithFullPrecision(fullResult.dense!, candidates);Sparse Vector Quantization
Section titled “Sparse Vector Quantization”Sparse vectors are NOT quantized—only dense and multivector:
result = client.encode( "BAAI/bge-m3", Item(text="hello"), output_types=["dense", "sparse"], output_dtype="int8")
# Dense is int8print(result["dense"].dtype) # int8
# Sparse stays float32 (indices + values don't benefit from quantization)print(result["sparse"]["values"].dtype) # float32const result = await client.encode( "BAAI/bge-m3", { text: "hello" }, { outputTypes: ["dense", "sparse"], outputDtype: "int8" });
// Dense is quantizedconsole.log(`Dense length: ${result.dense?.length}`);
// Sparse stays float32 (indices + values don't benefit from quantization)console.log(`Sparse values: Float32Array`);HTTP API
Section titled “HTTP API”The server defaults to msgpack for efficient binary transport. For JSON responses:
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "quantized text"}], "params": {"output_dtype": "int8"} }'Response includes int8 values:
{ "model": "BAAI/bge-m3", "items": [ { "dense": {"dims": 1024, "dtype": "int8", "values": [23, -89, 12, ...]} } ]}Note: JSON represents int8 as integers. For msgpack, values are packed as int8.
What’s Next
Section titled “What’s Next”- Model Catalog - all supported models