Skip to content
SIE

Quantization

Quantization reduces vector storage and bandwidth. A 1024-dim float32 vector (4KB) becomes 1KB with int8 or 128 bytes with binary. Quality loss is typically 1-3% for int8, more for binary.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Int8 quantization
result = client.encode(
"BAAI/bge-m3",
Item(text="text to encode"),
output_dtype="int8"
)
# Result is int8 array, 4x smaller than float32
print(f"Dtype: {result['dense'].dtype}") # int8
print(f"Range: [{result['dense'].min()}, {result['dense'].max()}]") # [-127, 127]
TypeSize ReductionQuality LossBest For
float321x (baseline)0%Quality-critical
float162x~0%Balance
int84x1-2%General storage
uint84x1-2%Qdrant compatibility
binary32x5-10%Massive scale

Symmetric per-vector quantization mapping values to [-127, 127]:

result = client.encode(
"BAAI/bge-m3",
Item(text="hello world"),
output_dtype="int8"
)
# Each vector is independently scaled
# value_int8 = round(value_float32 / max_abs * 127)

Use with vector databases that support int8:

  • Qdrant (scalar quantization)
  • Milvus (int8 index)
  • Pinecone (using product quantization)

Linear mapping to [0, 255] range:

result = client.encode(
"BAAI/bge-m3",
Item(text="hello world"),
output_dtype="uint8"
)
# Maps [min, max] → [0, 255] per vector

Qdrant’s scalar quantization uses uint8 format.

Bit-packed to 32x smaller:

result = client.encode(
"BAAI/bge-m3",
Item(text="hello world"),
output_dtype="binary"
)
# 1024-dim float32 (4KB) → 128 bytes
# Each dimension becomes 1 bit: positive → 1, negative → 0
print(f"Shape: {result['dense'].shape}") # (128,) uint8

Binary uses Hamming distance instead of cosine:

# Hamming distance = XOR + popcount
hamming = np.sum(np.bitwise_xor(a_binary, b_binary).astype(np.uint8))

Binary is useful for:

  • First-stage candidate filtering
  • Memory-constrained environments
  • Re-ranking with full-precision vectors

Half precision with minimal quality loss:

result = client.encode(
"BAAI/bge-m3",
Item(text="hello world"),
output_dtype="float16"
)
print(f"Dtype: {result['dense'].dtype}") # float16

Float16 is effectively lossless for vector search in practice. Use it when your database supports it.

Approximate NDCG retention on standard benchmarks:

QuantizationNDCG@10 Retention
float32100% (baseline)
float16~99.9%
int8~98-99%
uint8~98-99%
binary~90-95%

Actual impact varies by model and task. Run evals on your data.

Use binary for fast candidate retrieval, full precision for reranking:

# Stage 1: Binary search over millions
binary_result = client.encode(model, query, output_dtype="binary")
candidates = binary_index.search(binary_result["dense"], top_k=1000)
# Stage 2: Full precision rerank of top candidates
full_result = client.encode(model, query) # float32
reranked = rerank_with_full_precision(full_result["dense"], candidates)

Sparse vectors are NOT quantized—only dense and multivector:

result = client.encode(
"BAAI/bge-m3",
Item(text="hello"),
output_types=["dense", "sparse"],
output_dtype="int8"
)
# Dense is int8
print(result["dense"].dtype) # int8
# Sparse stays float32 (indices + values don't benefit from quantization)
print(result["sparse"]["values"].dtype) # float32

The server defaults to msgpack for efficient binary transport. For JSON responses:

Terminal window
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"items": [{"text": "quantized text"}],
"params": {"output_dtype": "int8"}
}'

Response includes int8 values:

{
"model": "BAAI/bge-m3",
"items": [
{
"dense": {"dims": 1024, "dtype": "int8", "values": [23, -89, 12, ...]}
}
]
}

Note: JSON represents int8 as integers. For msgpack, values are packed as int8.