Quantization

Quantization reduces vector storage and bandwidth. A 1024-dim float32 vector (4KB) becomes 1KB with int8 or 128 bytes with binary. Quality loss is typically 1-3% for int8, more for binary.

Quick Example

Python
TypeScript

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Int8 quantization
result = client.encode(
    "BAAI/bge-m3",
    Item(text="text to encode"),
    output_dtype="int8"
)

# Result is int8 array, 4x smaller than float32
print(f"Dtype: {result['dense'].dtype}")  # int8
print(f"Range: [{result['dense'].min()}, {result['dense'].max()}]")  # [-127, 127]

import { SIEClient } from "@sie/sdk";

const client = new SIEClient("http://localhost:8080");

// Int8 quantization
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "text to encode" },
  { outputDtype: "int8" }
);

// Result is still Float32Array but contains quantized values
// Server handles quantization, client receives appropriate format
console.log(`Dimensions: ${result.dense?.length}`);

await client.close();

Quantization Types

Type	Size Reduction	Quality Loss	Best For
`float32`	1x (baseline)	0%	Quality-critical
`float16`	2x	~0%	Balance
`int8`	4x	1-2%	General storage
`uint8`	4x	1-2%	Qdrant compatibility
`binary`	32x	5-10%	Massive scale

Int8 Quantization

Symmetric per-vector quantization mapping values to [-127, 127]:

Python
TypeScript

result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello world"),
    output_dtype="int8"
)

# Each vector is independently scaled
# value_int8 = round(value_float32 / max_abs * 127)

const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello world" },
  { outputDtype: "int8" }
);

// Each vector is independently scaled
// value_int8 = round(value_float32 / max_abs * 127)

Use with vector databases that support int8:

Qdrant (scalar quantization)
Milvus (int8 index)
Pinecone (using product quantization)

Uint8 Quantization

Linear mapping to [0, 255] range:

Python
TypeScript

result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello world"),
    output_dtype="uint8"
)

# Maps [min, max] → [0, 255] per vector

const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello world" },
  { outputDtype: "uint8" }
);

// Maps [min, max] → [0, 255] per vector

Qdrant’s scalar quantization uses uint8 format.

Binary Quantization

Bit-packed to 32x smaller:

Python
TypeScript

result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello world"),
    output_dtype="binary"
)

# 1024-dim float32 (4KB) → 128 bytes
# Each dimension becomes 1 bit: positive → 1, negative → 0
print(f"Shape: {result['dense'].shape}")  # (128,) uint8

const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello world" },
  { outputDtype: "binary" }
);

// 1024-dim float32 (4KB) → 128 bytes
// Each dimension becomes 1 bit: positive → 1, negative → 0
console.log(`Shape: ${result.dense?.length}`);  // 128

Binary uses Hamming distance instead of cosine:

# Hamming distance = XOR + popcount
hamming = np.sum(np.bitwise_xor(a_binary, b_binary).astype(np.uint8))

Binary is useful for:

First-stage candidate filtering
Memory-constrained environments
Re-ranking with full-precision vectors

Float16 Precision

Half precision with minimal quality loss:

Python
TypeScript

result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello world"),
    output_dtype="float16"
)

print(f"Dtype: {result['dense'].dtype}")  # float16

const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello world" },
  { outputDtype: "float16" }
);

// Note: JavaScript doesn't have native float16, so values may be returned as float32
console.log(`Dimensions: ${result.dense?.length}`);

Float16 is effectively lossless for vector search in practice. Use it when your database supports it.

Quality Impact

Approximate NDCG retention on standard benchmarks:

Quantization	NDCG@10 Retention
float32	100% (baseline)
float16	~99.9%
int8	~98-99%
uint8	~98-99%
binary	~90-95%

Actual impact varies by model and task. Run evals on your data.

Two-Stage Pattern

Use binary for fast candidate retrieval, full precision for reranking:

Python
TypeScript

# Stage 1: Binary search over millions
binary_result = client.encode(model, query, output_dtype="binary")
candidates = binary_index.search(binary_result["dense"], top_k=1000)

# Stage 2: Full precision rerank of top candidates
full_result = client.encode(model, query)  # float32
reranked = rerank_with_full_precision(full_result["dense"], candidates)

// Stage 1: Binary search over millions
const binaryResult = await client.encode(model, query, { outputDtype: "binary" });
const candidates = await binaryIndex.search(binaryResult.dense!, 1000);

// Stage 2: Full precision rerank of top candidates
const fullResult = await client.encode(model, query);  // float32
const reranked = rerankWithFullPrecision(fullResult.dense!, candidates);

Sparse Vector Quantization

Sparse vectors are NOT quantized—only dense and multivector:

Python
TypeScript

result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    output_types=["dense", "sparse"],
    output_dtype="int8"
)

# Dense is int8
print(result["dense"].dtype)  # int8

# Sparse stays float32 (indices + values don't benefit from quantization)
print(result["sparse"]["values"].dtype)  # float32

const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello" },
  { outputTypes: ["dense", "sparse"], outputDtype: "int8" }
);

// Dense is quantized
console.log(`Dense length: ${result.dense?.length}`);

// Sparse stays float32 (indices + values don't benefit from quantization)
console.log(`Sparse values: Float32Array`);

HTTP API

The server defaults to msgpack for efficient binary transport. For JSON responses:

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "quantized text"}],
    "params": {"output_dtype": "int8"}
  }'

Response includes int8 values:

{
  "model": "BAAI/bge-m3",
  "items": [
    {
      "dense": {"dims": 1024, "dtype": "int8", "values": [23, -89, 12, ...]}
    }
  ]
}

Note: JSON represents int8 as integers. For msgpack, values are packed as int8.

What’s Next

Model Catalog - all supported models