Benchmarks
Measured scale, real embeddings, methodology scope, and limitations.
Benchmark claims are separated by evidence class: public receipt preview, gated proof pack, scale-envelope capacity run, real-embedding calibration, Atlas classification, and roadmap certification.
100% · 20.4 ms p50
100% R@10 scale envelope
38.51 ms · 96–100% envelope
60.89 ms · 98% envelope
What this proves
Benchmark rows are not interchangeable.
The strongest credibility comes from keeping each number attached to its evidence boundary.
| Evidence | Setup | Artifact status | Claim boundary |
|---|---|---|---|
| 100M H100 receipt | H100 80GB, D=384, rank=32 | Receipt preview public; full bundle gated | Exact-tier proof path, not universal corpus claim |
| H100/B200 scale envelope | Generated low-rank data, latent tier | Public table; methodology packet gated | Capacity measurement, not blanket real-embedding claim |
| Real embeddings | 13 production models, public datasets | Public summary; row artifacts gated | Two-tier fp32/SQ8 behavior by model/corpus |
| Atlas | 71 classified rows (text, vision, physics, scientific) | Public summaries; per-row artifacts gated | fp32 calibration; 5 physics rows compressed-certified |
| Soak / durability | 24h soak and snapshot tests | Summary public; detailed logs gated | Operational signal, not production SLA |
Scale envelope
Single-GPU ceilings by precision tier.
These rows are measured scale-envelope runs on generated low-rank data. They are valid capacity measurements, not a claim that every real embedding corpus will reproduce the same recall profile.
| GPU / tier | max entries | query p50 | R@10 | query VRAM | compression |
|---|---|---|---|---|---|
| H100 fp64 | 200M | 40.40 ms | 100% | 50.4 GB | 11.6× vs fp64 |
| H100 fp32 | 500M | 76.53 ms | 100% | 73.0 GB | 23.3× vs fp64 |
| H100 fp16 | 1B | 38.51 ms | 96-100% | 72.9 GB | 46.5× vs fp64 |
| B200 fp16 | 2B | 60.89 ms | 98% | 142.0 GB | 46.5× vs fp64 |
Real embeddings
Two-tier recall on measured model/corpus pairs.
The hardened rows show the public behavior of the production query architecture: Tier-1 latent scan and Tier-2 SQ8 rerank. The Atlas tracks rows that recover, flatten, or become rerank-harmful.
| Model | rank | ρ | RR R@10 | RR p50 | classification |
|---|---|---|---|---|---|
| Gemini-001 3072d | 667 | 0.22 | 0.998 | 1.42 ms | A_ELITE / hardened |
| Cohere v3 1024d | 418 | 0.41 | 0.994 | 2.10 ms | A_ELITE / hardened |
| OpenCLIP 1024d | 401 | 0.39 | 0.995 | 1.31 ms | A_ELITE / hardened |
| E5-Mistral 4096d | 1,146 | 0.28 | 0.934 | 1.63 ms | D_SENSITIVE / hardened |
Claim boundaries
Every number carries its scope.
HX-SDP benchmark claims are separated by evidence class: production hot path, signed receipts, scale envelope, real-embedding calibration, and roadmap certification. Exact-recall statements stay tied to the tiers and artifacts that support them.
not described as generic QTT-native serving
FIPS 204 Category 3 receipt chain
fp32 calibration, not TQ4 certification
concurrency, WAL, fp16, cold start, host RAM
Limitations
Tradeoffs are part of the benchmark.
| Limitation | Operating interpretation |
|---|---|
| fp16 is not exact at every scale | Use fp32/fp64 for exact-recall requirements; fp16 is the scale tier. |
| Concurrency has a single-GPU ceiling | At S-100M, production planning should account for roughly 10–20 error-free clients per GPU before sharding. |
| WAL is not the current recovery mechanism | Crash recovery is snapshot reload; WAL is roadmap in the canonical limitation list. |
| Atlas rows are calibration | A_ELITE in the current Atlas means native fp32 rerank utility, not compressed deployment certification. |
| Scale-envelope data is generated low-rank | The 1B/2B rows are measured capacity rows, not direct real-corpus generalization. |
| Host RAM matters at build | Large-scale builds may require substantial CPU RAM before serving footprint is compact. |
Diligence