TEI on Spark — reranker for RAG¶

tei-spark is a HuggingFace text-embeddings-inference deployment on the Spark node, serving BAAI/bge-reranker-v2-m3 as an external cross-encoder for Open WebUI's RAG pipeline. It replaces Open WebUI's in-process reranker so the chat pod doesn't carry a 6 Gi GPU-less-reranker memory footprint and so reranking runs on Spark next to the bge-m3 embedder.


Upstream image	`ghcr.io/huggingface/text-embeddings-inference:121-sha-<sha>` (digest-pinned)
Model	`BAAI/bge-reranker-v2-m3`
Manifests	`kubernetes/apps/ai/tei-spark/`
Cluster service	`tei-spark.ai.svc.cluster.local:3000` (HTTP `/rerank`, `/metrics`)
Node	`spark.${SECRET_DOMAIN}` (arm64, GB10, sm_121)
GPU	`nvidia.com/gpu: 1`, `ai-gpu-critical` priority
Model cache	5 Gi `ceph-block` PVC (`tei-spark-cache-pvc`) at `/data`

Why this shape¶

Family-matched reranker. bge-reranker-v2-m3 is BAAI's v2 reranker designed to pair with the bge-m3 embedder (Phase A 2026-05-20). On BAAI's published MIRACL/BEIR/CMTEB benchmarks reranking bge-m3 top-100 retrievals, v2-m3 beats bge-reranker-large in every multilingual setting. Don't substitute to a v1-family reranker for "TEI tier-1 supported" reasons (see Quirks).

Spark, not P40. The P40 is Pascal (sm_61). bge-reranker-v2-m3 wants FP16; Pascal handles FP16 poorly. Spark's Blackwell (sm_121) is the right home, and co-locating with the bge-m3 embedder eliminates a network hop.

External, not in-process. Open WebUI's in-process reranker shipped with a 6 Gi memory limit headroom even though the model wasn't always loaded. Pulling it out cleans the architecture for the day Open WebUI's KB external-Qdrant gap closes and rerank actually becomes hot.

Architecture¶

                  ┌────────────────────────────┐
   Open WebUI ───▶│ tei-spark (Service:3000)   │
                  │ - /rerank                  │
                  │ - /health                  │
                  │ - /metrics                 │
                  └──────────┬─────────────────┘
                             │ runs on
                             ▼
                  ┌────────────────────────────┐
                  │ Spark node (arm64, GB10)   │
                  │ TEI HTTP variant           │
                  │ Candle CUDA backend        │
                  │ FlashBert on sm_121        │
                  │ dtype=float16              │
                  └──────────┬─────────────────┘
                             │ HF Hub download (first start only,
                             │ cached on ceph-block PVC)
                             ▼
                  ┌────────────────────────────┐
                  │ BAAI/bge-reranker-v2-m3    │
                  │ XLM-RoBERTa-base (568M)    │
                  └────────────────────────────┘

Quirks¶

1. The 121- image is built from `main`, not a tagged release¶

Upstream PRs #840 (multi-arch CUDA, sm_121) and #852 (restrict 12.1 to linux/arm64) merged 2026-03-31 but no TEI release tag has been cut since v1.9.3 (pre-#840). The HelmRelease pins to 121-sha-<sha>@sha256:... under a # workaround: annotation. Remove the pin when TEI cuts a release containing #840.

2. `bge-reranker-v2-m3` is not on TEI's documented supported-rerankers list¶

TEI's docs/source/en/supported_models.md lists bge-reranker-base / bge-reranker-large / GTE rerankers — not v2-m3. It still loads cleanly because TEI's router/src/lib.rs::get_backend_model_type dispatches XLM-RoBERTa Sequence Classification generically via arch.ends_with("Classification"); the Candle CUDA backend implements the forward pass. Issue #713 confirms it serving live on TEI 1.8.

If TEI ever tightens the architecture allowlist, v2-m3 could break — keep an eye on release notes. Fallback options in priority order:

Same model, different server — michaelfeil/infinity lists v2-m3 in tested models, Blackwell support landed 2025-07.
Different model, same server — bge-reranker-large (TEI tier-1 listed) but it's the older v1 family and worse on multilingual.

3. `/metrics` is on the HTTP port, not `--prometheus-port`¶

TEI's HTTP variant serves /metrics on the main API port (3000 in this deploy). The --prometheus-port / PROMETHEUS_PORT setting does not open a separate listener — it's a no-op for the HTTP variant (likely applies only to the gRPC build). The HelmRelease originally set PROMETHEUS_PORT: 9000 and the ServiceMonitor scraped :9000; the scrape silently failed with "connection refused" until corrected in PR #11898. Single port, single ServiceMonitor endpoint (port: http), CNP allows both Open WebUI and Prometheus on 3000.

4. Cold start is ~65 seconds, dominated by HF Hub model download¶

First-boot timing on Spark:

Image pull: ~30 s (the TEI 121- image is large; CUDA libs + Rust binary).
HF Hub artifact download (configs, tokenizer): ~1 s.
Model weights (safetensors, ~570 MB) → PVC: ~48 s.
CUDA backend load + warmup: ~14 s.
HTTP server up + Ready: total ~65 s.

Subsequent pod restarts skip the HF Hub steps (PVC cache is warm) and complete in ~55 s. The startupProbe budget is 5 min (60 × 5 s) — comfortable margin, can be tightened to ~2 min if/when we want faster pod-failure detection.

5. DCGM utilization counters are broken on GB10¶

Most DCGM utilization counters are broken on GB10: GPU_UTIL, MEM_COPY_UTIL, GR_ENGINE_ACTIVE, FB_USED are stuck at 0 or empty on Spark. Only POWER_USAGE and SM_CLOCK report. Don't write GPU-utilization-based alerts; use request-success metrics instead.

Configuration knobs¶

In kubernetes/apps/ai/tei-spark/app/helmrelease.yaml:

Env	Value	Notes
`MODEL_ID`	`BAAI/bge-reranker-v2-m3`	Family-matched to bge-m3 embedder
`PORT`	`3000`	HTTP + metrics share this port
`HUGGINGFACE_HUB_CACHE`	`/data`	ceph-block PVC mount
`DTYPE`	`float16`	Halves VRAM use, sm_121 native
`AUTO_TRUNCATE`	`true`	Truncate over-long inputs rather than 413
`JSON_OUTPUT`	`true`	Structured logs for Loki ingestion

Verification recipes¶

Service is live and reranking¶

kubectl -n ai run --rm -i tei-smoke --image=curlimages/curl:8.10.1 --restart=Never \
  -- curl -sS -X POST http://tei-spark.ai.svc.cluster.local:3000/rerank \
  -H 'content-type: application/json' \
  -d '{"query":"What is RAG?",
       "texts":["RAG combines retrieval and generation.",
                "Pizza is delicious.",
                "Embeddings are vector representations."]}'

Expected: JSON array with index 0 scored highest by an order of magnitude.

Prometheus scrape is healthy¶

kubectl -n observability port-forward sts/prometheus-kube-prometheus-stack 19090:9090 &
curl -sS 'http://localhost:19090/api/v1/targets' \
  | jq '.data.activeTargets[] | select(.scrapeUrl | contains("tei-spark"))'

Expected: "health": "up", scrapeUrl on port 3000.

Alerts are loaded¶

SparkTeiRerankPodDown (critical, 5 min) and SparkTeiRerankErrorRate (warning, 10 min, >5%) ship in the same kustomization. Verify via Prom /api/v1/rules or the Alerts UI.

Cutover history¶

PR	Description
#11886	PR-1 — scaffold HelmRelease (suspended)
#11893	PR-2 — unsuspend; first pod up, smoke test against v2-m3 passes
#11898	PR-2.1 — fix `/metrics` port wiring (was scraping wrong port)
#11906	PR-4 — PrometheusRule (pod-down + error-rate alerts)
pending	PR-3 — Open WebUI cutover to `RAG_RERANKING_ENGINE: external`; window-gated to Tue 2026-05-26 02:00 ET

Before PR-3 lands, TEI is dark capacity — no consumer routes to it, alerts are inert. Soak metrics on the Spark side (POWER_USAGE baseline, pod memory) can still be observed during this window.

Common failure modes (anticipated; populate as encountered)¶

HF Hub 429 during first start — model weights download fails. Fix: delete pod, retry. Hub rate limits are per-IP and brief. Long-term mitigation: pre-stage weights into the PVC via a one-shot Job.
CUDA OOM — bge-reranker-v2-m3 at FP16 needs ~1.5 Gi VRAM with batched requests; should never OOM on GB10's 128 Gi unified memory unless something else on the GPU is leaking. Check nvidia-smi on the Spark node.
Tokenizer panic on input — TEI logs the offending input; not seen yet in this deploy. AUTO_TRUNCATE: true should prevent this for length issues.