CodeRankEmbed-f16

An f16 (half-precision) cast of nomic-ai/CodeRankEmbed — the 137M NomicBert bi-encoder for code retrieval — in safetensors, for GPU inference (e.g. candle on Apple-Silicon Metal) at roughly half the memory of the f32 base.

This repo is weights only, identical architecture: every tensor is the base model cast f32 → f16, tensor names/shapes unchanged. Use it exactly like the base model (same config.json, tokenizer.json, CLS pooling, and the required query instruction prefix).

Why

The base repo ships f32 safetensors (~547 MB). On the Metal GPU the f16 weights halve the working set and matmul bandwidth with no change to retrieval quality, so it is the form used by embedding-search on Apple Silicon.

Validation (f16 vs f32, CodeSearchNet Python, N=300)

Same code/corpus, dtype the only difference:

dtype	peak RSS	MRR@10	Recall@1
f32 (base)	1116 MB	0.9573	0.9367
f16 (this)	570 MB	0.9573	0.9367

cosine(f16, f32) per-document: mean 0.999998, min 0.999996
top-1 retrieval agreement f16 vs f32: 1.0000
MRR@10 / Recall@1 deltas: 0.0000

f16 is numerically a no-op for retrieval at about half the RAM. (The absolute MRR is high because the eval uses a small 300-doc distractor pool — it is an f16-vs-f32 parity check, not a full-CodeSearchNet reproduction of the base model's published score.)

Usage

The query must use the task instruction prefix (same as the base model); code/documents get no prefix:

Represent this query for searching relevant code: <your query>

CLS-pool the last hidden state and L2-normalize; cosine similarity for ranking.

Provenance & license

Produced by a pure dtype cast (CPU, candle) of nomic-ai/CodeRankEmbed model.safetensors; config.json and tokenizer.json copied unchanged. Inherits the base model's MIT license. Credit and citation belong to the original authors — see the base model card and the CoRNStack paper (arXiv:2412.01007).