The use case
Most of our customers run knowledge bases, product catalogs, FAQs, internal docs. The agent retrieves relevant chunks at query time. Quality of retrieval directly drives quality of answers.
We started with OpenAI text-embedding-3-large. Then Cohere. Then a popular open-source model. Each was decent. None was great for our specific data.
Training a domain-specific model
We fine-tuned a 350M-param base model on 8M query-document pairs from our customer data (with consent and proper privacy controls). Training took 4 days on 8 H100s.
Our model is smaller than text-embedding-3-large but performs measurably better on our retrieval benchmark. Domain specificity beats raw scale, when the domain is concentrated enough.
Results
Retrieval recall@5 on our benchmark went from 71% (best off-the-shelf) to 88%. End-to-end answer quality (judged by humans) improved by 14% absolute.
We're publishing the eval methodology (not the weights, those are competitive). The methodology is in our research GitHub repo.