Why we trained our own embedding model, MyChatBot Blog

The use case

Most of our customers run knowledge bases, product catalogs, FAQs, internal docs. The agent retrieves relevant chunks at query time. Quality of retrieval directly drives quality of answers.

We started with OpenAI text-embedding-3-large. Then Cohere. Then a popular open-source model. Each was decent. None was great for our specific data.

Training a domain-specific model

We fine-tuned a 350M-param base model on 8M query-document pairs from our customer data (with consent and proper privacy controls). Training took 4 days on 8 H100s.

Our model is smaller than text-embedding-3-large but performs measurably better on our retrieval benchmark. Domain specificity beats raw scale, when the domain is concentrated enough.

Results

Retrieval recall@5 on our benchmark went from 71% (best off-the-shelf) to 88%. End-to-end answer quality (judged by humans) improved by 14% absolute.

We're publishing the eval methodology (not the weights, those are competitive). The methodology is in our research GitHub repo.

#ml#research

Anna Roman

Lead Researcher

Leads our applied ML research. Published widely on multi-agent systems. Believes good evals are 80% of good AI.

Why we trained our own embedding model

The use case

Training a domain-specific model

Results

Try MyChatBot for free

More from AI Research

Comparing GPT-4, Claude and our in-house model for voice

Prompt engineering at production scale

Voice Agent v2: 3× faster, 40% cheaper, in 14 languages

Save your agent to continue