The Embedding Model Decision Nobody Talks About
Most teams building RAG systems spend weeks debating which LLM to use as their generator, then spend about 20 minutes picking an embedding model. That's backwards.
Your embedding model determines whether your retrieval system can actually find the relevant chunks when a user asks a question. A mediocre generator with excellent retrieval will outperform an excellent generator with mediocre retrieval almost every time. Yet the embedding model comparison conversation gets skipped in favour of arguing about GPT-4 versus Claude.
This article gives you a practical framework for choosing an embedding model that fits your documents, your infrastructure, and your budget - without the vendor marketing noise.
What Embedding Models Actually Do in a RAG Pipeline
Before comparing options, it's worth being precise about the job. An embedding model converts text - a document chunk, a query, a sentence - into a dense vector, typically between 384 and 3,072 dimensions depending on the model. Your retrieval system then uses cosine similarity or dot product to find stored vectors that are close to the query vector.
The quality of that vector representation determines retrieval quality. If your embedding model encodes "invoice payment terms" and "net 30 days" as vectors that are far apart in the embedding space, your system will miss relevant documents even when the answer is sitting right there.
Two properties matter most:
- Semantic accuracy - does the model understand that "staff turnover" and "employee attrition" mean the same thing?
- Domain alignment - was the model trained on text similar to your documents?
A model trained heavily on web content will perform differently on legal contracts, medical records, or engineering specifications. This is the core issue that most embedding model comparisons gloss over.
The Main Contenders: A Practical Overview
Here's a grounded look at the models you'll actually encounter when building production RAG systems.
OpenAI text-embedding-3-small and text-embedding-3-large
OpenAI's current embedding models are the default choice for many teams, largely because they're already using the OpenAI API. text-embedding-3-small produces 1,536-dimensional vectors (configurable down to 512) and costs $0.02 per million tokens. text-embedding-3-large uses 3,072 dimensions and costs $0.13 per million tokens.
Both perform well on general English text. The large model shows meaningful improvements on technical and specialised content. The practical limitation is API dependency - your retrieval latency includes a network round trip, and you're subject to rate limits and pricing changes.
Cohere Embed v3
Cohere's embed-english-v3.0 and embed-multilingual-v3.0 models are worth serious consideration, particularly for organisations with multilingual document sets. The English model produces 1,024-dimensional vectors and consistently performs well on domain-specific content in benchmarks like BEIR.
Cohere also offers an input_type parameter that lets you specify whether you're embedding a document or a query. This matters because the optimal representation differs between the two - a document should encode its content, while a query should encode its intent. Most other API providers don't expose this distinction.
Sentence Transformers (Open Source)
The sentence-transformers library gives you access to hundreds of models you can run locally. The most commonly used include:
- all-MiniLM-L6-v2 - 384 dimensions, extremely fast, reasonable quality for general text
- all-mpnet-base-v2 - 768 dimensions, better quality, still fast
- bge-large-en-v1.5 - 1,024 dimensions, strong benchmark performance, from BAAI
Running these locally eliminates API costs and latency, which matters significantly at scale. A team processing 10 million document chunks pays nothing in embedding costs beyond compute. The trade-off is that you're managing the infrastructure yourself.
Voyage AI
Voyage AI has gained traction specifically because of their domain-specific models. voyage-law-2, voyage-code-2, and voyage-finance-2 are trained on domain-specific corpora. If you're building a RAG system over legal documents or a codebase, these are worth benchmarking seriously. The general voyage-large-2 also performs well on MTEB benchmarks.
How to Run a Meaningful Embedding Models Comparison
Generic benchmarks like MTEB give you a starting point, but your documents are not the MTEB test set. Here's how to run a comparison that's actually relevant to your use case.
Step 1: Build a small evaluation set. Take 50-100 representative documents from your corpus and write 3-5 questions per document that a real user might ask. Record the correct source chunk for each question. This takes a few hours but it's the only way to get reliable results.
Step 2: Embed and index with each candidate model. Use the same chunking strategy, the same vector database, and the same retrieval parameters for every model. The only variable should be the embedding model.
Step 3: Measure retrieval metrics. For each question, check whether the correct chunk appears in the top-k results (typically top-3 or top-5). Calculate recall@3 and recall@5 for each model. If you have the time, also measure Mean Reciprocal Rank (MRR), which rewards models that put the right answer at position 1 rather than position 3.
Step 4: Analyse failure cases. When a model misses, look at what it retrieved instead. Is it consistently confused by specific terminology in your domain? Is it retrieving chunks that are topically adjacent but wrong? This tells you whether the problem is fixable with better chunking or whether the model genuinely lacks domain knowledge.
A concrete example: a legal services firm processing Australian contract law documents ran this process across four models. OpenAI text-embedding-3-large achieved recall@5 of 0.71. Cohere embed-english-v3.0 reached 0.74. Voyage-law-2 hit 0.83. The difference between 0.71 and 0.83 on recall@5 is significant - it means roughly 12 in every 100 queries that would have returned a wrong or incomplete answer now return the correct one. For a legal application, that's not a marginal improvement.
Dimensions, Costs, and the Latency Trade-off
Higher-dimensional embeddings generally capture more semantic nuance, but they come with real costs:
- Storage - a 3,072-dimensional float32 vector takes 12KB. One million chunks takes 12GB. A 384-dimensional vector takes 1.5KB for the same chunk count.
- Query latency - similarity search over higher-dimensional vectors is slower, particularly at scale. This compounds if you're using a managed vector database that charges by query.
- Indexing time - re-embedding a large document corpus when you switch models is expensive. Budget for this before committing.
For most production systems under a few million chunks, dimension count won't be your primary bottleneck. But if you're building at scale or need sub-100ms retrieval, it's worth profiling. Many vector databases support approximate nearest neighbour (ANN) indexing like HNSW, which mitigates the latency issue considerably.
The matryoshka representation learning (MRL) approach used in OpenAI's v3 models lets you truncate vectors to smaller dimensions without retraining. You can embed at 1,536 dimensions and store at 512, trading some quality for storage and speed. This is worth testing if you're cost-sensitive.
When to Use a Local Model vs an API
This decision comes down to three factors: data sensitivity, volume, and operational complexity tolerance.
Use a local model if:
- Your documents contain sensitive or regulated information (health records, legal files, financial data) and you can't send them to an external API
- You're processing high volumes where API costs become material
- You need consistent latency without network dependency
- Your organisation has the infrastructure capability to run and maintain the model
Use an API model if:
- You're in early stages and want to move fast without infrastructure overhead
- Your document volume is moderate and API costs are manageable
- You need multilingual support and the API model handles it well out of the box
- You don't have the team to manage model hosting
A practical middle path: use an API model during development and evaluation, then migrate to a local model for production if costs or data sensitivity require it. The migration is straightforward if you've abstracted your embedding layer properly - change one function, re-embed your corpus, done.
What to Do Next
If you're currently running a RAG system without having benchmarked your embedding model against alternatives, that's the first thing to fix. Here's a concrete starting point:
- Pull 50 representative documents from your corpus today and write questions a real user would ask about them.
- Run the evaluation process described above against at least three models - one API model, one open source model from sentence-transformers, and one domain-specific model if your content warrants it.
- Check your chunking strategy alongside the model comparison - a 512-token chunk with overlap often outperforms a 1,024-token chunk, and this interacts with embedding model performance.
- Document your baseline before changing anything. You can't know whether a new model is better if you don't know how the current one performs.
If you're starting fresh, default to OpenAI text-embedding-3-large or Cohere embed-english-v3.0 for general English documents, then run your own evaluation before committing to production infrastructure.
The embedding model comparison is not glamorous work. It involves spreadsheets, manual question writing, and a lot of recall calculations. But it's the work that actually determines whether your RAG system is useful or frustrating - and that matters more than which LLM you put at the end of the pipeline.
Exponential Tech works with Australian organisations building production RAG systems across legal, financial, and enterprise knowledge management contexts. If you want help structuring your evaluation process or interpreting your results, get in touch.