How to Build a RAG App with Cohere Embeddings 2026
RAG (Retrieval-Augmented Generation) lets LLMs answer questions about your data without fine-tuning. Embed your documents, store vectors, search for relevant chunks, and pass them as context to the model. Cohere's Embed v4 and Command R+ make this straightforward — and more cost-effective than most alternatives.
Before choosing RAG, it's worth being clear on the tradeoffs against other approaches. Prompt stuffing (putting all your data directly in the context window) works for small corpora — up to a few hundred pages — but becomes expensive and slow at scale, and recent research suggests models struggle to reason reliably over very long contexts. Fine-tuning teaches the model new behaviors but is a poor fit for grounding answers in specific documents — it doesn't update well, and re-tuning every time your data changes is expensive and slow. RAG is the right choice when you have a large or frequently-updated document corpus and need the model to cite specific sources accurately.
Cohere's position in the RAG landscape is genuinely differentiated. Embed v4 is multilingual across 100+ languages and costs $0.10 per million tokens — roughly 15x cheaper than OpenAI's text-embedding-3-large at $1.30/1M tokens. Command R+ was specifically trained for RAG workloads: it handles long context gracefully, produces accurate citations, and degrades gracefully when the context doesn't contain the answer rather than hallucinating. Rerank v3.5 is a cross-encoder model that dramatically improves retrieval precision — it's not a commodity feature you can easily replicate with another provider. For teams building production RAG, this combination is hard to beat on the price-to-quality curve.
TL;DR
RAG with Cohere + pgvector is production-ready in a weekend. The full stack: Cohere Embed v4 for embeddings, pgvector for storage, Cohere Rerank v3.5 for precision, and Command R+ for generation. Total cost for a 1,000-document corpus with 100 queries/day is about $9/month.
Two Cohere-specific things to get right from the start: use search_document input type when embedding documents and search_query when embedding queries — getting this wrong silently degrades retrieval quality. And always add the reranking step: it costs $2 per thousand searches but the precision improvement over raw vector similarity is substantial enough to be worth it in almost every production use case.
What You'll Build
- Document chunking and embedding pipeline
- Vector storage and similarity search
- RAG query endpoint with citations
- Conversational RAG with chat history
- Reranking for better relevance
Prerequisites: Node.js 18+, Cohere API key (free tier: 100 API calls/min).
1. Setup
npm install cohere-ai
// lib/cohere.ts
import { CohereClient } from 'cohere-ai';
export const cohere = new CohereClient({
token: process.env.COHERE_API_KEY!,
});
2. Document Chunking
Chunking strategy is the most critical parameter in any RAG system — more impactful than model choice, more impactful than vector database selection. Get it wrong and retrieval quality degrades regardless of everything else you do correctly.
The core tension is between chunk size and retrieval precision. Large chunks (1,000+ tokens) carry more context, which helps the generation step, but they retrieve too broadly — a chunk about "database performance" that also contains a section about "connection pooling" will surface for connection pooling queries even if the most relevant information is elsewhere in your corpus. Small chunks (under 100 tokens) are precise but lose context — a sentence-level chunk about "index scan" won't answer questions that require understanding the surrounding paragraph.
The 200-500 token sweet spot works because it roughly maps to one coherent idea or procedure — a paragraph, a step-by-step process, a code example with explanation. Overlap between chunks (carrying the last 50 words of one chunk into the start of the next) prevents retrieval from missing answers that span a boundary. Character-based chunking is simple to implement but ignores document structure; paragraph-based chunking respects natural content boundaries and generally produces better retrieval quality. For structured documents (markdown, HTML), splitting on headers and then chunking within sections is worth the extra implementation complexity.
// lib/chunker.ts
interface Chunk {
id: string;
text: string;
metadata: {
source: string;
chunkIndex: number;
totalChunks: number;
};
}
export function chunkDocument(
text: string,
source: string,
options: {
chunkSize?: number;
overlap?: number;
} = {}
): Chunk[] {
const { chunkSize = 500, overlap = 50 } = options;
// Split by paragraphs first, then combine
const paragraphs = text.split(/\n\n+/).filter(p => p.trim().length > 0);
const chunks: Chunk[] = [];
let currentChunk = '';
let chunkIndex = 0;
for (const paragraph of paragraphs) {
if (currentChunk.length + paragraph.length > chunkSize && currentChunk.length > 0) {
chunks.push({
id: `${source}_chunk_${chunkIndex}`,
text: currentChunk.trim(),
metadata: { source, chunkIndex, totalChunks: 0 },
});
// Keep overlap from end of previous chunk
const words = currentChunk.split(' ');
currentChunk = words.slice(-overlap).join(' ') + '\n\n' + paragraph;
chunkIndex++;
} else {
currentChunk += (currentChunk ? '\n\n' : '') + paragraph;
}
}
// Add final chunk
if (currentChunk.trim()) {
chunks.push({
id: `${source}_chunk_${chunkIndex}`,
text: currentChunk.trim(),
metadata: { source, chunkIndex, totalChunks: 0 },
});
}
// Update total chunks count
return chunks.map(c => ({
...c,
metadata: { ...c.metadata, totalChunks: chunks.length },
}));
}
3. Generate Embeddings
Cohere requires different inputType values for document embeddings vs query embeddings, and this distinction is more important than it might appear. Document embeddings are optimized for storage and retrieval — the model encodes them in a way that maximizes their usefulness as retrieval targets. Query embeddings are optimized to match against those document embeddings — they're encoded to find, not to be found. Using search_document for both (the easy mistake to make) degrades retrieval quality silently: your similarity scores will be lower, your top-K results will be less relevant, and you'll struggle to understand why since everything technically "works."
This input type distinction is unique to Cohere among major embedding providers. OpenAI and Voyage use the same embedding space for both documents and queries. Cohere's asymmetric embeddings are what give Embed v4 its strong retrieval benchmarks — but only if you use the API correctly. For a deeper comparison of how Cohere's approach compares to alternatives, see our embedding models comparison.
// lib/embeddings.ts
import { cohere } from './cohere';
export async function embedDocuments(texts: string[]): Promise<number[][]> {
// Batch in groups of 96 (Cohere limit)
const batchSize = 96;
const allEmbeddings: number[][] = [];
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const response = await cohere.v2.embed({
texts: batch,
model: 'embed-v4.0',
inputType: 'search_document',
embeddingTypes: ['float'],
});
allEmbeddings.push(...(response.embeddings.float ?? []));
}
return allEmbeddings;
}
export async function embedQuery(query: string): Promise<number[]> {
const response = await cohere.v2.embed({
texts: [query],
model: 'embed-v4.0',
inputType: 'search_query', // Different input type for queries
embeddingTypes: ['float'],
});
return response.embeddings.float![0];
}
4. Vector Store
For prototyping, an in-memory vector store gets you running in minutes — no database setup, no infrastructure. It's entirely adequate for development, demos, and datasets under a few thousand documents. The limitation is obvious: data doesn't survive process restarts, and performance degrades linearly with corpus size since cosine similarity is computed sequentially against every document.
For production, pgvector is the right default if you're already using PostgreSQL. The case for pgvector over dedicated vector databases like Pinecone or Weaviate is transactional consistency — your application data and your vector data live in the same database, behind the same transaction guarantees. When you ingest a document and store its embedding, you can do both in one transaction. When you delete a record, you can delete its embedding atomically. Dedicated vector databases require you to manage consistency across two separate systems. For teams that are already operating Postgres (which is most teams), adding pgvector is a single CREATE EXTENSION rather than a new vendor relationship. For a detailed analysis of when dedicated vector databases make sense, see our vector database comparison.
Simple In-Memory Store
// lib/vector-store.ts
interface StoredDocument {
id: string;
text: string;
embedding: number[];
metadata: Record<string, any>;
}
class VectorStore {
private documents: StoredDocument[] = [];
async add(docs: { id: string; text: string; metadata: Record<string, any> }[]) {
const texts = docs.map(d => d.text);
const embeddings = await embedDocuments(texts);
for (let i = 0; i < docs.length; i++) {
this.documents.push({
...docs[i],
embedding: embeddings[i],
});
}
}
async search(query: string, topK: number = 5): Promise<StoredDocument[]> {
const queryEmbedding = await embedQuery(query);
// Calculate cosine similarity
const scored = this.documents.map(doc => ({
...doc,
score: cosineSimilarity(queryEmbedding, doc.embedding),
}));
return scored
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
}
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
export const vectorStore = new VectorStore();
With PostgreSQL + pgvector (Production)
// lib/vector-store-pg.ts
import { Pool } from 'pg';
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
// Setup (run once)
export async function initVectorStore() {
await pool.query(`
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS documents (
id TEXT PRIMARY KEY,
text TEXT NOT NULL,
embedding vector(1024),
metadata JSONB DEFAULT '{}'
);
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
`);
}
export async function addDocuments(
docs: { id: string; text: string; embedding: number[]; metadata: Record<string, any> }[]
) {
const query = `
INSERT INTO documents (id, text, embedding, metadata)
VALUES ($1, $2, $3::vector, $4)
ON CONFLICT (id) DO UPDATE SET
text = EXCLUDED.text,
embedding = EXCLUDED.embedding,
metadata = EXCLUDED.metadata
`;
for (const doc of docs) {
await pool.query(query, [
doc.id,
doc.text,
`[${doc.embedding.join(',')}]`,
JSON.stringify(doc.metadata),
]);
}
}
export async function searchDocuments(queryEmbedding: number[], topK: number = 5) {
const result = await pool.query(
`SELECT id, text, metadata, 1 - (embedding <=> $1::vector) as score
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[`[${queryEmbedding.join(',')}]`, topK]
);
return result.rows;
}
5. RAG Query
The retrieve-rerank-generate pipeline is the core of a production-quality RAG system. Vector search (retrieve) finds the top candidates using approximate nearest neighbor search — fast but imprecise, because cosine similarity in high-dimensional space doesn't perfectly capture semantic relevance. Reranking applies a cross-encoder model that reads both the query and each candidate document simultaneously and scores their relevance directly — this is slower and more expensive, but dramatically more accurate than vector similarity alone. Generation then takes the top 3-5 reranked results and synthesizes an answer.
The reranking step is where many RAG implementations fall short. Vector search frequently surfaces documents that are lexically similar to the query without being semantically relevant — a document that contains the same keywords as your query but in a completely different context. A cross-encoder reranker catches these false positives because it evaluates the relationship between query and document holistically. Cohere Rerank v3.5 at $2/1K searches is one of the most cost-effective ways to close the gap between prototype retrieval quality and production retrieval quality. For detailed benchmarks across different retrieval approaches, see our vector database comparison.
// lib/rag.ts
import { cohere } from './cohere';
import { vectorStore } from './vector-store';
export async function ragQuery(question: string): Promise<{
answer: string;
sources: { text: string; source: string }[];
}> {
// 1. Search for relevant documents
const relevantDocs = await vectorStore.search(question, 5);
// 2. Rerank for better relevance
const reranked = await cohere.v2.rerank({
model: 'rerank-v3.5',
query: question,
documents: relevantDocs.map(d => ({ text: d.text })),
topN: 3,
});
const topDocs = reranked.results.map(r => relevantDocs[r.index]);
// 3. Generate answer with context
const context = topDocs
.map((doc, i) => `[Source ${i + 1}]: ${doc.text}`)
.join('\n\n');
const response = await cohere.v2.chat({
model: 'command-r-plus',
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer questions based on the provided context.
If the context doesn't contain the answer, say so. Always cite your sources using [Source N] format.`,
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`,
},
],
});
return {
answer: response.message?.content?.[0]?.text ?? 'No answer generated.',
sources: topDocs.map(d => ({
text: d.text.substring(0, 200) + '...',
source: d.metadata.source,
})),
};
}
6. Cohere Chat with RAG (Built-In)
Cohere's Chat API has a native document grounding feature that handles citation generation automatically, producing structured citations in the response that map specific answer spans to source documents. This is worth using when you want clean, structured citations without writing your own citation extraction logic.
The manual pipeline from the previous section gives you more control: you can customize the prompt, adjust how context is formatted, and integrate with any chat model. The built-in RAG feature trades that flexibility for convenience — Cohere handles the context formatting and citation mapping internally. Use the built-in feature for straightforward Q&A applications where structured citations are a first-class requirement. Use the manual pipeline when you need precise prompt control, are using a model other than Command R/R+, or have custom reranking or filtering logic that doesn't fit the built-in document format.
export async function cohereRAG(question: string) {
// Get relevant documents
const relevantDocs = await vectorStore.search(question, 5);
const response = await cohere.v2.chat({
model: 'command-r-plus',
messages: [
{
role: 'user',
content: question,
},
],
documents: relevantDocs.map(doc => ({
id: doc.id,
data: {
text: doc.text,
source: doc.metadata.source,
},
})),
});
return {
answer: response.message?.content?.[0]?.text,
citations: response.message?.citations ?? [],
};
}
7. Conversational RAG
Single-turn RAG is straightforward. Multi-turn RAG introduces a subtle but critical problem: follow-up questions almost always contain pronouns and references that only make sense in the context of previous exchanges. "What about the pricing?" after a question about Cohere embeddings is a perfectly natural follow-up, but it's not a useful query for vector search — the retrieval system doesn't know that "pricing" refers to Cohere embedding pricing specifically.
The query rewriting step solves this by using the LLM to reformulate follow-up questions into self-contained queries before passing them to the retrieval system. "What about the pricing?" becomes "What is the pricing for Cohere Embed v4?" — a query that retrieves relevant documents accurately. This rewriting step adds one LLM call per turn, but the improvement in retrieval quality for multi-turn conversations makes it worth the cost. Skipping it means follow-up questions frequently return irrelevant documents and the conversation quality degrades quickly after the first turn.
// lib/conversational-rag.ts
import { cohere } from './cohere';
import { vectorStore } from './vector-store';
interface Message {
role: 'user' | 'assistant';
content: string;
}
export async function conversationalRAG(
question: string,
chatHistory: Message[]
): Promise<{ answer: string; sources: any[] }> {
// 1. Rewrite question with context from chat history
let searchQuery = question;
if (chatHistory.length > 0) {
const rewrite = await cohere.v2.chat({
model: 'command-r-plus',
messages: [
{
role: 'system',
content: 'Rewrite the user question to be self-contained, incorporating context from chat history. Return only the rewritten question.',
},
...chatHistory.map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
})),
{ role: 'user' as const, content: question },
],
});
searchQuery = rewrite.message?.content?.[0]?.text ?? question;
}
// 2. Search with rewritten query
const relevantDocs = await vectorStore.search(searchQuery, 5);
// 3. Rerank
const reranked = await cohere.v2.rerank({
model: 'rerank-v3.5',
query: searchQuery,
documents: relevantDocs.map(d => ({ text: d.text })),
topN: 3,
});
const topDocs = reranked.results.map(r => relevantDocs[r.index]);
// 4. Generate with full chat history
const response = await cohere.v2.chat({
model: 'command-r-plus',
messages: [
{
role: 'system',
content: 'Answer based on the provided documents. Cite sources.',
},
...chatHistory.map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
})),
{
role: 'user',
content: question,
},
],
documents: topDocs.map(doc => ({
id: doc.id,
data: {
text: doc.text,
source: doc.metadata.source,
},
})),
});
return {
answer: response.message?.content?.[0]?.text ?? '',
sources: topDocs.map(d => ({ text: d.text.substring(0, 150), source: d.metadata.source })),
};
}
8. API Route
// app/api/rag/route.ts
import { NextResponse } from 'next/server';
import { ragQuery } from '@/lib/rag';
export async function POST(req: Request) {
const { question } = await req.json();
if (!question) {
return NextResponse.json({ error: 'Question required' }, { status: 400 });
}
const result = await ragQuery(question);
return NextResponse.json(result);
}
9. Ingestion Script
A one-shot ingestion script works for getting started, but production RAG systems need incremental ingestion — the ability to add new documents, update changed documents, and remove deleted documents without re-embedding the entire corpus. The ON CONFLICT (id) DO UPDATE pattern in the pgvector schema handles upserts correctly. For deletion, maintain a set of known document IDs and remove embeddings for documents that no longer exist in your source.
For large corpora, consider batching ingestion with checkpointing so you can resume interrupted jobs. Cohere's 96-document batch limit means a 10,000-document corpus requires about 105 API calls — well within limits, but worth running as a background job rather than a synchronous request. Tracking ingestion state (which documents have been embedded at which version) also lets you skip re-embedding for unchanged documents on subsequent runs, keeping incremental ingestion costs near zero.
// scripts/ingest.ts
import { readFileSync, readdirSync } from 'fs';
import { join } from 'path';
import { chunkDocument } from '../lib/chunker';
import { vectorStore } from '../lib/vector-store';
async function ingest(docsDir: string) {
const files = readdirSync(docsDir).filter(f => f.endsWith('.md') || f.endsWith('.txt'));
console.log(`Found ${files.length} documents`);
for (const file of files) {
const content = readFileSync(join(docsDir, file), 'utf-8');
const chunks = chunkDocument(content, file);
console.log(` ${file}: ${chunks.length} chunks`);
await vectorStore.add(chunks);
}
console.log('Ingestion complete!');
}
ingest('./docs').catch(console.error);
Cohere Pricing
| Model | Price |
|---|---|
| Embed v4 | $0.10 / 1M tokens |
| Command R+ | $2.50 / 1M input, $10 / 1M output |
| Command R | $0.15 / 1M input, $0.60 / 1M output |
| Rerank v3.5 | $2.00 / 1K searches |
Example cost: 1,000 documents (500 tokens each) embedded + 100 RAG queries/day:
- Embedding: $0.05 (one-time)
- Queries: ~$0.30/day (embed + rerank + generate)
- Total: ~$9/month
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Chunks too large | Poor retrieval precision | Keep chunks to 200-500 tokens |
| Chunks too small | Missing context | Use overlap between chunks |
| Same input type for docs and queries | Poor search quality | Use search_document and search_query |
| No reranking | Irrelevant results ranked high | Add Cohere Rerank step |
| Stuffing all results into prompt | Noise overwhelms signal | Rerank and use top 3-5 results |
| No chat history rewriting | Follow-up questions fail | Rewrite questions with context |
Building AI-powered search? Compare Cohere vs OpenAI Embeddings vs Voyage AI on APIScout — embedding quality, pricing, and RAG performance.
Evaluate Cohere and compare alternatives on APIScout.