Building a RAG Pipeline (2026)

TL;DR

For most teams: pgvector. It runs in your existing Postgres database, costs nothing extra, and handles millions of vectors comfortably. Pinecone wins when you need a fully managed, scale-to-zero vector store with zero operational overhead. Weaviate wins when you want AI-native features (built-in vectorization, hybrid BM25+vector search, graph traversal) without running your own model. The "best" vector database in 2026 depends almost entirely on your existing stack — don't introduce a new service if Postgres already works.

Key Takeaways

pgvector: free, Postgres-native, handles 1M+ vectors easily, no new infra
Pinecone: best managed vector DB, auto-scaling, $0 free tier (100K vectors), fastest similarity search
Weaviate: best AI-native features (HNSW + BM25 hybrid), multimodal, self-host or cloud
Embedding models: text-embedding-3-small (OpenAI, cheap) or nomic-embed-text (open source, free)
RAG latency: pgvector ~10-50ms, Pinecone ~10-20ms, Weaviate ~5-20ms
Rule of thumb: <500K vectors → pgvector, 500K-10M → Pinecone, complex queries → Weaviate

The RAG Architecture

Before comparing databases, understand the full pipeline:

Document Ingestion:
  Raw docs → Chunk → Embed → Store in vector DB
  (Run once, or incrementally as content changes)

Query Time:
  User query → Embed query → Similarity search → Get top-K chunks
                                                         ↓
                                              Inject into LLM prompt
                                                         ↓
                                              LLM generates answer

All three vector databases handle the "Store" and "Similarity search" steps. The embeddings step is the same regardless of which you use.

Setting Up Embeddings (Same for All Three)

// embeddings.ts — Generate embeddings with OpenAI:
import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function embed(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',  // 1536 dims, $0.02/1M tokens
    // model: 'text-embedding-3-large',  // 3072 dims, $0.13/1M tokens
    input: text.replace(/\n/g, ' '),
  });
  return response.data[0].embedding;
}

// Batch embeddings (more efficient):
export async function embedBatch(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts.map((t) => t.replace(/\n/g, ' ')),
  });
  return response.data.map((d) => d.embedding);
}

// Chunking strategy (critical for RAG quality):
export function chunkDocument(content: string, chunkSize = 500, overlap = 50): string[] {
  const words = content.split(/\s+/);
  const chunks: string[] = [];

  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    const chunk = words.slice(i, i + chunkSize).join(' ');
    if (chunk.length > 100) {  // Skip tiny chunks
      chunks.push(chunk);
    }
  }

  return chunks;
}

// For code documentation — split by function/class, not words
// For PDFs — split by page or paragraph
// For markdown — split by heading sections

pgvector: RAG in Postgres

Best choice when: already using Postgres (Supabase, Neon, Railway, self-hosted)

Setup

-- Enable the extension (Supabase: already enabled by default):
CREATE EXTENSION IF NOT EXISTS vector;

-- Create documents table:
CREATE TABLE documents (
  id          BIGSERIAL PRIMARY KEY,
  content     TEXT NOT NULL,
  metadata    JSONB DEFAULT '{}',
  embedding   VECTOR(1536),  -- Match your model's dimensions
  created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index (best for most cases):
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- OR IVFFlat (better for exact recall at scale):
-- CREATE INDEX ON documents
--   USING ivfflat (embedding vector_cosine_ops)
--   WITH (lists = 100);  -- lists ≈ sqrt(num_rows)

// pgvector with Drizzle ORM:
import { drizzle } from 'drizzle-orm/postgres-js';
import postgres from 'postgres';
import { sql } from 'drizzle-orm';
import { customType, jsonb, pgTable, serial, text, timestamp } from 'drizzle-orm/pg-core';

// Custom vector type for Drizzle:
const vector = (name: string, dimensions: number) =>
  customType<{ data: number[]; driverData: string }>({
    dataType() {
      return `vector(${dimensions})`;
    },
    toDriver(value: number[]) {
      return `[${value.join(',')}]`;
    },
    fromDriver(value: string) {
      return value.slice(1, -1).split(',').map(Number);
    },
  })(name);

export const documents = pgTable('documents', {
  id: serial('id').primaryKey(),
  content: text('content').notNull(),
  metadata: jsonb('metadata').default({}),
  embedding: vector('embedding', 1536),
  createdAt: timestamp('created_at').defaultNow(),
});

const db = drizzle(postgres(process.env.DATABASE_URL!));

// Insert documents:
export async function insertDocument(
  content: string,
  metadata: Record<string, unknown> = {}
) {
  const embedding = await embed(content);

  await db.insert(documents).values({
    content,
    metadata,
    embedding,
  });
}

// Semantic search:
export async function searchDocuments(query: string, limit = 5) {
  const queryEmbedding = await embed(query);

  // cosine similarity search using pgvector <=> operator:
  const results = await db.execute(sql`
    SELECT
      id,
      content,
      metadata,
      1 - (embedding <=> ${`[${queryEmbedding.join(',')}]`}::vector) AS similarity
    FROM documents
    ORDER BY embedding <=> ${`[${queryEmbedding.join(',')}]`}::vector
    LIMIT ${limit}
  `);

  return results.rows as Array<{
    id: number;
    content: string;
    metadata: Record<string, unknown>;
    similarity: number;
  }>;
}

// Full RAG function with pgvector:
import { openai } from '@ai-sdk/openai';
import { generateText } from 'ai';

export async function ragAnswer(userQuestion: string): Promise<string> {
  // 1. Search for relevant context:
  const relevant = await searchDocuments(userQuestion, 5);

  if (relevant.length === 0) {
    return "I don't have information about that in my knowledge base.";
  }

  // 2. Build context string:
  const context = relevant
    .map((doc, i) => `[${i + 1}] ${doc.content}`)
    .join('\n\n');

  // 3. Generate answer:
  const { text } = await generateText({
    model: openai('gpt-4o'),
    system: `You are a helpful assistant. Answer questions based ONLY on the provided context.
If the context doesn't contain enough information, say so.

Context:
${context}`,
    prompt: userQuestion,
  });

  return text;
}

Pinecone: Managed Vector Database

Best choice when: you need a fully managed solution, high query volume, or don't want to manage Postgres.

Setup

// npm install @pinecone-database/pinecone
import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY!,
});

// Create an index (serverless — scales to zero):
await pinecone.createIndex({
  name: 'knowledge-base',
  dimension: 1536,      // Match embedding model
  metric: 'cosine',
  spec: {
    serverless: {
      cloud: 'aws',
      region: 'us-east-1',
    },
  },
});

// Insert vectors:
const index = pinecone.index('knowledge-base');

export async function upsertDocuments(
  documents: Array<{ id: string; content: string; metadata?: Record<string, string | number> }>
) {
  // Embed in batches:
  const batchSize = 100;
  for (let i = 0; i < documents.length; i += batchSize) {
    const batch = documents.slice(i, i + batchSize);
    const embeddings = await embedBatch(batch.map((d) => d.content));

    await index.upsert(
      batch.map((doc, j) => ({
        id: doc.id,
        values: embeddings[j],
        metadata: {
          content: doc.content,  // Store content in metadata for retrieval
          ...doc.metadata,
        },
      }))
    );
  }
}

// Query vectors:
export async function searchPinecone(
  query: string,
  filter?: Record<string, string | number>,
  limit = 5
) {
  const queryEmbedding = await embed(query);

  const results = await index.query({
    vector: queryEmbedding,
    topK: limit,
    includeMetadata: true,
    filter,  // Optional: filter by metadata fields
  });

  return results.matches.map((match) => ({
    id: match.id,
    content: match.metadata?.content as string,
    score: match.score,
    metadata: match.metadata,
  }));
}

// Filter example — only search certain document types:
const results = await searchPinecone('pricing questions', {
  document_type: 'pricing',
  language: 'en',
});

Pinecone Namespaces for Multi-Tenancy

// Namespace isolation per customer (no extra cost):
const customerIndex = pinecone.index('knowledge-base').namespace(`customer-${customerId}`);

// Insert to customer namespace:
await customerIndex.upsert([{ id: 'doc-1', values: embedding, metadata: { content } }]);

// Query only this customer's data:
const results = await customerIndex.query({ vector: queryEmbedding, topK: 5 });

// Clean up when customer leaves:
await customerIndex.deleteAll();

Weaviate: AI-Native Vector Search

Best choice when: you want hybrid search (semantic + keyword), built-in vectorization, or graph relationships between documents.

Setup with Weaviate Cloud

// npm install weaviate-client
import weaviate, { WeaviateClient, dataType } from 'weaviate-client';

const client: WeaviateClient = await weaviate.connectToWeaviateCloud(
  process.env.WEAVIATE_URL!,
  {
    authCredentials: new weaviate.ApiKey(process.env.WEAVIATE_API_KEY!),
    headers: {
      'X-OpenAI-Api-Key': process.env.OPENAI_API_KEY!,  // For auto-vectorization
    },
  }
);

// Create collection (Weaviate's equivalent of a table):
await client.collections.create({
  name: 'Document',
  vectorizers: [
    weaviate.configure.vectorizer.text2VecOpenAI({
      model: 'text-embedding-3-small',
    }),
  ],
  generative: weaviate.configure.generative.openAI({ model: 'gpt-4o' }),
  properties: [
    { name: 'content', dataType: dataType.TEXT },
    { name: 'source', dataType: dataType.TEXT },
    { name: 'category', dataType: dataType.TEXT },
  ],
});

// Insert objects — Weaviate auto-vectorizes:
const collection = client.collections.get('Document');

await collection.data.insertMany([
  { content: 'PostgreSQL is an object-relational database...', source: 'docs/postgres.md', category: 'database' },
  { content: 'MongoDB is a document database...', source: 'docs/mongo.md', category: 'database' },
]);
// No need to call embed() — Weaviate does it automatically

// Hybrid search (vector + BM25 keyword):
const results = await collection.query.hybrid('how do I connect to postgres', {
  limit: 5,
  alpha: 0.75,  // 0 = pure keyword, 1 = pure vector
  returnMetadata: ['score', 'certainty'],
  filters: collection.filter.byProperty('category').equal('database'),
});

for (const result of results.objects) {
  console.log(`Score: ${result.metadata?.score}, Content: ${result.properties.content}`);
}

// Weaviate Generative Search (built-in RAG):
const results = await collection.generate.nearText(
  'how does postgres handle transactions',
  {
    groupedTask: 'Summarize how the following documents describe database transactions.',
    limit: 5,
  }
);

// results.generated contains the LLM-generated answer
console.log(results.generated);
// Each result also has the source chunks

Benchmark: Similarity Search Performance

For 1M documents, Llama 3 embedding (1024 dims):

Database	Query latency (p99)	Throughput	ANN accuracy
Pinecone (serverless)	15-25ms	High	99%+
Weaviate (cloud)	10-20ms	High	99%+
pgvector (HNSW)	20-50ms	Medium	98%+
pgvector (IVFFlat)	50-150ms	Medium	95-99%

For most applications, all three are fast enough. The difference matters at >10M vectors or >100 QPS.

Cost Comparison at Scale

10M vectors, 1536 dimensions, 1000 queries/day:

Solution	Monthly Cost	Notes
pgvector on Neon	~$50	8GB storage, compute
pgvector on Supabase Pro	$25	Included in Pro plan
Pinecone Serverless	~$35	Estimated by usage
Weaviate Cloud	~$100	Enterprise features
Weaviate Self-hosted	~$20-50	Just VPS cost

pgvector wins on cost for teams already paying for Postgres.

Full Production RAG Checklist

Ingestion pipeline:
[ ] Chunk documents intelligently (by section, not word count)
[ ] Add metadata (source, date, document_type) for filtering
[ ] Deduplicate before upserting (hash content)
[ ] Store original content for retrieval (not just vectors)
[ ] Batch embed for cost efficiency

Query time:
[ ] Embed query with same model used for documents
[ ] Retrieve 5-10 chunks (more = better context, higher cost)
[ ] Hybrid search if keyword matching matters (Weaviate or pgvector + tsvector)
[ ] Filter by metadata when query implies scope (dates, categories)
[ ] Rerank results (Cohere Rerank or cross-encoder) for better accuracy

LLM generation:
[ ] Set clear system prompt: "Answer ONLY from the context provided"
[ ] Include source citations in the prompt
[ ] Handle "not found in context" gracefully
[ ] Use streaming for better UX
[ ] Log what context was used (debugging + audit)

Evaluation:
[ ] Track retrieval recall (were relevant docs in the top-K?)
[ ] Track answer faithfulness (did LLM hallucinate beyond context?)
[ ] Use Ragas or ARES for automated RAG evaluation

Hybrid Search: Combining Vector and Keyword

Pure vector search can miss exact keyword matches that matter for precision. A query like "show me the Stripe API key configuration" should surface documents that literally contain the words "Stripe API key" — not just semantically similar documents about payment configuration. When users search for specific names, product codes, error codes, or exact phrases, vector similarity alone often fails them.

Hybrid search addresses this by combining vector similarity scores with BM25 keyword relevance scores. BM25 is the same algorithm that powers Elasticsearch and most traditional search engines — it rewards exact term matches and penalizes documents that lack the query terms entirely.

Weaviate has the most mature native hybrid search in 2026. The alpha parameter in collection.query.hybrid() lets you tune the blend continuously: alpha=0 gives pure keyword search (BM25 only), alpha=1 gives pure vector search, and alpha=0.5 weights them equally. In practice, alpha=0.75 — weighted toward vector — works well for most RAG use cases, with keyword scoring acting as a tiebreaker for exact-match queries.

pgvector doesn't have built-in hybrid search, but you can implement it in plain Postgres by combining the <=> vector operator with Postgres tsvector full-text search. The practical approach: run both searches separately to get two ranked lists, then merge them using Reciprocal Rank Fusion (RRF). RRF is a simple, effective rank fusion algorithm that doesn't require score normalization — it works purely from rank positions. A minimal RRF implementation is about 10 lines of SQL. The result is a re-ranked list that scores higher for documents that appear in both the vector and keyword results.

Pinecone supports sparse-dense hybrid search using SPLADE sparse vectors alongside dense embeddings. This requires embedding each document twice — once for dense and once for sparse — and querying with both. More setup than Weaviate's native approach, but effective and well-supported in the Pinecone client.

For most RAG systems, pure vector search is sufficient. Add hybrid search when users consistently report missing obvious results, or when your knowledge base contains a lot of proper nouns, product names, or structured identifiers that semantic search treats as semantically equivalent to similar-sounding terms.

Improving RAG Quality Beyond the Database

The vector database is one piece of a larger pipeline. Teams often spend weeks tuning their vector database configuration when the real quality gains come from elsewhere.

Chunking strategy matters more than database choice. Fixed-size word chunking (split every 500 words) is easy to implement but poor for retrieval quality. Semantic chunking — splitting at natural boundaries like paragraph breaks, section headings, or sentence endings while preserving surrounding context — significantly outperforms fixed-size chunking on most RAG benchmarks. A chunk that cuts a sentence in half or separates a conclusion from its premise forces the retriever to return incomplete context to the LLM. For markdown content, chunk by heading sections. For PDFs, chunk by page or paragraph. For code documentation, chunk by function or class definition.

Reranking is the highest-impact improvement for most RAG systems. After vector search returns your top-20 candidates, a reranker model evaluates each candidate against the original query using a cross-encoder architecture — it sees both the query and the candidate simultaneously, unlike the bi-encoder models used for embedding. Cohere Rerank is the most widely used managed reranker; open-source cross-encoder models (from Hugging Face) are available for self-hosting. Reranking the top-20 down to the top-5 chunks typically yields a significant accuracy improvement at roughly 2x retrieval latency — usually worth it for non-latency-critical applications.

Metadata filtering narrows the search space before vector similarity runs. Store document type, date, author, category, and other structured attributes as metadata on each vector. When a query implies scope — "find recent pricing docs", "show me Python examples" — filter by those metadata fields first. Filtering before similarity search is both faster and more accurate than relying on the embedding model to encode temporal or categorical relevance.

Evaluation frameworks make the difference between guessing and knowing. Ragas and ARES provide automated metrics: context recall (were the relevant documents in the top-K results?), answer faithfulness (did the LLM stay grounded in the retrieved context or hallucinate?), and answer relevancy (did the final answer address the question?). Build a small test set of 50–100 representative queries with known answers before deploying, run your eval suite against it, and use the scores as a baseline. Any change to chunking, embedding model, retrieval count, or prompt should be measured against that baseline.

Production Architecture: Ingestion and Serving Pipelines

A RAG system has two distinct operational modes that require different architectural thinking: ingestion (loading new documents) and serving (answering user queries). Conflating them creates operational problems.

The ingestion pipeline runs asynchronously and can be slow. Documents arrive as files, webhooks, or database changes. Each document must be chunked, embedded, and stored. Embedding 10,000 documents at 500 words each takes real time and costs money — batching efficiently matters. A production ingestion pipeline:

Receive document via webhook or scheduled job
Check if document already exists (by hash) — skip if unchanged
Chunk the document using your strategy
Embed all chunks in batches of 100–200 (OpenAI's embedding API accepts up to 2048 inputs per request)
Upsert to vector store — use the document hash as part of the vector ID to enable clean updates when content changes
Update ingestion metadata (document ID, chunk count, last updated timestamp) in Postgres

The serving path must be fast. Target under 500ms total for the retrieval step (embedding the query + vector search). Serving latency breaks down as: query embedding (~50ms for OpenAI, ~20ms for a self-hosted model), vector search (~10–50ms depending on database and corpus size), optional reranking (+100ms), and prompt assembly. Keep each step optimized and monitor P95 latency separately — embedding API latency spikes during OpenAI capacity events and should be the first thing you instrument.

Cache common queries. If your knowledge base is relatively static and many users ask similar questions, caching the embedding + retrieval result for frequently repeated queries significantly reduces cost and latency. Use semantic caching (not exact string matching) — an embedding similarity threshold above 0.95 indicates a query that's semantically equivalent to a previously answered one. Upstash Semantic Cache or GPTCache implement this pattern with minimal setup.

Incremental updates are trickier than initial loads. When a document changes, you need to delete all old chunks (identified by document ID prefix in the vector store) and insert the new chunks. For pgvector, a simple DELETE WHERE metadata->>'doc_id' = $1 followed by fresh upserts works. For Pinecone, delete by namespace or ID prefix. Design your chunk IDs to encode the document ID from the start so updates are always clean replacements, not accumulations.

Cost Optimization at Scale

Vector database costs become significant at 10M+ vectors. Two levers reduce costs without sacrificing quality.

Matryoshka embeddings are the most important recent development for cost reduction. OpenAI's text-embedding-3-* models support dimensionality reduction via the dimensions parameter — you can truncate from 1536 to 256 or 512 dimensions with minimal quality loss (the models are trained to pack the most important information in the first dimensions). Fewer dimensions mean smaller index size, lower storage cost, and faster similarity computation:

// Use 512 dimensions instead of 1536:
const response = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: text,
  dimensions: 512,  // 1/3 the storage and compute of full 1536d
});

At 10M vectors: 1536-dim at float32 = 61GB of vector data. 512-dim = 20GB. The index overhead is proportionally smaller too. Benchmark the quality trade-off on your specific domain before reducing dimensions — technical domains with specialized vocabulary retain more quality at higher dimensions than general-purpose corpora.

Product quantization (PQ) compresses vectors in memory at the cost of some recall accuracy. pgvector supports quantization via the halfvec type (float16 instead of float32, halving memory). Weaviate and Pinecone handle quantization internally. For most RAG applications where top-1 precision matters less than recall (you're retrieving 5–10 chunks to give the LLM context, not just the single best match), the accuracy trade-off of quantization is acceptable.

Methodology

Vector database benchmarks sourced from published ANN benchmarks (ann-benchmarks.com) using HNSW parameters and 1M-vector datasets; actual query latency in production varies by query complexity and concurrency. pgvector version: 0.7.x (HNSW index added in 0.5.0; IVFFlat was the only index type in earlier versions). Pinecone Serverless pricing: $0.033/1M read units, $0.08/1M write units as of March 2026. Weaviate Cloud Starter: $25/month. Embedding costs: text-embedding-3-small at $0.02/1M tokens, text-embedding-3-large at $0.13/1M tokens (OpenAI pricing as of March 2026). Cohere Rerank: $1/1K reranking calls. pgvector HNSW default parameters (m=16, ef_construction=64) are starting points; workload-specific tuning improves recall. Matryoshka embedding quality trade-off data from OpenAI's model card for text-embedding-3-small. Ragas and ARES evaluation framework descriptions based on their published documentation and benchmark papers.

Compare vector databases and AI APIs at APIScout.

The API Integration Checklist (Free PDF)