Building an AI-Powered App: Choosing Your API Stack

Building an AI-powered app isn't just picking an LLM. It's choosing providers for inference, embeddings, vector storage, guardrails, monitoring, and more. Here's the complete stack and how to choose the right API for each layer.

Most AI apps fail not because the LLM is bad but because the surrounding stack is poorly chosen. Teams start with a working proof-of-concept using a single LLM and a hardcoded prompt, then ship to production and discover: retrieval quality is poor (chunking strategy, embedding model, vector search configuration all matter), cost is 10x higher than expected (no model routing, no caching, no monitoring), quality degrades over time (no evals, no tracking of which inputs cause problems), and the app is fragile to the LLM provider's outages (no fallback). The stack in this guide addresses each of these.

The architecture diagram above shows the full production stack. Not every app needs every layer — a simple summarization tool needs just an LLM and monitoring. A full RAG system for a knowledge base needs all of them. The decision tree at the end of each section helps you decide whether to add that layer for your use case.

The AI API Stack

┌─────────────────────────────────────┐
│  Frontend / User Interface          │  Chat UI, streaming display
├─────────────────────────────────────┤
│  AI Gateway / Router                │  LiteLLM, Portkey, or custom
│  (model routing, fallback, caching) │
├─────────────────────────────────────┤
│  LLM Provider                      │  OpenAI, Anthropic, Google, OSS
│  (chat, completion, reasoning)      │
├─────────────────────────────────────┤
│  Embeddings                         │  OpenAI, Cohere, Voyage AI
│  (text → vectors for search/RAG)    │
├─────────────────────────────────────┤
│  Vector Database                    │  Pinecone, Weaviate, Qdrant
│  (similarity search, retrieval)     │
├─────────────────────────────────────┤
│  Document Processing                │  Unstructured, LlamaParse
│  (PDF, HTML → chunks)              │
├─────────────────────────────────────┤
│  Guardrails / Safety               │  Guardrails AI, NeMo
│  (content filtering, validation)    │
├─────────────────────────────────────┤
│  Monitoring / Observability         │  Helicone, Langfuse, Braintrust
│  (cost tracking, quality, latency)  │
└─────────────────────────────────────┘

Layer 1: LLM Provider

Provider	Best Model	Strength	Pricing (1M tokens)
Anthropic	Claude Sonnet	Coding, analysis, safety	$3 in / $15 out
OpenAI	GPT-4o	Multimodal, ecosystem	$5 in / $15 out
Google	Gemini 2.0 Pro	Long context, multimodal	$1.25 in / $5 out
Groq	Llama 3.3 70B	Ultra-fast inference	$0.59 in / $0.79 out
Together AI	Open models	Variety, competitive price	$0.20-3.00

Choosing Your Primary LLM

Need best reasoning/coding? → Claude Sonnet or GPT-4o
Need cheapest good model? → Gemini 2.0 Flash or Llama 3.3
Need fastest inference? → Groq
Need multimodal (vision)? → GPT-4o or Claude Sonnet
Need open-source/self-host? → Llama 3.3 via Together/vLLM
Need 1M+ token context? → Gemini (2M context window)

Basic Integration

import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

async function chat(userMessage: string, systemPrompt?: string) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 1024,
    system: systemPrompt || 'You are a helpful assistant.',
    messages: [{ role: 'user', content: userMessage }],
  });

  return response.content[0].type === 'text'
    ? response.content[0].text
    : '';
}

Layer 2: Embeddings

Provider	Model	Dimensions	Pricing (1M tokens)
OpenAI	text-embedding-3-small	1536	$0.02
OpenAI	text-embedding-3-large	3072	$0.13
Cohere	embed-v4	1024	$0.10
Voyage AI	voyage-3	1024	$0.06
Google	text-embedding-004	768	Free (low volume)

Embedding Pipeline

import OpenAI from 'openai';

const openai = new OpenAI();

async function embedText(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts,
  });

  return response.data.map(item => item.embedding);
}

// Chunk documents before embedding
function chunkText(text: string, chunkSize: number = 500, overlap: number = 50): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push(text.slice(start, end));
    start += chunkSize - overlap;
  }

  return chunks;
}

Layer 3: Vector Database

Database	Type	Free Tier	Best For
Pinecone	Managed	1 index, 100K vectors	Simplest setup
Weaviate	Managed / Self-hosted	14-day trial	Hybrid search
Qdrant	Managed / Self-hosted	1GB free	Performance, self-hosting
Chroma	Self-hosted	Free (OSS)	Local development
pgvector	PostgreSQL extension	Free (with Postgres)	Already using Postgres

RAG Pipeline

// Complete RAG (Retrieval-Augmented Generation) pipeline

// 1. Index documents
async function indexDocuments(documents: { id: string; text: string }[]) {
  for (const doc of documents) {
    const chunks = chunkText(doc.text);
    const embeddings = await embedText(chunks);

    // Store in vector database
    await vectorDB.upsert(
      chunks.map((chunk, i) => ({
        id: `${doc.id}_${i}`,
        values: embeddings[i],
        metadata: { text: chunk, documentId: doc.id },
      }))
    );
  }
}

// 2. Query with RAG
async function ragQuery(question: string): Promise<string> {
  // Embed the question
  const [questionEmbedding] = await embedText([question]);

  // Find relevant chunks
  const results = await vectorDB.query({
    vector: questionEmbedding,
    topK: 5,
  });

  // Build context from retrieved chunks
  const context = results.matches
    .map(match => match.metadata.text)
    .join('\n\n');

  // Generate answer with context
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 1024,
    system: `Answer based on this context:\n\n${context}`,
    messages: [{ role: 'user', content: question }],
  });

  return response.content[0].type === 'text' ? response.content[0].text : '';
}

Embedding Model Selection

The choice of embedding model affects retrieval quality more than any other single parameter. For English-only applications, OpenAI text-embedding-3-small is a strong default — it's cheap ($0.02/1M tokens), has excellent quality for semantic similarity tasks, and integrates seamlessly with the OpenAI ecosystem. If you're already paying for an OpenAI API subscription, there's almost no reason to use a different provider for embeddings.

For multilingual applications, Cohere's embed-v4 and Voyage AI's voyage-multilingual-2 have stronger cross-language retrieval performance than OpenAI's multilingual embeddings. If users will search in their native language and your content is in English (or vice versa), test multilingual embedding models before committing.

Dimension count (768, 1024, 1536, 3072) directly affects vector storage cost and retrieval speed. Higher dimensions capture more semantic nuance but cost more to store and query. For most knowledge base RAG applications, 1024 dimensions is the sweet spot — enough quality for complex queries, not so many that storage costs compound. OpenAI's text-embedding-3-small at 1536 dimensions can be truncated to 512 or 1024 via the dimensions parameter, giving you cost control without switching models.

Hybrid search: Pure vector search misses exact keyword matches that humans expect. Searching for "GPT-4o" should match documents containing that exact string, not just semantically similar content. Weaviate and Qdrant both support hybrid search (vector + BM25 keyword) natively. Pinecone requires building keyword search separately. For knowledge base and documentation search, hybrid search improves retrieval recall by 15-30% over pure vector search on typical enterprise queries.

Layer 4: AI Gateway

Route requests across multiple providers with fallback, caching, and cost tracking:

// Using LiteLLM as AI gateway
import { completion } from 'litellm';

// Same interface, any provider
const response = await completion({
  model: 'anthropic/claude-sonnet-4-20250514', // or 'gpt-4o', 'groq/llama-3.3-70b'
  messages: [{ role: 'user', content: 'Hello' }],
  // Automatic fallback
  fallbacks: ['gpt-4o', 'groq/llama-3.3-70b-versatile'],
});

Layer 5: Monitoring

Tool	What It Tracks	Pricing
Helicone	Requests, cost, latency	Free tier available
Langfuse	Traces, evaluations, prompts	Free tier, open-source
Braintrust	Evals, experiments, logging	Free tier available
LangSmith	Traces, testing, monitoring	Free tier (LangChain)

Recommended Stacks by Use Case

Chatbot / Customer Support

LLM:        Anthropic Claude Sonnet (safety, instruction-following)
Embeddings: OpenAI text-embedding-3-small (cheap, good quality)
Vector DB:  Pinecone (managed, zero ops)
Monitor:    Helicone (cost tracking)

RAG / Knowledge Base

LLM:        Anthropic Claude Sonnet (long context, citations)
Embeddings: Cohere embed-v4 (best retrieval quality)
Vector DB:  Weaviate (hybrid search: vector + keyword)
Processing: LlamaParse (PDF extraction)
Monitor:    Langfuse (trace retrieval quality)

Code Generation

LLM:        Anthropic Claude Sonnet (best coding benchmarks)
Gateway:    LiteLLM (fallback to GPT-4o if needed)
Monitor:    Braintrust (eval code quality)

Cost-Optimized

LLM:        Groq / Together AI (Llama 3.3 70B)
Embeddings: Google text-embedding-004 (free tier)
Vector DB:  pgvector (free with existing Postgres)
Gateway:    LiteLLM (route to cheapest available)

Common Mistakes

Mistake	Impact	Fix
Using GPT-4 for everything	10x overspending	Route simple tasks to cheaper/smaller models
No cost monitoring	Surprise bills	Add Helicone or Langfuse from day 1
Embedding entire documents	Poor retrieval quality	Chunk documents (300-500 tokens per chunk)
No fallback provider	Outage = app down	AI gateway with automatic failover
Skipping guardrails	Harmful outputs, prompt injection	Add input/output validation
Not evaluating quality	Don't know if changes improve things	Set up automated evals

Guardrails and Safety: Layer 6

Guardrails are the most neglected layer in AI app stacks. Without them, you're one clever prompt away from your app generating harmful content, leaking confidential system prompt instructions, or being used as a proxy for things you never intended.

Input validation: Screen user inputs before sending to the LLM. Block prompt injection attempts ("ignore previous instructions"), detect attempts to exfiltrate your system prompt, and filter inputs that contain personal data you shouldn't send to a third-party API (credit card numbers, SSNs). Guardrails AI and NeMo Guardrails provide pre-built validators; for most apps, a lightweight custom check on input length and content is sufficient.

Output validation: Validate that the LLM response matches your expected format and content policies. For structured outputs (JSON schemas, code), parse and validate before returning to the user. For text outputs, run through a content moderation check if your app serves untrusted users. Anthropic's Claude has strong built-in safety, but the responsibility for preventing harmful use ultimately rests with your app's design.

System prompt protection: Never assume your system prompt is secret. Users can often extract system prompts through carefully crafted inputs. Design your app to work even if the system prompt is known — use system prompts for behavior shaping, not for storing secrets or enforcing security boundaries. Security-critical logic belongs in your application code, not in a system prompt.

Rate limiting per user: AI apps are expensive to operate. Without per-user rate limits, a single bad actor can exhaust your API budget. Track usage by user ID and implement token-level rate limits (e.g., 100K tokens/day per free user). Track cost attribution to each user so you can identify abusive patterns before they become billing surprises.

Building an Evaluation Pipeline

AI app quality is hard to measure with traditional testing — the "correctness" of a language model response is often subjective. But without measurement, you can't know if a prompt change improved or degraded quality.

Automated evals: Define a test set of (input, expected_behavior) pairs. Expected behavior doesn't have to be exact match — it can be "response contains the correct answer", "response doesn't mention competitor", "response is in valid JSON format". Run these evals on every prompt change. Braintrust and Langfuse both have eval frameworks; for simpler setups, a custom eval script that runs 50 test cases and reports pass rates is sufficient.

LLM-as-judge: Use a capable LLM (Claude Opus) to grade the output of your production LLM (Claude Sonnet). Define a rubric: accuracy (0-5), helpfulness (0-5), format compliance (pass/fail). This scales to evaluating thousands of responses automatically. The cost of judge evals (a few cents per graded response) is small compared to the cost of shipping quality regressions.

A/B testing prompts: For significant prompt changes, route a small percentage of traffic (5-10%) to the new prompt and measure user satisfaction signals (explicit feedback, retry rate, session length). This is more expensive than offline evals but catches the cases where evals don't reflect real user behavior.

Version control for prompts: Store prompts in your database or a prompt management tool (Langfuse, Braintrust both have prompt registries), not hardcoded in source code. This enables: A/B testing without a deployment, rollback when a prompt change degrades quality, audit trail of what prompt each response was generated with, and collaboration between engineers and non-technical prompt engineers on your team.

Latency budgeting across the stack: A complete RAG pipeline has cumulative latency. A typical breakdown: embedding the query (50ms), vector search (20-100ms), LLM inference with context (500-2000ms), response streaming to client (begins before inference completes). The LLM is the dominant latency factor. To improve perceived latency, start streaming the response token-by-token as soon as the LLM begins generating, rather than waiting for the full response. All major LLMs support streaming; implement it from day one rather than retrofitting later. For async use cases (batch summaries, background analysis), latency is less important than throughput — use the Batches API or parallel requests to maximize utilization.

Vendor lock-in mitigation: The AI API landscape changes fast. Build an abstraction layer between your application and LLM provider calls early. LiteLLM provides this at the SDK level. Your own thin wrapper (an interface with chat(messages) and embed(text)) takes 2 hours to build and makes future provider switches a configuration change instead of a codebase refactor.

Methodology

Pricing data for LLMs, embeddings, and vector databases is sourced from each provider's public pricing pages as of early 2026. LLM pricing is particularly volatile — Gemini 2.0 Flash and Groq pricing have changed multiple times in the past year as competition intensifies; always verify current pricing before architecture decisions. Model performance rankings (coding, reasoning) are based on public benchmarks (HumanEval, MMLU, SWE-bench) and the Artificial Analysis leaderboard (artificialanalysis.ai), which provides independent third-party testing. The chunking parameters (300-500 tokens, 50 token overlap) are commonly cited starting points; optimal chunk size depends on your document structure and retrieval use case — markdown documents with natural section breaks benefit from section-level chunking, while dense PDFs work better with fixed-size chunks with overlap. Experiment with chunk sizes of 256, 512, and 1024 tokens against your actual queries and measure retrieval recall before settling on a value. The 15-30% hybrid search improvement figure is drawn from RAG evaluation benchmarks published by Weaviate and Cohere; results vary significantly by domain and query type.

Compare AI APIs across every layer of the stack on APIScout — LLMs, embeddings, vector databases, and monitoring tools side by side.

Building an AI-Powered App: API Stack Guide 2026