Anthropic Claude API: Developer Guide 2026

TL;DR

Claude Sonnet 4.6 is the best model for most production use cases — top-tier coding, excellent instruction following, and strong value at $3/$15 per 1M tokens. Use Haiku 4.5 for high-volume simple tasks ($1/$5 per 1M tokens), Sonnet 4.6 for most everything else ($3/$15), and Opus 4.6 for the most capable reasoning, agentic workflows, and complex coding ($5/$25). The API is OpenAI-compatible-ish but with key differences in tool use, content blocks, and prompt caching that make the switch non-trivial. Here's everything you need.

Key Takeaways

Model lineup: Haiku 4.5 (fast/cheap) → Sonnet 4.6 (best value) → Opus 4.6 (most capable)
Prompt caching: up to 90% cost reduction on repeated context — killer feature for RAG and chatbots
Adaptive thinking: Claude dynamically decides when and how deeply to reason, dramatically improves complex tasks
Tool use: stop_reason: "tool_use" pattern, mixed text+tool content blocks in same response
Vision: images in image content blocks, supports base64 and URLs
Context window: 200K tokens on all models — the longest context in production LLMs

Models and Pricing (2026)

Model	Input $/1M	Output $/1M	Context	Best For
claude-haiku-4-5	$1.00	$5.00	200K	High-volume, fast tasks
claude-sonnet-4-6	$3.00	$15.00	200K (1M beta)	Most production use cases
claude-opus-4-6	$5.00	$25.00	200K (1M beta)	Agentic workflows, complex reasoning

Recommendation: Default to claude-sonnet-4-6 for most production use cases. Use claude-opus-4-6 for agentic coding and complex reasoning tasks.

Basic Setup

// npm install @anthropic-ai/sdk
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

// Basic completion:
const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  messages: [
    { role: 'user', content: 'Explain async/await in 3 sentences.' }
  ],
});

console.log(message.content[0].text);

// With system prompt:
const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 2048,
  system: `You are an expert TypeScript engineer. Be concise and precise.
Always include types in code examples. Use const over let.`,
  messages: [
    { role: 'user', content: 'Write a retry wrapper for async functions.' }
  ],
});

Streaming

// Streaming with async iterator:
const stream = await anthropic.messages.stream({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Write a haiku about TypeScript.' }],
});

// Stream text chunks:
for await (const text of stream.textStream) {
  process.stdout.write(text);
}

// Or get the final message after streaming:
const finalMessage = await stream.getFinalMessage();
console.log(finalMessage.usage);  // { input_tokens: 12, output_tokens: 42 }

// Server-Sent Events for Next.js App Router:
export async function POST(req: Request) {
  const { messages } = await req.json();

  const stream = await anthropic.messages.stream({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    messages,
  });

  // Convert to ReadableStream for Response:
  return new Response(
    new ReadableStream({
      async start(controller) {
        const encoder = new TextEncoder();
        for await (const text of stream.textStream) {
          controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
        }
        controller.enqueue(encoder.encode('data: [DONE]\n\n'));
        controller.close();
      },
    }),
    { headers: { 'Content-Type': 'text/event-stream' } }
  );
}

Prompt Caching: 90% Cost Reduction

Prompt caching is Anthropic's biggest cost-optimization feature. If you're sending the same long system prompt or context repeatedly, mark it for caching and pay 90% less on subsequent requests.

// Without caching: pay full price for system prompt on every request
// With caching: pay once (write), then ~10% on subsequent reads

const systemPrompt = `You are an expert software architect with 20 years of experience.
[...imagine 10,000 tokens of detailed instructions, examples, and context...]
`;

// Mark the system prompt for caching:
const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: systemPrompt,
      cache_control: { type: 'ephemeral' },  // ← Enable caching
    },
  ],
  messages: [
    { role: 'user', content: 'Review this PR description...' }
  ],
});

// First request: full price (cache write)
// Subsequent requests within 5 minutes: 90% cheaper (cache read)
// Cache write: 25% premium over base input price
// Cache read: ~90% discount vs uncached input price

// Cache large context (like documentation or a codebase):
const docsContext = fs.readFileSync('docs/api-reference.md', 'utf-8');

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  system: 'You are a helpful developer support assistant.',
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: `Here is our API documentation:\n\n${docsContext}`,
          cache_control: { type: 'ephemeral' },  // Cache the long context
        },
        {
          type: 'text',
          text: userQuestion,  // Not cached — changes each request
        },
      ],
    },
  ],
});

// Check cache status:
console.log(response.usage);
// { cache_creation_input_tokens: 15000, cache_read_input_tokens: 0, input_tokens: 50 }
// → First request: writing cache
// On second request: cache_read_input_tokens: 15000 (90% cheaper)

Rules for caching to work:

Minimum 1024 tokens to cache
Content must be identical across requests (any change = new cache write)
Cache TTL: 5 minutes for ephemeral type
Cache position: must be at the end of the system/user block, before the varying content

Adaptive Thinking

Adaptive thinking enables Claude to dynamically decide when and how deeply to reason before producing its final response — dramatically improving accuracy on math, coding, and complex analysis. On Claude 4.6 models, adaptive thinking is the recommended approach (the older budget_tokens parameter is deprecated).

// Enable adaptive thinking (recommended for Opus 4.6 and Sonnet 4.6):
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 16000,
  thinking: {
    type: 'adaptive',  // Claude decides when and how much to think
  },
  messages: [
    {
      role: 'user',
      content: `A train leaves Station A at 60 mph heading to Station B,
        250 miles away. 30 minutes later, another train leaves Station B
        at 80 mph heading to Station A. When and where do they meet?`,
    },
  ],
});

// Response contains both thinking blocks and the final answer:
for (const block of response.content) {
  if (block.type === 'thinking') {
    console.log('Thinking:', block.thinking);  // Claude's reasoning process
  } else if (block.type === 'text') {
    console.log('Answer:', block.text);        // Final answer to user
  }
}

// Control thinking depth with the effort parameter:
const response = await anthropic.messages.create({
  model: 'claude-opus-4-6',
  max_tokens: 16000,
  thinking: { type: 'adaptive' },
  output_config: { effort: 'high' },  // low | medium | high | max (Opus only)
  messages: [{ role: 'user', content: complexCodingTask }],
});

// Streaming with adaptive thinking:
const stream = await anthropic.messages.stream({
  model: 'claude-sonnet-4-6',
  max_tokens: 64000,
  thinking: { type: 'adaptive' },
  messages: [{ role: 'user', content: complexCodingTask }],
});

for await (const event of stream) {
  if (event.type === 'content_block_delta') {
    if (event.delta.type === 'thinking_delta') {
      process.stdout.write(event.delta.thinking);
    } else if (event.delta.type === 'text_delta') {
      process.stdout.write(event.delta.text);
    }
  }
}

When to use adaptive thinking:

Math and logic problems
Complex code generation or debugging
Multi-step reasoning tasks
Analysis requiring several considerations

Cost note: thinking tokens are billed at output rates. Use the effort parameter to control the cost-quality tradeoff.

Tool Use (Function Calling)

const tools: Anthropic.Messages.Tool[] = [
  {
    name: 'search_web',
    description: 'Search the web for current information',
    input_schema: {
      type: 'object',
      properties: {
        query: { type: 'string', description: 'Search query' },
        max_results: { type: 'number', description: 'Number of results', default: 5 },
      },
      required: ['query'],
    },
  },
  {
    name: 'run_code',
    description: 'Execute Python code and return the output',
    input_schema: {
      type: 'object',
      properties: {
        code: { type: 'string', description: 'Python code to execute' },
      },
      required: ['code'],
    },
  },
];

async function runAgentLoop(userMessage: string): Promise<string> {
  const messages: Anthropic.Messages.MessageParam[] = [
    { role: 'user', content: userMessage },
  ];

  while (true) {
    const response = await anthropic.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 4096,
      tools,
      messages,
    });

    // Stop if Claude finished without calling tools:
    if (response.stop_reason === 'end_turn') {
      return response.content
        .filter((b) => b.type === 'text')
        .map((b) => (b as Anthropic.Messages.TextBlock).text)
        .join('');
    }

    // Handle tool calls:
    if (response.stop_reason === 'tool_use') {
      // Add Claude's response (may include text + tool_use blocks):
      messages.push({ role: 'assistant', content: response.content });

      // Execute each tool:
      const toolResults: Anthropic.Messages.ToolResultBlockParam[] = [];
      for (const block of response.content) {
        if (block.type !== 'tool_use') continue;

        let result: unknown;
        try {
          result = await executeTool(block.name, block.input as Record<string, unknown>);
        } catch (err) {
          result = `Error: ${err instanceof Error ? err.message : 'Unknown error'}`;
        }

        toolResults.push({
          type: 'tool_result',
          tool_use_id: block.id,
          content: JSON.stringify(result),
          // is_error: true,  // Uncomment for error results
        });
      }

      messages.push({ role: 'user', content: toolResults });
    }
  }
}

Vision: Analyzing Images

// Image from URL:
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'image',
          source: {
            type: 'url',
            url: 'https://example.com/screenshot.png',
          },
        },
        {
          type: 'text',
          text: 'What UI issues do you see in this screenshot? Be specific.',
        },
      ],
    },
  ],
});

// Image from file (base64):
import fs from 'fs';

const imageBuffer = fs.readFileSync('diagram.png');
const base64Image = imageBuffer.toString('base64');

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 2048,
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'image',
          source: {
            type: 'base64',
            media_type: 'image/png',  // 'image/jpeg', 'image/gif', 'image/webp'
            data: base64Image,
          },
        },
        { type: 'text', text: 'Explain this architecture diagram.' },
      ],
    },
  ],
});

Message History (Multi-turn Conversations)

// Stateless pattern — manage history yourself:
const conversationHistory: Anthropic.Messages.MessageParam[] = [];

async function chat(userMessage: string): Promise<string> {
  // Add user message:
  conversationHistory.push({ role: 'user', content: userMessage });

  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: 'You are a helpful assistant.',
    messages: conversationHistory,
  });

  const assistantMessage = response.content
    .filter((b) => b.type === 'text')
    .map((b) => (b as Anthropic.Messages.TextBlock).text)
    .join('');

  // Add assistant response to history:
  conversationHistory.push({ role: 'assistant', content: response.content });

  return assistantMessage;
}

Cost Optimization Checklist

1. Model selection:
   → Haiku 4.5 for classification, simple extraction, routing ($1/$5 per 1M)
   → Sonnet 4.6 for most production tasks — best value ($3/$15 per 1M)
   → Opus 4.6 for agentic workflows and complex reasoning ($5/$25 per 1M)

2. Prompt caching:
   → Cache system prompts >1024 tokens
   → Cache large context (docs, codebase, examples)
   → Saves 90% on input tokens for cached content

3. Token budgeting:
   → Set max_tokens to ~16000 for non-streaming, ~64000 for streaming
   → Use streaming to stop early when you have enough output
   → Use the effort parameter (low/medium/high) to control thinking costs

4. Batching:
   → Use Message Batches API for offline/async workloads
   → 50% cost reduction, up to 24hr processing window
   → Great for: bulk classification, data extraction, eval runs

Claude vs. OpenAI API: Key Differences

If you're migrating from OpenAI or building on both APIs, several Claude API patterns differ in ways that will break direct ports.

Content blocks vs. string content: OpenAI's choices[0].message.content is a string. Claude's content is an array of typed blocks (TextBlock, ToolUseBlock, ThinkingBlock). Always check block.type before accessing block-specific fields. When you need just the text, filter for type === 'text' blocks and join.

Tool response format: OpenAI tool responses go in messages with role: "tool". Claude tool results go in a user message with content type tool_result. The structure is similar but the field names and nesting differ enough to require careful porting.

System prompt placement: OpenAI takes system as a message with role: "system" inside the messages array. Claude takes system as a top-level parameter on the request. This matters when you're building conversation history — Claude's system prompt is separate from the conversation messages.

Stop reasons: OpenAI uses finish_reason: "stop" or "tool_calls". Claude uses stop_reason: "end_turn" or "tool_use". Update your checks accordingly. Claude also returns "max_tokens" (truncated) and "stop_sequence" (hit a custom stop sequence) as stop reasons.

Pricing model: OpenAI prices are per request; Anthropic prices are per token. The per-token model makes prompt caching much more impactful — cached tokens cost 0.10¢/1M, whereas uncached input is 30¢/1M for Sonnet. For workloads with large stable context (RAG, chatbots with long system prompts), Anthropic's prompt caching can make it significantly cheaper than OpenAI at equivalent quality levels.

Error Handling and Rate Limits

The Anthropic API uses standard HTTP status codes with structured error responses. The errors you'll encounter in production:

429 (Rate Limited): Anthropic's rate limits are per-organization and vary by tier. Free tier accounts have tight limits; production accounts should request a limit increase via the Anthropic console before launch. The SDK includes automatic retry with exponential backoff by default (maxRetries: 2 in the default configuration). For sustained high-volume workloads, use the Message Batches API instead of synchronous requests.

529 (Overloaded): Anthropic returns 529 when their systems are under high load. This is temporary — retry with backoff. The SDK handles this automatically under the same retry policy as 429s. If you see frequent 529s, it usually means your traffic spike coincided with high global demand on the API. Route non-time-sensitive requests to the Batch API during these periods.

400 (Invalid Request): Usually a malformed message structure — missing role, wrong content block format, or exceeding max_tokens greater than the model's maximum. Check that max_tokens is at most 8,192 for Haiku, and 64,000 for Sonnet/Opus. When using adaptive thinking, you need at least 1,000 max_tokens for the thinking budget.

Context window exceeded: For 200K context windows, you'll hit this if you accumulate too much conversation history or include large documents without truncation. Track usage.input_tokens in responses and truncate the middle of conversation history when you approach 180K tokens (keeping the system prompt and recent turns intact).

Production Hardening

Building a production Claude integration requires thinking beyond happy-path completions.

Content filtering: Claude's safety training means it will occasionally refuse requests that are legitimate for your use case, especially if your system prompt contains words that could trigger caution (security, exploit, hack, etc. in the context of security tooling). If you're building in a sensitive domain, test your prompts with Claude before launch and consider adding explicit permission context in the system prompt ("You are assisting authorized security researchers..."). Monitor for stop_reason: "content_filtered" in responses.

Structured output validation: When you need structured JSON from Claude, always validate the output against your expected schema before using it. Claude is excellent at producing valid JSON, but rare edge cases exist — truncated responses due to max_tokens limits, or unusual inputs that cause format drift. Use a library like Zod to validate the parsed JSON and have a fallback path for validation failures (retry, or return a default value).

Model versioning: Anthropic periodically updates models without changing the model ID string. claude-sonnet-4-6 today may behave differently than claude-sonnet-4-6 in six months. For production systems where consistency is critical, use date-stamped model IDs (e.g., claude-sonnet-4-6-20251022) when they're available. For most applications, the ongoing improvements to the latest model version are desirable rather than a concern.

Observability: Track token usage per request (response.usage.input_tokens, response.usage.output_tokens, cache_read_input_tokens) and log it alongside your application metrics. Token usage correlates with cost and latency. A sudden spike in input_tokens means something in your prompt construction changed — often a bug where context is being appended instead of replaced. A spike in output_tokens means Claude is generating longer responses than expected, which may indicate a prompt change or model behavior shift. Set up alerts on both. For multi-turn conversations, track total conversation token count and log when it crosses 100K, 150K, and 180K so you can investigate before hitting the context limit.

Testing: Use gpt-4o-mini or a similar small model as a cheap proxy during unit test development, then run integration tests against Claude directly before deploying changes to system prompts. Claude's behavior is consistent enough that if your integration tests pass, production usually behaves as expected — but always run at least 20-50 representative examples through the actual model before shipping prompt changes.

Methodology

Pricing data is sourced from Anthropic's pricing page (anthropic.com/pricing) as of early 2026. Prompt cache TTL (5 minutes for ephemeral type) is documented in Anthropic's prompt caching guide. The 90% cost reduction figure for cache reads vs. uncached input is Anthropic's published rate. Adaptive thinking with the type: 'adaptive' field is the current API design for Claude 4.x models; the older extended_thinking with budget_tokens applies to Claude 3.7 and earlier. The output_config.effort parameter for controlling thinking depth is available on Opus 4.6; Sonnet 4.6 uses adaptive thinking without explicit effort control. All code examples use @anthropic-ai/sdk v0.30+. The OpenAI comparison pricing (0.10¢/1M for cached vs 30¢/1M uncached) uses Sonnet 4.6 rates; check anthropic.com/pricing for current figures as prices change with model updates. Cache write costs (25% premium over base input) are documented in Anthropic's prompt caching guide and apply to the first request that populates a cache entry.

Compare all AI APIs including Anthropic at APIScout.

Evaluate Anthropic and compare alternatives on APIScout.

The API Integration Checklist (Free PDF)