How to Build an AI Chatbot with the Anthropic API 2026
Claude is one of the most capable AI models for building chatbots — strong at reasoning, following instructions, and maintaining coherent conversations. This guide covers everything from basic chat to streaming, tool use, and production deployment.
What You'll Build
- Conversational chatbot with message history
- Streaming responses for real-time output
- System prompts for personality and behavior
- Tool use (function calling) for dynamic actions
- Production-ready error handling and rate limiting
Prerequisites: Node.js 18+, Anthropic API key (from console.anthropic.com).
1. Setup
Install the SDK
npm install @anthropic-ai/sdk
Initialize the Client
// lib/anthropic.ts
import Anthropic from '@anthropic-ai/sdk';
export const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
Environment Variables
# .env.local
ANTHROPIC_API_KEY=sk-ant-...
2. Basic Chat
Simple Message
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages: [
{ role: 'user', content: 'What is an API?' }
],
});
console.log(message.content[0].text);
With System Prompt
System prompts define your chatbot's personality, knowledge, and behavior:
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: `You are a helpful API expert assistant. You help developers choose
the right APIs for their projects. Be concise, technical, and always include
code examples when relevant. If you don't know something, say so.`,
messages: [
{ role: 'user', content: 'Which email API should I use for transactional emails?' }
],
});
Conversation with History
Maintain context by sending the full conversation history:
const conversationHistory: Anthropic.MessageParam[] = [];
async function chat(userMessage: string) {
conversationHistory.push({
role: 'user',
content: userMessage,
});
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: 'You are a helpful assistant.',
messages: conversationHistory,
});
const assistantMessage = response.content[0].text;
conversationHistory.push({
role: 'assistant',
content: assistantMessage,
});
return assistantMessage;
}
// Usage
await chat('What is REST?');
await chat('How does it compare to GraphQL?'); // Knows context
await chat('Which should I use for my mobile app?'); // Remembers both
3. Streaming Responses
Streaming shows text as it's generated — essential for a good chatbot UX:
const stream = anthropic.messages.stream({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages: [
{ role: 'user', content: 'Explain API rate limiting' }
],
});
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
process.stdout.write(event.delta.text);
}
}
Streaming API Route (Next.js)
// app/api/chat/route.ts
import { anthropic } from '@/lib/anthropic';
export async function POST(req: Request) {
const { messages } = await req.json();
const stream = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 2048,
stream: true,
system: 'You are a helpful API expert.',
messages,
});
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
controller.enqueue(encoder.encode(event.delta.text));
}
}
controller.close();
},
});
return new Response(readable, {
headers: { 'Content-Type': 'text/plain; charset=utf-8' },
});
}
Streaming Client Component
// components/Chat.tsx
'use client';
import { useState, useRef } from 'react';
type Message = { role: 'user' | 'assistant'; content: string };
export function Chat() {
const [messages, setMessages] = useState<Message[]>([]);
const [input, setInput] = useState('');
const [isStreaming, setIsStreaming] = useState(false);
const handleSubmit = async (e: React.FormEvent) => {
e.preventDefault();
if (!input.trim() || isStreaming) return;
const userMessage: Message = { role: 'user', content: input };
const updatedMessages = [...messages, userMessage];
setMessages(updatedMessages);
setInput('');
setIsStreaming(true);
const res = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages: updatedMessages }),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let assistantContent = '';
setMessages([...updatedMessages, { role: 'assistant', content: '' }]);
while (true) {
const { done, value } = await reader.read();
if (done) break;
assistantContent += decoder.decode(value);
setMessages([
...updatedMessages,
{ role: 'assistant', content: assistantContent },
]);
}
setIsStreaming(false);
};
return (
<div>
<div className="messages">
{messages.map((m, i) => (
<div key={i} className={m.role}>
<strong>{m.role}:</strong> {m.content}
</div>
))}
</div>
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={(e) => setInput(e.target.value)}
placeholder="Ask about APIs..."
disabled={isStreaming}
/>
<button type="submit" disabled={isStreaming}>Send</button>
</form>
</div>
);
}
4. Tool Use (Function Calling)
Give your chatbot the ability to perform actions — look up data, call APIs, execute functions:
const tools: Anthropic.Tool[] = [
{
name: 'search_apis',
description: 'Search the API directory for APIs matching a query',
input_schema: {
type: 'object',
properties: {
query: { type: 'string', description: 'Search query' },
category: {
type: 'string',
enum: ['ai', 'payments', 'email', 'auth', 'search'],
description: 'API category filter',
},
},
required: ['query'],
},
},
{
name: 'compare_apis',
description: 'Compare two APIs side by side',
input_schema: {
type: 'object',
properties: {
api_a: { type: 'string', description: 'First API name' },
api_b: { type: 'string', description: 'Second API name' },
},
required: ['api_a', 'api_b'],
},
},
];
// Handle tool use in conversation
async function chatWithTools(userMessage: string) {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
tools,
messages: [{ role: 'user', content: userMessage }],
});
// Check if Claude wants to use a tool
for (const block of response.content) {
if (block.type === 'tool_use') {
const toolResult = await executeTool(block.name, block.input);
// Send tool result back to Claude
const followUp = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
tools,
messages: [
{ role: 'user', content: userMessage },
{ role: 'assistant', content: response.content },
{
role: 'user',
content: [{
type: 'tool_result',
tool_use_id: block.id,
content: JSON.stringify(toolResult),
}],
},
],
});
return followUp;
}
}
return response;
}
5. Extended Thinking
For complex questions, enable extended thinking to let Claude reason before responding:
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 16000,
thinking: {
type: 'enabled',
budget_tokens: 10000,
},
messages: [
{
role: 'user',
content: 'Design an API architecture for a multi-tenant SaaS platform with real-time features',
},
],
});
// Response includes thinking blocks + text blocks
for (const block of response.content) {
if (block.type === 'thinking') {
console.log('Thinking:', block.thinking);
} else if (block.type === 'text') {
console.log('Response:', block.text);
}
}
6. Model Selection
| Model | Best For | Speed | Cost |
|---|---|---|---|
| claude-sonnet-4-20250514 | General chatbot, balanced quality/speed | Fast | Medium |
| claude-opus-4-20250514 | Complex reasoning, high-stakes responses | Slower | Higher |
| claude-haiku-3-5-20241022 | High-volume, simple queries | Fastest | Lowest |
For chatbots: Start with Sonnet. Use Haiku for high-volume, simple interactions. Escalate to Opus for complex queries.
7. Production Best Practices
Rate Limiting
// Implement per-user rate limiting
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';
const ratelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.slidingWindow(20, '1 m'), // 20 messages per minute
});
// In your API route
const { success } = await ratelimit.limit(userId);
if (!success) {
return NextResponse.json({ error: 'Rate limited' }, { status: 429 });
}
Error Handling
try {
const response = await anthropic.messages.create({ ... });
} catch (error) {
if (error instanceof Anthropic.RateLimitError) {
// Wait and retry
} else if (error instanceof Anthropic.APIError) {
// Log and return user-friendly error
}
}
Context Window Management
Claude has a large context window but costs increase with token count. Manage conversation length:
function trimConversation(messages: Message[], maxTokens: number = 50000) {
// Keep system prompt + last N messages
// Summarize older messages if needed
if (estimateTokens(messages) > maxTokens) {
// Keep first message (system context) and last 10 messages
return [messages[0], ...messages.slice(-10)];
}
return messages;
}
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Not streaming | Users wait for full response — feels slow | Always stream in chat UIs |
| Sending full history forever | Costs increase, hits context limits | Trim or summarize old messages |
| No rate limiting | One user exhausts your API budget | Per-user rate limits |
| Exposing API key to client | Account compromise | Server-side only |
| Ignoring stop reasons | Missing tool calls, truncated responses | Check stop_reason in response |
| No error handling | Crashes on 429/500 responses | Try/catch with retry logic |
Persistent Conversation Storage
The in-memory array approach shown earlier works fine for a single browser session, but real chatbots need to survive page reloads, cross-device access, and server restarts. Once a user closes the tab, that conversation history is gone. For production applications, you need to persist messages to a database and load them back when a conversation resumes.
The data model is straightforward: a conversations table with an ID and metadata, and a messages table with foreign key to conversation, role, content, and a timestamp. Here's the pattern using a Prisma-style database client:
// lib/conversation-store.ts
import { db } from '@/lib/database';
export async function getConversationMessages(conversationId: string) {
return db.messages.findMany({
where: { conversationId },
orderBy: { createdAt: 'asc' },
select: { role: true, content: true }
});
}
export async function appendMessage(
conversationId: string,
role: 'user' | 'assistant',
content: string
) {
return db.messages.create({
data: { conversationId, role, content }
});
}
For conversation ID management, generate a UUID on the first message and return it to the client in the API response. The client stores it in localStorage (or a session cookie if you want it tied to authentication) and includes it in every subsequent request. On component mount, load the stored conversation ID from localStorage and fetch the previous messages to populate the chat history.
One practical concern at scale: storing every message in a relational database adds write latency to every API call. A pattern that works well is writing messages to Redis first (fast writes, short retention) and asynchronously flushing to PostgreSQL for long-term storage. Redis's built-in TTL also handles automatic expiration — set a 30-day TTL on conversation keys and old conversations clean themselves up without a background job.
For long-running conversations, also consider a summarization strategy. If a user has been chatting with your bot for two hours, their conversation history may contain thousands of tokens. Rather than trimming arbitrarily, you can periodically ask Claude to summarize the older portion of the conversation — "Summarize the first 20 messages in 3-4 sentences" — and replace those messages with the summary in your storage. This keeps context coherent without unbounded token growth.
Multi-Tenant Chatbots: Per-Customer System Prompts
Most production chatbots need different behavior for different customers or contexts. A customer support bot for a B2B SaaS company might need to know which plan a user is on, what features they have access to, and the tone appropriate for their company. A white-label chatbot deployed across multiple clients needs each client to have entirely separate personalities and knowledge.
The pattern is to store system prompts in your database per tenant (or per user tier), load the appropriate prompt at conversation initialization, and pass it to the Anthropic API on every request. This is more flexible than hardcoding system prompts in your API route.
System prompt injection comes in three common patterns. Static prompts are fixed for all users of a tenant — every user of Acme Corp's deployment gets the same "You are a helpful assistant for Acme Corp customers" prompt. Dynamic prompts inject user-specific context — the user's name, their subscription plan, their role in the organization. This personalizes the experience without requiring separate prompts per user. RAG-augmented prompts include retrieved documents from your knowledge base or the user's account data, allowing the model to answer questions it couldn't answer from training data alone.
const systemPrompt = `${basePrompt}
Context from the customer's account:
- Plan: ${user.plan}
- Company: ${user.companyName}
- Recent activity: ${recentActivity}`;
A few things to watch for with this pattern: prompt injection attacks are a real concern if any part of the system prompt comes from user-controlled input. Always sanitize and validate data before interpolating it into prompts. Also be mindful of token counts — a verbose system prompt with a lot of retrieved context can be expensive at scale. Profile your average system prompt token count and factor it into your cost model.
Cost Optimization at Scale
Claude's token-based pricing makes cost modeling straightforward, but the numbers can surprise you at scale. Claude Haiku is priced at $0.25 per million input tokens and $1.25 per million output tokens. Claude Sonnet is $3 per million input and $15 per million output. Those are very different numbers at volume.
For a typical chatbot turn with 500 input tokens and 300 output tokens, Haiku costs about $0.0005 per turn and Sonnet costs about $0.006 per turn. That's a 12x difference. At 100,000 turns per month, Haiku costs roughly $50 and Sonnet costs roughly $600. At a million turns per month, the gap becomes $500 versus $6,000.
The practical optimization that most teams underuse is query routing — not every message needs Sonnet. A user typing "hi" or "thanks" or asking a simple factual question that any small model handles well doesn't need a frontier model. You can classify incoming messages with a lightweight heuristic (or with Haiku itself, cheaply) and route simple queries to Haiku while reserving Sonnet for complex reasoning, code generation, or nuanced instruction-following.
Other cost levers worth implementing: cache common responses in Redis with a short TTL. FAQ-style questions — "How do I reset my password?" "What's your refund policy?" — return identical or near-identical responses across users. Caching the response for an hour eliminates redundant API calls. Aggressive conversation trimming also matters — keeping only the last five to ten turns rather than the full history reduces input token counts significantly on long conversations. Finally, set a sensible max_tokens limit on responses. If your chatbot is answering support questions, it rarely needs to generate 4,000 tokens. A max_tokens of 500-800 prevents runaway long responses that inflate your bill without adding value.
Beyond cost monitoring, think carefully about evaluation. Automated metrics — response latency, error rate, token usage per session — are easy to instrument but don't tell you whether the chatbot is actually helping users. Build in explicit feedback signals: thumbs up/down ratings, escalation-to-human rates, and conversation abandonment rates (sessions where the user stops responding without their issue resolved). These behavioral indicators are better proxies for chatbot quality than any automated evaluation score.
Review a random sample of conversations weekly during the first few months of deployment. You'll quickly identify patterns: guardrails that are too aggressive and refuse legitimate requests, responses that hallucinate product details when the system prompt context is incomplete, answers so long that users abandon reading before reaching the relevant information. Logging full conversation turns — with appropriate privacy controls and data retention policies — makes diagnosing these failure patterns straightforward. For regulated industries, verify what conversation data you can retain before building your logging infrastructure; some compliance frameworks restrict the storage of user message content even in internal systems, making logging infrastructure a design decision that warrants early legal review rather than a retrofit.
Building with the Anthropic API? Explore AI API comparisons and integration guides on APIScout — Claude vs GPT, Claude vs Gemini, and more.
Evaluate Anthropic and compare alternatives on APIScout.
Related: Building an AI Agent in 2026, How to Build a RAG App with Cohere Embeddings, How to Choose an LLM API in 2026