Monitor API Performance: Metrics and SLAs 2026
How to Monitor API Performance: Latency, Errors, and SLAs
You can't improve what you don't measure. API performance monitoring tracks latency, error rates, throughput, and availability — the metrics that determine whether your API is meeting its commitments. Here's what to measure, how to measure it, and when to alert.
TL;DR
- Measure latency percentiles (p50, p95, p99) — averages hide tail latency problems that affect real users
- Set SLOs tighter than your SLA commitments to catch degradation before it breaches customer agreements
- Alert on symptoms (high error rate, elevated p99) not causes (high CPU) — cause-based alerts lead to alert fatigue
- OpenTelemetry is the standard for distributed tracing — instrument once, send to any backend
- Error budgets give you a principled framework for balancing reliability and velocity
The Four Golden Signals
Google SRE's four golden signals apply directly to APIs:
1. Latency
What: Time from request received to response sent.
Measure percentiles, not averages:
| Percentile | Meaning | Use |
|---|---|---|
| p50 (median) | Half of requests are faster | Typical experience |
| p95 | 95% of requests are faster | Most users' experience |
| p99 | 99% of requests are faster | Worst-case normal experience |
| p99.9 | 99.9% are faster | Tail latency |
Why not averages? An average of 100ms hides that 1% of requests take 5 seconds. p99 catches that.
Targets:
| Endpoint Type | p50 | p95 | p99 |
|---|---|---|---|
| Simple read | <50ms | <200ms | <500ms |
| Database query | <100ms | <500ms | <1s |
| Search | <200ms | <1s | <2s |
| Write operation | <100ms | <500ms | <1s |
| External API call | <500ms | <2s | <5s |
2. Error Rate
What: Percentage of requests returning errors (4xx/5xx).
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| 5xx rate | <0.1% | 0.1-1% | >1% |
| 4xx rate | <5% | 5-10% | >10% |
| Total error rate | <1% | 1-5% | >5% |
Track by status code: Distinguish between client errors (4xx — usually the client's fault) and server errors (5xx — your fault).
3. Throughput
What: Requests per second (RPS) or requests per minute (RPM).
Track throughput to:
- Capacity plan (are you approaching limits?)
- Detect anomalies (sudden spike = attack? sudden drop = outage?)
- Correlate with latency (does latency increase with load?)
4. Saturation
What: How close your system is to capacity.
| Resource | Metric | Alert Threshold |
|---|---|---|
| CPU | Utilization % | >80% sustained |
| Memory | Usage / available | >85% |
| Database connections | Active / max pool | >80% |
| Disk I/O | IOPS / max IOPS | >70% |
| Network | Bandwidth usage | >70% |
SLA / SLO / SLI
SLI (Service Level Indicator)
A measurable metric: "99.5% of requests complete in under 500ms."
SLO (Service Level Objective)
Your internal target: "p99 latency < 500ms, error rate < 0.1%."
SLA (Service Level Agreement)
Your external commitment with consequences: "99.9% uptime or service credits."
Set SLOs tighter than SLAs. If your SLA promises 99.9% uptime, set your SLO at 99.95% so you have a buffer before breaching the SLA.
Uptime Targets
| Uptime | Downtime/Year | Downtime/Month |
|---|---|---|
| 99% | 3.65 days | 7.3 hours |
| 99.9% | 8.77 hours | 43.8 minutes |
| 99.95% | 4.38 hours | 21.9 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
Alerting Strategy
Alert on Symptoms, Not Causes
Good alerts (symptoms):
- p99 latency > 2s for 5 minutes
- Error rate > 1% for 3 minutes
- Throughput dropped 50% vs same hour last week
Bad alerts (causes):
- CPU > 80% (may not affect users)
- Memory > 90% (may be normal)
- Single health check failed (transient)
Alert Severity
| Severity | Criteria | Response |
|---|---|---|
| P1 - Critical | Service down, data loss | Page on-call, all hands |
| P2 - High | Degraded performance, partial outage | Page on-call, investigate |
| P3 - Medium | Non-critical service degraded | Next business day |
| P4 - Low | Cosmetic, minor issue | Backlog |
Monitoring Tools
| Tool | Best For | Price |
|---|---|---|
| Datadog | Full observability | From $5/host/mo |
| Grafana + Prometheus | Self-hosted, open source | Free |
| Better Stack | Uptime + incidents | Free (10 monitors) |
| Checkly | Synthetic monitoring | Free (5 checks) |
| Sentry | Error tracking | Free (5K events) |
| PostHog | Product analytics | Free (1M events) |
Dashboard Essentials
Every API monitoring dashboard should show:
- Request volume — RPS over time (detect anomalies)
- Latency percentiles — p50, p95, p99 over time
- Error rate — 4xx and 5xx separately
- Top errors — most frequent error codes/messages
- Slowest endpoints — which endpoints need optimization
- Uptime — current and 30-day availability
OpenTelemetry for APIs
OpenTelemetry (OTel) is the industry-standard framework for distributed tracing, metrics, and logs. It provides vendor-neutral instrumentation so you can switch between Datadog, Grafana, Honeycomb, and Jaeger without rewriting your instrumentation code. For API teams, OTel solves a specific problem: when a slow API request spans multiple services, OTel traces show exactly where time was spent — which database query, which downstream service call, which function.
Auto-instrumentation in Node.js is the fastest path to distributed tracing. The @opentelemetry/auto-instrumentations-node package automatically instruments Express, Hono, Fastify, HTTP clients (fetch, axios), databases (pg, mysql2, mongoose), and Redis without any code changes:
// instrument.ts — must be loaded BEFORE all other code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-api',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, // e.g., https://api.honeycomb.io/v1/traces
headers: {
'x-honeycomb-team': process.env.HONEYCOMB_API_KEY,
},
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// package.json — load instrumentation before app code
{
"scripts": {
"start": "node -r ./dist/instrument.js dist/server.js"
}
}
Manual span creation adds business context that auto-instrumentation cannot infer. When you have a complex operation (order processing, payment flow, batch job), manual spans make traces far more useful:
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('my-api');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttributes({
'order.id': orderId,
'order.source': 'api',
});
try {
const order = await tracer.startActiveSpan('db.getOrder', async (dbSpan) => {
const result = await db.orders.findUnique({ where: { id: orderId } });
dbSpan.setAttributes({ 'db.rows_affected': result ? 1 : 0 });
dbSpan.end();
return result;
});
await tracer.startActiveSpan('payment.charge', async (paymentSpan) => {
paymentSpan.setAttributes({ 'payment.amount': order.total });
await chargePayment(order);
paymentSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: String(err) });
span.recordException(err as Error);
throw err;
} finally {
span.end();
}
});
}
Trace context propagation ensures that a trace started in your API gateway is continued through downstream microservices. OTel propagates context via the traceparent HTTP header (W3C Trace Context standard). When you use auto-instrumentation for HTTP clients, this happens automatically. For custom queue consumers or background workers, you may need to extract context manually from the message payload.
For sending traces to Grafana Tempo or Datadog, replace the OTLPTraceExporter URL with the appropriate endpoint. The instrumentation code stays identical — OTel's vendor neutrality is genuine.
Prometheus + Grafana Setup
Prometheus is the de facto standard for metrics collection in production API infrastructure. It works by scraping a /metrics endpoint exposed by your API, storing time-series data, and enabling alerting rules. Grafana visualizes Prometheus data and manages alert notifications.
Exposing /metrics from your API using prom-client (Node.js):
import { Registry, Counter, Histogram, collectDefaultMetrics } from 'prom-client';
const register = new Registry();
collectDefaultMetrics({ register }); // CPU, memory, event loop lag
// Request counter
const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register],
});
// Latency histogram (buckets in seconds)
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
registers: [register],
});
// Express middleware
export function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
const labels = {
method: req.method,
route: req.route?.path ?? 'unknown',
status_code: String(res.statusCode),
};
httpRequestTotal.inc(labels);
end(labels);
});
next();
}
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Prometheus alertmanager rules for the critical thresholds:
# prometheus/alerts.yml
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.01
for: 3m
labels:
severity: critical
annotations:
summary: "High error rate: {{ $value | humanizePercentage }}"
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency {{ $value }}s exceeds 2s threshold"
Grafana dashboards for APIs should display the USE method (Utilization, Saturation, Errors) for your infrastructure resources alongside the RED method (Rate, Errors, Duration) for your API endpoints. Import Grafana dashboard #1860 (Node.js Application Dashboard) as a starting point and customize for your specific endpoints.
Synthetic Monitoring
Real user monitoring tells you what happened. Synthetic monitoring tells you what is happening right now, proactively. Synthetic monitors are scripted tests that run continuously from multiple geographic locations, verifying that your API endpoints respond correctly.
Checkly is the leading synthetic monitoring tool for APIs. You write tests in Playwright (for browser flows) or plain JavaScript (for API checks), and Checkly runs them on a schedule (every 1 minute to every 24 hours) from 20+ global locations:
// checkly.config.ts
import { ApiCheck, AssertionBuilder } from '@checkly/cli/constructs';
new ApiCheck('api-health-check', {
name: 'API Health Check',
activated: true,
frequency: 1, // every minute
locations: ['us-east-1', 'eu-west-1', 'ap-southeast-1'],
request: {
url: 'https://api.example.com/health',
method: 'GET',
headers: [{ key: 'Accept', value: 'application/json' }],
assertions: [
AssertionBuilder.statusCode().equals(200),
AssertionBuilder.responseTime().lessThan(1000),
AssertionBuilder.jsonBody('$.status').equals('ok'),
],
},
alertChannels: [slackAlertChannel],
});
Synthetic vs real user monitoring (RUM): Synthetic monitoring runs scripted probes at regular intervals — it catches outages and performance regressions quickly, from your perspective. RUM measures what actual users experience — it captures the full distribution of real-world latency, error rates across devices, networks, and geographies. Both are necessary. Synthetic monitoring catches issues before users report them. RUM reveals issues that synthetic doesn't reproduce (e.g., performance problems for users on slow mobile connections in specific regions).
The most valuable synthetic test is one that exercises the critical path of your API — not just a health check endpoint, but the actual sequence of calls a user makes. Authentication → fetch data → write data. If that sequence fails, your users cannot use your product. A synthetic monitor on that path catches total outages in under 2 minutes.
For the blog index on API tools and patterns, synthetic monitoring is often paired with alerting on real user error rates for a comprehensive picture of API health.
Error Budgets
An error budget is the amount of unreliability you are allowed to have while still meeting your SLO. It is computed as: 100% - SLO target. If your SLO is 99.9% availability, your error budget is 0.1% — equivalent to 43.8 minutes of downtime per month.
Error budgets reframe reliability decisions. Instead of an abstract debate about "how reliable should we be?", error budgets make the tradeoff concrete: every deployment, every feature flag rollout, every risky infrastructure change spends error budget. When the budget is full, you can move fast. When the budget is depleted, you must slow down and focus on reliability.
Error budget burn rate measures how quickly you are consuming the budget. A burn rate of 1x means you will exactly exhaust the budget by the end of the period. A burn rate of 2x means you'll exhaust it in half the time. Fast burn alerts catch acute incidents; slow burn alerts catch gradual degradation.
Google SRE recommends a two-tier alerting strategy for error budgets:
Fast burn alert (high severity, page on-call):
- Triggered when: burn rate > 14.4x over a 1-hour window
- Meaning: you'll exhaust 1 month of error budget in 2 hours
- Response: immediate investigation and response
Slow burn alert (lower severity, ticket):
- Triggered when: burn rate > 1x over a 72-hour window
- Meaning: you're consuming budget faster than it replenishes
- Response: investigate root cause, plan improvements
In Prometheus:
- alert: ErrorBudgetFastBurn
expr: |
(
rate(http_requests_total{status_code=~"5.."}[1h])
/ rate(http_requests_total[1h])
) > 0.001 * 14.4 # 14.4x burn on 0.1% budget
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget fast burn: exhausting monthly budget in 2 hours"
When the error budget is exhausted, the engineering principle is: stop shipping new features and spend all effort on reliability improvements. This is the organizational forcing function that makes SLOs meaningful rather than aspirational.
Incident Management
Even well-monitored APIs have incidents. The difference between teams that recover quickly and teams that don't is preparation: documented runbooks, clear incident command structure, and a blameless postmortem culture.
Runbook structure for API incidents should be concise and action-oriented. A runbook is not a design document — it is a checklist for a stressed engineer at 3am. Structure:
- Symptoms: What alerts fired? What are users experiencing?
- Initial triage: Which service? Which endpoints? What changed recently?
- Diagnostic commands: Specific queries to run to identify the root cause
- Mitigation steps: Actions to take in order (rollback, scale up, disable feature flag)
- Escalation path: Who to page if this runbook doesn't resolve the incident
Incident command separates the person driving technical resolution from the person managing communication. The incident commander (IC) coordinates work, decides escalation, and makes calls about risk tradeoffs. The communications lead handles stakeholder updates, status page messages, and customer notifications. Without this separation, the technical lead is interrupted by stakeholder requests at exactly the moment they need to focus.
Timeline recording is essential for the postmortem. Use an incident chat channel (Slack, Discord) where every action and observation is timestamped. When did the first alert fire? When did the team start investigating? When was the problem identified? When was the mitigation applied? When did metrics return to normal? This timeline is the foundation of the postmortem.
Blameless postmortems focus on systemic factors, not individual mistakes. The goal is not to find who caused the incident — it is to find the conditions that made the incident possible and likely. Good postmortem questions: Why did the monitoring not catch this sooner? Why was the mitigation so slow? What made this problem hard to diagnose? What process or tool change would prevent recurrence?
MTTR (Mean Time to Recover) measures how long incidents last. Track MTTR over time — a rising MTTR trend indicates that incidents are getting more complex or that your runbooks and tooling are not keeping pace with system complexity. For a comprehensive view of building reliable API infrastructure, see our guides on API gateway patterns and API rate limiting best practices.
Conclusion
API performance monitoring is not a one-time setup — it is a continuous practice. The four golden signals (latency, error rate, throughput, saturation) provide the measurement foundation. SLOs and error budgets provide the organizational framework for reliability decisions. OpenTelemetry provides vendor-neutral instrumentation. Synthetic monitoring provides proactive outage detection. And incident management runbooks ensure that when things go wrong, recovery is fast and learning is systematic.
Build monitoring into your API from day one, not as an afterthought. The cost of retroactively instrumenting a complex API is far higher than instrumenting it during initial development. And explore the full API tooling directory for observability tools, APM platforms, and monitoring services that integrate with your stack.
Related: Building an AI Agent in 2026, Building an AI-Powered App: Choosing Your API Stack, Building an API Marketplace