How to Monitor API Performance: Latency, Errors, and SLAs

You can't improve what you don't measure. API performance monitoring tracks latency, error rates, throughput, and availability — the metrics that determine whether your API is meeting its commitments. Here's what to measure, how to measure it, and when to alert.

TL;DR

Measure latency percentiles (p50, p95, p99) — averages hide tail latency problems that affect real users
Set SLOs tighter than your SLA commitments to catch degradation before it breaches customer agreements
Alert on symptoms (high error rate, elevated p99) not causes (high CPU) — cause-based alerts lead to alert fatigue
OpenTelemetry is the standard for distributed tracing — instrument once, send to any backend
Error budgets give you a principled framework for balancing reliability and velocity

The Four Golden Signals

Google SRE's four golden signals apply directly to APIs:

1. Latency

What: Time from request received to response sent.

Measure percentiles, not averages:

Percentile	Meaning	Use
p50 (median)	Half of requests are faster	Typical experience
p95	95% of requests are faster	Most users' experience
p99	99% of requests are faster	Worst-case normal experience
p99.9	99.9% are faster	Tail latency

Why not averages? An average of 100ms hides that 1% of requests take 5 seconds. p99 catches that.

Targets:

Endpoint Type	p50	p95	p99
Simple read	<50ms	<200ms	<500ms
Database query	<100ms	<500ms	<1s
Search	<200ms	<1s	<2s
Write operation	<100ms	<500ms	<1s
External API call	<500ms	<2s	<5s

2. Error Rate

What: Percentage of requests returning errors (4xx/5xx).

Metric	Healthy	Warning	Critical
5xx rate	<0.1%	0.1-1%	>1%
4xx rate	<5%	5-10%	>10%
Total error rate	<1%	1-5%	>5%

Track by status code: Distinguish between client errors (4xx — usually the client's fault) and server errors (5xx — your fault).

3. Throughput

What: Requests per second (RPS) or requests per minute (RPM).

Track throughput to:

Capacity plan (are you approaching limits?)
Detect anomalies (sudden spike = attack? sudden drop = outage?)
Correlate with latency (does latency increase with load?)

4. Saturation

What: How close your system is to capacity.

Resource	Metric	Alert Threshold
CPU	Utilization %	>80% sustained
Memory	Usage / available	>85%
Database connections	Active / max pool	>80%
Disk I/O	IOPS / max IOPS	>70%
Network	Bandwidth usage	>70%

SLA / SLO / SLI

SLI (Service Level Indicator)

A measurable metric: "99.5% of requests complete in under 500ms."

SLO (Service Level Objective)

Your internal target: "p99 latency < 500ms, error rate < 0.1%."

SLA (Service Level Agreement)

Your external commitment with consequences: "99.9% uptime or service credits."

Set SLOs tighter than SLAs. If your SLA promises 99.9% uptime, set your SLO at 99.95% so you have a buffer before breaching the SLA.

Uptime Targets

Uptime	Downtime/Year	Downtime/Month
99%	3.65 days	7.3 hours
99.9%	8.77 hours	43.8 minutes
99.95%	4.38 hours	21.9 minutes
99.99%	52.6 minutes	4.38 minutes
99.999%	5.26 minutes	26.3 seconds

Alerting Strategy

Alert on Symptoms, Not Causes

Good alerts (symptoms):

p99 latency > 2s for 5 minutes
Error rate > 1% for 3 minutes
Throughput dropped 50% vs same hour last week

Bad alerts (causes):

CPU > 80% (may not affect users)
Memory > 90% (may be normal)
Single health check failed (transient)

Alert Severity

Severity	Criteria	Response
P1 - Critical	Service down, data loss	Page on-call, all hands
P2 - High	Degraded performance, partial outage	Page on-call, investigate
P3 - Medium	Non-critical service degraded	Next business day
P4 - Low	Cosmetic, minor issue	Backlog

Monitoring Tools

Tool	Best For	Price
Datadog	Full observability	From $5/host/mo
Grafana + Prometheus	Self-hosted, open source	Free
Better Stack	Uptime + incidents	Free (10 monitors)
Checkly	Synthetic monitoring	Free (5 checks)
Sentry	Error tracking	Free (5K events)
PostHog	Product analytics	Free (1M events)

Dashboard Essentials

Every API monitoring dashboard should show:

Request volume — RPS over time (detect anomalies)
Latency percentiles — p50, p95, p99 over time
Error rate — 4xx and 5xx separately
Top errors — most frequent error codes/messages
Slowest endpoints — which endpoints need optimization
Uptime — current and 30-day availability

OpenTelemetry for APIs

OpenTelemetry (OTel) is the industry-standard framework for distributed tracing, metrics, and logs. It provides vendor-neutral instrumentation so you can switch between Datadog, Grafana, Honeycomb, and Jaeger without rewriting your instrumentation code. For API teams, OTel solves a specific problem: when a slow API request spans multiple services, OTel traces show exactly where time was spent — which database query, which downstream service call, which function.

Auto-instrumentation in Node.js is the fastest path to distributed tracing. The @opentelemetry/auto-instrumentations-node package automatically instruments Express, Hono, Fastify, HTTP clients (fetch, axios), databases (pg, mysql2, mongoose), and Redis without any code changes:

// instrument.ts — must be loaded BEFORE all other code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-api',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, // e.g., https://api.honeycomb.io/v1/traces
    headers: {
      'x-honeycomb-team': process.env.HONEYCOMB_API_KEY,
    },
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

// package.json — load instrumentation before app code
{
  "scripts": {
    "start": "node -r ./dist/instrument.js dist/server.js"
  }
}

Manual span creation adds business context that auto-instrumentation cannot infer. When you have a complex operation (order processing, payment flow, batch job), manual spans make traces far more useful:

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-api');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttributes({
      'order.id': orderId,
      'order.source': 'api',
    });

    try {
      const order = await tracer.startActiveSpan('db.getOrder', async (dbSpan) => {
        const result = await db.orders.findUnique({ where: { id: orderId } });
        dbSpan.setAttributes({ 'db.rows_affected': result ? 1 : 0 });
        dbSpan.end();
        return result;
      });

      await tracer.startActiveSpan('payment.charge', async (paymentSpan) => {
        paymentSpan.setAttributes({ 'payment.amount': order.total });
        await chargePayment(order);
        paymentSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(err) });
      span.recordException(err as Error);
      throw err;
    } finally {
      span.end();
    }
  });
}

Trace context propagation ensures that a trace started in your API gateway is continued through downstream microservices. OTel propagates context via the traceparent HTTP header (W3C Trace Context standard). When you use auto-instrumentation for HTTP clients, this happens automatically. For custom queue consumers or background workers, you may need to extract context manually from the message payload.

For sending traces to Grafana Tempo or Datadog, replace the OTLPTraceExporter URL with the appropriate endpoint. The instrumentation code stays identical — OTel's vendor neutrality is genuine.

Prometheus + Grafana Setup

Prometheus is the de facto standard for metrics collection in production API infrastructure. It works by scraping a /metrics endpoint exposed by your API, storing time-series data, and enabling alerting rules. Grafana visualizes Prometheus data and manages alert notifications.

Exposing /metrics from your API using prom-client (Node.js):

import { Registry, Counter, Histogram, collectDefaultMetrics } from 'prom-client';

const register = new Registry();
collectDefaultMetrics({ register }); // CPU, memory, event loop lag

// Request counter
const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register],
});

// Latency histogram (buckets in seconds)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [register],
});

// Express middleware
export function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    const labels = {
      method: req.method,
      route: req.route?.path ?? 'unknown',
      status_code: String(res.statusCode),
    };
    httpRequestTotal.inc(labels);
    end(labels);
  });
  next();
}

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Prometheus alertmanager rules for the critical thresholds:

# prometheus/alerts.yml
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status_code=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.01
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $value | humanizePercentage }}"

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency {{ $value }}s exceeds 2s threshold"

Grafana dashboards for APIs should display the USE method (Utilization, Saturation, Errors) for your infrastructure resources alongside the RED method (Rate, Errors, Duration) for your API endpoints. Import Grafana dashboard #1860 (Node.js Application Dashboard) as a starting point and customize for your specific endpoints.

Synthetic Monitoring

Real user monitoring tells you what happened. Synthetic monitoring tells you what is happening right now, proactively. Synthetic monitors are scripted tests that run continuously from multiple geographic locations, verifying that your API endpoints respond correctly.

Checkly is the leading synthetic monitoring tool for APIs. You write tests in Playwright (for browser flows) or plain JavaScript (for API checks), and Checkly runs them on a schedule (every 1 minute to every 24 hours) from 20+ global locations:

// checkly.config.ts
import { ApiCheck, AssertionBuilder } from '@checkly/cli/constructs';

new ApiCheck('api-health-check', {
  name: 'API Health Check',
  activated: true,
  frequency: 1, // every minute
  locations: ['us-east-1', 'eu-west-1', 'ap-southeast-1'],
  request: {
    url: 'https://api.example.com/health',
    method: 'GET',
    headers: [{ key: 'Accept', value: 'application/json' }],
    assertions: [
      AssertionBuilder.statusCode().equals(200),
      AssertionBuilder.responseTime().lessThan(1000),
      AssertionBuilder.jsonBody('$.status').equals('ok'),
    ],
  },
  alertChannels: [slackAlertChannel],
});

Synthetic vs real user monitoring (RUM): Synthetic monitoring runs scripted probes at regular intervals — it catches outages and performance regressions quickly, from your perspective. RUM measures what actual users experience — it captures the full distribution of real-world latency, error rates across devices, networks, and geographies. Both are necessary. Synthetic monitoring catches issues before users report them. RUM reveals issues that synthetic doesn't reproduce (e.g., performance problems for users on slow mobile connections in specific regions).

The most valuable synthetic test is one that exercises the critical path of your API — not just a health check endpoint, but the actual sequence of calls a user makes. Authentication → fetch data → write data. If that sequence fails, your users cannot use your product. A synthetic monitor on that path catches total outages in under 2 minutes.

For the blog index on API tools and patterns, synthetic monitoring is often paired with alerting on real user error rates for a comprehensive picture of API health.

Error Budgets

An error budget is the amount of unreliability you are allowed to have while still meeting your SLO. It is computed as: 100% - SLO target. If your SLO is 99.9% availability, your error budget is 0.1% — equivalent to 43.8 minutes of downtime per month.

Error budgets reframe reliability decisions. Instead of an abstract debate about "how reliable should we be?", error budgets make the tradeoff concrete: every deployment, every feature flag rollout, every risky infrastructure change spends error budget. When the budget is full, you can move fast. When the budget is depleted, you must slow down and focus on reliability.

Error budget burn rate measures how quickly you are consuming the budget. A burn rate of 1x means you will exactly exhaust the budget by the end of the period. A burn rate of 2x means you'll exhaust it in half the time. Fast burn alerts catch acute incidents; slow burn alerts catch gradual degradation.

Google SRE recommends a two-tier alerting strategy for error budgets:

Fast burn alert (high severity, page on-call):

Triggered when: burn rate > 14.4x over a 1-hour window
Meaning: you'll exhaust 1 month of error budget in 2 hours
Response: immediate investigation and response

Slow burn alert (lower severity, ticket):

Triggered when: burn rate > 1x over a 72-hour window
Meaning: you're consuming budget faster than it replenishes
Response: investigate root cause, plan improvements

In Prometheus:

- alert: ErrorBudgetFastBurn
  expr: |
    (
      rate(http_requests_total{status_code=~"5.."}[1h])
      / rate(http_requests_total[1h])
    ) > 0.001 * 14.4  # 14.4x burn on 0.1% budget
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error budget fast burn: exhausting monthly budget in 2 hours"

When the error budget is exhausted, the engineering principle is: stop shipping new features and spend all effort on reliability improvements. This is the organizational forcing function that makes SLOs meaningful rather than aspirational.

Incident Management

Even well-monitored APIs have incidents. The difference between teams that recover quickly and teams that don't is preparation: documented runbooks, clear incident command structure, and a blameless postmortem culture.

Runbook structure for API incidents should be concise and action-oriented. A runbook is not a design document — it is a checklist for a stressed engineer at 3am. Structure:

Symptoms: What alerts fired? What are users experiencing?
Initial triage: Which service? Which endpoints? What changed recently?
Diagnostic commands: Specific queries to run to identify the root cause
Mitigation steps: Actions to take in order (rollback, scale up, disable feature flag)
Escalation path: Who to page if this runbook doesn't resolve the incident

Incident command separates the person driving technical resolution from the person managing communication. The incident commander (IC) coordinates work, decides escalation, and makes calls about risk tradeoffs. The communications lead handles stakeholder updates, status page messages, and customer notifications. Without this separation, the technical lead is interrupted by stakeholder requests at exactly the moment they need to focus.

Timeline recording is essential for the postmortem. Use an incident chat channel (Slack, Discord) where every action and observation is timestamped. When did the first alert fire? When did the team start investigating? When was the problem identified? When was the mitigation applied? When did metrics return to normal? This timeline is the foundation of the postmortem.

Blameless postmortems focus on systemic factors, not individual mistakes. The goal is not to find who caused the incident — it is to find the conditions that made the incident possible and likely. Good postmortem questions: Why did the monitoring not catch this sooner? Why was the mitigation so slow? What made this problem hard to diagnose? What process or tool change would prevent recurrence?

MTTR (Mean Time to Recover) measures how long incidents last. Track MTTR over time — a rising MTTR trend indicates that incidents are getting more complex or that your runbooks and tooling are not keeping pace with system complexity. For a comprehensive view of building reliable API infrastructure, see our guides on API gateway patterns and API rate limiting best practices.

Conclusion

API performance monitoring is not a one-time setup — it is a continuous practice. The four golden signals (latency, error rate, throughput, saturation) provide the measurement foundation. SLOs and error budgets provide the organizational framework for reliability decisions. OpenTelemetry provides vendor-neutral instrumentation. Synthetic monitoring provides proactive outage detection. And incident management runbooks ensure that when things go wrong, recovery is fast and learning is systematic.

Build monitoring into your API from day one, not as an afterthought. The cost of retroactively instrumenting a complex API is far higher than instrumenting it during initial development. And explore the full API tooling directory for observability tools, APM platforms, and monitoring services that integrate with your stack.

Monitor API Performance: Metrics and SLAs 2026