Best Event-Driven Architecture APIs 2026
Services That Talk Directly Are Services That Fail Together
When Service A calls Service B synchronously over HTTP, A fails when B fails. When A's throughput exceeds B's capacity, A queues up requests until B times out. When you need to add Services C and D to receive the same events as B, A needs to know about all of them.
Event-driven architecture decouples these services: A publishes an event to a message broker; B, C, and D subscribe to that event independently. B can go down, restart, and process missed events from the queue. A's throughput is no longer bottlenecked by B's capacity. Adding D doesn't require changing A.
In 2026, four options define the event-driven messaging landscape: Apache Kafka (distributed streaming platform, self-hosted or managed), RabbitMQ (open-source message broker for task queues and routing), AWS SNS/SQS (fully managed cloud messaging for AWS-native architectures), and Confluent Cloud (managed Kafka with additional stream processing).
TL;DR
AWS SQS/SNS is the right default for AWS-native architectures — fully managed, $0.40-$0.50/million messages, no infrastructure to operate, and deep integration with every other AWS service. Apache Kafka is the right choice for high-throughput event streaming where message replay, log compaction, and consumer groups at scale matter — self-host for control, use Amazon MSK or Confluent for managed options. RabbitMQ is purpose-built for task queues and complex routing (different messages to different queues based on routing keys) — simpler than Kafka for traditional queue workloads. Confluent Cloud adds stream processing (ksqlDB, Flink) on top of Kafka for applications that need to transform and aggregate streams.
Key Takeaways
- AWS SQS charges $0.40/million requests — simple queue semantics, at-least-once delivery, 14-day message retention max.
- AWS SNS charges $0.50/million notifications — fan-out to multiple SQS queues, HTTP endpoints, Lambda, email, and mobile push.
- Apache Kafka is free to self-host — significant operational overhead (ZooKeeper/KRaft, replication, partition management). Amazon MSK starts at ~$0.20/broker-hour.
- Confluent Cloud starts at $0.10/GB for Kafka storage plus $0.50/CKU (Confluent Kafka Unit) for throughput — managed Kafka with ksqlDB for stream processing.
- RabbitMQ is open-source (MPL 2.0) and free to self-host. CloudAMQP (managed RabbitMQ) starts at $19/month.
- Kafka retains messages by default (configurable retention period) — consumers can replay past events, which queues like SQS don't support after acknowledgment.
- Fan-out patterns: SNS + SQS fan-out (one SNS topic → multiple SQS queues) is the AWS-native alternative to Kafka consumer groups.
Pricing Comparison
| Platform | Pricing | Managed Option | Self-Host |
|---|---|---|---|
| AWS SQS | $0.40/million requests | Yes (fully managed) | No |
| AWS SNS | $0.50/million notifications | Yes (fully managed) | No |
| Apache Kafka | Free (self-host) | MSK: $0.20/broker-hour | Yes |
| Confluent Cloud | $0.10/GB + $0.50/CKU | Yes | Open-source |
| RabbitMQ | Free (self-host) | CloudAMQP: $19/month | Yes |
AWS SNS/SQS
Best for: AWS-native architectures, event fan-out, simple queues, task distribution, zero ops
AWS SNS (Simple Notification Service) and SQS (Simple Queue Service) are the fully managed AWS-native messaging pair. SNS handles fan-out (one message → many subscribers); SQS handles point-to-point task queuing. The standard pattern: SNS topic → multiple SQS queues → Lambda or ECS workers.
Pricing
- SQS Standard: $0.40/million requests (first 1M/month free)
- SQS FIFO: $0.50/million requests (guaranteed ordering, exactly-once delivery)
- SNS: $0.50/million notifications (first 1M free)
SNS Fan-Out Pattern
import boto3
import json
sns = boto3.client("sns", region_name="us-east-1")
sqs = boto3.client("sqs", region_name="us-east-1")
# Publish to SNS — delivers to all subscribers simultaneously
sns.publish(
TopicArn="arn:aws:sns:us-east-1:123456789:order-events",
Message=json.dumps({
"event_type": "order.created",
"order_id": "ord_abc123",
"user_id": "user_456",
"amount": 99.99,
"items": [{"sku": "WIDGET-001", "qty": 2}],
}),
MessageAttributes={
"event_type": {
"DataType": "String",
"StringValue": "order.created",
}
},
)
# Multiple SQS queues subscribed to this topic:
# - order-fulfillment-queue (warehouse processing)
# - order-analytics-queue (metrics recording)
# - order-email-queue (customer notification)
# - order-fraud-check-queue (fraud detection)
SQS Worker (Lambda)
# Lambda triggered by SQS — processes tasks from queue
import json
def handler(event, context):
for record in event["Records"]:
message = json.loads(record["body"])
order_id = message["order_id"]
# Process the order
fulfill_order(order_id, message["items"])
# Message is automatically deleted on successful return
# On exception: message returns to queue after visibility timeout
return {"statusCode": 200, "body": "Processed"}
Dead Letter Queue
# Configure DLQ for failed messages
queue = sqs.create_queue(
QueueName="order-fulfillment",
Attributes={
"RedrivePolicy": json.dumps({
"deadLetterTargetArn": "arn:aws:sqs:...:order-fulfillment-dlq",
"maxReceiveCount": "3", # Move to DLQ after 3 failed attempts
}),
"VisibilityTimeout": "300", # 5 minutes to process before re-delivery
},
)
When to Choose AWS SNS/SQS
AWS-native applications where Lambda or ECS workers process events, fan-out patterns where one event needs to trigger multiple independent downstream processes, or teams that want zero infrastructure management at the cost of Kafka's replay and consumer group capabilities.
Apache Kafka
Best for: High-throughput event streaming, message replay, audit logs, stream processing, polyglot architectures
Apache Kafka is not a message queue — it's a distributed event log. Messages are written to partitioned, replicated topics and retained for a configurable period (days, weeks, or indefinitely). Multiple consumer groups can independently read from the same topic at different offsets — each group maintains its own position in the log.
Kafka vs. SQS: Core Difference
- SQS: Message disappears after acknowledgment — no replay
- Kafka: Message persists in the log — any consumer group can replay from any offset
Self-Hosted Configuration (Docker)
# docker-compose.yml — Kafka with KRaft (no ZooKeeper)
version: "3.8"
services:
kafka:
image: confluentinc/cp-kafka:7.6.0
ports:
- "9092:9092"
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_LOG_DIRS: /var/lib/kafka/data
CLUSTER_ID: your-cluster-id
Producer (Python)
from confluent_kafka import Producer
import json
producer = Producer({
"bootstrap.servers": "kafka:9092",
"acks": "all", # Wait for all replicas to acknowledge
"enable.idempotence": True, # Exactly-once semantics
})
def delivery_report(err, msg):
if err is not None:
print(f"Message delivery failed: {err}")
else:
print(f"Message delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}")
# Produce event
producer.produce(
"order-events",
key="ord_abc123",
value=json.dumps({
"event_type": "order.created",
"order_id": "ord_abc123",
"user_id": "user_456",
"amount": 99.99,
}),
callback=delivery_report,
)
producer.flush() # Wait for all messages to be delivered
Consumer Groups
from confluent_kafka import Consumer
import json
# Consumer group for order fulfillment
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "order-fulfillment-service", # Unique per consumer group
"auto.offset.reset": "earliest", # Start from beginning on new group
"enable.auto.commit": False, # Manual commit for exactly-once
})
consumer.subscribe(["order-events"])
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
print(f"Consumer error: {msg.error()}")
continue
event = json.loads(msg.value())
# Process event
process_order(event["order_id"])
# Commit offset only after successful processing
consumer.commit(msg)
When to Choose Kafka
High-throughput event streaming (millions of events/day), event sourcing where the full event history is the source of truth, audit logging requiring indefinite message retention, stream processing with consumer group replay, or polyglot architectures where Go, Python, Java, and Node.js services all consume from the same topics.
RabbitMQ
Best for: Task queues, complex routing, work distribution, traditional queue patterns
RabbitMQ is the open-source message broker for workloads that fit the traditional queue model — task queues, work distribution, and complex routing (different messages to different queues based on message attributes or routing keys). The AMQP protocol and exchange/binding model give fine-grained control over message routing.
Pricing
- Self-hosted: Free (MPL 2.0)
- CloudAMQP (managed): $19/month (Lemur plan, shared cluster)
Producer/Consumer (Python with Pika)
import pika
import json
# Connection
connection = pika.BlockingConnection(
pika.ConnectionParameters(host="rabbitmq")
)
channel = connection.channel()
# Declare exchange and queue
channel.exchange_declare(exchange="orders", exchange_type="topic", durable=True)
channel.queue_declare(queue="order.fulfillment", durable=True)
# Bind queue to exchange with routing key pattern
channel.queue_bind(
exchange="orders",
queue="order.fulfillment",
routing_key="order.created.*", # Match order.created.us, order.created.eu, etc.
)
# Publish message
channel.basic_publish(
exchange="orders",
routing_key="order.created.us",
body=json.dumps({"order_id": "ord_123", "region": "us"}),
properties=pika.BasicProperties(delivery_mode=2), # Persistent
)
# Consumer with manual ack
def process_order(ch, method, properties, body):
order = json.loads(body)
fulfill_order(order["order_id"])
ch.basic_ack(delivery_tag=method.delivery_tag) # Acknowledge success
channel.basic_qos(prefetch_count=1) # One message at a time per worker
channel.basic_consume(queue="order.fulfillment", on_message_callback=process_order)
channel.start_consuming()
When to Choose RabbitMQ
Traditional task queue patterns (job queues, email sending, image processing), applications requiring complex message routing (different queue for different message types/regions), or teams coming from PHP/Python/Ruby backgrounds where AMQP is the established pattern.
Decision Framework
| Requirement | AWS SNS/SQS | Kafka | RabbitMQ |
|---|---|---|---|
| Zero infrastructure ops | Yes | No | No |
| Message replay | No | Yes | No |
| Fan-out to many consumers | Yes (SNS) | Yes (consumer groups) | Yes (exchanges) |
| High throughput (10M+/day) | Yes | Yes | Yes |
| Complex routing | Limited | No | Yes (exchanges) |
| Stream processing | No | Yes (ksqlDB, Flink) | No |
| AWS-native integration | Best | MSK | No |
| Self-hosted option | No | Yes | Yes |
| Exactly-once delivery | FIFO only | With transactions | With confirms |
Event Schema Design and the CloudEvents Standard
Event-driven systems fail in predictable ways when events don't have consistent, well-defined schemas. A payment event published by the payments service needs to be understood by the fulfillment service, the analytics service, and the fraud detection service — independently, without coordination. Without a schema standard and registry, each consumer builds a fragile implicit understanding of what fields to expect.
CloudEvents is the CNCF-standardized event envelope format — a common set of metadata attributes that wrap your event payload. The required attributes are: specversion (always "1.0"), id (unique event identifier), source (producing service), type (event category, e.g., "com.example.orders.created"), and datacontenttype (typically "application/json"). Optional attributes include time (ISO 8601 timestamp) and subject (the resource the event is about, e.g., the order ID).
Kafka, EventBridge, and most managed messaging platforms support CloudEvents natively or via adapter. Using CloudEvents means consumers can introspect event type without parsing the payload — a significant advantage for routing and dead letter queue analysis where you need to understand event category without deserializing every message body.
A schema registry is the production complement to CloudEvents. Confluent Schema Registry (for Kafka) stores Avro, JSON Schema, or Protobuf schemas keyed by subject (topic name + schema type). Producers register a schema before publishing; consumers fetch the schema by ID embedded in the message header. Schema evolution rules (backward compatible, forward compatible, full compatible) enforce that schema changes don't silently break consumers — a field removal is rejected if the registry enforces backward compatibility.
For AWS SQS/SNS architectures, AWS EventBridge Schema Registry performs the same function — auto-discovering event schemas from EventBridge traffic and generating typed code bindings in Python, TypeScript, and Java. Event schemas in EventBridge are versioned and discoverable, reducing the "what fields does this event have" coordination overhead between teams.
Event naming conventions matter as much as schema structure. Use past tense (OrderPlaced, PaymentFailed, InventoryReserved) rather than imperative commands (PlaceOrder, FailPayment) for events that describe something that happened. Include the domain and the action: com.company.orders.placed is better than order_event — parseable, searchable, and self-documenting in dead letter queues and monitoring dashboards.
Dead Letter Queues and Failure Handling
Every message broker eventually delivers an unprocessable message — a malformed payload, a consumer bug, a downstream dependency that's unavailable. Without explicit failure handling, these messages either block the queue, get silently dropped, or cause unbounded retry loops.
AWS SQS has native Dead Letter Queue (DLQ) support. Configure a maxReceiveCount (e.g., 5) and a DLQ target queue; after a message fails maxReceiveCount times, SQS moves it to the DLQ automatically. The DLQ retains the original message body, and you can inspect failed messages in the SQS console or via the AWS CLI. The critical operational pattern: set a CloudWatch alarm on DLQ depth that fires when any message appears — a non-empty DLQ is always an action item.
Kafka doesn't have native DLQ semantics (messages are retained by time/size, not by consumer state). The idiomatic Kafka pattern is consumer-side retry logic: when a message fails processing, publish it to a dedicated retry topic (with a backoff delay implemented via scheduled processing or a separate slow-consumer worker), and after N retries, publish to a dead letter topic. This requires more application code than SQS's built-in DLQ, but provides replay capabilities — Kafka's retention means you can replay the DLQ topic and reprocess all failed messages once a bug is fixed.
RabbitMQ implements Dead Letter Exchanges (DLX) at the queue level. Configure x-dead-letter-exchange and x-message-ttl per queue; messages that expire or are rejected (nacked without requeue) are automatically routed to the DLX. The DLX can route failed messages by their original routing key to a monitoring queue.
Regardless of the platform, the operational discipline for DLQ handling matters as much as the technical setup. Teams that configure a DLQ and then never check it are no better off than teams without one. DLQ contents must flow into alerting — a message in the DLQ represents a failed business operation (an unfulfilled order, an unsent notification, an unbilled event) that needs human attention. Automated DLQ replay tooling, where failed messages can be reprocessed in bulk once the underlying bug is fixed, dramatically reduces the operational burden of incidents that produce large DLQ backlogs.
Verdict
AWS SNS/SQS is the default for AWS-native architectures — zero infrastructure, deep AWS service integration, and simple pricing. The fan-out pattern (SNS → multiple SQS) covers most decoupling requirements without Kafka's operational complexity.
Apache Kafka is the right choice when message replay, event sourcing, or high-throughput consumer groups are requirements. The operational overhead is real — use Amazon MSK or Confluent Cloud for managed options if the self-hosted operational load is prohibitive.
RabbitMQ fills the niche for traditional queue workloads and complex routing patterns — mature, stable, and the right tool when AMQP's exchange/binding model matches your message routing requirements.
A practical decision heuristic: if you need message replay (the ability to re-consume past events), use Kafka. If replay is not a requirement and you're in AWS, use SQS/SNS. If you're on-prem or need complex routing rules without the AWS ecosystem, use RabbitMQ. The Kafka-vs-SQS decision for AWS teams often comes down to throughput and ordering requirements — SQS FIFO queues provide per-message-group ordering at up to 3,000 messages/second, which is sufficient for most application-level event workloads. Kafka's per-partition ordering and near-unlimited throughput matter primarily for high-velocity event streams (IoT telemetry, user activity tracking at scale, financial market data) where SQS throughput or cost ($0.40/million requests × billions of events) becomes a constraint. Most SaaS applications never approach the scale where Kafka's throughput advantage over SQS justifies the operational overhead.
Compare event messaging platform pricing, throughput capacity, and integration documentation at APIScout — find the right messaging infrastructure for your event-driven system.
Related: Event-Driven APIs: Webhooks, WebSockets & SSE, Motia API Workflows: Steps & Events 2026, Motia: Event-Driven API Workflows in 2026