Every decision workflow in Union Architecture eventually hits a fork: should we process tasks through a queue, or react to events as they occur? The answer shapes latency, reliability, and how easily the system evolves. This guide compares both patterns at a conceptual level, using examples from typical Union deployments — moderation pipelines, scoring engines, and stateful validators. We'll avoid dogma and focus on the trade-offs that matter when you're the one wiring the components together.
Why This Choice Matters Now
Union Architecture encourages loose coupling between services, but loose coupling doesn't mean zero coordination. Decision workflows — sequences of checks, transformations, and approvals — are where coupling hides. A queue-based approach imposes a predictable order; an event-driven approach lets services react independently. The choice affects how you handle failures, scale under load, and add new decision stages later.
Consider a content moderation pipeline: incoming posts need to pass through profanity filters, image checks, and human review. If you model this as a queue, each post waits in line for each stage. If you model it as events, each stage subscribes to a 'post submitted' event and publishes its own result. Both work, but they behave very differently under load spikes, partial failures, or when you need to add a new check mid-stream.
Teams that pick the wrong pattern often discover the mismatch during production incidents. A queue that becomes a bottleneck. An event storm that overwhelms downstream services. A missing event that leaves a decision hanging. Understanding the core mechanisms — and the edge cases — is the only way to make an informed choice.
The Cost of Getting It Wrong
In one composite scenario, a team built an event-driven approval workflow for financial transactions. Each step published an event, and downstream listeners updated the transaction state. It worked beautifully until a network partition caused duplicate events. Suddenly, some transactions were approved twice. The team had to retrofit idempotency keys — a fix that would have been built-in with a queue. This isn't a knock on event-driven design; it's a reminder that each pattern carries implicit guarantees that you must understand upfront.
Core Idea in Plain Language
Queue-based workflows treat each decision as a discrete job that sits in a line until a worker picks it up. The queue owns the state: it knows which jobs are pending, in progress, or failed. Workers pull jobs when they're ready, and the queue can retry failed jobs automatically. This pattern gives you ordering guarantees (first-in, first-out by default) and built-in backpressure — if workers are slow, the queue grows, but no jobs are lost.
Event-driven workflows, by contrast, treat decisions as reactions to events. When a service publishes an event (e.g., 'order placed'), any interested service can consume it and act. There is no central queue; each consumer manages its own state. Events are typically broadcast to multiple subscribers, enabling parallel processing. But ordering is not guaranteed unless you use partitioned event streams (like Kafka topics with keys), and backpressure is your responsibility — if a consumer falls behind, events may be dropped or buffered at the broker.
Real-World Analogies
Think of a queue as a conveyor belt in a factory. Each item moves at the belt's pace; workers at each station pick items off the belt, process them, and put them back. The belt ensures order and prevents items from piling up at any station. An event-driven system is more like a newsroom: a story breaks, editors, fact-checkers, and graphic designers all react independently. The story gets covered quickly, but without coordination, two editors might write the same piece, or a fact-check might arrive after the graphic is finalized.
Key Differences at a Glance
- State ownership: Queue owns the job state; in event-driven, each consumer owns its state.
- Ordering: Queue typically guarantees FIFO; event-driven requires partitioning for ordering.
- Backpressure: Queue naturally throttles producers; event-driven needs explicit flow control.
- Scalability: Queue scales workers; event-driven scales consumers and partitions.
- Failure handling: Queue can retry with dead-letter queues; event-driven needs event replay or compensating actions.
How It Works Under the Hood
Let's look at the internals of both patterns in a Union Architecture context. In a queue-based workflow, you typically set up a message broker (like RabbitMQ or Amazon SQS) and define a queue per decision stage. A producer publishes a message with the decision context (e.g., a user ID and a 'verify identity' command). Workers poll the queue, process the decision, and either publish a new message to the next queue or write the result to a database. If a worker crashes mid-process, the message becomes visible again after a timeout, and another worker picks it up. This gives you at-least-once processing by default.
In an event-driven workflow, you use an event bus (like Kafka, EventBridge, or a simple Redis pub/sub). Services emit events when something happens — 'user registered', 'payment succeeded', 'risk score computed'. Downstream services subscribe to the events they care about. Each subscriber maintains its own offset or cursor, so it can replay events if needed. However, because events are broadcast, multiple subscribers can process the same event independently — useful for parallel tasks like logging, analytics, and triggering notifications.
State and Idempotency
Queue-based workflows naturally track state in the queue itself. The queue knows which messages are unacked, and workers can signal success or failure. Event-driven workflows shift state management to each consumer. If a consumer needs to ensure it doesn't process the same event twice, it must store processed event IDs and check them — a pattern called idempotent consumer. This adds complexity but also flexibility: consumers can be stateless for pure reactions, or stateful for accumulations.
Scaling Patterns
Queues scale by adding more workers. The queue acts as a shock absorber — if traffic spikes, the queue depth increases, and workers catch up when they can. Event-driven systems scale by adding more partitions (in Kafka) or more consumer instances. But scaling consumers is trickier: if you add too many consumers for a partition, only one consumer per partition is active; the rest sit idle. You need to balance the number of partitions with expected throughput.
Worked Example: A Compliance Checker
Imagine you're building a compliance checker for a financial platform. Each transaction must pass three checks: AML screening, sanctions list lookup, and fraud scoring. The checks are independent, but the results must be combined before the transaction is approved.
Queue-based approach: You create three queues — one per check. A coordinator service publishes a 'check transaction' message to all three queues simultaneously. Each queue has a worker pool that processes the check and writes the result to a shared database. The coordinator polls the database until all three results are available, then makes the final decision. This works, but the coordinator becomes a bottleneck and adds latency (polling interval).
Event-driven approach: A 'transaction submitted' event is published to a topic. Three services subscribe: AML, sanctions, fraud. Each processes independently and publishes its own event ('aml_cleared', 'sanctions_cleared', 'fraud_cleared'). A 'decision aggregator' service subscribes to all three events and, once it has received all three for a transaction, publishes the final approval or rejection. This is more responsive — the aggregator reacts immediately to each event — but you must handle the case where one event never arrives (timeout logic) or arrives after a delay (late-arriving data).
Trade-offs in This Example
The queue-based approach is simpler to reason about: you know each check runs exactly once (with retries), and the coordinator ensures ordering. But it's slower and has a single point of coordination. The event-driven approach is faster and more resilient (no coordinator), but you need to handle partial results, duplicates, and eventual consistency. In practice, many teams choose a hybrid: use queues for the critical checks (AML, sanctions) and events for less critical ones (fraud scoring), with a timeout-based fallback.
Edge Cases and Exceptions
Both patterns break in predictable ways. Here are the edge cases you should design for.
Queue Poison Messages
A message that always fails — due to corrupt data or a transient bug — will be retried indefinitely, blocking other messages behind it. Solutions include dead-letter queues (move the message aside after N retries) and poison message detection (track failure counts and skip). In event-driven systems, a failing consumer can skip the event or park it, but since events are not ordered by default, a bad event doesn't block others — it just gets ignored or retried separately.
Event Ordering and Causality
In event-driven systems, if you need strict ordering (e.g., 'user created' before 'user updated'), you must partition events by a key (e.g., user ID) so all events for that user go to the same partition. Even then, if a consumer crashes and restarts, it may process events out of order if it re-reads from an old offset. Queue-based systems avoid this by design: the queue preserves order within a single queue. But if you need to fan out to multiple queues, ordering across queues is lost.
Backpressure in Event-Driven Systems
When a consumer falls behind, events pile up in the broker. If the broker's retention limit is exceeded, events are dropped. You can mitigate this by increasing retention, using consumer groups with more instances, or implementing backpressure signals (e.g., the consumer tells the producer to slow down via a separate channel). Queues handle backpressure naturally: producers can block when the queue is full, or you can configure a max queue size and reject new messages.
Duplicate Events and At-Least-Once Semantics
Both patterns can produce duplicates. Queues can redeliver messages if a worker crashes after processing but before acking. Event buses can re-send events if the broker fails. In queue-based systems, workers must implement idempotency (e.g., checking a processed flag in the database). In event-driven systems, consumers must track processed event IDs. The difference is that in queues, duplicates are rare and usually caused by worker crashes; in event-driven systems, duplicates can be more frequent due to broker retries and consumer rebalancing.
Limits of the Approach
No pattern is a silver bullet. Queue-based workflows struggle with workflows that require dynamic routing (e.g., 'if risk score > 0.8, send to human review; otherwise, auto-approve'). You can implement conditional routing by having workers inspect the message and publish to different queues, but this logic lives in the worker, not the infrastructure. Event-driven workflows handle this naturally — a service can publish different events based on conditions, and subscribers filter accordingly.
Event-driven workflows also struggle with workflows that need a global view of state. For example, a 'session timeout' decision requires knowing the last activity timestamp across multiple services. In a queue-based system, you could have a single worker that processes all session-related messages and maintains state in memory. In an event-driven system, you'd need a state store (like a database or a stream processor) that aggregates events from multiple sources.
Operational Complexity
Queues are operationally simpler: you monitor queue depth, worker count, and dead-letter queues. Event-driven systems require monitoring consumer lag, partition distribution, and event throughput. Both need infrastructure for retries, but event-driven systems often need a more sophisticated error handling strategy (e.g., retry topics with exponential backoff).
When Neither Works Well
For workflows that require strong consistency across multiple steps (e.g., a multi-step transaction that must either fully succeed or fully roll back), neither queue nor event-driven patterns are sufficient on their own. You need a saga pattern or a distributed transaction coordinator. Queues can implement sagas with compensating actions, and event-driven systems can use choreographed sagas, but both add significant complexity.
Reader FAQ
Can I use both patterns in the same workflow?
Yes, and many teams do. A common hybrid is to use queues for the critical path (where ordering and reliability are paramount) and events for side effects (logging, notifications, analytics). For example, in an order processing workflow, the main 'fulfill order' steps run through a queue, while 'send confirmation email' and 'update inventory' are triggered by events. This gives you the best of both worlds, but adds the complexity of managing two infrastructure patterns.
How do I choose between them for a new project?
Start by listing your non-negotiable requirements: strict ordering? Exactly-once processing? Low latency? Dynamic routing? If ordering and at-least-once are critical, lean toward queues. If you need parallel processing and loose coupling, lean toward events. If you're unsure, prototype the most complex decision path with both patterns — the one that feels more natural for your team's mental model is often the right choice.
What about cloud-managed services?
Managed services like AWS SQS (queue) and EventBridge (event bus) reduce operational overhead. SQS offers FIFO queues for strict ordering. EventBridge supports content-based filtering and schema discovery. The trade-offs remain the same, but managed services handle scaling and durability for you. Just watch out for vendor-specific quirks: SQS FIFO limits throughput to 300 transactions per second, and EventBridge has a 24-hour retention limit for events.
How do I handle long-running decisions?
Both patterns can handle decisions that take minutes or hours. In queue-based systems, you set a long visibility timeout and use heartbeats to prevent the message from being redelivered. In event-driven systems, you can use a 'process started' event and a separate 'process completed' event, with a timeout check in between. For very long workflows (days), consider using a workflow engine (like Temporal or AWS Step Functions) that combines queues and events with state persistence.
Practical Takeaways
Choosing between queue-based and event-driven decision workflows isn't a one-time architectural decision. It's a series of small choices for each decision path in your system. Here are concrete steps to apply what we've covered.
1. Map your decision paths. For each workflow, list the steps, their dependencies, and the ordering requirements. If steps are independent and can run in parallel, events are a natural fit. If steps must run in sequence and you need strict ordering, use a queue.
2. Identify critical vs. non-critical paths. Use queues for the path that must never lose a message and must retry on failure. Use events for paths where occasional loss is acceptable (e.g., analytics) or where you can reconstruct state from other sources.
3. Implement idempotency early. Whether you choose queues or events, assume duplicates will happen. Add a unique ID to each message or event, and store processed IDs in a deduplication store (Redis, DynamoDB). This one practice will save you countless debugging hours.
4. Monitor the right metrics. For queues: queue depth, age of oldest message, dead-letter count. For events: consumer lag, event throughput, error rate. Set alerts on these metrics before you go to production.
5. Plan for change. Your workflow will evolve. Design your message schemas with versioning (e.g., a 'version' field) and avoid tight coupling between producers and consumers. In event-driven systems, use schema registries. In queue-based systems, use a common envelope format that can be extended.
Ultimately, the best pattern is the one your team can operate confidently. Start simple, measure, and iterate. Union Architecture is about making the right trade-offs for your context — and now you have a clearer map of the terrain.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!