In distributed environments, operations commonly span across multiple, independent components. Consistent execution of the interactions is desired, i.e. all operations should succeed, or, if something goes wrong, no operation should be completed and every component should revert into its original state. In a non-distributed or tightly controlled environment, transaction mechanisms usually ensure this type of consistency, guaranteeing the classic ACID properties: operations are atomic, consistent, isolated, and durable. Databases and TP Monitors are classic examples of components that guarantee ACID properties. In distributed, loosely coupled systems, however, such ACID transactions are typically not available or not practical.
In the absence of traditional "ACID" transactions, Pat Helland proposed in 2007 to rephrase the acronym to: Associative, Commutative, Idempotent, and Distributed ([Helland]). While the original ACID properties are about correctness and precision, the new ones are about flexibility and robustness. Being associative means that a chain of operations can be performed in any grouping, the classical example being (A + B) + C = A + (B + C) for the associative operator ‘+’. Likewise, commutative operations can swap the operands without affecting the result: A + B = B + A. Finally, a function is idempotent if its repeated application does not change the result, i.e. f(x) = f(f(x)). In combination the "new ACID" means that operations can be performed out of order or duplicated without affecting the results.
When the communication infrastructure does not provide traditional "ACID" guarantees, components have to engage in conversations to achieve consistency. For example, they may expect acknowledgments to their messages, resend requests, or perform tentative operations to be confirmed later. When engaging in a conversation to ensure consistency, participants have to balance the following considerations:
- Reducing uncertainty: When a system communicates with another system over a potentially unreliable channel, it is inherently uncertain about the other system’s state. For example, if a service consumer does not receive a response to a request, it cannot be certain whether the service provider processed the request or not.
- Detecting errors: Due to the inherent uncertainty across services, detecting an error condition itself may prove difficult, even more so if external systems are involved. For example, how does one detect that a letter sent with regular mail did not arrive?
- Mitigating risk: In the absence of traditional ACID transactions, participants have to accept that consistency cannot be achieved in all cases. Therefore, the conversation should try to minimize the probability and potential downside of inconsistent outcomes.
- Optimistic vs. Pessimistic: Participants can choose to execute optimistically, optimizing for the case where everything goes well, usually gaining simplicity and high throughput. But they also have to accept the downside of having to deal with difficult to recover or complex failure scenarios. Alternatively, services can be pessimistic, trying to minimize the frequency and severity of failure scenarios, typically at the expense of slower throughput or higher complexity. Some systems are so optimistic that they don't even handle errors at all.
- Idempotency: One of the simplest strategies to cope with failed operations is to retry the operation. In the face of uncertainty across distributed systems this can mean, though, that a service is asked to repeat an already successfully executed operation. Idempotent services recognize such a situation and avoid duplicate execution.
- Certainty vs. Complexity: More complex conversations, including multiple acknowledgments, can help ensure a consistent outcome, but also increase the complexity of the interaction, which may reduce throughput and make it more difficult to implement.
- Layered Protocols: Lower protocol layers may be able to deal with failure scenarios so that the application-level protocol does not have to worry about it. For example, reliable messaging infrastructures typically implement sophisticated store-and-forward mechanisms that can deal with lost connections, unavailable service providers, and other mishaps. However, not all communication errors can be completely hidden from the application layer. For example, if a message can still not be delivered after many attempts, the application may have to inform the user or pursue and alternate path of execution.
This section divides into two categories of patterns:
Consistency Strategies
- Ignore Error: This unlikely strategy can often be the most effective: do nothing in case of errors
- Isolate Error: ignore the error in the context of the current conversation, but handle all errors afterwards.
- Retry: if you don't succeed at first, try again
- Compensating Action: use a second action that undoes a prior action to regain a consistent state
- Start over: if you cannot undo an action, revert to the beginning and rebuild the desired state.
- Tentative Operation: an interaction is split into two parts, a pending operation, followed by a confirmation or a time-out, which leads to cancellation. (try-confirm-cancel)
- Coordinated Agreement: a dedicated coordinator interacts with participants to find a jointly agreeable solution.
Mitigation Strategies
- Perform Most Likely to Fail Action First: failing early reduces the number of actions to be undone in case of failure.
- Perform Hardest to Revert Action Last: later actions are less likely to have to be undone due to subsequent failures.