Request-Response with Retry

Two services engage in a Asynchronous Request-Response conversation.

How can a consumer deal with a missing response message in a Request-Response conversation?

Because Asynchronous Request-Response uses separate messages for the request and the response, the initiator may expect a response message, but never receive it.
Remote communication can fail for a number of reasons, many of which may be intermittent. For example, a network router may be unavailable, or service may not be operating. Such circumstances can cause a valid operation to fail some time and succeed another time.
If both initiator and provider are Transactional Clients in combination with Guaranteed Delivery, no messages are lost – the message is either in the request channel, being processed by the service, or in the reply channel. However, using this type of interchange requires three transactions: one to enqueue the request, one to process it, and one to consume the response [19].
Even if no message is lost, the initiator may become "impatient". For example, the request may be processed by a very slow service provider. The initiator may get a faster response by sending a new request.

Have the consumer retry the request if it does not receive a response within a certain time interval. Make both the initiator and the service idempotent so they can deal with duplicate messages.

The Request-Response with Retry conversation involves the following participants:

The Requestor initiates the conversation by sending a Request message. If no Response message is received within a certain time interval, the Requestor re-sends the request.
The Provider waits for incoming Request messages and replies with Response messages.

The "happy path" of the Request-Response with Retry conversation is identical to Asynchronous Request-Response. However, the Requestor uses a time-out condition to decide how long to wait for a Response message. If the Response message does not arrive within that time window, the Requestor resends the Request message. The challenge in most conversations, that also exemplifies in this very simple patter, is that the participants cannot know the real state of other participants - they have to derive that state based on the observed messages, which in typically run the risk of getting lost.

Idempotency

For example, the time-out in Request-Response with Retry may have been triggered as a result of a lost (or delayed) Response message. The Requestor cannot distinguish this scenario from one where the Request got lost and the only recourse it has is to resend the Request message, causing the Provider to receive the same Request message, which was already processed before, a second time. If the operation is inherently idempotent (such as a read operation), the provider can simply re-execute the operation and send the new Response message to avoid holding any state. If the service provides rapidly changing data it may even be desired that the Provider sends "fresh" data on a new request. However, as this conversation does not distinguish "Retry" from a new conversation instance, it ends up following the Asynchronous Request-Response conversation without any error handling.

If the Provider wants to avoid performing the requested action a second time, it has to be built as an Idempotent Receiver, i.e. it must be able to distinguish a resent message from two distinct requests, which can also happen to contain the same data. To aid the service in detecting duplicates, the initiator can equip the resent message with the same Correlation Identifier. If the provider has already processed the a message with the same Correlation Identifier before, it can skip the requested operation and simply return the previously constructed Response message. Being able to re-send response messages requires the Provider to cache Response messages at the convenience of the Requestor, making this conversation no longer stateless from the provider's perspective: the provider needs to keep a list of received conversation IDs, and possibly a list of already sent responses.Holding state at the convenience of a conversation is a form of coupling that should be managed, e.g. by using Resource Management patterns.

The initiator has to deal with duplicate messages as well. The service provider might have sent a Response through an asynchronous message queue just as the consumer decided to give up waiting. In this case the consumer resends the Request message, just to receive the original Response message a fraction of a second later. This makes the consumer happy, but a little while later it will receive another Response message from the Provider who processed the resent request. The Requestor should ignore this message as the previous Response message had already been processed. This the Requestor has to be idempotent as well. This consideration highlights the symmetry of the Asynchronous Request-Response conversation in contrast to the inherently asymmetric method invocation, which serves as role model for Remote Procedure Invocation.

Dynamic System Behavior

A simple extension of Asynchronous Request-Response with a Retry message significantly increases the dynamic behavior of the overall system. For example, if a Provider serves many Requestors, a heavy load may increase response times, which in turn causes the Requestors to start detecting time-out conditions and to send additional Request messages with the intent to retry the operation. The additional messages only increase the load on the Provider, exacerbating the situation. What was intended as a means to increase the reliability of the system has now become a reason the overall stability of the system decreased. Guaranteed Delivery can help reduce such situations by relegating retries to the lower-level messaging system, giving the application-layer protocol the assurance that the message won't be lost.

In systems without Guaranteed Delivery, two strategies are common:

A maximum retry count avoids overloading the system with infinite retries. Instead the Requestor reports an error to the application layer after a specified number of resent request messages.
Exponential backoff increases the timeout interval after each time-out condition by a factor, e.g. 2, which doubles the timeout after each "missed" Response message.
Circuit breakers track the state of a Provider: if previous Request messages have led to timeouts, the Requestor assumes the Provider is either hopelessly overloaded or in an error state and will return an error immediately to the application instead of making a new request, which likely results in another timeout.

Application-level Error Conditions

If the Requestor application fails, it may or may not detect the state of the state of a conversation that was in progress? If all parties are Transactional Clients, and the message channel can provide information as to which message IDs have been pushed to the channel, the Requestor can recover the conversation state without resending messages ([19]). If the system is not transactional, the client would likely resend the message.

Error recovery from conversations is a complex topic. Request-Response with Retry is the application of the Retry error recovery pattern in a Asynchronous Request-Response conversation. For more detail on error handling strategies see Ensuring Consistency

Example: RosettaNet Implementation Framework

The RosettaNet Implementation Framework 02.00.00 states in Sections 2.6.5 Receipt Acknowledgment and 2.6.6 Handling Retries and Late Acknowledgments:

As established earlier, the trading partner sending an action message retries the message until either a Signal (Receipt Acknowledgment or Exception) is received or a timeout condition occurs. Hence, the receiver MUST be prepared to receive the same action message more than once. In such a case, if the action requires a Receipt Acknowledgment, the Receipt Acknowledgment (or Exception if there is a failure) MUST be resent.

The RosettaNet PIPs (Partner Interface Process) specifies the Time To Acknowledge, Time To Perform, and Retry Count parameters to guide when and how often an action can be retried.