Getting Started
NServiceBus
Transports
Persistence
ServiceInsight
ServicePulse
ServiceControl
Monitoring
Samples

Recoverability

In message-driven systems, failures in message receivers do not impact the message senders. Messages are persistent, and even failed messages remain in the system until issues are resolved and they can be successfully processed. This increases system resiliency as failures in one component do not affect other components and no data is lost. Different strategies can be applied to deal with different types of failures:

Transient errors

Transient failures are temporary and are not caused by errors in business logic. They may be network issues, throttling, concurrency conflicts, etc. Resilient applications absorb such failures through self-healing. Retries are a good solution for transient errors. There are two common retry patterns:

  • Immediate retries: Many transient failures (e.g. concurrency errors) can be resolved by immediately retrying messages. However, immediate retries might not be the best approach when the root cause is due to overloaded or throttled resources, as they may exacerbate these problems.
  • Delayed retries: Infrastructure-related transient failures (e.g. network problems) might require more time to resolve. In this case, it makes more sense to retry in the near future. Different delayed retry strategies can be used such as fixed intervals, exponential backoff, or exception-based values.

Blog: I caught an exception. Now what? →

Persistent errors

If errors cannot be resolved after a certain amount of automated retries, they are considered persistent errors. Persistent errors typically require manual intervention to resolve the root cause before retrying the failed messages.

To avoid persistently failing messages from being retried infinitely, they can be moved to dedicated error queues (many message queueing technologies use dead-letter queues). This puts messages aside, to prevent them from clogging up the system, in a place where they can be manually inspected. However, this also means error queues need to be actively monitored.

Once the root cause of a persistent error has been resolved, messages can be moved back to their intended queues to be retried.

Video: An exception occurred... Try again →

Best practices

  • Don't catch exceptions in business logic invoked by messages. With message-level recoverability mechanisms in place, exceptions thrown from business logic cause messages to be retried without additional error handling code, such as try/catch blocks or Polly.
  • Configure and customize recoverability policies using the dedicated NServiceBus configuration options to avoid leaking infrastructure-related issues into business logic.
  • Review consistency strategies for guidance on dealing with consistency while re-running business logic during retries.
  • Don't build custom retry mechanisms. Building custom recoverability logic is risky and error-prone. The NServiceBus retry mechanisms are proven and thoroughly tested.