Getting Started
Architecture
Transports
Persistence
ServiceInsight
ServicePulse
ServiceControl
Monitoring
Samples

Critical Errors

Component: NServiceBus
NuGet Package: NServiceBus (8.1)

NServiceBus has built-in recoverability but in certain scenarios, it is not possible to handle errors in a graceful way. The reason for this is that NServiceBus does not have enough context to make a sensible decision on how to proceed after these errors have occurred.

Examples of critical errors include:

  • An exception occurs when NServiceBus attempts to execute the recoverability policy, including moving a message to the error queue. The context will contain a specific error Failed to execute recoverability policy for message with native ID: \``
  • There are repeated failures in reading information from a required storage.
  • An exception occurs reading from the input queue.

Default behavior

The default behavior is to log the exception and keep retrying indefinitely.

Custom handling

A custom critical error handler can be provided to override the default behavior.

Examples of reasons to consider implementating a custom critical error action include:

  • Restarting the endpoint and resetting the transport connection may resolve underlying issues in receiving or dispatching messages.
  • To notify support personnel when the endpoint has raised a critical error.
  • The endpoint contains a handler which must not be executed beyond the configured recoverability policy.

Define a custom handler using the following code.

endpointConfiguration.DefineCriticalErrorAction(OnCriticalError);

Stopping the endpoint

When a critical error occurs, it is often unknown if the issue is recoverable. A sound strategy is to terminate the process (e.g. via Environment.FailFast or IHostApplicationLifetime.Stop) when a critical error occurs and rely on the process hosting environment to restart the process as a recovery mechanism. This is a resilient way to deal with critical errors.

Before terminating the process, the NServiceBus endpoint can attempt a graceful shutdown which can be useful in non-transactional processing environments:

await criticalErrorContext.Stop(cancellationToken);
The Microsoft Generic Host's IHostApplicationLifetime.Stop method also stops the NServiceBus endpoint gracefully.
Calling criticalErrorContext.Stop without terminating the host process will only stop the NServiceBus endpoint without affecting the host process and other components running within the same process. It is recommended to restart the process after stopping the endpoint.

Host OS recoverability

Whenever possible rely on the environment hosting the endpoint process to automatically restart it:

A possible custom implementation

The following implementation assumes that the endpoint instance is hosted in isolation and that the hosting environment of the process will restart the process after it has been killed.
async Task OnCriticalError(ICriticalErrorContext context, CancellationToken cancellationToken)
{
    try
    {
        // To leave the process active, stop the endpoint.
        // When it is stopped, attempts to send messages will cause an ObjectDisposedException.
        await context.Stop(cancellationToken);
        // Perform custom actions here, e.g.
        // NLog.LogManager.Shutdown();
    }
    finally
    {
        var failMessage = $"Critical error shutting down:'{context.Error}'.";
        Environment.FailFast(failMessage, context.Exception);
    }
}

Implementation concerns

If the endpoint is stopped without exiting the process, then any Send or Publish operation will result in a KeyNotFoundException being thrown.

When implementing a custom critical error callback:

  • Decide if the process can be exited/terminated and use the Environment.FailFast method to exit the process. If the environment has threads running that should be completed before shutdown (e.g. non transactional operations), the Environment.Exit method can also be used.
  • The code should be wrapped in a try...finally clause. In the try block perform any custom operations; in the finally block call the method that exits the process.
  • The custom operations should include flushing any in-memory state and cached data, if normally it is persisted at a certain interval or during graceful shutdown. For example, flush appenders when using buffering or asynchronous logging for Serilog via Log.CloseAndFlush();, or NLog and log4net by calling LogManager.Shutdown();.

Raising a critical error

Any code in the endpoint can invoke the Critical Error action.

// 'criticalError' is an instance of NServiceBus.CriticalError
// This instance can be resolved from dependency injection.
criticalError.Raise(errorMessage, exception);

Heartbeat functionality

The Heartbeat functionality is configured to start pinging ServiceControl immediately after the endpoint starts. It only stops when the process exits. The only way for a critical error to result in a heartbeat failure in ServicePulse/ServiceControl is for the critical error to kill the process.


Last modified