Critical Errors

What are critical errors?

NServiceBus has the ability to handle message processing failures through the recoverability feature. However, there may be other types of errors outside of message processing that NServiceBus does not have enough context to handle gracefully. These tend to be deeper infrastructure issues that cannot be caught by the recoverability feature of message processing. NServiceBus raises these as critical errors.

Examples of critical errors include:

An exception occurs when NServiceBus attempts to execute the recoverability policy, including moving a message to the error queue. The context will contain a specific error Failed to execute recoverability policy for message with native ID: \``
There are repeated failures in reading information from a required storage.
An exception occurs reading from the input queue.

What happens when a critical error occurs in NServiceBus?

The default behavior is to log the exception and keep retrying indefinitely.

Often, critical errors are transient (e.g. a database was temporarily unavailable). An immediate retry can be successful in these cases where the system will continue processing where it left off.

However, sometimes, critical errors are persistent.

How do I deal with persistent critical errors?

When a critical error persists, it is often unknown if the issue is recoverable. Stopping the endpoint along with terminating and restarting the process is recommended.

Stop the endpoint

Microsoft Generic Host's IHostApplicationLifetime.Stop method stops the NServiceBus endpoint gracefully.

Alternatively, a call to criticalErrorContext.Stop can be used.

Copy code|Copy usings|Edit

await criticalErrorContext.Stop(cancellationToken);

Calling criticalErrorContext.Stop without terminating the host process will only stop the NServiceBus endpoint without affecting the host process and other components running within the same process. This is why restarting the process after stopping the endpoint is the recommended approach.

Terminate and restart the process

Terminate the process. If using Environment.FailFast or IHostApplicationLifetime.Stop, the NServiceBus endpoint can attempt a graceful shutdown which can be useful in non-transactional processing environments.
Ensure the environment is configured to automatically restart processes when they stop.

IIS: The IIS host will automatically spawn a new instance.
Windows Service: The OS can restart the service after 1 minute if Windows Service Recovery is enabled.
Docker: Ensure that containers are configured with restart=always. See Start containers automatically (Docker.com)

What if I need to override the default behavior?

The default behavior is often appropriate for the lifetime of most systems. However, it is possible to override the default behavior to accommodate business needs.

For example, the default behavior can be modified with:

Sending a real-time notification to support personnel when the endpoint has raised a critical error.
Limiting the retries of the endpoint handler, e.g. when it might affect costs.
Automatically restarting the endpoint and resetting the transport connection to attempt to resolve underlying issues in receiving or dispatching messages.

To override the default behavior a custom action needs to be provided:

Copy code|Copy usings|Edit

endpointConfiguration.DefineCriticalErrorAction(OnCriticalError);

Example of a custom implementation

The following implementation assumes that the endpoint instance is hosted in isolation and that the hosting environment of the process will restart the process after it has been killed.

Copy code|Copy usings|Edit

async Task OnCriticalError(ICriticalErrorContext context, CancellationToken cancellationToken)
{
    try
    {
        // To leave the process active, stop the endpoint.
        // When it is stopped, attempts to send messages will cause an ObjectDisposedException.
        await context.Stop(cancellationToken);
        // Perform custom actions here, e.g.
        // NLog.LogManager.Shutdown();
    }
    finally
    {
        var failMessage = $"Critical error shutting down:'{context.Error}'.";
        Environment.FailFast(failMessage, context.Exception);
    }
}

Implementation concerns

If the endpoint is stopped without exiting the process, then any Send or Publish operation will result in a KeyNotFoundException being thrown.

When implementing a custom critical error callback:

Decide if the process can be exited/terminated and use the Environment.FailFast method to exit the process. If the environment has threads running that should be completed before shutdown (e.g. non transactional operations), the Environment.Exit method can also be used.
The code should be wrapped in a try...finally clause. In the try block perform any custom operations; in the finally block call the method that exits the process.
The custom operations should include flushing any in-memory state and cached data, if normally it is persisted at a certain interval or during graceful shutdown. For example, flush appenders when using buffering or asynchronous logging for Serilog via Log.CloseAndFlush();, or NLog and log4net by calling LogManager.Shutdown();.

Raising a critical error

Any part of the implementation of the endpoint can invoke the criticalError action.

Copy code|Copy usings|Edit

// 'criticalError' is an instance of NServiceBus.CriticalError
// This instance can be resolved from dependency injection.
criticalError.Raise(errorMessage, exception);

Heartbeat functionality

The Heartbeat functionality is configured to start pinging ServiceControl immediately after the endpoint starts. It only stops when the process exits. The only way for a critical error to result in a heartbeat failure in ServicePulse/ServiceControl is for the critical error to kill the process.

What are critical errors?

What happens when a critical error occurs in NServiceBus?

How do I deal with persistent critical errors?

Stop the endpoint

Terminate and restart the process

What if I need to override the default behavior?

Example of a custom implementation

Implementation concerns

Raising a critical error

Heartbeat functionality

In this article