Monitoring NServiceBus endpoints with Prometheus and Grafana

This article is part of the NServiceBus Learning Path.

Prometheus is a monitoring solution for storing time series data like metrics. Grafana visualizes the data stored in Prometheus (and other sources). This sample demonstrates how to capture NServiceBus OpenTelemetry metrics, store them in Prometheus, and visualize these metrics using a Grafana dashboard.

Grafana NServiceBus fetched, processed, and errored messages

Prerequisites

To run this sample, Prometheus and Grafana are required. This sample uses Docker and a docker-compose.yml file to run the stack.

Code overview

The sample simulates message load with a random 10% failure rate using the LoadSimulator class:

Copy code|Copy usings|Edit

var simulator = new LoadSimulator(endpointInstance, TimeSpan.Zero, TimeSpan.FromSeconds(10));
simulator.Start(cancellation.Token);

Reporting metric values

NServiceBus uses the OpenTelemetry standard to report metrics. The metrics are disabled by default and must be enabled on the endpoint configuration.

Copy code|Copy usings|Edit

var endpointConfiguration = new EndpointConfiguration(EndpointName);
endpointConfiguration.EnableOpenTelemetry();

Opt into a specific metric, either by name or by wildcard:

Copy code|Copy usings|Edit

var meterProviderBuilder = Sdk.CreateMeterProviderBuilder()
    .SetResourceBuilder(resourceBuilder)
    .AddMeter("NServiceBus.Core");

Each reported metric is tagged with the following additional information:

the queue name of the endpoint
the uniquely addressable address for the endpoint (if set)
the .NET fully qualified type information for the message being processed
the exception type name (if applicable)

Additional metrics

Recoverability and processing-related metrics currently emitted by the metrics package are not yet supported in OpenTelemetry's native format (using System.Diagnostics), so a shim is required to expose them as OpenTelemetry metrics.

Copy code|Copy usings|Edit

class EmitNServiceBusMetrics : Feature
{
    public EmitNServiceBusMetrics()
    {
        EnableByDefault();
    }

    protected override void Setup(FeatureConfigurationContext context)
    {
        var queueName = context.LocalQueueAddress().BaseAddress;
        var discriminator = context.InstanceSpecificQueueAddress()?.Discriminator;

        var recoverabilitySettings = (RecoverabilitySettings)typeof(RecoverabilitySettings).GetConstructor(
              BindingFlags.NonPublic | BindingFlags.Instance,
              null, [typeof(SettingsHolder)],
              null).Invoke([(SettingsHolder)context.Settings]);

        recoverabilitySettings.Immediate(i => i.OnMessageBeingRetried((m, _) => RecordRetry(m.Headers, queueName, discriminator, true)));
        recoverabilitySettings.Delayed(d => d.OnMessageBeingRetried((m, _) => RecordRetry(m.Headers, queueName, discriminator, false)));
        recoverabilitySettings.Failed(f => f.OnMessageSentToErrorQueue((m, _) => RecordFailure(m.Headers, queueName, discriminator)));

        context.Pipeline.OnReceivePipelineCompleted((e, _) =>
        {
            e.TryGetMessageType(out var messageType);

            var tags = new TagList(
            [
                new(Tags.QueueName, queueName ?? ""),
                new(Tags.EndpointDiscriminator, discriminator ?? ""),
                new(Tags.MessageType, messageType ?? ""),
            ]);

            ProcessingTime.Record((e.CompletedAt - e.StartedAt).TotalMilliseconds, tags);

            if (e.TryGetDeliverAt(out DateTimeOffset startTime) || e.TryGetTimeSent(out startTime))
            {
                CriticalTime.Record((e.CompletedAt - startTime).TotalMilliseconds, tags);
            }

            return Task.CompletedTask;
        });
    }

    static Task RecordRetry(Dictionary<string, string> headers, string queueName, string discriminator, bool immediate)
    {
        headers.TryGetMessageType(out var messageType);

        var tags = new TagList(
        [
            new(Tags.QueueName, queueName ?? ""),
            new(Tags.EndpointDiscriminator, discriminator ?? ""),
            new(Tags.MessageType, messageType ?? ""),
        ]);

        if (immediate)
        {
            ImmedidateRetries.Add(1, tags);
        }
        else
        {
            DelayedRetries.Add(1, tags);
        }
        Retries.Add(1, tags);

        return Task.CompletedTask;
    }

    static Task RecordFailure(Dictionary<string, string> headers, string queueName, string discriminator)
    {
        headers.TryGetMessageType(out var messageType);

        var tags = new TagList(
        [
            new(Tags.QueueName, queueName ?? ""),
            new(Tags.EndpointDiscriminator, discriminator ?? ""),
            new(Tags.MessageType, messageType ?? "")
        ]);

        MessageSentToErrorQueue.Add(1, tags);

        return Task.CompletedTask;
    }

    static readonly Meter NServiceBusMeter = new Meter("NServiceBus.Core", "0.1.0");

    public static readonly Counter<long> ImmedidateRetries =
        NServiceBusMeter.CreateCounter<long>("nservicebus.recoverability.immediate_retries", description: "Number of immediate retries performed by the endpoint.");

    public static readonly Counter<long> DelayedRetries =
        NServiceBusMeter.CreateCounter<long>("nservicebus.recoverability.delayed_retries", description: "Number of delayed retries performed by the endpoint.");

    public static readonly Counter<long> Retries =
        NServiceBusMeter.CreateCounter<long>("nservicebus.recoverability.retries", description: "Number of retries performed by the endpoint.");

    public static readonly Counter<long> MessageSentToErrorQueue =
        NServiceBusMeter.CreateCounter<long>("nservicebus.recoverability.moved_to_error", description: "Number of messages sent to the error queue.");

    public static readonly Histogram<double> ProcessingTime =
        NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.processingtime", "ms", "The time in milliseconds between when the message was pulled from the queue until processed by the endpoint.");

    public static readonly Histogram<double> CriticalTime =
        NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.criticaltime", "ms", "The time in milliseconds between when the message was sent until processed by the endpoint.");

    public static class Tags
    {
        public const string EndpointDiscriminator = "nservicebus.discriminator";
        public const string QueueName = "nservicebus.queue";
        public const string MessageType = "nservicebus.message_type";
    }
}

Message processing counters

To monitor the rate of messages being fetched from the queuing system, processed successfully, retried, and failed for the endpoint use:

nservicebus.messaging.fetches
nservicebus.messaging.successes
nservicebus.messaging.failures

Recoverability

To monitor recoverability metrics use:

nservicebus.recoverability.immediate_retries
nservicebus.recoverability.delayed_retries
nservicebus.recoverability.retries
nservicebus.recoverability.sent_to_error

Critical time and processing time

To monitor critical time and processing time (in milliseconds) for successfully processed messages use:

nservicebus.messaging.processingtime
nservicebus.messaging.criticaltime

Exporting metrics

The metrics are gathered using OpenTelemetry standards on the endpoint and must be reported and collected by an external service. A Prometheus HTTP listener exposes this data so the Prometheus service, hosted as a docker service, can retrieve and store this information.

The listener is available via the OpenTelemetry.Exporter.Prometheus.HttpListener" NuGet package. In this sample, the service that exposes the data to scrape is hosted on http://127.0.0.1:9464/metrics:

Copy code|Copy usings|Edit

meterProviderBuilder.AddPrometheusHttpListener(options => options.UriPrefixes = new[] { "http://127.0.0.1:9464" });

127.0.0.1 is used so that the Prometheus service running in Docker can reach it over the network.

The raw metrics retrieved through the scraping endpoint look as follows:

# HELP nservicebus_messaging_successes Total number of messages processed successfully by the endpoint.
# TYPE nservicebus_messaging_successes counter
nservicebus_messaging_successes{nservicebus_discriminator="main",nservicebus_message_type="SomeCommand, Endpoint, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null",nservicebus_queue="OpenTelemetryDemo"} 850 1657693075515

# HELP nservicebus_messaging_fetches Total number of messages fetched from the queue by the endpoint.
# TYPE nservicebus_messaging_fetches counter
nservicebus_messaging_fetches{nservicebus_discriminator="main",nservicebus_message_type="SomeCommand, Endpoint, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null",nservicebus_queue="OpenTelemetryDemo"} 1060 1657693075515

# HELP nservicebus_messaging_failures Total number of messages processed unsuccessfully by the endpoint.
# TYPE nservicebus_messaging_failures counter
nservicebus_messaging_failures{nservicebus_discriminator="main",nservicebus_failure_type="System.Exception",nservicebus_message_type="SomeCommand, Endpoint, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null",nservicebus_queue="OpenTelemetryDemo"} 210 1657693075515

The diagram below shows the overall component interactions:

graph TD A[NServiceBus Endpoint] -->|Report Metrics| B(Prometheus Exporter) B -->|Expose| C{Metric Endpoint} C -->|No Metrics| D[Status 200] C -->|Has Metrics| E[Return Metrics] F[Promethus Service] --> |Poll Metrics| E F --> |Store Metrics| F G[Grafana] --> |Query Data| F

Docker stack

The Prometheus service must be configured to retrieve the metrics data from the endpoint. Grafana must also be configured to get the data from Prometheus and visualize it as graphs.

To run the Docker stack, run docker-compose up -d in the directory where the docker-compose.yml file is located.

Show a graph

Open Prometheus on http://localhost:9000/graph.

NServiceBus pushes events for success, failure, and fetched. These events must be converted to rates by a query. For example, the nservicebus_messaging_successes_total metric can be queried as:

avg(rate(nservicebus_messaging_successes_total[5m]))

Prometheus graphs based on query

Grafana

Grafana must be installed and configured to display the data scraped and stored in Prometheus. For more information on how to install Grafana, refer to the Grafana installation guide. In this sample, the Grafana service runs as part of the Docker stack mentioned above.

Dashboard

To graph the metrics, the following steps must be performed:

Add a new dashboard
Add a graph
Click its title to edit
From the Data source dropdown, select Prometheus
For the query, open the Metrics dropdown and select one of the metrics. Built-in functions (e.g. rate) can also be applied.

Grafana dashboard with NServiceBus OpenTelemetry metrics

The sample includes an export of the Grafana dashboard which can be imported as a reference.

Prerequisites

Code overview

Reporting metric values

Additional metrics

Message processing counters

Recoverability

Critical time and processing time

Exporting metrics

Docker stack

Show a graph

Grafana

Dashboard

Related Articles

In this article