Monitoring NServiceBus Demo - Message backlogs

Which endpoints have the most work to do?

Each endpoint has a backlog of messages to process in its input queue. As each endpoint processes its own backlog it is also generating new messages that go into the backlog of other endpoints in the system. When a spike in traffic occurs, either due to increased activity by end users or because a large batch process kicks off, it can be useful to monitor the changes to backlog load across the system. Using this information you can decide to scale some of your endpoints out to meet the increased demand.

This part of the tutorial guides you to the endpoints with the greatest backlog of work to do.

This walkthrough uses the sample solution included in the demo download package to illustrate monitoring concepts in the Particular Service Platform. For instructions on how to download and run this sample, see Monitoring NServiceBus solutions: Demo.

Metrics

To measure endpoint backlogs we're going to use two metrics: queue length and critical time.

Queue length is an estimate of how many messages are in the input queue of an endpoint, representing the backlog of messages to be processed. If an endpoint is processing messages faster than it is receiving them then queue length will decrease. Conversely, when an endpoint is receiving messages faster than it can process them, queue length will increase.

Critical time is the time between when a message is initially sent and when it has been completely processed. Critical time includes the time it takes for a message to get from a sending endpoint to the destination queue, the time the message spends waiting to get to the front of that queue to be processed, and the time it takes the receiving endpoint to process the message. Critical time is the time it takes for your system to react to a specific message.

Queue length and critical time are related. As queue length increases, messages spend more time waiting in an endpoint's input queue, and so critical time also increases.

Both critical time and queue length are approximations and are not exact measurements.

Critical time is calculated using timestamps generated by two different machines (sender and receiver). If there is a significant difference in clocks on these machines then it will introduce an error into the critical time calculation.

Queue length is approximated by having sending and receiving endpoints report how many messages of a specific type they have sent or received respectively. If reports are received out of order or if there is a significant delay in either of these reports arriving, then it can introduce an error into the queue length calculation.

Sample walkthrough

The following walk through shows you how to identify changes in endpoint backlog and use that information to scale out endpoints to handle load.

Run the sample solution. Open ServicePulse to the Monitoring tab.

Service Pulse monitoring tab showing sample endpoints

Look at the queue length and critical times for each endpoint. By default, the ClientUI endpoint is sending one order per second into the system and the system is able to keep up. Queue lengths are hovering around 0 as messages are not waiting around in input queues. Critical times are remaining very near processing time as most of a messages critical time is the amount of time it takes to process.

Let's see what happens when we apply more pressure to the system.

Find the ClientUI endpoint window and toggle High-Throughput mode. Now go back to the ServicePulse Monitoring Tab.

Service Pulse monitoring tab showing sample endpoints in high throuhgput mode

Now look at the queue length and critical time for each endpoint. Notice that there isn't that much change for the Billing and Shipping endpoints. The Sales endpoint is quickly running in to difficulties though. Once it hits it's throughput limit, the Sales queue length starts to ramp up. As it does, the critical time also starts to climb. The Sales endpoint is becoming less responsive as it's backlog of work increases.

The sample endpoint included in the demo has been artificially limited so that it can only handle 4 messages concurrently.

Once there are 1,000 messages in the backlog, turn off High-Throughput mode.

Now that the spike in traffic is over, there is still a large backlog of messages to get through. At a throughput of 4 messages per second, it's going to take 250 seconds (just over 4 minutes) to get through all of the backlog. That doesn't account for the new messages still being added to the Sales input queue by the ClientUI endpoint.

To handle this load of traffic we are going to scale out the Sales endpoint.

In the Monitoring folder, run scale.bat.

This batch file will start 3 more instances of the Sales endpoint (for a total of 4) with a 20 second delay in between starting each instance. As each new instance comes online, the critical time for the Sales endpoint comes down and the throughput increases.

As the throughput of the Sales endpoint increases watch what happens to the Billing and Shipping endpoints. As these endpoints are waiting for OrderPlaced events from the Sales endpoint, increasing the throughput of Sales also increases the throughput of these downstream endpoints. Neither of them seems to be building up a large queue length or critical time, so we don't need to scale them out.

This scale-out example is a simulation, and it only works on a single machine because the message handlers are using little/no system resources (simulating message processing latency with Task.Delay). In a production scenario you'd likely run a second instance of the endpoint on a different machine.

Once you have more than one instance of an endpoint running, ServicePulse will show a count of how many instances are running in the overview screen.

In the ServicePulse monitoring tab, click the Sales endpoint to open a detailed view.

In this view you can see all of the instances that are running.

ServicePulse Details tab showing instances of scaled out Sales endpoint

Each instance provides it's own breakdown of metric data. If one particular instance is having a problem you'll be able to see it here.

Find the window for instance-2 of the Sales endpoint. Close it.

When an endpoint instance stops sending data, it will be displayed with an error icon in ServicePulse.

ServicePulse Monitoring tab showing instance of Sales endpoint has stopped sending data

Critical time is going up, but so is processing time

Critical time covers the entire life-time of a message, from when it is first sent to when it has been completely processed. This includes the length of time it takes for the endpoint to process the message, which is already measured by processing time.

As processing time is a component of critical time, if processing time increases, so will critical time. Not only that, but an increase in processing time is often accompanied by a decrease in throughput (see Which message types take the longest to process? to find out why). This decrease in throughput can result in a backlog of messages (increased queue length) meaning that messages spend more time waiting to be processed, and that also increases the critical time.

If processing time is increasing, it is often a sign that there is some resource that is failing and the endpoint relies on that resource to process it's messages. i.e. A database server or remote web server that is taking longer to process requests. If this is the case, scaling out your endpoint might just make the problem worse as you'll be putting additional pressure on these resources which may slow down even further, or in the worst case, start to break.

You need to do some deeper investigation to resolve this type of issue. First, use the details view to find the message types that are taking the longest to process. Look at the handler code for these message types and try to identify any calls that are I/O-bound. Identify which resources are being accessed. Look at the behavior of these resources and see why they might be slowing down. Is there any way to prevent that from happening? Should you be reducing your concurrency or even throttling your requests to the resource to ensure it stays up and running?

The queue length on my endpoint has started going down but critical time still increasing

Critical time is a delayed measurement. It measures the amount of time a message took to get processed after it was sent. When queue length, network latency, and processing time are relatively stable, then critical time can be used to predict how long a new message will take to get processed. If any of those factors are changing significantly then critical time is less useful as a predictive measurement.

Next Steps

Explore the demo:

If you have your own NServiceBus solution that you would like to add monitoring to, go to Adding monitoring to an existing NServiceBus solution.

If you want to learn more about NServiceBus, go to NServiceBus Quick Start


Last modified