Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Analyzing high WSGI/Response time data of django with mod_wsgi on NewRelic

Project deployment as: django with apache mod_wsgi

Got New Relic lite version configured to track web performance.
On the New Relic -> Monitoring -> Web transactions panel -> Sort on Most time consuming -> Selecting on First entry -> Breakdown Table -> WSGI/Response seems to consume 81% of the time

  • What exactly does WSGI/Response mean in this context
  • What sub-components does it include
  • Is is possible to get a breakdown table for the WSGI/Response
  • What might be the possible issues for such high WSGI/Response percentage of time
  • Any good idea to analyze on this ? (free solutions only)
like image 789
Subhranath Chunder Avatar asked Nov 26 '13 10:11

Subhranath Chunder


1 Answers

Consider a simple WSGI hello world program.

def application(environ, start_response):
    status = '200 OK'
    output = 'Hello World!'

    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]

From a WSGI servers perspective there are two keys phases in having the WSGI application handle a request.

The first is the calling of the actual WSGI application, with it returning the result.

The second, is the act by the WSGI server of consuming the result, where the result is required to have been some form of iterable yielding strings.

Under New Relic, 'WSGI/Application' tracks the time spent in the first phase. 'WSGI/Response' is tracking the second phase and how much time was spent in consuming the strings from the returned iterable.

To understand why 'WSGI/Response' may show a large amount of time it is necessary to understand a bit deeper what is actually going on.

The first non obvious thing is that after each string is yielded from the iterable, the WSGI server has to write that string back to the HTTP client making the request. It doesn't need to wait until the client has received it, but it does need to do enough to ensure that delivery will continue in parallel after it moves onto getting the next string from the iterable.

The time recorded by 'WSGI/Response' therefore encompasses not only the time take to yield each item from the iterable returned by the WSGI application, but the time taken to write back the response to the HTTP client.

As it includes the time taken to write back the response a few issues can arise.

These are, that very large responses can take time to be written back. That a slow client or network can make that time longer if the underlying web server blocks at any point. And finally, a WSGI application itself can make it all worse if it yields a large number of small strings, rather than a single string, or at least a smaller number of larger strings.

The worst case scenario is a WSGI application which has a bug in it whereby it returns a string object as the result, rather than a list or iterable which yields the string. This is bad because each single character in the string will be written one at a time, with the corresponding flush occurring to have it written back to the client. This can cause excessive amounts of overhead and increase the time taken.

If using Apache/mod_wsgi there should be warning messages logged to the Apache error log if a WSGI application is buggy in this way. Some other WSGI servers, such as uWSGI, silently correct the mistake as an optimisation, even though technically that is violating the WSGI specification and how it says the result should be handled. A WSGI server silently correcting it is bad practice because it gives a false sense of security that it is working okay, but when you move to a WSGI server that conforms with the WSGI specification, performance will degrade.

In order to determine which of these may be the cause, the New Relic Python agent also records custom metrics for each response about the number of bytes returned in the response and how many separate strings were yielded. When you have a slow transaction sample trace, these will show in the trace summary under 'WSGI/Output/Bytes' and 'WSGI/Output/Calls/yield'. There is also 'WSGI/Output/Time', which records the time from first byte sent to last byte sent. If it helps to get a view for these across the whole WSGI application, these can also be charted with a custom dashboard.

Now as well as the above, another issue can also come into play where the iterable returned is not simply a list, but a generator or custom iterable.

In this case, 'WSGI/Response' is also recording the time the WSGI application takes to generate each yielded string. So if a WSGI application generates the response slowly as it is calculating it on demand in some way then that can also cause an increase in time recorded under 'WSGI/Response'.

So overall there is potentially a lot going on which is recorded under 'WSGI/Response', the primary things being:

  • Time taken to yield all strings making up the response from the iterable returned by the WSGI application.
  • Time taken to write the response back to the HTTP client which made the request.

In some cases, especially with WSGI applications that implement server push or deliver up very large responses, that this time is included in response time can be an issue and skew averages for responses times of the whole web application.

The use of the percentiles view for response times rather than averages can help to separate out these outliers, but in other cases a little extra work may be necessary.

In these special cases, what can be done is to use the Python agent API in the handlers for so affected web transactions, to prematurely terminate recording of the transaction so as to not count any excessive amount of time spent in delivering up a response. There are also options to completely disable monitoring for specific web transactions, or web transactions can be flagged so as to be recorded as a background task instead, thereby removing any influence they have on web transaction averages for response times.

The three agent API functions for these purposes are:

  • newrelic.agent.end_of_transaction()
  • newrelic.agent.ignore_transaction()
  • newrelic.agent.set_background_task()

When using Apache/mod_wsgi, you also have some ability to apply these through the Apache configuration file. For example, you can flag specific URLs to be ignored using:

<Location /some/sub/url>
SetEnv newrelic.ignore_transaction true
</Location>

As to getting better visibility into what is going on. If the time is due to large responses or slow HTTP clients, there isn't much more you can do beyond looking at the transaction metrics record against slow transaction samples for output bytes, number of yield calls etc.

If the iterable is actually a generator which is doing work, then you can possibly get better insight by using key transactions and X-Ray sessions in New Relic to get a thread profile dump. That will help you to narrow down where the time is being spent. For more enduring visibility, you can then apply additional function traces to your code so additional functions called when a WSGI application is yielding the response are also tracked and shown in the performance breakdown.

like image 145
Graham Dumpleton Avatar answered Sep 22 '22 06:09

Graham Dumpleton