Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

haproxy stats: qtime,ctime,rtime,ttime?

Running a web app behind HAProxy 1.6.3-1ubuntu0.1, I'm getting haproxy stats qtime,ctime,rtime,ttime values of 0,0,0,2704.

From the docs (https://www.haproxy.org/download/1.6/doc/management.txt):

 58. qtime [..BS]: the average queue time in ms over the 1024 last requests
 59. ctime [..BS]: the average connect time in ms over the 1024 last requests
 60. rtime [..BS]: the average response time in ms over the 1024 last requests
     (0 for TCP)
 61. ttime [..BS]: the average total session time in ms over the 1024 last requests

I'm expecting response times in the 0-10ms range. ttime of 2704 milliseconds seems unrealistically high. Is it possible the units are off and this is 2704 microseconds rather than 2704 millseconds?

Secondly, it seems suspicious that ttime isn't even close to qtime+ctime+rtime. Is total response time not the sum of the time to queue, connect, and respond? What is the other time, that is included in total but not queue/connect/response? Why can my response times be <1ms, but my total response times be ~2704 ms?

Here is my full csv stats:

$ curl "http://localhost:9000/haproxy_stats;csv"
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
http-in,FRONTEND,,,4707,18646,50000,5284057,209236612829,42137321877,0,0,997514,,,,,OPEN,,,,,,,,,1,2,0,,,,0,4,0,2068,,,,0,578425742,0,997712,22764,1858,,1561,3922,579448076,,,0,0,0,0,,,,,,,,
servers,server1,0,0,0,4337,20000,578546476,209231794363,41950395095,,0,,22861,1754,95914,0,no check,1,1,0,,,,,,1,3,1,,578450562,,2,1561,,6773,,,,0,578425742,0,198,0,0,0,,,,29,1751,,,,,0,,,0,0,0,2704,
servers,BACKEND,0,0,0,5919,5000,578450562,209231794363,41950395095,0,0,,22861,1754,95914,0,UP,1,1,0,,0,320458,0,,1,3,0,,578450562,,1,1561,,3922,,,,0,578425742,0,198,22764,1858,,,,,29,1751,0,0,0,0,0,,,0,0,0,2704,
stats,FRONTEND,,,2,5,2000,5588,639269,8045341,0,0,29,,,,,OPEN,,,,,,,,,1,4,0,,,,0,1,0,5,,,,0,5374,0,29,196,0,,1,5,5600,,,0,0,0,0,,,,,,,,
stats,BACKEND,0,0,0,1,200,196,639269,8045341,0,0,,196,0,0,0,UP,0,0,0,,0,320458,0,,1,4,0,,0,,1,0,,5,,,,0,0,0,0,196,0,,,,,0,0,0,0,0,0,0,,,0,0,0,0,
like image 795
clay Avatar asked Nov 07 '22 10:11

clay


1 Answers

In haproxy >2 you now get two values n / n which is the max within a sliding window and the average for that window. The max value remains the max across all sample windows until a higher value is found. On 1.8 you only get the average.

Example of haproxy 2 v 1.8. Note these proxies are used very differently and with dramatically different loads.

haproxy 2.2.8

haproxy 1.8

So looks like the average response times at least since last reboot are 66m and 275ms.

The average is computed as:

data time + cumulative http connections - 1 / cumulative http connections

This might not be a perfect analysis so if anyone has improvements it'd be appreciated. This is meant to show how I came to the answer above so you can use it to gather more insight into the other counters you asked about. Most of this information was gathered from reading stats.c. The counters you asked about are defined here.

unsigned int q_time, c_time, d_time, t_time; /* sums of conn_time, queue_time, data_time, total_time */
unsigned int qtime_max, ctime_max, dtime_max, ttime_max; /* maximum of conn_time, queue_time, data_time, total_time observed */```

The stats page values are built from this code:

if (strcmp(field_str(stats, ST_F_MODE), "http") == 0)
            chunk_appendf(out, "<tr><th>- Responses time:</th><td>%s / %s</td><td>ms</td></tr>",
                      U2H(stats[ST_F_RT_MAX].u.u32), U2H(stats[ST_F_RTIME].u.u32));
            chunk_appendf(out, "<tr><th>- Total time:</th><td>%s / %s</td><td>ms</td></tr>",
                  U2H(stats[ST_F_TT_MAX].u.u32), U2H(stats[ST_F_TTIME].u.u32));   

You asked about all the counter but I'll focus on one. As can be seen in the snippit above for "Response time:" ST_F_RT_MAX and ST_F_RTIME are the values displayed on the stats page as n (rtime_max) / n (rtime) respectively. These are defined as follows:

[ST_F_RT_MAX] = { .name = "rtime_max", .desc = "Maximum observed time spent waiting for a server response, in milliseconds (backend/server)" },
[ST_F_RTIME] = { .name = "rtime", .desc = "Time spent waiting for a server response, in milliseconds, averaged over the 1024 last requests (backend/server)" },

These set a "metric" value (among other things) in a case statement further down in the code:

case ST_F_RT_MAX:
    metric = mkf_u32(FN_MAX, sv->counters.dtime_max);
    break;
case ST_F_RTIME:
    metric = mkf_u32(FN_AVG, swrate_avg(sv->counters.d_time, srv_samples_window));
    break;

These metric values give us a good look at what the stats page is telling us. The first value in the "Responses time: 0 / 0" ST_F_RT_MAX, is some max value time spent waiting. The second value in "Responses time: 0 / 0" ST_F_RTIME is an average time taken for each connection. These are the max and average taken within a window of time, i.e. however long it takes for you to get 1024 connections.

For example "Responses time: 10000 / 20":

  • max time spent waiting (max value ever reached including http keepalive time) over the last 1024 connections 10 seconds
  • average time over the last 1024 connections 20ms

So for all intents and purposes

rtime_max = dtime_max
rtime = swrate_avg(d_time, srv_samples_window)

Which begs the question what is dtime_max d_time and srv_sample_window? These are the data time windows, I couldn't actually figure how these time values are being set, but at face value it's "some time" for the last 1024 connections. As pointed out here keepalive times are included in max totals which is why the numbers are high.

Now that we know ST_F_RT_MAX is a max value and ST_F_RTIME is an average, an average of what?

/* compue time values for later use */
if (selected_field == NULL || *selected_field == ST_F_QTIME ||
    *selected_field == ST_F_CTIME || *selected_field == ST_F_RTIME ||
    *selected_field == ST_F_TTIME) {
    srv_samples_counter = (px->mode == PR_MODE_HTTP) ? sv->counters.p.http.cum_req : sv->counters.cum_lbconn;
    if (srv_samples_counter < TIME_STATS_SAMPLES && srv_samples_counter > 0)
        srv_samples_window = srv_samples_counter;
}

TIME_STATS_SAMPLES value is defined as

#define TIME_STATS_SAMPLES 512
unsigned int srv_samples_window = TIME_STATS_SAMPLES;

In mode http srv_sample_counter is sv->counters.p.http.cum_req. http.cum_req is defined as ST_F_REQ_TOT.

[ST_F_REQ_TOT]  = { .name = "req_tot",  .desc = "Total number of HTTP requests processed by this object since the worker process started" },

For example if the value of http.cum_req is 10, then srv_sample_counter will be 10. The sample appears to be the number of successful requests for a given sample window for a given backends server. d_time (data time) is passed as "sum" and is computed as some non-negative value or it's counted as an error. I thought I found the code for how d_time is created but I wasn't sure so I haven't included it.

/* Returns the average sample value for the sum <sum> over a sliding window of
 * <n> samples. Better if <n> is a power of two. It must be the same <n> as the
 * one used above in all additions.
 */
static inline unsigned int swrate_avg(unsigned int sum, unsigned int n)
{
    return (sum + n - 1) / n;
}
like image 58
b0bu Avatar answered Dec 15 '22 16:12

b0bu