Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Jsoup connect() in a loop. The first request is always much slower than all other subsequent ones

I'm creating a small app to measure how long it takes an HTML document to load, checking every x number of seconds.

I'm using jsoup in a loop:

    Connection.Response response = null;

    for (int i = 0; i < totalGets; i++) {
        long startTime = System.currentTimeMillis();

        try {
            response = Jsoup.connect(url)
                    .userAgent(USER_AGENT)  //just using a Firefox user-agent
                    .timeout(30_000)
                    .execute();
        } catch (IOException e) {
            if (e.getMessage().contains("connect timed out")) {
                System.out.println("Request timed out after 30 seconds!");
            }
        }

        long currentTime = System.currentTimeMillis();

        System.out.println("Response time: " + (currentTime - startTime) + "ms" + "\tResponse code: " + response.statusCode());

        sleep(2000);
    }

The issue I'm having is that the very first execution of the jsoup connection is always slower than all subsequent once, no matter what website.

Here is my output on https://www.google.com

Response time: 934ms    Response code: 200
Response time: 149ms    Response code: 200
Response time: 122ms    Response code: 200
Response time: 136ms    Response code: 200
Response time: 128ms    Response code: 200

Here is what I get on http://stackoverflow.com

Response time: 440ms    Response code: 200
Response time: 182ms    Response code: 200
Response time: 187ms    Response code: 200
Response time: 193ms    Response code: 200
Response time: 185ms    Response code: 200

Why is it always faster after the first connect? Is there a better way to determine the document's load speed?

like image 714
Andrio Avatar asked Dec 15 '15 16:12

Andrio


2 Answers

1. Jsoup must run some boiler plate code before the first request can be fired. I would not count the first request into your measurements, since all that initialization will skew the first request time.

2. As mentioned in the comments, many websites cache responses for a couple of seconds. Depending on the website you want to measure you can use some tricks to get the webserver to produce a fresh site each time. Such a trick could be to add a timestamp parameter. Usually _ is used for that (like http://url/path/?pameter1=val1&_=ts). Or you could send along no cache headers in the HTTP request. however, none of these tricks can force a webserver to behave the way you want it. So you can wait longer than 30 seconds in between each request.

like image 87
luksch Avatar answered Nov 14 '22 22:11

luksch


I think that in addition to @luksch points there is another factor, I think Java is keeping connection alive for a few seconds, maybe saving time in protocol trips.

If you use .header("Connection", "close") you'll see more consistent times.

You can check that connections are kept alive with a sniffer. At least I can see port numbers (I mean source port, of course) reused.

EDIT:

Another thing that may add time to first request is DNS lookup ...

like image 23
fonkap Avatar answered Nov 14 '22 22:11

fonkap