Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java HttpClient seems to be caching content

I'm building a simple web-scraper and i need to fetch the same page a few hundred times, and there's an attribute in the page that is dynamic and should change at each request. I've built a multithreaded HttpClient based class to process the requests and i'm using an ExecutorService to make a thread pool and run the threads. The problem is that dynamic attribute sometimes doesn't change on each request and i end up getting the same value on like 3 or 4 subsequent threads. I've read alot about HttpClient and i really can't find where this problem comes from. Could it be something about caching, or something like it!?

Update: here is the code executed in each thread:

HttpContext localContext = new BasicHttpContext();

HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params,
        HTTP.DEFAULT_CONTENT_CHARSET);
HttpProtocolParams.setUseExpectContinue(params, true);

ClientConnectionManager connman = new ThreadSafeClientConnManager();

DefaultHttpClient httpclient = new DefaultHttpClient(connman, params);

HttpHost proxy = new HttpHost(inc_proxy, Integer.valueOf(inc_port));
httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY,
        proxy);

HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");

String iden = null;
int timeoutConnection = 10000;
HttpConnectionParams.setConnectionTimeout(httpGet.getParams(),
        timeoutConnection);

try {

    HttpResponse response = httpclient.execute(httpGet, localContext);

    HttpEntity entity = response.getEntity();

    if (entity != null) {

        InputStream instream = entity.getContent();
        String result = convertStreamToString(instream);
        // System.out.printf("Resultado\n %s",result +"\n");
        instream.close();

        iden = StringUtils
                .substringBetween(result,
                        "<input name=\"iden\" value=\"",
                        "\" type=\"hidden\"/>");
        System.out.printf("IDEN:%s\n", iden);
        EntityUtils.consume(entity);
    }

}

catch (ClientProtocolException e) {
    // TODO Auto-generated catch block
    System.out.println("Excepção CP");

} catch (IOException e) {
    // TODO Auto-generated catch block
    System.out.println("Excepção IO");
}
like image 355
Trota Avatar asked Mar 09 '12 22:03

Trota


2 Answers

HTTPClient does not use cache by default (when you use DefaultHttpClient class only). It does so, if you use CachingHttpClient which is HttpClient interface decorator enabling caching:

HttpClient client = new CachingHttpClient(new DefaultHttpClient(), cacheConfiguration);

Then, it analyzes If-Modified-Since and If-None-Match headers in order to decide if request to the remote server is performed, or if its result is returned from cache.

I suspect, that your issue is caused by proxy server standing between your application and remote server.

You can test it easily with curl application; execute some number of requests omitting proxy:

#!/bin/bash

for i in {1..50}
do
  echo "*** Performing request number $i"
  curl -D - http://yourserveraddress.com -o $i -s
done

And then, execute diff between all downloaded files. All of them should have differences you mentioned. Then, add -x/--proxy <host[:port]> option to curl, execute this script and compare files again. If some responses are the same as others, then you can be sure that this is proxy server issue.

like image 135
omnomnom Avatar answered Sep 26 '22 15:09

omnomnom


Generally speaking, in order to test whether or not HTTP requests are being made over the wire, you can use a "sniffing" tool that analyzes network traffic, for example:

  • Fiddler ( http://fiddler2.com/fiddler2/ ) - I would start with this
  • Wireshark ( http://www.wireshark.org/ ) - more low level

I highly doubt HttpClient is performing caching of any sort (this would imply it needs to store pages in memory or on disk - not one of its capabilities).

While this is not an answer, its a point to ponder: Is it possible that the server (or some proxy in between) is returning you cached content? If you are performing many requests (simultaneously or near simultaneously) for the same content, the server may be returning you cached content because it has decided that the information has not "expired" yet. In fact the HTTP protocol provides caching directives for such functionality. Here is a site that provides a high level overview of the different HTTP caching mechanisms:

http://betterexplained.com/articles/how-to-optimize-your-site-with-http-caching/

I hope this gives you a starting point. If you have already considered these avenues then that's great.

like image 42
SuperPomodoro Avatar answered Sep 26 '22 15:09

SuperPomodoro