Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

Tags:

I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically:

java.net.SocketException: Connection reset

The code that causes this is:

// Execute the request
HttpResponse response; 
try {
    response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
    return;//deep down in apache http sometimes throws a null pointer...  
}

For most servers it's just fine. But for others, it immediately throws a SocketException.

Example of site that causes immediate SocketException: http://www.bhphotovideo.com/

Works great (as do most websites): http://www.google.com/

Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads fine when I don't use Apache's HTTP Client. (Code like this:)

 HttpURLConnection c = (HttpURLConnection)url.openConnection();  
 BufferedInputStream in = new BufferedInputStream(c.getInputStream());  
 Reader r = new InputStreamReader(in);     

 int i;  
 while ((i = r.read()) != -1) {  
      source.append((char) i);  
 }

So, why don't I just use this code instead? Well there are some key features in Apache's HTTP Client that I need to use.

Does anyone know what causes some servers to cause this exception?

Research so far:

Problem occurs on my local Mac dev machines AND an AWS EC2 Instance, so it's not a local firewall.
It seems the error isn't caused by the remote machine because the exception doesn't say "by peer"
This stack overflow seems relavent java.net.SocketException: Connection reset but the answers don't show why this would happen only from Apache HTTP Client and not other approaches.

Bonus question: I'm doing a fair amount of crawling with this system. Is there generally a better Java class for this other than Apache HTTP Client? I've found a number of issues (such as the NullPointerException I have to catch in the code above). It seems that HTTPClient is very picky about server communications -- more picky than I'd like for a crawler that can't just break when a server doesn't behave.

Thanks all!

Solution

Honestly, I don't have a perfect solution, but it works, so that's good enough for me.

As pointed out by oleg below, Bixo has created a crawler that customizes HttpClient to be more forgiving to servers. To "get around" the issue more than fix it, I just used SimpleHttpFetcher provided by Bixo here: (linked removed - SO thinks I'm a spammer, so you'll have to google it yourself)

SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","[email protected]","ENTER URL"));
try {
    FetchedResult result = fetch.fetch("ENTER URL");
    System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
    e.printStackTrace();
}

The down side to this solution is that there are a lot of dependencies for Bixo -- so this may not be a good work around for everyone. However, you can always just work through their use of DefaultHttpClient and see how they instantiated it to get it to work. I decided to use the whole class because it handles some things for me, like automatic redirect following (and reporting the final destination url) that are helpful.

Thanks for the help all.

Edit: TinyBixo

Hi all. So, I loved how Bixo worked, but didn't like that it had so many dependencies (including all of Hadoop). So, I created a vastly simplified Bixo, without all the dependencies. If you're running into the problems above, I would recommend using it (and feel free to make pull requests if you'd like to update it!)

It's available here: https://github.com/juliuss/TinyBixo

829

asked Mar 12 '11 04:03

nostromo

1 Answers

First, to answer your question:

The connection reset was caused by a problem on the server side. Most likely the server failed to parse the request or was unable to process it and dropped the connection as a result without returning a valid response. There is likely something in the HTTP requests generated by HttpClient that causes server side logic to fail, probably due to a server side bug. Just because the error message does not say 'by peer' does not mean the connection reset took place on the client side.

A few remarks:

(1) Several popular web crawlers such as bixo http://openbixo.org/ use HttpClient without major issues but pretty much of them had to tweak HttpClient behavior to make it more lenient about common HTTP protocol violations. Per default HttpClient is rather strict about the HTTP protocol compliance.

(2) Why did not you report the NPE problem or any other problem you have been experiencing to the HttpClient project?

116

answered Nov 15 '22 06:11

ok2c

Related questions
                            
                                Get instance of keystore that JVM loads by default
                            
                                Why use the command pattern in GWT (or any web app)?
                            
                                Is there a call tree view available for jvisualvm or NetBeans profiler?
                            
                                Why hibernate perform two queries for eager load a @OneToOne bidirectional association?
                            
                                Java Collections with Mutable Objects
                            
                                JAXB workaround for Chameleon XSD imports?
                            
                                Spring MVC and JSR-303 hibernate conditional validation
                            
                                Temporal libraries for Java [closed]
                            
                                netbeans: how to determine unused JARs?
                            
                                Switching off Jersey logging programmatically
                            
                                Calling a java function from C++ via JNI that returns a string [duplicate]
                            
                                How to guarantee FIFO execution order in a ThreadPoolExecutor
                            
                                Spring @Autowired messageSource working in Controller but not in other classes?
                            
                                why to avoid constant folding in Java? When?
                            
                                File Streaming in Java
                            
                                Performance problems when using lots of AOP request scoped beans
                            
                                Running web app in both Jetty and Tomcat
                            
                                How to ensure garbage collection of a FutureTask that is submitted to a ThreadPoolExecutor and then cancelled?
                            
                                The expression of type x is boxed into X?
                            
                                My app is constantly running Full GC!

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

Tags:

java

apache

sockets

web-crawler

httpclient