I am making a crawler, and need to get the data from the stream regardless if it is a 200 or not. CURL is doing it, as well as any standard browser. The following will not actually get the content of the request, even though there is some, an exception is thrown with the http error status code. I want the output regardless, is there a way? I prefer to use this library as it will actually do persistent connections, which is perfect for the type of crawling I am doing. <pre class="prettyprint"><code>package test; import java.net.*; import java.io.*; public class Test { public static void main(String[] args) { try { URL url = new URL("http://github.com/XXXXXXXXXXXXXX"); URLConnection connection = url.openConnection(); DataInputStream inStream = new DataInputStream(connection.getInputStream()); String inputLine; while ((inputLine = inStream.readLine()) != null) { System.out.println(inputLine); } inStream.close(); } catch (MalformedURLException me) { System.err.println("MalformedURLException: " + me); } catch (IOException ioe) { System.err.println("IOException: " + ioe); } } } </code></pre> Worked, thanks: Here is what I came up with - just as a rough proof of concept: <pre class="prettyprint"><code>import java.net.*; import java.io.*; public class Test { public static void main(String[] args) { //InputStream error = ((HttpURLConnection) connection).getErrorStream(); URL url = null; URLConnection connection = null; String inputLine = ""; try { url = new URL("http://verelo.com/asdfrwdfgdg"); connection = url.openConnection(); DataInputStream inStream = new DataInputStream(connection.getInputStream()); while ((inputLine = inStream.readLine()) != null) { System.out.println(inputLine); } inStream.close(); } catch (MalformedURLException me) { System.err.println("MalformedURLException: " + me); } catch (IOException ioe) { System.err.println("IOException: " + ioe); InputStream error = ((HttpURLConnection) connection).getErrorStream(); try { int data = error.read(); while (data != -1) { //do something with data... //System.out.println(data); inputLine = inputLine + (char)data; data = error.read(); //inputLine = inputLine + (char)data; } error.close(); } catch (Exception ex) { try { if (error != null) { error.close(); } } catch (Exception e) { } } } System.out.println(inputLine); } } </code></pre>

You need to do the following after calling <code>openConnection</code>. <ol> <li>Cast the URLConnection to HttpURLConnection</li> <li>Call getResponseCode</li> <li>If the response is a success, use getInputStream, otherwise use getErrorStream</li> </ol> (The test for success should be <code>200 <= code < 300</code> because there are valid HTTP success codes apart from than 200.) <hr> <blockquote> I am making a crawler, and need to get the data from the stream regardless if it is a 200 or not. </blockquote> Just be aware that it if the code is a 4xx or 5xx, then the "data" is likely to be an error page of some kind. <hr> The final point that should be made is that you should always respect the "robots.txt" file ... and read the Terms of Service before crawling / scraping the content of a site whose owners might care. Simply blatting off GET requests is likely to annoy site owners ... unless you've already come to some sort of "arrangement" with them.

URLConnection is not allowing me to access data on Http errors (404,500,etc)

Tags:

I am making a crawler, and need to get the data from the stream regardless if it is a 200 or not. CURL is doing it, as well as any standard browser.

The following will not actually get the content of the request, even though there is some, an exception is thrown with the http error status code. I want the output regardless, is there a way? I prefer to use this library as it will actually do persistent connections, which is perfect for the type of crawling I am doing.

package test;  import java.net.*; import java.io.*;  public class Test {      public static void main(String[] args) {           try {              URL url = new URL("http://github.com/XXXXXXXXXXXXXX");             URLConnection connection = url.openConnection();              DataInputStream inStream = new DataInputStream(connection.getInputStream());             String inputLine;              while ((inputLine = inStream.readLine()) != null) {                 System.out.println(inputLine);             }             inStream.close();         } catch (MalformedURLException me) {             System.err.println("MalformedURLException: " + me);         } catch (IOException ioe) {             System.err.println("IOException: " + ioe);         }     } }

Worked, thanks: Here is what I came up with - just as a rough proof of concept:

import java.net.*; import java.io.*;  public class Test {      public static void main(String[] args) { //InputStream error = ((HttpURLConnection) connection).getErrorStream();          URL url = null;         URLConnection connection = null;         String inputLine = "";          try {              url = new URL("http://verelo.com/asdfrwdfgdg");             connection = url.openConnection();              DataInputStream inStream = new DataInputStream(connection.getInputStream());              while ((inputLine = inStream.readLine()) != null) {                 System.out.println(inputLine);             }             inStream.close();         } catch (MalformedURLException me) {             System.err.println("MalformedURLException: " + me);         } catch (IOException ioe) {             System.err.println("IOException: " + ioe);              InputStream error = ((HttpURLConnection) connection).getErrorStream();              try {                 int data = error.read();                 while (data != -1) {                     //do something with data...                     //System.out.println(data);                     inputLine = inputLine + (char)data;                     data = error.read();                     //inputLine = inputLine + (char)data;                 }                 error.close();             } catch (Exception ex) {                 try {                     if (error != null) {                         error.close();                     }                 } catch (Exception e) {                  }             }         }          System.out.println(inputLine);     } }

886

asked Feb 03 '12 13:02

MichaelICE

2 Answers

Simple:

URLConnection connection = url.openConnection(); InputStream is = connection.getInputStream(); if (connection instanceof HttpURLConnection) {    HttpURLConnection httpConn = (HttpURLConnection) connection;    int statusCode = httpConn.getResponseCode();    if (statusCode != 200 /* or statusCode >= 200 && statusCode < 300 */) {      is = httpConn.getErrorStream();    } }

You can refer to Javadoc for explanation. The best way I would handle this is as follows:

URLConnection connection = url.openConnection(); InputStream is = null; try {     is = connection.getInputStream(); } catch (IOException ioe) {     if (connection instanceof HttpURLConnection) {         HttpURLConnection httpConn = (HttpURLConnection) connection;         int statusCode = httpConn.getResponseCode();         if (statusCode != 200) {             is = httpConn.getErrorStream();         }     } }

104

answered Oct 12 '22 23:10

Buhake Sindi

You need to do the following after calling openConnection.

Cast the URLConnection to HttpURLConnection
Call getResponseCode
If the response is a success, use getInputStream, otherwise use getErrorStream

(The test for success should be 200 <= code < 300 because there are valid HTTP success codes apart from than 200.)

I am making a crawler, and need to get the data from the stream regardless if it is a 200 or not.

Just be aware that it if the code is a 4xx or 5xx, then the "data" is likely to be an error page of some kind.

The final point that should be made is that you should always respect the "robots.txt" file ... and read the Terms of Service before crawling / scraping the content of a site whose owners might care. Simply blatting off GET requests is likely to annoy site owners ... unless you've already come to some sort of "arrangement" with them.

answered Oct 12 '22 23:10

Stephen C

Related questions
                            
                                Rails database setup on Travis-CI
                            
                                min value of float in java is positive why? [duplicate]
                            
                                NoClassDefFoundError when GoogleAnalyticsTracker.getInstance()
                            
                                Android SDK requires android developer toolkit version 17.0.0 or above
                            
                                How can I add a UITapGestureRecognizer to a UILabel inside a table view cell?
                            
                                OS X / Linux: pipe into two processes?
                            
                                Sorting an Array of Hash by multiple keys Perl
                            
                                Setting paper size in FPDF
                            
                                How can I include external jar on my Netbeans project
                            
                                AngularJS : $scope.$watch is not updating value fetched from $resource on custom directive
                            
                                checking if file exists in a specific directory
                            
                                Copy & rename a file to the same directory without deleting the original file [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With