Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best java lib for http connections?

Tags:

java

hi all I'm writing a simple web crawling script that needs to connect to a webpage, follow the 302 redirects automatically, give me the final url from the link and let me grab the html.

What's the preferred java lib for doing these kinds of things?

thanks

like image 636
James Avatar asked Jul 02 '10 03:07

James


People also ask

Is HttpClient available in Java 8?

Is there any way to use it in java 8? No, because the jdk. incubator. http module has been added since Java 9.

Is Java 11 HttpClient thread-safe?

Once created, an HttpClient instance is immutable, thus automatically thread-safe, and you can send multiple requests with it. By default, the client tries to open an HTTP/2 connection. If the server answers with HTTP/1.1, the client automatically falls back to this version.

What are HTTP libraries?

This library provides methods to handle typical apps web requests, such as calls to RESTful APIs and image downloads. It also simplifies the management of the HTTP lifecycle by providing asynchronous calls, transparent HTTP caching, automatic scheduling of network requests and request prioritization.


2 Answers

You can use Apache HttpComponents Client for this (or "plain vanilla" the Java SE builtin and verbose URLConnection API). For the HTML parsing/traversing/manipulation part Jsoup may be useful.

Note that a bit decent crawler should obey the robots.txt. You may want to take a look at existing Java based webcrawlers, like J-Spider Apache Nutch.

like image 145
BalusC Avatar answered Oct 11 '22 21:10

BalusC


As BalusC said, have a look at Apache's HttpComponents Client. The Nutch project has solved lots of hard crawling/fetching/indexing problems, so if if you want to see how they solve the following 302, have a look at http://svn.apache.org/viewvc/nutch/trunk/src/

like image 36
labratmatt Avatar answered Oct 11 '22 21:10

labratmatt