hi all I'm writing a simple web crawling script that needs to connect to a webpage, follow the 302 redirects automatically, give me the final url from the link and let me grab the html.
What's the preferred java lib for doing these kinds of things?
thanks
Is there any way to use it in java 8? No, because the jdk. incubator. http module has been added since Java 9.
Once created, an HttpClient instance is immutable, thus automatically thread-safe, and you can send multiple requests with it. By default, the client tries to open an HTTP/2 connection. If the server answers with HTTP/1.1, the client automatically falls back to this version.
This library provides methods to handle typical apps web requests, such as calls to RESTful APIs and image downloads. It also simplifies the management of the HTTP lifecycle by providing asynchronous calls, transparent HTTP caching, automatic scheduling of network requests and request prioritization.
You can use Apache HttpComponents Client for this (or "plain vanilla" the Java SE builtin and verbose URLConnection
API). For the HTML parsing/traversing/manipulation part Jsoup may be useful.
Note that a bit decent crawler should obey the robots.txt. You may want to take a look at existing Java based webcrawlers, like J-Spider Apache Nutch.
As BalusC said, have a look at Apache's HttpComponents Client. The Nutch project has solved lots of hard crawling/fetching/indexing problems, so if if you want to see how they solve the following 302, have a look at http://svn.apache.org/viewvc/nutch/trunk/src/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With