hi all I'm writing a simple web crawling script that needs to connect to a webpage, follow the 302 redirects automatically, give me the final url from the link and let me grab the html. What's the preferred java lib for doing these kinds of things? thanks

You can use Apache HttpComponents Client for this (or "plain vanilla" the Java SE builtin and verbose <code>URLConnection</code> API). For the HTML parsing/traversing/manipulation part Jsoup may be useful. Note that a bit decent crawler should obey the robots.txt. You may want to take a look at existing Java based webcrawlers, like <strike>J-Spider</strike> Apache Nutch.

Best java lib for http connections?

2 Answers

You can use Apache HttpComponents Client for this (or "plain vanilla" the Java SE builtin and verbose URLConnection API). For the HTML parsing/traversing/manipulation part Jsoup may be useful.

Note that a bit decent crawler should obey the robots.txt. You may want to take a look at existing Java based webcrawlers, like ~~J-Spider~~ Apache Nutch.

145

answered Oct 11 '22 21:10

BalusC

As BalusC said, have a look at Apache's HttpComponents Client. The Nutch project has solved lots of hard crawling/fetching/indexing problems, so if if you want to see how they solve the following 302, have a look at http://svn.apache.org/viewvc/nutch/trunk/src/

answered Oct 11 '22 21:10

labratmatt

Related questions
                            
                                Java program terminates unexpectedly without any error message
                            
                                Best practices for "Back" navigation links in JSF
                            
                                how to close a java frame with threads
                            
                                Can the android JVM run on a PC also?
                            
                                Java Swing: Do something when a component has *finished* resizing
                            
                                How should Iterator implementation deal with checked exceptions?
                            
                                Does Java have an equivalent to .NET resource (.resx) files for localization?
                            
                                Drawing translucent bitmaps using Canvas (Android)
                            
                                Can I find out if the java program was launched using java or javaw
                            
                                Set Icon Image in Jar file
                            
                                JAXB - Beans to XSD or XSD to beans?
                            
                                How do I use WS-Addressing properly in an Axis2 client?
                            
                                Java Memory Model: reordering and concurrent locks
                            
                                Error starting modern compiler
                            
                                What exactly is RTSJ, the Real-Time Specification for Java?
                            
                                How is this statement making sense? (Sun's naming convention for Java variables)
                            
                                java background task
                            
                                Why is a sequence named hibernate_sequence being created with JPA using Hibernate with the Oracle 10g dialect?
                            
                                Logging in multi-threaded application in java
                            
                                JPA : Many to Many query help needed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best java lib for http connections?

Tags:

java

James

People also ask

2 Answers

BalusC

labratmatt

Recent Activity

Donate For Us