I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone knows of any good API's or resources to help me out. I'm using java, by the way. Here's what my workflow has been so far: <ol> <li>Connect to a website (using HTTPComponents from Apache)</li> <li>Website contains a section with a bunch of links that I need to visit (using built in java HTML parsers to figure out what all the links I need to visit are, this is annoying and messy code)</li> <li>Visit all the links that I found</li> <li>For each link that I visit, there's more data that I need to extract, spread out on multiple pages so I may need to visit more links</li> </ol> Thoughts: <ul> <li>Does anyone know of any higher level/more intelligent html parsers than the built in java one?</li> <li>Basically it's a depth first search. I imagine I would like to make this multithreaded at some time so I can visit some of these links in parallel.</li> <li>Maybe what I'm really looking for is a multithreaded web crawling library</li> </ul> If you haven't figured out, this is my first time messing around with this so I'm having a difficult time trying to articulate exactly what my needs are. I would greatly appreciate any input that any of you who have done this before might have.

I've found JSoup really good for HTML parsing. For more pointers check this article out: How to write a multi-threaded webcrawler

Try using Web-Harvest project.

Web scraping, screen scraping, data mining tips? [closed]

Tags:

data-mining

I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone knows of any good API's or resources to help me out.

I'm using java, by the way.

Here's what my workflow has been so far:

Connect to a website (using HTTPComponents from Apache)
Website contains a section with a bunch of links that I need to visit (using built in java HTML parsers to figure out what all the links I need to visit are, this is annoying and messy code)
Visit all the links that I found
For each link that I visit, there's more data that I need to extract, spread out on multiple pages so I may need to visit more links

Thoughts:

Does anyone know of any higher level/more intelligent html parsers than the built in java one?
Basically it's a depth first search. I imagine I would like to make this multithreaded at some time so I can visit some of these links in parallel.
Maybe what I'm really looking for is a multithreaded web crawling library

If you haven't figured out, this is my first time messing around with this so I'm having a difficult time trying to articulate exactly what my needs are. I would greatly appreciate any input that any of you who have done this before might have.

317

asked Nov 02 '10 16:11

JPC

3 Answers

I've found JSoup really good for HTML parsing.

For more pointers check this article out: How to write a multi-threaded webcrawler

answered Nov 14 '22 22:11

Related questions
                            
                                code interchanges
                            
                                Improvement/s to my Java generic console input method?
                            
                                how java jaxb works?
                            
                                Getting a java.lang.ClassNotFoundException even though i've specified the correct jar with -cp
                            
                                MySQL Updates are taking forever
                            
                                Controlling volume of a Clip when using Java Sound (javax,sound.sampled)
                            
                                Why doesn't TreeSet.contains() work?
                            
                                How can I iterate over an object while modifying it in Java? [duplicate]
                            
                                Painted content invisible while resizing in Java
                            
                                Making a Java program available online for all to use
                            
                                Is there a tool for Java which finds which lines of code are tested by specific JUnit tests?
                            
                                Action into Submenu Context Menu Java JFace SWT Eclipse
                            
                                BigDecimal from Double incorrect value?
                            
                                Is it acceptable for an interface definition to contain references to other interfaces?
                            
                                Java on Windows: how to delete a file to trash (using JNA)
                            
                                how to control VLC by java
                            
                                Determining synchronization scope?
                            
                                How to communicate with 1000's of socket simultaneosuly in Java?
                            
                                Snippet creation keystroke/shortcut in Eclipse
                            
                                Getting started with Oracle Database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Web scraping, screen scraping, data mining tips? [closed]

Tags:

java

html-parsing

web-scraping

screen-scraping

data-mining

JPC

People also ask

3 Answers

dogbane

harshit

Boris Pavlović

Recent Activity

Donate For Us