Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping, screen scraping, data mining tips? [closed]

I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone knows of any good API's or resources to help me out.

I'm using java, by the way.

Here's what my workflow has been so far:

  1. Connect to a website (using HTTPComponents from Apache)
  2. Website contains a section with a bunch of links that I need to visit (using built in java HTML parsers to figure out what all the links I need to visit are, this is annoying and messy code)
  3. Visit all the links that I found
  4. For each link that I visit, there's more data that I need to extract, spread out on multiple pages so I may need to visit more links

Thoughts:

  • Does anyone know of any higher level/more intelligent html parsers than the built in java one?
  • Basically it's a depth first search. I imagine I would like to make this multithreaded at some time so I can visit some of these links in parallel.
  • Maybe what I'm really looking for is a multithreaded web crawling library

If you haven't figured out, this is my first time messing around with this so I'm having a difficult time trying to articulate exactly what my needs are. I would greatly appreciate any input that any of you who have done this before might have.

like image 317
JPC Avatar asked Nov 02 '10 16:11

JPC


People also ask

What is difference between data scraping and screen scraping?

Data scraping is a variant of screen scraping that is used to copy data from documents and web applications. Data scraping is a technique where structured, human-readable data is extracted. This method is mostly used for exchanging data with a legacy system and making it readable by modern applications.

Can you get blocked for web scraping?

Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.

Can web scraping be used to pull data off of websites?

Web scraping refers to the process of extracting content and data from websites using software. For example, most price comparison services use web scrapers to read price information from several online stores. Another example is Google, which routinely scrapes or “crawls” the web to index websites.


3 Answers

I've found JSoup really good for HTML parsing.

For more pointers check this article out: How to write a multi-threaded webcrawler

like image 62
dogbane Avatar answered Nov 14 '22 22:11

dogbane


I used Bixo for extracting the hyperlinks and images doing depth search,. It built over hadoop and cascading so there is a learning curve but the example provided is good enough to config the changes ...

like image 40
harshit Avatar answered Nov 14 '22 21:11

harshit


Try using Web-Harvest project.

like image 27
Boris Pavlović Avatar answered Nov 14 '22 23:11

Boris Pavlović