I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone knows of any good API's or resources to help me out.
I'm using java, by the way.
Here's what my workflow has been so far:
Thoughts:
If you haven't figured out, this is my first time messing around with this so I'm having a difficult time trying to articulate exactly what my needs are. I would greatly appreciate any input that any of you who have done this before might have.
Data scraping is a variant of screen scraping that is used to copy data from documents and web applications. Data scraping is a technique where structured, human-readable data is extracted. This method is mostly used for exchanging data with a legacy system and making it readable by modern applications.
Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.
Web scraping refers to the process of extracting content and data from websites using software. For example, most price comparison services use web scrapers to read price information from several online stores. Another example is Google, which routinely scrapes or “crawls” the web to index websites.
I've found JSoup really good for HTML parsing.
For more pointers check this article out: How to write a multi-threaded webcrawler
I used Bixo for extracting the hyperlinks and images doing depth search,. It built over hadoop and cascading so there is a learning curve but the example provided is good enough to config the changes ...
Try using Web-Harvest project.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With