We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database. What language would you recommend for doing this on a large scale(tens of millions of pages?). .
We're using MongoDB for the database, so anything with solid MongoDB drivers is a plus.
So far, we have been using(don't laugh) PHP, curl, and Simple HTML DOM Parser but I don't think that's scalable to millions of pages, especially as PHP doesn't have proper multithreading.
We need something that is easy to develop in, can run on a Linux server, has a robust HTML/DOM parser to easily extract that tag, and can easily download millions of webpages in a reasonable amount of time. We're not really looking for a web crawler, because we don't need to follow links and index all content, we just need to extract one tag from each page on a list.
If you're really talking about large scale, then you'll probably want something that lets you scale horizontally, e.g., a Map-Reduce framework like Hadoop. You can write Hadoop jobs in a number of languages, so you're not tied to Java. Here's an article on writing Hadoop jobs in Python, for instance. BTW, this is probably the language I'd use, thanks to libs like httplib2
for making the requests and lxml
for parsing the results.
If a Map-Reduce framework is overkill, you could keep it in Python and use multiprocessing
.
UPDATE:
If you don't want a MapReduce framework, and you prefer a different language, check out the ThreadPoolExecutor
in Java. I would definitely use the Apache Commons HTTP client stuff, though. The stuff in the JDK proper is way less programmer-friendly.
You should probably use tools used for testing web applications (WatiN or Selenium).
You can then compose your workflow separated from the data using a tool I've written.
https://github.com/leblancmeneses/RobustHaven.IntegrationTests
You shouldn't have to do any manual parsing when using WatiN or Selenium. You'll instead write an css querySelector.
Using TopShelf and NServiceBus you can scale the # of workers horizontally.
FYI: With mono these tools i mention can run on Linux. (although miles may vary)
If JavaScript doesn't need to be evaluated to load data dynamically: Anything requiring the document to be loaded in memory is going waste time. If you know where your tag is, all you need is a sax parser.
I do something similar using Java with the HttpClient commons library. Although I avoid the DOM parser because I'm looking for a specific tag which can be found easily from a regex.
The slowest part of the operation is making the http requests.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With