Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What technology for large scale scraping/parsing? [closed]

We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database. What language would you recommend for doing this on a large scale(tens of millions of pages?). .

We're using MongoDB for the database, so anything with solid MongoDB drivers is a plus.

So far, we have been using(don't laugh) PHP, curl, and Simple HTML DOM Parser but I don't think that's scalable to millions of pages, especially as PHP doesn't have proper multithreading.

We need something that is easy to develop in, can run on a Linux server, has a robust HTML/DOM parser to easily extract that tag, and can easily download millions of webpages in a reasonable amount of time. We're not really looking for a web crawler, because we don't need to follow links and index all content, we just need to extract one tag from each page on a list.

like image 834
Jonathan Knight Avatar asked Jun 29 '10 17:06

Jonathan Knight


3 Answers

If you're really talking about large scale, then you'll probably want something that lets you scale horizontally, e.g., a Map-Reduce framework like Hadoop. You can write Hadoop jobs in a number of languages, so you're not tied to Java. Here's an article on writing Hadoop jobs in Python, for instance. BTW, this is probably the language I'd use, thanks to libs like httplib2 for making the requests and lxml for parsing the results.

If a Map-Reduce framework is overkill, you could keep it in Python and use multiprocessing.

UPDATE: If you don't want a MapReduce framework, and you prefer a different language, check out the ThreadPoolExecutor in Java. I would definitely use the Apache Commons HTTP client stuff, though. The stuff in the JDK proper is way less programmer-friendly.

like image 86
Hank Gay Avatar answered Nov 18 '22 09:11

Hank Gay


You should probably use tools used for testing web applications (WatiN or Selenium).

You can then compose your workflow separated from the data using a tool I've written.

https://github.com/leblancmeneses/RobustHaven.IntegrationTests

You shouldn't have to do any manual parsing when using WatiN or Selenium. You'll instead write an css querySelector.

Using TopShelf and NServiceBus you can scale the # of workers horizontally.

FYI: With mono these tools i mention can run on Linux. (although miles may vary)

If JavaScript doesn't need to be evaluated to load data dynamically: Anything requiring the document to be loaded in memory is going waste time. If you know where your tag is, all you need is a sax parser.

like image 30
Leblanc Meneses Avatar answered Nov 18 '22 07:11

Leblanc Meneses


I do something similar using Java with the HttpClient commons library. Although I avoid the DOM parser because I'm looking for a specific tag which can be found easily from a regex.

The slowest part of the operation is making the http requests.

like image 1
Quotidian Avatar answered Nov 18 '22 07:11

Quotidian