Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Vs Nutch [closed]

I am planning to use webcrawling in an application i am currently working on. I did some research on Nutch and run some preliminary test using it. But then i came across scrapy. But when i did some preliminary research and went through the documentation about scrapy i found that it can capture only structed data (You have to give the div name from which you want to capture data). The backend of the application i am developing is based on Python and i understand scrapy is Python based and some have suggested that scrapy is better than Nutch.

My requirement is to capture the data from more than a 1000 different webpages and run search for relevant keywords in that information.Is there any way scrapy can satisfy the same requirement.

1)If yes can you point out some example on how it can be done ?

2)Or Nutch+Solr is best suited for my requirement

like image 239
Vidhu Avatar asked Jun 19 '13 19:06

Vidhu


1 Answers

Scrapy would work perfectly in your case.

You are not required to give divs names - you can get anything you want:

Scrapy comes with its own mechanism for extracting data. They’re called XPath selectors (or just “selectors”, for short) because they “select” certain parts of the HTML document specified by XPath expressions.

Plus, you can use BeautifulSoup and lxml for extracting the data from the page content.

Besides, scrapy is based on twisted and is completely async and fast.

There are plenty of examples scrapy spiders here on SO - just look through the scrapy tag questions. If you have a more specific question - just ask.

Hope that helps.

like image 66
alecxe Avatar answered Oct 07 '22 13:10

alecxe