Scrapy Vs Nutch [closed]

Question

I am planning to use webcrawling in an application i am currently working on. I did some research on Nutch and run some preliminary test using it. But then i came across scrapy. But when i did some preliminary research and went through the documentation about scrapy i found that it can capture only structed data (You have to give the div name from which you want to capture data). The backend of the application i am developing is based on Python and i understand scrapy is Python based and some have suggested that scrapy is better than Nutch.

My requirement is to capture the data from more than a 1000 different webpages and run search for relevant keywords in that information.Is there any way scrapy can satisfy the same requirement.

1)If yes can you point out some example on how it can be done ?

2)Or Nutch+Solr is best suited for my requirement

My requirement is to capture the data from more than a 1000 different webpages and run search for relevant keywords in that information.Is there any way scrapy can satisfy the same requirement.

1)If yes can you point out some example on how it can be done ?

2)Or Nutch+Solr is best suited for my requirement

alecxe · Accepted Answer

Scrapy would work perfectly in your case.

You are not required to give divs names - you can get anything you want:

Scrapy comes with its own mechanism for extracting data. They’re called XPath selectors (or just “selectors”, for short) because they “select” certain parts of the HTML document specified by XPath expressions.

Plus, you can use BeautifulSoup and lxml for extracting the data from the page content.

Besides, scrapy is based on twisted and is completely async and fast.

There are plenty of examples scrapy spiders here on SO - just look through the scrapy tag questions. If you have a more specific question - just ask.

Hope that helps.

Scrapy Vs Nutch [closed]

Tags:

python

solr

web-scraping

scrapy

web-crawler

Vidhu

1 Answers

alecxe

Recent Activity

Donate For Us

Scrapy Vs Nutch [closed]

Tags:

python

solr

web-scraping

scrapy

web-crawler

Vidhu

1 Answers

alecxe

Related questions

Recent Activity

Donate For Us