I am planning to use webcrawling in an application i am currently working on. I did some research on Nutch and run some preliminary test using it. But then i came across scrapy. But when i did some preliminary research and went through the documentation about scrapy i found that it can capture only structed data (You have to give the div name from which you want to capture data). The backend of the application i am developing is based on Python and i understand scrapy is Python based and some have suggested that scrapy is better than Nutch.
My requirement is to capture the data from more than a 1000 different webpages and run search for relevant keywords in that information.Is there any way scrapy can satisfy the same requirement.
1)If yes can you point out some example on how it can be done ?
2)Or Nutch+Solr is best suited for my requirement
Scrapy
would work perfectly in your case.
You are not required to give divs names - you can get anything you want:
Scrapy comes with its own mechanism for extracting data. They’re called XPath selectors (or just “selectors”, for short) because they “select” certain parts of the HTML document specified by XPath expressions.
Plus, you can use BeautifulSoup
and lxml
for extracting the data from the page content.
Besides, scrapy
is based on twisted and is completely async and fast.
There are plenty of examples scrapy spiders here on SO - just look through the scrapy tag questions. If you have a more specific question - just ask.
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With