I am conducting a research which relates to distributing the indexing of the internet.
While several such projects exist (IRLbot, Distributed-indexing, Cluster-Scrapy, Common-Crawl etc.), mine is more focused on incentivising such behavior. I am looking for a simple way to crawl real webpages without knowing anything about their URL or HTML structure and:
To clarify - this is only for Proof of Concept (PoC), so I don't mind it won't scale, it's slow, etc. I am aiming at scraping most of the text which is presented to the user, in most cases, with or without dynamic content, and with as little "garbage" such as functions, tags, keywords etc. A working simple partial solution which works out of the box is preferred over the perfect solution which requires a lot of expertise to deploy.
A secondary issue is the storing of the (url,extracted text) for indexing (by a different process?), but I think I will be able to figure it out myself with some more digging.
Any advice on how to augment "itsy"'s parse function will be highly appreciated!
import scrapy
from scrapy_1.tutorial.items import WebsiteItem
class FirstSpider(scrapy.Spider):
name = 'itsy'
# allowed_domains = ['dmoz.org']
start_urls = \
[
"http://www.stackoverflow.com"
]
# def parse(self, response):
# filename = response.url.split("/")[-2] + '.html'
# with open(filename, 'wb') as f:
# f.write(response.body)
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = WebsiteItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['body_text'] = sel.xpath('text()').extract()
yield item
To actually access the text information from the link’s href attribute, we use Scrapy’s .get () function which will return the link destination as a string. Next, we check to see if the URL contains an image file extension.
Web scraping is the process of extracting structured data from websites. Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you’ll learn how to get started with Scrapy and you’ll also implement an example project to scrape an e-commerce website.
Now that we’re enumerating the page’s links, we can start to analyze the links for images. ... To actually access the text information from the link’s href attribute, we use Scrapy’s .get () function which will return the link destination as a string.
Scrapy is a wonderful open source Python web scraping framework. It handles the most common use cases when doing web scraping at scale: The main difference between Scrapy and other commonly used librairies like Requests / BeautifulSoup is that it is opinionated. It allows you to solve the usual web scraping problems in an elegant way.
What you are looking for here is scrapy CrawlSpider
CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.
Here's a good example how your spider might look with CrawlSpider:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'crawlspider'
start_urls = ['http://scrapy.org']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title'] = response.meta['link_text']
# extracting basic body
item['body'] = '\n'.join(response.xpath('//text()').extract())
# or better just save whole source
item['source'] = response.body
return item
This spider will crawl every webpage it can find on the website and log the title, url and whole text body.
For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss.
Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.
There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With