Scraping all text using Scrapy without knowing webpages' structure

Tags:

I am conducting a research which relates to distributing the indexing of the internet.

While several such projects exist (IRLbot, Distributed-indexing, Cluster-Scrapy, Common-Crawl etc.), mine is more focused on incentivising such behavior. I am looking for a simple way to crawl real webpages without knowing anything about their URL or HTML structure and:

extract all their text (in order to index it)
Collect all their URLs and add them to the URLs to crawl
Prevent crashing and elegantly continuing (even without the scraped text) in case of malformed webpage

To clarify - this is only for Proof of Concept (PoC), so I don't mind it won't scale, it's slow, etc. I am aiming at scraping most of the text which is presented to the user, in most cases, with or without dynamic content, and with as little "garbage" such as functions, tags, keywords etc. A working simple partial solution which works out of the box is preferred over the perfect solution which requires a lot of expertise to deploy.

A secondary issue is the storing of the (url,extracted text) for indexing (by a different process?), but I think I will be able to figure it out myself with some more digging.

Any advice on how to augment "itsy"'s parse function will be highly appreciated!

import scrapy

from scrapy_1.tutorial.items import WebsiteItem


class FirstSpider(scrapy.Spider):
name = 'itsy'

# allowed_domains = ['dmoz.org'] 

start_urls = \
    [
        "http://www.stackoverflow.com"
    ]

# def parse(self, response):
#     filename = response.url.split("/")[-2] + '.html'
#     with open(filename, 'wb') as f:
#         f.write(response.body)

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        item = WebsiteItem()
        item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a/@href').extract()
        item['body_text'] = sel.xpath('text()').extract()
        yield item

483

asked Aug 25 '16 20:08

UriCS

Video Answer

1 Answers

What you are looking for here is scrapy CrawlSpider

CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.

Here's a good example how your spider might look with CrawlSpider:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'crawlspider'
    start_urls = ['http://scrapy.org']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = dict()
        item['url'] = response.url
        item['title'] = response.meta['link_text']
        # extracting basic body
        item['body'] = '\n'.join(response.xpath('//text()').extract())
        # or better just save whole source
        item['source'] = response.body
        return item

This spider will crawl every webpage it can find on the website and log the title, url and whole text body.
For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss. Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.

There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page

148

answered Oct 05 '22 03:10

Granitosaurus

Related questions
                            
                                H5py store list of list of strings
                            
                                pandas DataFrame add fill_value NotImplementedError
                            
                                Why does Python's eval(input("Enter input: ")) change input's datatype?
                            
                                django - AttributeError: 'NoneType' object has no attribute 'first_name'
                            
                                How to transform vector into unit vector in Tensorflow
                            
                                Finding conditional probability of trigram in python nltk
                            
                                split sentence without space in python (nltk?)
                            
                                Inverse filtering using Python
                            
                                Python OpenCV plot circles at a list of centre coordinates
                            
                                Python regular expression to replace everything but specific words
                            
                                How to unittest Flask websocket server (Flask-SocketIO)
                            
                                How to make a Luigi task generate an in-memory list as target
                            
                                Python- np.mean() giving wrong means?
                            
                                How to access MultiIndex column after groupby in pandas?
                            
                                Zip cyclically over multiple lists in Python
                            
                                Python3 Asyncio shared resources between concurrent tasks
                            
                                Distributed Tensorflow: ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable"
                            
                                conv2d_transpose is dependent on batch_size when making predictions
                            
                                In python, does lock get automatically released when an exception happens?
                            
                                Infinite loop while adding two integers using bitwise operations?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping all text using Scrapy without knowing webpages' structure

Tags:

python

web-scraping

scrapy

UriCS

People also ask

Video Answer

1 Answers

Granitosaurus

Recent Activity

Donate For Us