Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy crawling stackoverflow questions matching multiple tags

I am trying out scrapy now. I tried the example code in http://doc.scrapy.org/en/1.0/intro/overview.html page. I tried extracting the recent questions with tag 'bigdata'. Everything worked well. But when I tried to extract questions with both tags 'bigdata' and 'python', the results were not correct, with questions having only 'bigdata' tag coming in the result. But on browser I am getting questions with both the tags correctly. Please find the code below:

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['https://stackoverflow.com/questions/tagged/bigdata?page=1&sort=newest&pagesize=50']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'title': response.css('h1 a::text').extract()[0],
            'votes': response.css('.question .vote-count-post::text').extract()[0],
            'body': response.css('.question .post-text').extract()[0],
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

When I change start_urls as

start_urls = ['https://stackoverflow.com/questions/tagged/bigdata+python?page=1&sort=newest&pagesize=50']

the results contain questions with only 'bigdata' tag. How to get questions with both the tags only?

Edit: I think what is happening is that scrapy is going into pages with tag 'bigdata' from the main page I gave because the tags are links to the main page for that tag. How can I edit this code to make scrapy not go into the tag pages and only questions in that page? I tried using rules like below but results were still not right.

rules = (Rule(LinkExtractor(restrict_css='.question-summary h3 a::attr(href)'), callback='parse_question'),)
like image 722
Joswin K J Avatar asked Mar 24 '26 10:03

Joswin K J


1 Answers

The url you have (as well as the initial css rules) is correct; or more simply:

start_urls = ['https://stackoverflow.com/questions/tagged/python+bigdata']

Extrapolating from this, this will also work:

start_urls = ['https://stackoverflow.com/questions/tagged/bigdata%20python']

The issue you are running into however, is that stackoverflow appears to require you to be logged in to access the multiple tag search feature. To see this, simply log out of your stackoverflow session and try the same url in your browser. It will redirect you to a page of results for the first of the two tags only.

TL;DR the only way to get the multiple tags feature appears to be logging in (enforced via session cookies)

Thus, when using scrapy, the fix is to authenticate the session (login) before doing anything else, and then proceed to parse as normal and it all works. To do this, you can use an InitSpider instead of Spider and add the appropriate login methods. Assuming you login with StackOverflow directly (as opposed to through Google or the like), I was able to get it working as expected like this:

import scrapy
import getpass
from scrapy.spiders.init import InitSpider

class StackOverflowSpider(InitSpider):
    name = 'stackoverflow'
    login_page = 'https://stackoverflow.com/users/login'
    start_urls = ['https://stackoverflow.com/questions/tagged/bigdata+python']

    def parse(self, response):
        ...

    def parse_question(self, response):
        ...

    def init_request(self):
        return scrapy.Request(url=self.login_page, callback=self.login)

    def login(self, response):
        return scrapy.FormRequest.from_response(response,
                    formdata={'email': '[email protected]',
                              'password': getpass.getpass()},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        if "/users/logout" in response.body:
            self.log("Successfully logged in")
            return self.initialized()
        else:
            self.log("Failed login")
like image 166
lemonhead Avatar answered Mar 25 '26 23:03

lemonhead



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!