Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: HTTP status code is not handled or not allowed?

I want to get product title,link,price in category https://tiki.vn/dien-thoai-may-tinh-bang/c1789

But it fails "HTTP status code is not handled or not allowed":

error log

My file: spiders/tiki.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from stackdata.items import StackdataItem


class StackdataSpider(CrawlSpider):
    name = "tiki"
    allowed_domains = ["tiki.vn"]
    start_urls = [
        "https://tiki.vn/dien-thoai-may-tinh-bang/c1789",
    ]

    rules = (
        Rule(LinkExtractor(allow=r"\?page=2"),
             callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        questions = response.xpath('//div[@class="product-item"]')

        for question in questions:
            question_location = question.xpath(
                '//a/@href').extract()[0]
            full_url = response.urljoin(question_location)
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        item = StackdataItem()
        item["title"] = response.css(
            ".item-box h1::text").extract()[0]
        item["url"] = response.url
        item["content"] = response.css(
            ".price span::text").extract()[0]
        yield item

File: items.py

import scrapy


class StackdataItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    price = scrapy.Field()

Please help me!!!! thanks!

like image 875
gait Avatar asked Oct 14 '17 16:10

gait


Video Answer


1 Answers

tl;dr

You are being blocked based on scrapy's user-agent.

You have two options:

  1. Grant the wish of the website and do not scrape them, or
  2. Change your user-agent

I assume you want to take option 2.

Go to your settings.py in your scrapy project and set your user-agent to a non-default value. Either your own project name (it probably should not contain the word scrapy) or a standard browser's user-agent.

USER_AGENT='my-cool-project (http://example.com)'

Detailed error analysis

We all want to learn, so here is an explanation of how I got to this result and what you can do if you see such behavior again.

The website tiki.vn seems to return HTTP status 404 for all requests of your spider. You can see in your screenshot that you get a 404 for both your requests to /robots.txt and /dien-thoai-may-tinh-bang/c1789.

404 means "not found" and web servers use this to show that a URL does not exist. However, if we check the same sites manually, we can see that both sites contain valid content. Now, it could technically be possible that these sites return both content and a 404 error code at the same time, but we can check this with the developer console of our browser (e.g. Chrome or Firefox).

enter image description here

Here we can see that robots.txt returns a valid 200 status code.

Further investigations to be done

Many web sites try to restrict scraping, so they try to detect scraping behavior. So, they will look at some indicators and decide if they will serve content to you or block your request. I assume that exactly this is what's happening to you.

I wanted to crawl one website, which worked totally fine from my home PC, but did not respond at all (not even 404) to any request from my server (scrapy, wget, curl, ...).

Next steps you'll have to take to analyze the reason for this issue:

  • Can you reach the website from your home PC (and do you get Status code 200)?
  • What happens if you run scrapy from your home PC? Still 404?
  • Try to load the website from the server, on which you run scrapy (e.g. with wget or curl)

You can fetch it with wget like this:

wget https://tiki.vn/dien-thoai-may-tinh-bang/c1789

wget does send a custom user-agent, so you might want to set it to a web browser's user-agent if this command does not work (it does from my PC).

wget -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' https://tiki.vn/dien-thoai-may-tinh-bang/c1789

This will help you to find out if the problem is with the server (e.g. they blocked the IP or a whole IP range) or if you need to make some modifications to your spider.

Checking the user-agent

If it works with wget for your server, I would suspect the user-agent of scrapy to be the problem. According to the documentation, scrapy does use Scrapy/VERSION (+http://scrapy.org) as the user-agent unless you set it yourself. It's quite possible that they block your spider based on the user-agent.

So, you have to go to settings.py in your scrapy project and look for the settings USER_AGENT there. Now, set this to anything which does not contain the keyword scrapy. If you want to be nice, use your project name + domain, otherwise use a standard browser user-agent.

Nice variant:

USER_AGENT='my-cool-project (http://example.com)'

Not so nice (but common in scraping) variant:

USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

In fact, I was able to verify that they block on the user-agent with this wget command from my local PC:

wget -U 'Scrapy/1.3.0 (+http://scrapy.org)' https://tiki.vn/dien-thoai-may-tinh-bang/c1789

which results in

--2017-10-14 18:54:04--  https://tiki.vn/dien-thoai-may-tinh-bang/c1789
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving tiki.vn... 203.162.81.188
Connecting to tiki.vn|203.162.81.188|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-10-14 18:54:06 ERROR 404: Not Found.
like image 142
aufziehvogel Avatar answered Oct 08 '22 23:10

aufziehvogel