I want to get product title,link,price in category https://tiki.vn/dien-thoai-may-tinh-bang/c1789
But it fails "HTTP status code is not handled or not allowed":
My file: spiders/tiki.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from stackdata.items import StackdataItem
class StackdataSpider(CrawlSpider):
name = "tiki"
allowed_domains = ["tiki.vn"]
start_urls = [
"https://tiki.vn/dien-thoai-may-tinh-bang/c1789",
]
rules = (
Rule(LinkExtractor(allow=r"\?page=2"),
callback="parse_item", follow=True),
)
def parse_item(self, response):
questions = response.xpath('//div[@class="product-item"]')
for question in questions:
question_location = question.xpath(
'//a/@href').extract()[0]
full_url = response.urljoin(question_location)
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
item = StackdataItem()
item["title"] = response.css(
".item-box h1::text").extract()[0]
item["url"] = response.url
item["content"] = response.css(
".price span::text").extract()[0]
yield item
File: items.py
import scrapy
class StackdataItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
price = scrapy.Field()
Please help me!!!! thanks!
You are being blocked based on scrapy's user-agent.
You have two options:
I assume you want to take option 2.
Go to your settings.py
in your scrapy project and set your user-agent to a non-default value. Either your own project name (it probably should not contain the word scrapy
) or a standard browser's user-agent.
USER_AGENT='my-cool-project (http://example.com)'
We all want to learn, so here is an explanation of how I got to this result and what you can do if you see such behavior again.
The website tiki.vn seems to return HTTP status 404 for all requests of your spider. You can see in your screenshot that you get a 404 for both your requests to /robots.txt
and /dien-thoai-may-tinh-bang/c1789
.
404 means "not found" and web servers use this to show that a URL does not exist. However, if we check the same sites manually, we can see that both sites contain valid content. Now, it could technically be possible that these sites return both content and a 404 error code at the same time, but we can check this with the developer console of our browser (e.g. Chrome or Firefox).
Here we can see that robots.txt returns a valid 200 status code.
Many web sites try to restrict scraping, so they try to detect scraping behavior. So, they will look at some indicators and decide if they will serve content to you or block your request. I assume that exactly this is what's happening to you.
I wanted to crawl one website, which worked totally fine from my home PC, but did not respond at all (not even 404) to any request from my server (scrapy, wget, curl, ...).
Next steps you'll have to take to analyze the reason for this issue:
You can fetch it with wget like this:
wget https://tiki.vn/dien-thoai-may-tinh-bang/c1789
wget does send a custom user-agent, so you might want to set it to a web browser's user-agent if this command does not work (it does from my PC).
wget -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' https://tiki.vn/dien-thoai-may-tinh-bang/c1789
This will help you to find out if the problem is with the server (e.g. they blocked the IP or a whole IP range) or if you need to make some modifications to your spider.
If it works with wget for your server, I would suspect the user-agent of scrapy to be the problem. According to the documentation, scrapy does use Scrapy/VERSION (+http://scrapy.org)
as the user-agent unless you set it yourself. It's quite possible that they block your spider based on the user-agent.
So, you have to go to settings.py
in your scrapy project and look for the settings USER_AGENT
there. Now, set this to anything which does not contain the keyword scrapy
. If you want to be nice, use your project name + domain, otherwise use a standard browser user-agent.
Nice variant:
USER_AGENT='my-cool-project (http://example.com)'
Not so nice (but common in scraping) variant:
USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
In fact, I was able to verify that they block on the user-agent with this wget command from my local PC:
wget -U 'Scrapy/1.3.0 (+http://scrapy.org)' https://tiki.vn/dien-thoai-may-tinh-bang/c1789
which results in
--2017-10-14 18:54:04-- https://tiki.vn/dien-thoai-may-tinh-bang/c1789
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving tiki.vn... 203.162.81.188
Connecting to tiki.vn|203.162.81.188|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-10-14 18:54:06 ERROR 404: Not Found.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With