How to crawl local HTML file with Scrapy

Question

I tried to crawl a local HTML file stored in my desktop with the code below, but I encounter the following errors before crawling procedure, such as "No such file or directory: '/robots.txt'".

Is it possible to crawl local HTML files in a local computer(Mac)?
If possible, how should I set parameters like "allowed_domains" and "start_urls"?

[Scrapy command]

$ scrapy crawl test -o test01.csv

[Scrapy spider]

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = []
    start_urls = ['file:///Users/Name/Desktop/test/test.html']

[Errors]

2018-11-16 01:57:52 [scrapy.core.engine] INFO: Spider opened
2018-11-16 01:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 01:57:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-16 01:57:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2018-11-16 01:57:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'

Japes · Accepted Answer

When working with it locally, I never specify the allowed_domains. Try to take that line of code out and see if it works.

In your error its testing the 'empty' domain that you have given it.

How to crawl local HTML file with Scrapy

Tags:

python

scrapy

Baka

1 Answers

Japes

Recent Activity

Donate For Us

How to crawl local HTML file with Scrapy

Tags:

python

scrapy

Baka

1 Answers

Japes

Related questions

Recent Activity

Donate For Us