Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl local HTML file with Scrapy

Tags:

python

scrapy

I tried to crawl a local HTML file stored in my desktop with the code below, but I encounter the following errors before crawling procedure, such as "No such file or directory: '/robots.txt'".

  • Is it possible to crawl local HTML files in a local computer(Mac)?
  • If possible, how should I set parameters like "allowed_domains" and "start_urls"?

[Scrapy command]

$ scrapy crawl test -o test01.csv

[Scrapy spider]

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = []
    start_urls = ['file:///Users/Name/Desktop/test/test.html']

[Errors]

2018-11-16 01:57:52 [scrapy.core.engine] INFO: Spider opened
2018-11-16 01:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 01:57:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-16 01:57:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2018-11-16 01:57:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'
like image 226
Baka Avatar asked Oct 22 '25 04:10

Baka


1 Answers

When working with it locally, I never specify the allowed_domains. Try to take that line of code out and see if it works.

In your error its testing the 'empty' domain that you have given it.

like image 136
Japes Avatar answered Oct 24 '25 20:10

Japes