I have the following directory structure:
my_project/
__init__.py
spiders/
__init__.py
my_spider.py
other_spider.py
pipeines.py
# other files
Right now I can be in the my_project
directory and start my crawl using scrapy crawl my_spider
.
What I'd like to achieve is to be able to run scrapy crawl my_spider
with this updated structure:
my_project/
__init__.py
spiders/
__init__.py
subtopic1/
__init__.py # <-- I get the same error whether this is present or not
my_spider.py
subtopicx/
other_spider.py
pipeines.py
# other files
But right now I get this error:
KeyError: 'Spider not found: my_spider'
What is the appropriate way to organize Scrapy spiders into directories?
start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.
Run the code with scrapy crawl spider -o next_page. json and check the result.
I know this is long over due but this is the right way to organize your spiders in nested directories. You set the modules location in the settings defined here.
Example:
SPIDER_MODULES = ['my_project.spiders', 'my_project.spiders.subtopic1', 'my_project.spiders.subtopicx']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With