Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I organize my spiders into nested directories in Scrapy?

I have the following directory structure:

my_project/
  __init__.py
  spiders/
    __init__.py
    my_spider.py
    other_spider.py
  pipeines.py
  # other files

Right now I can be in the my_project directory and start my crawl using scrapy crawl my_spider.

What I'd like to achieve is to be able to run scrapy crawl my_spider with this updated structure:

my_project/
  __init__.py
  spiders/
    __init__.py
    subtopic1/
      __init__.py # <-- I get the same error whether this is present or not
      my_spider.py
    subtopicx/
      other_spider.py
  pipeines.py
  # other files

But right now I get this error:

KeyError: 'Spider not found: my_spider'

What is the appropriate way to organize Scrapy spiders into directories?

like image 389
YPCrumble Avatar asked Jun 20 '16 06:06

YPCrumble


People also ask

What is Start_urls in scrapy?

start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.

What is a spider in scrapy?

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).

How does a scrapy pipeline work?

Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.

How do you get to the next scrapy page?

Run the code with scrapy crawl spider -o next_page. json and check the result.


1 Answers

I know this is long over due but this is the right way to organize your spiders in nested directories. You set the modules location in the settings defined here.

Example:

SPIDER_MODULES = ['my_project.spiders', 'my_project.spiders.subtopic1', 'my_project.spiders.subtopicx']
like image 113
pariola Avatar answered Jan 02 '23 19:01

pariola