Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Pipelines to Seperate Folder/Files - Abstraction

I currently finalising a Scrapy project however I have quite a lengthy pipelines.py file.

I noticed that in my settings.py the pipe lines are show as follows (trimmed down):

ITEM_PIPELINES = {
     'proj.pipelines.MutatorPipeline': 200,
     'proj.pipelines.CalculatorPipeline': 300,
     'proj.pipelines.SaveToFilePipeline': 500,
}

I have tried the following ways to rectify this.

1.) I created a new file/folder and tried referencing it from pipeline in the same manner.

Folder was myPipelines/Test.py with a class name TestPipeline then referenced in pipeline settings as proj.myPipelines.Test.TestPipeline': 100,.

This threw me errors.

I then thought I could export the module and import into my current pipelines.py and it would take the reference from that. I added an empty __init__.py in my myPipelines directory and then added from myPipelines.Test import TestPipeline but scrapy still throws an error of...

Raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
exceptions.NameError: Module 'proj.pipelines' doesn't define any object named 'TestPipeline'.

Many thanks in advance!

like image 928
Matt The Ninja Avatar asked Jan 04 '23 23:01

Matt The Ninja


1 Answers

When you start a scrapy project, you get a directory tree like this:

$ scrapy startproject multipipeline
$ tree
.
├── multipipeline
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── example.py
│       └── __init__.py
└── scrapy.cfg

And the generated pipelines.py looks like this:

$ cat multipipeline/pipelines.py 
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class MultipipelinePipeline(object):
    def process_item(self, item, spider):
        return item

But your scrapy project can reference any Python class as item pipelines. One option is to convert the generated one-file pipelines module to a package within its own directory, with submodules. Notice the __init__.py file inside the pipelines/ dir:

$ tree
.
├── multipipeline
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines
│   │   ├── __init__.py
│   │   ├── one.py
│   │   ├── three.py
│   │   └── two.py
│   ├── settings.py
│   └── spiders
│       ├── example.py
│       └── __init__.py
└── scrapy.cfg

The individual modules inside the pipelines/ dir could look like this:

$ cat multipipeline/pipelines/two.py 
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import logging


logger = logging.getLogger(__name__)


class MyPipelineTwo(object):
    def process_item(self, item, spider):
        logger.debug(self.__class__.__name__)
        return item

You can read more about packages here.

The __init__.py files are required to make Python treat the directories as containing packages; this is done to prevent directories with a common name, such as string, from unintentionally hiding valid modules that occur later on the module search path. In the simplest case, __init__.py can just be an empty file, but it can also execute initialization code for the package or set the __all__ variable, described later.

And your settings.py would contain something like this:

ITEM_PIPELINES = {
    'multipipeline.pipelines.one.MyPipelineOne': 100,
    'multipipeline.pipelines.two.MyPipelineTwo': 200,
    'multipipeline.pipelines.three.MyPipelineThree': 300,
}
like image 148
paul trmbrth Avatar answered Jan 16 '23 22:01

paul trmbrth