I currently finalising a Scrapy project however I have quite a lengthy pipelines.py
file.
I noticed that in my settings.py
the pipe lines are show as follows (trimmed down):
ITEM_PIPELINES = {
'proj.pipelines.MutatorPipeline': 200,
'proj.pipelines.CalculatorPipeline': 300,
'proj.pipelines.SaveToFilePipeline': 500,
}
I have tried the following ways to rectify this.
1.) I created a new file/folder and tried referencing it from pipeline in the same manner.
Folder was myPipelines/Test.py
with a class name TestPipeline
then referenced in pipeline settings as proj.myPipelines.Test.TestPipeline': 100,
.
This threw me errors.
I then thought I could export the module and import into my current pipelines.py
and it would take the reference from that. I added an empty __init__.py
in my myPipelines
directory and then added from myPipelines.Test import TestPipeline
but scrapy still throws an error of...
Raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
exceptions.NameError: Module 'proj.pipelines' doesn't define any object named 'TestPipeline'.
Many thanks in advance!
When you start a scrapy project, you get a directory tree like this:
$ scrapy startproject multipipeline
$ tree
.
├── multipipeline
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── example.py
│ └── __init__.py
└── scrapy.cfg
And the generated pipelines.py
looks like this:
$ cat multipipeline/pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class MultipipelinePipeline(object):
def process_item(self, item, spider):
return item
But your scrapy project can reference any Python class as item pipelines. One option is to convert the generated one-file pipelines
module to a package within its own directory, with submodules.
Notice the __init__.py
file inside the pipelines/
dir:
$ tree
.
├── multipipeline
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines
│ │ ├── __init__.py
│ │ ├── one.py
│ │ ├── three.py
│ │ └── two.py
│ ├── settings.py
│ └── spiders
│ ├── example.py
│ └── __init__.py
└── scrapy.cfg
The individual modules inside the pipelines/
dir could look like this:
$ cat multipipeline/pipelines/two.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import logging
logger = logging.getLogger(__name__)
class MyPipelineTwo(object):
def process_item(self, item, spider):
logger.debug(self.__class__.__name__)
return item
You can read more about packages here.
The
__init__.py
files are required to make Python treat the directories as containing packages; this is done to prevent directories with a common name, such as string, from unintentionally hiding valid modules that occur later on the module search path. In the simplest case,__init__.py
can just be an empty file, but it can also execute initialization code for the package or set the__all__
variable, described later.
And your settings.py
would contain something like this:
ITEM_PIPELINES = {
'multipipeline.pipelines.one.MyPipelineOne': 100,
'multipipeline.pipelines.two.MyPipelineTwo': 200,
'multipipeline.pipelines.three.MyPipelineThree': 300,
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With