I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be: <ol> <li> Run forever Means it will periodical re-visit some portal pages to get updates. </li> <li> Schedule priorities. Give different priorities to different type of URLs. </li> <li>Multi thread fetch</li> </ol> I've read the Scrapy document but have not found something related to what I listed (maybe I am not careful enough). Is there anyone here know how to do that ? or just give some idea/example about it. Thanks!

Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks. <ol> <li>Running forever is up to your application that calls Scrapy. You tell the spiders where to go and when to go there.</li> <li>Giving priorities is the job of Scheduler middleware which you'd have to create and plug into Scrapy. The documentation on this appears spotty and I've not looked at the code - in principle the function is there.</li> <li>Scrapy is inherently, fundamentally asynchronous which may well be what you are desiring: request B can be satisfied while request A is still outstanding. The underlying connection engine does not prevent you from bona fide multi-threading, but Scrapy doesn't provide threading services.</li> </ol> Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.

How to build a web crawler based on Scrapy to run forever?

Tags:

python

scrapy

web-crawler

I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be:

Run forever

Means it will periodical re-visit some portal pages to get updates.
Schedule priorities.

Give different priorities to different type of URLs.
Multi thread fetch

I've read the Scrapy document but have not found something related to what I listed (maybe I am not careful enough). Is there anyone here know how to do that ? or just give some idea/example about it. Thanks!

257

asked Feb 28 '10 04:02

superb

1 Answers

Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks.

Running forever is up to your application that calls Scrapy. You tell the spiders where to go and when to go there.
Giving priorities is the job of Scheduler middleware which you'd have to create and plug into Scrapy. The documentation on this appears spotty and I've not looked at the code - in principle the function is there.
Scrapy is inherently, fundamentally asynchronous which may well be what you are desiring: request B can be satisfied while request A is still outstanding. The underlying connection engine does not prevent you from bona fide multi-threading, but Scrapy doesn't provide threading services.

Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.

answered Oct 14 '22 00:10

msw

Related questions
                            
                                Python type hint for Callable with variable number of str/same type arguments?
                            
                                Tensorflow 2 throwing ValueError: as_list() is not defined on an unknown TensorShape
                            
                                how to choose a python interpreter for Ansible playbook?
                            
                                Why does subclassing in Python slow things down so much?
                            
                                How to catch `botocore.errorfactory.UserNotFoundException`?
                            
                                Shorten sleep time on user input
                            
                                SimpleQueue vs Queue in Python - what is the advantage of using SimpleQueue?
                            
                                How to combine a custom protocol with the Callable protocol?
                            
                                __str__ function of class ported from rust to python using pyo3 doesn't get used in print
                            
                                Use GPU with opencv-python
                            
                                Run Pylons controller as separate app?
                            
                                How would you parse indentation (python style)?
                            
                                How are debug consoles implemented in Python?
                            
                                Easy_install of wxpython has "setup script" error
                            
                                Converting from mod_python to mod_wsgi
                            
                                Child process detecting the parent process' death in Python
                            
                                Units conversion in Python
                            
                                Redirect Embedded Python IO to a console created with AllocConsole
                            
                                Py2exe: Are manifest files and w9xpopen.exe required when compiling a web server without GUI interface?
                            
                                How to tell if Python SQLite database connection or cursor is closed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With