Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on <code>SgmlLinkExtractor</code>.

You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/ To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it: <pre class="prettyprint"><code>SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 } </code></pre> The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html Finally, you'll need to modify your items.py so that each item class has the following fields: <pre class="prettyprint"><code>visit_id = Field() visit_status = Field() </code></pre> And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites. Good luck!

Scrapy - how to identify already scraped urls

1 Answers

You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:

SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }

The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

Finally, you'll need to modify your items.py so that each item class has the following fields:

visit_id = Field()
visit_status = Field()

And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.

Good luck!

132

answered Oct 22 '22 20:10

Jama22

Related questions
                            
                                Are nested format specifications legal?
                            
                                How to schedule a task in asyncio so it runs at a certain date?
                            
                                Zero occurrences/frequency using value_counts() in PANDAS
                            
                                Seaborn: Avoid plotting missing values (line plot)
                            
                                Pandas - group by column and transform the data to numpy array
                            
                                Converting a Python function with a callback to an asyncio awaitable
                            
                                pip3 setup.py install_requires PEP 508 git URL for private repo
                            
                                How can I customize python syntax highlighting in VS code?
                            
                                Is it possible to call Black as an API?
                            
                                Python's requests triggers Cloudflare's security while urllib does not
                            
                                Mysql connection pooling question: is it worth it?
                            
                                Persistent Python Command-Line History
                            
                                Numpy equivalent of MATLAB's cell array
                            
                                Is there a cross-platform way to open a file browser in Python?
                            
                                How can I prevent a Python module from importing itself?
                            
                                List fields present in a table
                            
                                Python match and return string in between
                            
                                Python: how to inherit and override
                            
                                Use of a deprecated module 'string'
                            
                                PyDev bugs with imports

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy - how to identify already scraped urls

Tags:

python

scrapy

web-crawler

Avinash

People also ask

1 Answers

Jama22

Recent Activity

Donate For Us