I am trying to crawl website, which is sophisticated enough to stop bots, I mean it is permitting only a few requests, after that Scrapy hangs.
Question 1: is there a way, if Scrapy hangs I can restart my crawling process from the same point. To get rid of this problem, I wrote my settings file like this
BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'
SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'
This is my program:
class ypSpider(CrawlSpider):
name = "yp"
start_urls = [
SOME URL
]
rules=(
#These are some rules
)
def parse_item(self, response):
####################################################################
#cleaning the html page by removing scripts html tags
#######################################################
hxs=HtmlXPathSelector(response)
The question is where I could write the http proxies and shall i have to import any tor related classes, I am new to Scrapy because of this group I learned so much, Now I am trying to learn "how to use ip rotation or tor'
As one of our member suggested, I started tor and I set HTTP_PROXY to
set http_proxy=http://localhost:8118
but it is throwing some errors,
failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError' Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.
So i changed http_proxy to
set http_proxy=http://localhost:9051
Now the error is
failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.
I checked firefox network settings, there I couldn't see any http proxies but instead of that Its using SOCKSV5, there it is showing 127.0.0.1:9051. (before TOR it works with no proxies)Please help me I am still not understanding how to use TOR through Scrapy. Which bundle of TOR I am supposed to use and how? I hope that both of my questions will be resolved
Crawling using Scrapy with TorCreate ProxyMiddleware.py inside the middlewares folder and place the following code in it. Simply, the function new_tor_identity sends a signal to Tor controller to issue us a new identity. Make sure to change the passowrd PASSWORDHERE to the one you used earlier when configuring tor.
Scrapy (/ˈskreɪpaɪ/ SKRAY-peye) is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler.
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Scrapy provides Item pipelines that allow you to write functions in your spider that can process your data such as validating data, removing data and saving data to a database. It provides spider Contracts to test your spiders and allows you to create generic and deep crawlers as well.
TOR by itself is not an http proxy, the port 8118 and the connection refused error suggest that you don't have privoxy[1] running properly. Try setting up privoxy correctly and then try again using the environment variable http_proxy=http://localhost:8118
.
I have done crawling through TOR using privoxy with scrapy successfully.
[1] http://www.privoxy.org/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With