using tor with scrapy framework

Tags:

I am trying to crawl website, which is sophisticated enough to stop bots, I mean it is permitting only a few requests, after that Scrapy hangs.

Question 1: is there a way, if Scrapy hangs I can restart my crawling process from the same point. To get rid of this problem, I wrote my settings file like this

Click to copy

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

This is my program:

Click to copy

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

The question is where I could write the http proxies and shall i have to import any tor related classes, I am new to Scrapy because of this group I learned so much, Now I am trying to learn "how to use ip rotation or tor'

As one of our member suggested, I started tor and I set HTTP_PROXY to

Click to copy

set http_proxy=http://localhost:8118

but it is throwing some errors,

Click to copy

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

So i changed http_proxy to

Click to copy

set http_proxy=http://localhost:9051

Now the error is

Click to copy

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

I checked firefox network settings, there I couldn't see any http proxies but instead of that Its using SOCKSV5, there it is showing 127.0.0.1:9051. (before TOR it works with no proxies)Please help me I am still not understanding how to use TOR through Scrapy. Which bundle of TOR I am supposed to use and how? I hope that both of my questions will be resolved

If a scrapy crawler hangs for some reason (Connection failure), I would like to resume the service from there itself
How to use rotating IPs in Scrapy

701

asked Nov 10 '11 18:11

user1020058

1 Answers

TOR by itself is not an http proxy, the port 8118 and the connection refused error suggest that you don't have privoxy[1] running properly. Try setting up privoxy correctly and then try again using the environment variable http_proxy=http://localhost:8118.

I have done crawling through TOR using privoxy with scrapy successfully.

[1] http://www.privoxy.org/

answered Oct 16 '22 17:10

R. Max

Related questions
                            
                                uwsgi process died with "libgcc_s.so.1 must be installed for pthread_cancel to work"
                            
                                Passing data from Django to C++ application and back
                            
                                Amazon S3 Programmatically Access Usage Data
                            
                                Django App Install Script - How to add app to INSTALLED_APPS setting?
                            
                                Separate Django sites with a common authetication/registration backend
                            
                                How to subclass a subclass of numpy.ndarray
                            
                                Too many arguments in my function - Python
                            
                                python segmentation fault when closing / quitting
                            
                                Vim with python support enviromental variables
                            
                                Securely store Oauth token(s) in file
                            
                                How to debug a remote python application with (Python Tools for) Visual Studio?
                            
                                mod_wsgi python can't import from standard library
                            
                                Read Only Text widget in python3-tkinter; cross platform
                            
                                Representing a set of URLs in a list as a tree structure
                            
                                Combined atmospheric data visualization [closed]
                            
                                Check if two variables have values from two different sets, the DRY way
                            
                                Reading a raw HTTP request in Django 1.3
                            
                                cx_Oracle.so: undefined symbol:PyUnicodeUCS2_AsEncodedString
                            
                                HTTPS Python client
                            
                                Does this combinator have a name?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

using tor with scrapy framework

Tags:

python

tor

scrapy

user1020058

People also ask

1 Answers

R. Max

Recent Activity

Donate For Us