Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using tor with scrapy framework

Tags:

python

tor

scrapy

I am trying to crawl website, which is sophisticated enough to stop bots, I mean it is permitting only a few requests, after that Scrapy hangs.

Question 1: is there a way, if Scrapy hangs I can restart my crawling process from the same point. To get rid of this problem, I wrote my settings file like this

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

This is my program:

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

The question is where I could write the http proxies and shall i have to import any tor related classes, I am new to Scrapy because of this group I learned so much, Now I am trying to learn "how to use ip rotation or tor'

As one of our member suggested, I started tor and I set HTTP_PROXY to

set http_proxy=http://localhost:8118

but it is throwing some errors,

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

So i changed http_proxy to

set http_proxy=http://localhost:9051

Now the error is

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

I checked firefox network settings, there I couldn't see any http proxies but instead of that Its using SOCKSV5, there it is showing 127.0.0.1:9051. (before TOR it works with no proxies)Please help me I am still not understanding how to use TOR through Scrapy. Which bundle of TOR I am supposed to use and how? I hope that both of my questions will be resolved

  1. If a scrapy crawler hangs for some reason (Connection failure), I would like to resume the service from there itself
  2. How to use rotating IPs in Scrapy
like image 701
user1020058 Avatar asked Nov 10 '11 18:11

user1020058


People also ask

How do you use Tor Scrapy?

Crawling using Scrapy with TorCreate ProxyMiddleware.py inside the middlewares folder and place the following code in it. Simply, the function new_tor_identity sends a signal to Tor controller to issue us a new identity. Make sure to change the passowrd PASSWORDHERE to the one you used earlier when configuring tor.

Is Scrapy a framework?

Scrapy (/ˈskreɪpaɪ/ SKRAY-peye) is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler.

What is Scrapy API?

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

How does Scrapy work Python?

Scrapy provides Item pipelines that allow you to write functions in your spider that can process your data such as validating data, removing data and saving data to a database. It provides spider Contracts to test your spiders and allows you to create generic and deep crawlers as well.


1 Answers

TOR by itself is not an http proxy, the port 8118 and the connection refused error suggest that you don't have privoxy[1] running properly. Try setting up privoxy correctly and then try again using the environment variable http_proxy=http://localhost:8118.

I have done crawling through TOR using privoxy with scrapy successfully.

[1] http://www.privoxy.org/

like image 85
R. Max Avatar answered Oct 16 '22 17:10

R. Max