How can I change User_AGENT in scrapy spider?

Tags:

I wrote a spider to get my IP from http://ip.42.pl/raw via PROXY. This is my first spider. I want to change user_agent. I got information from this tutorial http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu

I completed all steps from this tutorial and this is my code.

settings.py

BOT_NAME = 'CheckIP'

SPIDER_MODULES = ['CheckIP.spiders']
NEWSPIDER_MODULE = 'CheckIP.spiders'

USER_AGENT_LIST = ['Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3',
'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Mozilla/5.0 (Linux; U; Android 4.0.3; de-ch; HTC Sensation Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Mozilla/5.0 (Linux; U; Android 2.3; en-us) AppleWebKit/999+ (KHTML, like Gecko) Safari/999.9',
'Mozilla/5.0 (Linux; U; Android 2.3.5; zh-cn; HTC_IncredibleS_S710e Build/GRJ90) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
    ]

HTTP_PROXY = 'http://127.0.0.1:8123'

DOWNLOADER_MIDDLEWARES = {
    'CheckIP.middlewares.RandomUserAgentMiddleware': 400,
    'CheckIP.middlewares.ProxyMiddleware': 410,
    'CheckIP.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

middleware.py

import random
from scrapy.conf import settings
from scrapy import log


class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua = random.choice(settings.get('USER_AGENT_LIST'))
        if ua:
            request.headers.setdefault('User-Agent', ua)
            #this is just to check which user agent is being used for request
            spider.log(
                u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
                level=log.DEBUG
            )


class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings.get('HTTP_PROXY')

checkip.py

import time
from scrapy.spider import Spider
from scrapy.http import Request

class CheckIpSpider(Spider):
    name = 'checkip'
    allowed_domains = ["ip.42.pl"]
    url = "http://ip.42.pl/raw"

    def start_requests(self):
            yield Request(self.url, callback=self.parse)

    def parse(self, response):
        now = time.strftime("%c")
        ip = now+"-"+response.body+"\n"
        with open('ips.txt', 'a') as f:
             f.write(ip)

This is returned information for USER_AGENT

2015-10-30 22:24:20+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-10-30 22:24:20+0200 [checkip] DEBUG: User-Agent: Scrapy/0.24.4 (+http://scrapy.org) <GET http://ip.42.pl/raw>

User-Agent: Scrapy/0.24.4 (+http://scrapy.org)

When I manual add header in request everything working correctly.

   def start_requests(self):
        yield Request(self.url, callback=self.parse, headers={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})

This is returned result in console with

2015-10-30 22:50:32+0200 [checkip] DEBUG: User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3 <GET http://ip.42.pl/raw>

How can I use USER_AGENT_LIST in my spider?

520

asked Oct 30 '15 20:10

dido

1 Answers

if you don't need a random user_agent, you can just put USER_AGENT on your settings file, like:

settings.py:

...
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
...

No need for the middleware. But if you want to really randomly select a user_agent, first make sure on scrapy logs that RandomUserAgentMiddleware is being used, you should check for something like this on your logs:

Enabled downloader middlewares:
[
    ...
    'CheckIP.middlewares.RandomUserAgentMiddleware',
    ...
]

check that CheckIP.middlewares is the path to that middleware.

Now maybe the settings are being incorrectly loaded on the middleware, I would recommend to use the from_crawler method to load this:

Class RandomUserAgentMiddleware(object):
    def __init__(self, settings):
        self.settings = settings

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        o = cls(settings, crawler.stats)
        return o

now use self.settings.get('USER_AGENT_LIST') for getting what you want inside the process_request method.

Also please update your scrapy version, looks like you are using 0.24 while it already passed 1.0.

146

answered Nov 03 '22 22:11

eLRuLL

Related questions
                            
                                UnboundLocalError: local variable 'cursor' referenced before assignment
                            
                                Storing a Pickle in MySql
                            
                                Unable to install pika using pip install
                            
                                How do you use R in Python to return an R graph through Django?
                            
                                Check if a digit is present in a list of numbers
                            
                                Raise Improperly configured psycopg2(postgresql)
                            
                                Selecting a subset of functions from a list of functions in python
                            
                                Get column names Dynamically with SQLAlchemy
                            
                                Find min, max and average of an ID in Python Pandas
                            
                                Got 'No such file or directory' error while configuring nginx and uwsgi
                            
                                Python: Assign variables from array
                            
                                Converting Pandas Timestamp to just the time (looking for something faster than .apply)
                            
                                What does a return do when using a "yield from" expression?
                            
                                `object.__setattr__(self, ..., ...)` instead of `setattr(self, ..., ...)`?
                            
                                Python 3.5 HookManager SystemError: PyEval_EvalFrameEx
                            
                                Cython speedup isn't as large as expected
                            
                                numpy, how to generate a normally distributed set of integers
                            
                                Converting a list of list into a dictionary
                            
                                Rendering to JS with Jinja produces invalid number rather than string
                            
                                How does python handle thread locking / context switching?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I change User_AGENT in scrapy spider?

Tags:

python

tor

scrapy