Intelligent screen scraping using different proxies and user-agents randomly?

2 Answers

Use something like:

import urllib2
import time
import random

MAX_WAIT = 5
ids = ...
agents = ...
proxies = ...

for id in ids:
    url = 'http://abc.com/view_page.aspx?ID=%d' % id
    opener = urllib2.build_opener(urllib2.ProxyHandler({'http' : proxies[0]}))
    html = opener.open(urllib2.Request(url, None, {'User-agent': agents[0]})).read()
    open('%d.html' % id, 'w').write(html)
    agents.append(agents.pop()) # cycle
    proxies.append(proxies.pop())
    time.sleep(MAX_WAIT*random.random())

144

answered Oct 23 '22 16:10

hoju

Use unix tool wget. It has option to specify custom user-agent and delay between each retrieval of the page.

You can see wget(1) man page for more information.

answered Oct 23 '22 16:10

pajton

Related questions
                            
                                Can I encrypt email and decrypt it back using python default library set?
                            
                                reading a stream made by urllib2 never recovers when connection got interrupted
                            
                                What is the fastest way to draw an image in Gtk+?
                            
                                Change python file in place
                            
                                How do I efficiently do a bulk insert-or-update with SQLAlchemy?
                            
                                Python TCP stack implementation
                            
                                How can I use Microsoft Word's spelling/grammar checker programmatically?
                            
                                wxPython: Update wx.ListBox list
                            
                                General Purpose Progressbar in Django
                            
                                Is there any scripting SVG editor?
                            
                                Writing a LaTeX document with Python code snippets
                            
                                django, location based searches
                            
                                The memory usage reported by guppy differ from ps command
                            
                                Memcached getting null for String set with python and then get from Java
                            
                                Prevent a console app from closing when not invoked from an existing terminal?
                            
                                Overwrite auto_now for unittest
                            
                                Ugly combination of generator expression with for loop
                            
                                Reverse Search Best Practices?
                            
                                cx_Oracle and output variables
                            
                                Retrieving my own data via FaceBook API

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Intelligent screen scraping using different proxies and user-agents randomly?

Tags:

python

proxy

screen-scraping

ThinkCode

People also ask

2 Answers

hoju

pajton

Recent Activity

Donate For Us