Spoofing IP address when web scraping (python)

Tags:

I have made a web scraper in python to give me information on when free bet offers from various bookie websites have changed or new ones have been added.

However, the bookies tend to record information relating to IP traffic and MAC addresses in order to flag up matched betters.

How can I spoof my IP address when using the Request() method in the urllib.request module?

My code is below:

Click to copy

req = Request('https://www.888sport.com/online-sports-betting-promotions/', headers={'User-Agent': 'Mozilla/5.0'})
site = urlopen(req).read()
content = bs4.BeautifulSoup(site, 'html.parser')

718

asked Aug 05 '16 09:08

2 Answers

I faced the same problem a while ago. Here is my code snippet, which I am using, in order to scrape anonymously.

Click to copy

from urllib.request import Request, urlopen
from fake_useragent import UserAgent
import random
from bs4 import BeautifulSoup
from IPython.core.display import clear_output

# Here I provide some proxies for not getting caught while scraping
ua = UserAgent() # From here we generate a random user agent
proxies = [] # Will contain proxies [ip, port]

# Main function
def main():
  # Retrieve latest proxies
  proxies_req = Request('https://www.sslproxies.org/')
  proxies_req.add_header('User-Agent', ua.random)
  proxies_doc = urlopen(proxies_req).read().decode('utf8')

  soup = BeautifulSoup(proxies_doc, 'html.parser')
  proxies_table = soup.find(id='proxylisttable')

  # Save proxies in the array
  for row in proxies_table.tbody.find_all('tr'):
    proxies.append({
      'ip':   row.find_all('td')[0].string,
      'port': row.find_all('td')[1].string
    })

  # Choose a random proxy
  proxy_index = random_proxy()
  proxy = proxies[proxy_index]

  for n in range(1, 20):
    req = Request('http://icanhazip.com')
    req.set_proxy(proxy['ip'] + ':' + proxy['port'], 'http')

    # Every 10 requests, generate a new proxy
    if n % 10 == 0:
      proxy_index = random_proxy()
      proxy = proxies[proxy_index]

    # Make the call
    try:
      my_ip = urlopen(req).read().decode('utf8')
      print('#' + str(n) + ': ' + my_ip)
      clear_output(wait = True)
    except: # If error, delete this proxy and find another one
      del proxies[proxy_index]
      print('Proxy ' + proxy['ip'] + ':' + proxy['port'] + ' deleted.')
      proxy_index = random_proxy()
      proxy = proxies[proxy_index]

# Retrieve a random index proxy (we need the index to delete it if not working)
def random_proxy():
  return random.randint(0, len(proxies) - 1)

if __name__ == '__main__':
  main()

That will create some proxies which are working. And the this part:

Click to copy

user_agent_list = (
   #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
)

Which will create different "headers", pretending to be a browser. Last but not least just enter those into you request().

Click to copy

 # Make a get request
    user_agent = random.choice(user_agent_list)
    headers= {'User-Agent': user_agent, "Accept-Language": "en-US, en;q=0.5"}
    proxy = random.choice(proxies)
    response = get("your url", headers=headers, proxies=proxy)

Hope that works with you problem.

Otherwise look here: https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

Cheers

answered Sep 20 '22 19:09

In order to overcome IP rate ban and hide your real IP you need to use proxies. There are a lot of different services that provide proxies. Consider using them as managing proxies by yourself is a real headache and cost would be much higher. I suggest https://botproxy.net among others. They provide rotating proxies though a single endpoint. Here is how you can make requests using this service:

Click to copy

#!/usr/bin/env python
import urllib.request
opener = urllib.request.build_opener(
    urllib.request.ProxyHandler(
        {'http': 'http://user-key:key-password@x.botproxy.net:8080',
         'https': 'http://user-key:key-password@x.botproxy.net:8080'}))
print(opener.open('https://httpbin.org/ip').read())

or using requests library

Click to copy

import requests

res = requests.get(
    'http://httpbin.org/ip',
    proxies={
        'http': 'http://user-key:key-password@x.botproxy.net:8080',
        'https': 'http://user-key:key-password@x.botproxy.net:8080'
        },
    headers={
        'X-BOTPROXY-COUNTRY': 'US'
        })
print(res.text)

They also have proxies in different countries.

answered Sep 18 '22 19:09

mylh

Related questions
                            
                                How to set marker style of Dataframe plot in Python Pandas?
                            
                                How to compare individual characters in two strings in Python 3
                            
                                How to use pip in Windows? [duplicate]
                            
                                How to change plot properties of statsmodels qqplot? (Python)
                            
                                PySpark count values by condition
                            
                                Paho Python MQTT client connects successfully but on_connect callback is not invoked
                            
                                Query Embedded Document List in MongoEngine
                            
                                pyQt: How do I update a label?
                            
                                Merging two dataframes with same column names but different number of columns in pandas
                            
                                Network capturing with Selenium/PhantomJS
                            
                                Python requests and Json for loop
                            
                                Python ldap3 LDAPSocketOpenError unable to send message, socket is not open
                            
                                Convert 3d Numpy array to 2d
                            
                                Custom Python gTTS voice
                            
                                Single worker thread for all tasks or multiple specific workers?
                            
                                How to remove the adjacent duplicate value in a numpy array?
                            
                                Appending more datasets into an existing Hdf5 file without deleting other groups and datasets
                            
                                What effect do the different URL parameters of the Sphinx HTML output's search feature have?
                            
                                multi_line hover in bokeh
                            
                                Set PYTHONPATH for cron jobs in shared hosting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spoofing IP address when web scraping (python)

Tags:

python

tcp

web-scraping

Diran

People also ask

2 Answers

Yannik Suhre

mylh

Recent Activity

Donate For Us