In Python, cURL transfers requests and data to and from servers using PycURL. PycURL functions as an interface for the libcURL library within Python. Almost every programming language can use REST APIs to access an endpoint hosted on a web server.
Despite the similar name, they are unrelated: they have a different design and a different implementation. urllib was the original Python HTTP client, added to the standard library in Python 1.2. Earlier documentation for urllib can be found in Python 1.4.
You are trying to connect to a SOCKS port - Tor rejects any non-SOCKS traffic. You can connect through a middleman - Privoxy - using Port 8118.
Example:
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open('http://www.google.com').read()
Also please note properties passed to ProxyHandler, no http prefixing the ip:port
pip install PySocks
Then:
import socket
import socks
import urllib2
ipcheck_url = 'http://checkip.amazonaws.com/'
# Actual IP.
print(urllib2.urlopen(ipcheck_url).read())
# Tor IP.
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050)
socket.socket = socks.socksocket
print(urllib2.urlopen(ipcheck_url).read())
Using just urllib2.ProxyHandler
as in https://stackoverflow.com/a/2015649/895245 fails with:
Tor is not an HTTP Proxy
Mentioned at: How can I use a SOCKS 4/5 proxy with urllib2?
Tested on Ubuntu 15.10, Tor 0.2.6.10, Python 2.7.10.
The following code is 100% working on Python 3.4
(you need to keep TOR Browser open wil using this code)
This script connects to TOR through socks5 get the IP from checkip.dyn.com, change identity and resend the request to get a the new IP (loops 10 times)
You need to install the appropriate libraries to get this working. (Enjoy and don't abuse)
import socks
import socket
import time
from stem.control import Controller
from stem import Signal
import requests
from bs4 import BeautifulSoup
err = 0
counter = 0
url = "checkip.dyn.com"
with Controller.from_port(port = 9151) as controller:
try:
controller.authenticate()
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
socket.socket = socks.socksocket
while counter < 10:
r = requests.get("http://checkip.dyn.com")
soup = BeautifulSoup(r.content)
print(soup.find("body").text)
counter = counter + 1
#wait till next identity will be available
controller.signal(Signal.NEWNYM)
time.sleep(controller.get_newnym_wait())
except requests.HTTPError:
print("Could not reach URL")
err = err + 1
print("Used " + str(counter) + " IPs and got " + str(err) + " errors")
Using privoxy as http-proxy in front of tor works for me - here's a crawler-template:
import urllib2
import httplib
from BeautifulSoup import BeautifulSoup
from time import sleep
class Scraper(object):
def __init__(self, options, args):
if options.proxy is None:
options.proxy = "http://localhost:8118/"
self._open = self._get_opener(options.proxy)
def _get_opener(self, proxy):
proxy_handler = urllib2.ProxyHandler({'http': proxy})
opener = urllib2.build_opener(proxy_handler)
return opener.open
def get_soup(self, url):
soup = None
while soup is None:
try:
request = urllib2.Request(url)
request.add_header('User-Agent', 'foo bar useragent')
soup = BeautifulSoup(self._open(request))
except (httplib.IncompleteRead, httplib.BadStatusLine,
urllib2.HTTPError, ValueError, urllib2.URLError), err:
sleep(1)
return soup
class PageType(Scraper):
_URL_TEMPL = "http://foobar.com/baz/%s"
def items_from_page(self, url):
nextpage = None
soup = self.get_soup(url)
items = []
for item in soup.findAll("foo"):
items.append(item["bar"])
nexpage = item["href"]
return nextpage, items
def get_items(self):
nextpage, items = self._categories_from_page(self._START_URL % "start.html")
while nextpage is not None:
nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage)
items.extend(newitems)
return items()
pt = PageType()
print pt.get_items()
Here is a code for downloading files using tor proxy in python: (update url)
import urllib2
url = "http://www.disneypicture.net/data/media/17/Donald_Duck2.gif"
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
The following solution works for me in Python 3. Adapted from CiroSantilli's answer:
With urllib
(name of urllib2 in Python 3):
import socks
import socket
from urllib.request import urlopen
url = 'http://icanhazip.com/'
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket
response = urlopen(url)
print(response.read())
With requests
:
import socks
import socket
import requests
url = 'http://icanhazip.com/'
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket
response = requests.get(url)
print(response.text)
With Selenium
+ PhantomJS:
from selenium import webdriver
url = 'http://icanhazip.com/'
service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ]
phantomjs_path = '/your/path/to/phantomjs'
driver = webdriver.PhantomJS(
executable_path=phantomjs_path,
service_args=service_args)
driver.get(url)
print(driver.page_source)
driver.close()
Note: If you are planning to use Tor often, consider making a donation to support their awesome work!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With