Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping free proxy listing website

I am trying to scrape one of the free proxy listings website but, I just couldn't be able to scrape the proxies.

Below is my code:

import requests
import re

url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}

source = requests.get(url, headers=headers, timeout=10).text

proxies = re.findall(r'([0-9]{1,3}\.){3}[0-9]{1,3}(:[0-9]{2,4})?', source)

print(proxies)

I would highly appreciate if someone could help me without the use of additional libraries/modules like BeautifulSoup.

like image 468
wished Avatar asked Jan 24 '18 15:01

wished


People also ask

Do I need a proxy for web scraping?

There are a number of reasons why proxies are important for data web scraping: Using a proxy (especially a pool of proxies - more on this later) allows you to crawl a website much more reliably. Significantly reducing the chances that your spider will get banned or blocked.

What is the best proxy scraper?

The most popular and effective proxy scraping tool today is the GSA Proxy Scraper. In a few simple clicks, the proxy scraper can test thousands of proxies for you. The process won't take long and you will know which proxies are good to go within seconds. The software also has a very powerful Port Scanner.


1 Answers

It is generally best to use a parser such as BeautifulSoup to extra data from html rather than regular expressions because it is very difficult to reproduce BeautifulSoup's acturacy; however, you can try this with pure regex:

import re
url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = str(requests.get(url, headers=headers, timeout=10).text)
data = [list(filter(None, i))[0] for i in re.findall('<td class="hm">(.*?)</td>|<td>(.*?)</td>', source)]
groupings = [dict(zip(['ip', 'port', 'code', 'using_anonymous'], data[i:i+4])) for i in range(0, len(data), 4)]

Sample output (actual length is 300):

[{'ip': '47.88.242.10', 'port': '80', 'code': 'SG', 'using_anonymous': 'anonymous'}, {'ip': '118.189.172.136', 'port': '80', 'code': 'SG', 'using_anonymous': 'elite proxy'}, {'ip': '147.135.210.114', 'port': '54566', 'code': 'PL', 'using_anonymous': 'anonymous'}, {'ip': '5.148.150.155', 'port': '8080', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '186.227.8.21', 'port': '3128', 'code': 'BR', 'using_anonymous': 'anonymous'}, {'ip': '49.151.155.60', 'port': '8080', 'code': 'PH', 'using_anonymous': 'anonymous'}, {'ip': '52.170.255.17', 'port': '80', 'code': 'US', 'using_anonymous': 'anonymous'}, {'ip': '51.15.35.239', 'port': '3128', 'code': 'NL', 'using_anonymous': 'elite proxy'}, {'ip': '163.172.27.213', 'port': '3128', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '94.137.31.214', 'port': '8080', 'code': 'RU', 'using_anonymous': 'anonymous'}]

Edit: to concatenate the ip and the port, iterate over each grouping and use string formatting:

final_groupings = [{'full_ip':"{ip}:{port}".format(**i)} for i in groupings]

Output:

[{'full_ip': '47.88.242.10:80'}, {'full_ip': '118.189.172.136:80'}, {'full_ip': '147.135.210.114:54566'}, {'full_ip': '5.148.150.155:8080'}, {'full_ip': '186.227.8.21:3128'}, {'full_ip': '49.151.155.60:8080'}, {'full_ip': '52.170.255.17:80'}, {'full_ip': '51.15.35.239:3128'}, {'full_ip': '163.172.27.213:3128'}, {'full_ip': '94.137.31.214:8080'}]
like image 137
Ajax1234 Avatar answered Sep 19 '22 18:09

Ajax1234