I am trying to scrape one of the free proxy listings website but, I just couldn't be able to scrape the proxies.
Below is my code:
import requests
import re
url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = requests.get(url, headers=headers, timeout=10).text
proxies = re.findall(r'([0-9]{1,3}\.){3}[0-9]{1,3}(:[0-9]{2,4})?', source)
print(proxies)
I would highly appreciate if someone could help me without the use of additional libraries/modules like BeautifulSoup.
There are a number of reasons why proxies are important for data web scraping: Using a proxy (especially a pool of proxies - more on this later) allows you to crawl a website much more reliably. Significantly reducing the chances that your spider will get banned or blocked.
The most popular and effective proxy scraping tool today is the GSA Proxy Scraper. In a few simple clicks, the proxy scraper can test thousands of proxies for you. The process won't take long and you will know which proxies are good to go within seconds. The software also has a very powerful Port Scanner.
It is generally best to use a parser such as BeautifulSoup
to extra data from html
rather than regular expressions because it is very difficult to reproduce BeautifulSoup
's acturacy; however, you can try this with pure regex:
import re
url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = str(requests.get(url, headers=headers, timeout=10).text)
data = [list(filter(None, i))[0] for i in re.findall('<td class="hm">(.*?)</td>|<td>(.*?)</td>', source)]
groupings = [dict(zip(['ip', 'port', 'code', 'using_anonymous'], data[i:i+4])) for i in range(0, len(data), 4)]
Sample output (actual length is 300):
[{'ip': '47.88.242.10', 'port': '80', 'code': 'SG', 'using_anonymous': 'anonymous'}, {'ip': '118.189.172.136', 'port': '80', 'code': 'SG', 'using_anonymous': 'elite proxy'}, {'ip': '147.135.210.114', 'port': '54566', 'code': 'PL', 'using_anonymous': 'anonymous'}, {'ip': '5.148.150.155', 'port': '8080', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '186.227.8.21', 'port': '3128', 'code': 'BR', 'using_anonymous': 'anonymous'}, {'ip': '49.151.155.60', 'port': '8080', 'code': 'PH', 'using_anonymous': 'anonymous'}, {'ip': '52.170.255.17', 'port': '80', 'code': 'US', 'using_anonymous': 'anonymous'}, {'ip': '51.15.35.239', 'port': '3128', 'code': 'NL', 'using_anonymous': 'elite proxy'}, {'ip': '163.172.27.213', 'port': '3128', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '94.137.31.214', 'port': '8080', 'code': 'RU', 'using_anonymous': 'anonymous'}]
Edit: to concatenate the ip and the port, iterate over each grouping and use string formatting:
final_groupings = [{'full_ip':"{ip}:{port}".format(**i)} for i in groupings]
Output:
[{'full_ip': '47.88.242.10:80'}, {'full_ip': '118.189.172.136:80'}, {'full_ip': '147.135.210.114:54566'}, {'full_ip': '5.148.150.155:8080'}, {'full_ip': '186.227.8.21:3128'}, {'full_ip': '49.151.155.60:8080'}, {'full_ip': '52.170.255.17:80'}, {'full_ip': '51.15.35.239:3128'}, {'full_ip': '163.172.27.213:3128'}, {'full_ip': '94.137.31.214:8080'}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With