I've created a script to scrape name
and email
address from a webpage. When I run my script I get the name
accordingly but in case of email
this is what I get aeccdcd7cfc0eedadcc783cdc1dc80cdc1c3
. The string that I get instead of email
changes every time I run the script.
Website link
I've tried so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.seafoodsource.com/supplier-directory/Tri-Cor-Flexible-Packaging-Inc"
res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
name = soup.select_one("[class$='-supplier-view-main-container'] > h1").text
email = soup.select_one("[class='__cf_email__']").get("data-cfemail")
print(f'{"Name: "}{name}\n{"Email: "}{email}')
Current output:
Name: Tri-Cor Flexible Packaging Inc
Email: aeccdcd7cfc0eedadcc783cdc1dc80cdc1c3
Expected output:
Name: Tri-Cor Flexible Packaging Inc
Email: [email protected]
PS I'm not after any solution related to any browser simulator, as in selenium.
How can I get that email from that page using requests?
The short answer is that you have to decode the email string, because it is being obfuscated.
Below is the reason why you have to decode the email string that you obtained from seafoodsource.com.
The website seafoodsource.com is using Cloudflare, which is a U.S. company that provides customer with website security, DDoS mitigation and other services.
I determined that the site was using Cloudflare by pinging seafoodsource.com, which returned the IP address 104.24.19.99. According to ARIN (American Registry for Internet Numbers) this IP address belongs to the netblock 104.16.0.0 - 104.31.255.255, which is registered to Cloudflare.
The string cf_email in your soup is also an indication that the email address is being protected by Cloudflare(CF). Another indication is this warning message, which displayed when you click the protected link when viewing the page source.
Cloudflare Email Address Obfuscation helps in spam prevention by hiding email addresses that appear on the target website from email harvesters and other bots, but the email is visible to normal site visitors.
Under this protection an email address become a hex encoded series of bytes of variable length, depending on the length of the email address.
It is worth noting that this encoding method is not designed to securely encrypt an email address, because it is cryptographically weak, but it is only designed to confuse non-intelligent web scrapers that are searching for mailto: links within the HTML code. In other words this encoding method is used to obfuscate an email address, but not completely enforce its confidentiality.
The encoded email address in your question is:
aeccdcd7cfc0eedadcc783cdc1dc80cdc1c3
The first byte of this email address is ae or hexadecimal 0xae. This byte is a key used to encrypt and decrypt the remaining bytes by bitwise XORing the key with each subsequent byte.
For instance:
0xae ^ 0xcc is hexadecimal 62, which translates to b in ASCII
0xae ^ 0xdc is hexadecimal 72, which translates to r in ASCII
0xae ^ 0xd7 is hexadecimal 79, which translates to y in ASCII
0xae ^ 0xcf is hexadecimal 61, which translates to a in ASCII
0xae ^ 0xc0 is hexadecimal 6e, which translates to n in ASCII
This spells bryan, which is the first part of the decoded email address.
The bitwise XORing is happening in this code:
chr(int(encoded_string[i:i+2], 16) ^ base_16)
Let me explain further:
The first byte of the encoding string is the cipher key, which in this case is ae or 0xae.
If we convert 0xae to decimal it becomes 174.
When we convert the next byte 0xcc to decimal it becomes 204.
Let's convert these decimals using the bitwise operator ^.
^ Bitwise Exclusive XOR
Returns the result of bitwise XOR of two integers.
first_byte = 174 # ae
second_byte = 204 # cc
xor_decimal = first_byte ^ second_byte
print (xor_decimal)
# outputs
98
Let's convert these decimals to hexadecimals (base-16). We can use the built-in function "hex" in Python to accomplish this.
first_byte = 174 # ae
second_byte = 204 # cc
xor_decimal = first_byte ^ second_byte
print (hex)xor_decimal)
# outputs
62
As I previously mentioned hexadecimal 62, translates to b in ASCII
Let's look at the next byte iteration in the encoded string.
first_byte = 174 # ae
next_byte = 220 # dc
xor_decimal = first_byte ^ next_byte
print (hex)xor_decimal)
# outputs
72
As I previously mentioned hexadecimal 72, translates to r in ASCII
I feel that it's relevant to show how to convert a hex string to a decimal.
# without the 0x prefix
decimal = int('ae', 16)
print (decimal)
# outputs
174
# with the 0x prefix
decimal = int('0xae', 0)
print (decimal)
# outputs
174
ASCII Text to Hex conversion for the obfuscated email address:
ASCII email address: [email protected]
Hex email address: 62 72 79 61 6e 40 74 72 69 2d 63 6f 72 2e 63 6f 6d
We can use the built-in function bytearray in Python to decode this hex string:
hex_string = '62 72 79 61 6e 40 74 72 69 2d 63 6f 72 2e 63 6f 6d'
ascii_conversion = bytearray.fromhex(hex_string).decode()
print (ascii_conversion)
# outputs
[email protected]
ASCII Text to Decimal conversion for the obfuscated email address:
ASCII email address: [email protected]
Decimal email address: 98 114 121 97 110 64 116 114 105 45 99 111 114 46 99 111 109
If we prepend decimal 174, which was ae in the obfuscated string to the head of the decimal email address:
Decimal email address: 174 98 114 121 97 110 64 116 114 105 45 99 111 114 46 99 111 109
ASCII email address: ®[email protected]
It looks like ® was the ASCII character used as the cipher key for the obfuscated string in your question.
I would be remiss if I did not mention binary numbers and XOR operations.
First byte conversions:
Second byte conversions:
We can perform the same ^ Bitwise Exclusive XOR operations with the binary numbers above:
# the notation 0b in front of the number is used to express that the value is
# a binary literal
first_byte_binary = 0b10101110
second_byte_binary = 0b11001100
xor_binary = first_byte_binary ^ second_byte_binary
print (bin(xor_binary))
# outputs
0b1100010
print (xor_binary)
# outputs
98
print (hex(xor_binary))
# outputs
0x62
ascii_conversion = bytearray.fromhex(hex(xor_binary)[2:]).decode()
print (ascii_conversion)
# outputs
b
Here is how to decode Cloudflare obfuscated email addresses.
import requests
from bs4 import BeautifulSoup
url = "https://www.seafoodsource.com/supplier-directory/Tri-Cor-Flexible-Packaging-Inc"
raw_html = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(raw_html.text,'lxml')
company_information = []
def get_company_name(soup):
company_name = soup.find('li', {'class': 'active'}).text
company_information.append(company_name)
return
def decode_cloudflare_protected_email(encoded_string):
# converting the encoding string to int base 16
base_16 = int(encoded_string[:2], 16)
decoded_email = ''.join([chr(int(encoded_string[i:i+2], 16) ^ base_16) for i in range(2, len(encoded_string), 2)])
company_information.append(decoded_email)
return
get_company_name(soup)
encoded_email = soup.select_one("[class='__cf_email__']").get("data-cfemail")
decode_cloudflare_protected_email(encoded_email)
print (company_information)
# outputs
['Tri-Cor Flexible Packaging Inc', '[email protected]']
If you are interesting in exploring XOR encryption more than I would suggest looking at the xortool, which is a Github project by Aleksei Hellman.
You have to decode the email.
import requests
from bs4 import BeautifulSoup
def cfDecodeEmail(encodedString):
r = int(encodedString[:2],16)
email = ''.join([chr(int(encodedString[i:i+2], 16) ^ r) for i in range(2, len(encodedString), 2)])
return email
url = "https://www.seafoodsource.com/supplier-directory/Tri-Cor-Flexible-Packaging-Inc"
res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
name = soup.select_one("[class$='-supplier-view-main-container'] > h1").text
email = cfDecodeEmail(soup.select_one("[class='__cf_email__']").get("data-cfemail"))
print(f'{"Name: "}{name}\n{"Email: "}{email}')
Output:
Name: Tri-Cor Flexible Packaging Inc
Email: [email protected]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With