Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't get a certain item from a webpage using requests

I've created a script to scrape name and email address from a webpage. When I run my script I get the name accordingly but in case of email this is what I get aeccdcd7cfc0eedadcc783cdc1dc80cdc1c3. The string that I get instead of email changes every time I run the script.

Website link

I've tried so far:

import requests
from bs4 import BeautifulSoup

url = "https://www.seafoodsource.com/supplier-directory/Tri-Cor-Flexible-Packaging-Inc"

res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
name = soup.select_one("[class$='-supplier-view-main-container'] > h1").text
email = soup.select_one("[class='__cf_email__']").get("data-cfemail")
print(f'{"Name: "}{name}\n{"Email: "}{email}')

Current output:

Name: Tri-Cor Flexible Packaging Inc
Email: aeccdcd7cfc0eedadcc783cdc1dc80cdc1c3

Expected output:

Name: Tri-Cor Flexible Packaging Inc
Email: [email protected]

PS I'm not after any solution related to any browser simulator, as in selenium.

How can I get that email from that page using requests?

like image 482
MITHU Avatar asked Sep 28 '19 16:09

MITHU


2 Answers

The short answer is that you have to decode the email string, because it is being obfuscated.

Below is the reason why you have to decode the email string that you obtained from seafoodsource.com.

The website seafoodsource.com is using Cloudflare, which is a U.S. company that provides customer with website security, DDoS mitigation and other services.

I determined that the site was using Cloudflare by pinging seafoodsource.com, which returned the IP address 104.24.19.99. According to ARIN (American Registry for Internet Numbers) this IP address belongs to the netblock 104.16.0.0 - 104.31.255.255, which is registered to Cloudflare.

The string cf_email in your soup is also an indication that the email address is being protected by Cloudflare(CF). Another indication is this warning message, which displayed when you click the protected link when viewing the page source.

Cloudflare Email Address Obfuscation helps in spam prevention by hiding email addresses that appear on the target website from email harvesters and other bots, but the email is visible to normal site visitors.

Under this protection an email address become a hex encoded series of bytes of variable length, depending on the length of the email address.

It is worth noting that this encoding method is not designed to securely encrypt an email address, because it is cryptographically weak, but it is only designed to confuse non-intelligent web scrapers that are searching for mailto: links within the HTML code. In other words this encoding method is used to obfuscate an email address, but not completely enforce its confidentiality.

The encoded email address in your question is:

aeccdcd7cfc0eedadcc783cdc1dc80cdc1c3

The first byte of this email address is ae or hexadecimal 0xae. This byte is a key used to encrypt and decrypt the remaining bytes by bitwise XORing the key with each subsequent byte.

For instance:

0xae ^ 0xcc is hexadecimal 62, which translates to b in ASCII

0xae ^ 0xdc is hexadecimal 72, which translates to r in ASCII

0xae ^ 0xd7 is hexadecimal 79, which translates to y in ASCII

0xae ^ 0xcf is hexadecimal 61, which translates to a in ASCII

0xae ^ 0xc0 is hexadecimal 6e, which translates to n in ASCII

This spells bryan, which is the first part of the decoded email address.

The bitwise XORing is happening in this code:

chr(int(encoded_string[i:i+2], 16) ^ base_16)

Let me explain further:

The first byte of the encoding string is the cipher key, which in this case is ae or 0xae.

If we convert 0xae to decimal it becomes 174.

When we convert the next byte 0xcc to decimal it becomes 204.

Let's convert these decimals using the bitwise operator ^.

^ Bitwise Exclusive XOR

Returns the result of bitwise XOR of two integers.

first_byte = 174 # ae
second_byte = 204 # cc
xor_decimal = first_byte ^ second_byte 
print (xor_decimal) 
# outputs 
98

Let's convert these decimals to hexadecimals (base-16). We can use the built-in function "hex" in Python to accomplish this.

first_byte = 174 # ae
second_byte = 204 # cc
xor_decimal = first_byte ^ second_byte 
print (hex)xor_decimal)
# outputs 
62

As I previously mentioned hexadecimal 62, translates to b in ASCII

Let's look at the next byte iteration in the encoded string.

first_byte = 174 # ae
next_byte = 220 # dc
xor_decimal = first_byte ^ next_byte 
print (hex)xor_decimal)
# outputs 
72

As I previously mentioned hexadecimal 72, translates to r in ASCII

I feel that it's relevant to show how to convert a hex string to a decimal.

 # without the 0x prefix
 decimal = int('ae', 16)
 print (decimal)
 # outputs
 174 

 # with the 0x prefix
 decimal = int('0xae', 0)
 print (decimal)
 # outputs
 174 

ASCII Text to Hex conversion for the obfuscated email address:

ASCII email address: [email protected]

Hex email address: 62 72 79 61 6e 40 74 72 69 2d 63 6f 72 2e 63 6f 6d

We can use the built-in function bytearray in Python to decode this hex string:

hex_string = '62 72 79 61 6e 40 74 72 69 2d 63 6f 72 2e 63 6f 6d'
ascii_conversion = bytearray.fromhex(hex_string).decode()
print (ascii_conversion)
# outputs
[email protected]

ASCII Text to Decimal conversion for the obfuscated email address:

ASCII email address: [email protected]

Decimal email address: 98 114 121 97 110 64 116 114 105 45 99 111 114 46 99 111 109

If we prepend decimal 174, which was ae in the obfuscated string to the head of the decimal email address:

Decimal email address: 174 98 114 121 97 110 64 116 114 105 45 99 111 114 46 99 111 109

ASCII email address: ®[email protected]

It looks like ® was the ASCII character used as the cipher key for the obfuscated string in your question.

I would be remiss if I did not mention binary numbers and XOR operations.

First byte conversions:

  • hex number: ae
  • decimal number: 174
  • hexadecimals (base-16): 98
  • binary number: 10101110
  • ascii text: ®

Second byte conversions:

  • hex number: cc
  • decimal number: 204
  • hexadecimals (base-16): 62
  • binary number: 11001100
  • ascii text: b

We can perform the same ^ Bitwise Exclusive XOR operations with the binary numbers above:

# the notation 0b in front of the number is used to express that the value is 
# a binary literal
first_byte_binary = 0b10101110
second_byte_binary = 0b11001100
xor_binary = first_byte_binary ^ second_byte_binary
print (bin(xor_binary))
# outputs
0b1100010

print (xor_binary)
# outputs 
98

print (hex(xor_binary))
# outputs
0x62

ascii_conversion = bytearray.fromhex(hex(xor_binary)[2:]).decode()
print (ascii_conversion)
# outputs
b

Here is how to decode Cloudflare obfuscated email addresses.

import requests
from bs4 import BeautifulSoup

url = "https://www.seafoodsource.com/supplier-directory/Tri-Cor-Flexible-Packaging-Inc"

raw_html = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(raw_html.text,'lxml')

company_information = []

def get_company_name(soup):
  company_name = soup.find('li', {'class': 'active'}).text
  company_information.append(company_name)
  return

def decode_cloudflare_protected_email(encoded_string):
    # converting the encoding string to int base 16
    base_16 = int(encoded_string[:2], 16)
    decoded_email = ''.join([chr(int(encoded_string[i:i+2], 16) ^ base_16) for i in range(2, len(encoded_string), 2)])
    company_information.append(decoded_email)
    return

get_company_name(soup)

encoded_email = soup.select_one("[class='__cf_email__']").get("data-cfemail")
decode_cloudflare_protected_email(encoded_email)

print (company_information)
# outputs
['Tri-Cor Flexible Packaging Inc', '[email protected]']

If you are interesting in exploring XOR encryption more than I would suggest looking at the xortool, which is a Github project by Aleksei Hellman.

like image 77
Life is complex Avatar answered Nov 09 '22 10:11

Life is complex


You have to decode the email.

import requests
from bs4 import BeautifulSoup

def cfDecodeEmail(encodedString):
    r = int(encodedString[:2],16)
    email = ''.join([chr(int(encodedString[i:i+2], 16) ^ r) for i in range(2, len(encodedString), 2)])
    return email

url = "https://www.seafoodsource.com/supplier-directory/Tri-Cor-Flexible-Packaging-Inc"

res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
name = soup.select_one("[class$='-supplier-view-main-container'] > h1").text
email = cfDecodeEmail(soup.select_one("[class='__cf_email__']").get("data-cfemail"))
print(f'{"Name: "}{name}\n{"Email: "}{email}')

Output:

Name: Tri-Cor Flexible Packaging Inc
Email: [email protected]
like image 30
Mrugesh Kadia Avatar answered Nov 09 '22 09:11

Mrugesh Kadia