Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping of protected email

I need to scrape emails from the website. It's visible in a browser but when I try to scrape it with requests\BeautifulSoup I get this: "[email protected]"

I can do this with Selenium but it will take more time and I would like to know is it possible to scrape these emails with requests\BeautifulSoup? Maybe it's needed to use some libraries for working with js.

The email tag:

<span id="signature_email"><a class="__cf_email__" href="/cdn-cgi/l/email-protection" data-cfemail="30425f5e70584346515c5c531e535f5d">[email&#160;protected]</a><script data-cfhash='f9e31' type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script></span></span> <span class="separator">|</span>
like image 677
GiveItAwayNow Avatar asked Apr 28 '16 10:04

GiveItAwayNow


1 Answers

From the CF tag, in your supplied html, I assume you are scraping a cloudflare site. They offer a feature to obfuscate emails listed (see here) which encrypts the addresses in the HTML and using JavaScript decrypts them. Hence, using selenium you'll see email-addresses but using requests you won't.

Since the decryption method can be easily taken from the JavaScript, you can write your own decryption method in Python.

In JavaScript,

(function () {
    try {
        var s, a, i, j, r, c, l = document.getElementById("__cf_email__");
        a = l.className;
        if (a) {
            s = '';
            r = parseInt(a.substr(0, 2), 16);
            for (j = 2; a.length - j; j += 2) {
                c = parseInt(a.substr(j, 2), 16) ^ r;
                s += String.fromCharCode(c);
            }
            s = document.createTextNode(s);
            l.parentNode.replaceChild(s, l);
        }
    } catch (e) {}
})();

In Python,

def decodeEmail(e):
    de = ""
    k = int(e[:2], 16)

    for i in range(2, len(e)-1, 2):
        de += chr(int(e[i:i+2], 16)^k)

    return de
like image 109
sowa Avatar answered Sep 26 '22 00:09

sowa