Sites like Crunchbase and Glassdoor are all protected by Distil Networks, are there any ways to programmatically ways get data from these sites? I was trying Scrapy+Splash, but somehow they are able to detect this. Are there any other ways to make your requests/javascript validation indistinguishable from a browser?
Well, this might be not the very correct answer, and bit late too, but try just to trace browser with fiddler(my favorite), and check urls, headers,cookies having distil tags, headers, cookies.. you'll see .js requests having query params PID=.....
for example : Yellow collored requests is a part of what I get, when searched for "distil" in fiddler.. Next, first request you see there "/trsnsvdstl-ce.js" If you'd check source, you would fine that long PID=... number and X-Distil-Ajax header, also, you can see in respinse a lot of cookies containint D_XXX= And I think what's most important, you can see parameter p= if you make same requests, and then UrlDecode p, you would find it interesting, it having lots of your machine parameters, like tools you have in your browsers, resolution, etc.. it is a finger print..
Well, at this point, I cannot answer more, just started diging into this. Also, what does help a alot, but costs money is GOOD prox'ys, Iam not talking about free, slow ones, I am talking about something like amazon clouds, where you can set anonimity level, so even distil could not see, if it is proxy.
So, that's it for now, sorry for my shi*ty english and good luck ! :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With