Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl Crunchbase with bot protection (Distil Networks)?

Sites like Crunchbase and Glassdoor are all protected by Distil Networks, are there any ways to programmatically ways get data from these sites? I was trying Scrapy+Splash, but somehow they are able to detect this. Are there any other ways to make your requests/javascript validation indistinguishable from a browser?

like image 328
James Samovar Avatar asked Oct 18 '22 22:10

James Samovar


1 Answers

Well, this might be not the very correct answer, and bit late too, but try just to trace browser with fiddler(my favorite), and check urls, headers,cookies having distil tags, headers, cookies.. you'll see .js requests having query params PID=.....

for example : enter image description here Yellow collored requests is a part of what I get, when searched for "distil" in fiddler.. Next, first request you see there "/trsnsvdstl-ce.js" If you'd check source, you would fine that long PID=... number and X-Distil-Ajax header, also, you can see in respinse a lot of cookies containint D_XXX= And I think what's most important, you can see parameter p= if you make same requests, and then UrlDecode p, you would find it interesting, it having lots of your machine parameters, like tools you have in your browsers, resolution, etc.. it is a finger print..

Well, at this point, I cannot answer more, just started diging into this. Also, what does help a alot, but costs money is GOOD prox'ys, Iam not talking about free, slow ones, I am talking about something like amazon clouds, where you can set anonimity level, so even distil could not see, if it is proxy.

So, that's it for now, sorry for my shi*ty english and good luck ! :)

like image 130
DevyDev Avatar answered Oct 21 '22 06:10

DevyDev