Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Websites that are particularly challenging to crawl and scrape? [closed]

I'm interested in public facing sites (nothing behind a login / authentication) that have things like:

  • High use of internal 301 and 302 redirects
  • Anti-scraping measures (but not banning crawlers via robots.txt)
  • Non-semantic, or invalid mark-up
  • Content loaded via AJAX in the form of onclicks or infinite scrolling
  • Lots of parameters used in urls
  • Canonical problems
  • Convoluted internal link structure
  • and anything else that generally makes crawling a website a headache!

I have built a crawler / spider that performs a range of analysis on a website, and I'm on the lookout for sites that will make it struggle.

like image 300
David Pratt Avatar asked Sep 12 '13 10:09

David Pratt


1 Answers

Here are some:

  • Content loaded via AJAX in the form of onclicks or infinite scrolling
    • pinterest
    • comments in such a page
      This is a Chinese commodity page and its comments is loaded by AJAX which is triggered by scrolling down the scrollbar in a browser or according to your browser's height. I must use PhantomJS and xvfb to trigger such actions.
  • Anti-scraping measures (but not banning crawlers via robots.txt)
    • amazon next page
      I have crawled amazon site in China and when I want to crawl the next page in such pages, it may modify the requests resulting in that you couldn't get the real next page
    • stackoverflow
      It has a limit of visit frequency. A few days ago, I wanted to get all of the tags in stackoverflow and set the spider's visit frequency to 10, but I was warned by stackoverflow...... Here's the screen shot. After that I have to use proxies to crawl stackoverflow.
  • and anything else that generally makes crawling a website a headache
    • yihaodian
      This is a Chinese e-commerce site and when you visit it in a browser, it will show your location and will offer some commodities according to your location.
    • etc.
      There're many sites like the above that will offer different contents according to your location. When you crawl such sites, what you get is not the same as what you see in a browser. It often needs setting cookie when emitting a request through a spider.

Last year I encountered a site which required http request headers and some cookies when emitting requests, but I don't remember that site....

like image 101
flyer Avatar answered Sep 17 '22 13:09

flyer