Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do scrapers need to be written for every site they target?

I'm new in scraping. I've wrote a scraper which will scrape Maplin store. I used Python as the language and BeautifulSoup to scrape the store.

I want to ask that if I need to scrape some other eCommerce store (say Amazon, Flipkart), do I need to customize my code since they have different HTML schema (id and class names are different, plus other things as well). So, the scraper I wrote will not work for other eCommerce store.

I want to know how price-comparison sites scrape data from all the online stores? Do they have different code for different online store or is there's a generic one? Do they study the HTML schema of every online store?

like image 708
PythonEnthusiast Avatar asked Dec 27 '14 21:12

PythonEnthusiast


2 Answers

do I need to customize my code

Yes, sure. It is not only because the web-sites have different HTML schema. It is also about the mechanisms involved in loading/rendering the page: some sites use AJAX to load partial content of a page, others let the javascript fill out the placeholders on the page which makes it harder to scrape - there can be lots and lots of differences. Others would use anti-web-scraping techniques: check your headers, behavior, ban you after hitting a site too often, etc.

I've also seen cases when prices were kept as images, or obfuscated with a "noise" - different tags inside one another that were hidden using different techniques, like CSS rules, classes, JS code, "display: None" etc - for an end-user in a browser the data looked normally, but for a web-scraping "robot" it was a mess.

want to know how price-comparison sites scrape data from all the online stores?

Usually, they use APIs whenever possible. But, if not, web-scraping and HTML parsing is always an option.


The general high-level idea is to split the scraping code into two main parts. The static one is a generic web-scraping spider (logic) that reads the parameters or configuration that is passed in. And a dynamic one - an annotator/web-site specific configuration - is usually field-specific XPath expressions or CSS selectors.

See, as an example, Autoscraping tool provided by Scrapinghub:

Autoscraping is a tool to scrape web sites without any programming knowledge. You just annotate web pages visually (with a point and click tool) to indicate where each field is on the page and Autoscraping will scrape any similar page from the site.

And, FYI, study what Scrapinghub offers and documents - there is a lot of useful information and a set of different unique web-scraping tools.


I've personally been involved in a project where we were building a generic Scrapy spider. As far as I remember, we had a "target" database table where records were inserted by a browser extension (annotator), field annotations were kept in JSON:

{
    "price": "//div[@class='price']/text()",  
    "description": "//div[@class='title']/span[2]/text()"
}

The generic spider received a target id as a parameter, read the configuration, and crawled the web-site.

We had a lot of problems staying on a generic side. Once a web-site involved javascript and ajax, we started to write site-specific logic to get to the desired data.

See also:

  • Creating a generic scrapy spider
  • Using one Scrapy spider for several websites
  • What is the best practice for writing maintainable web scrapers?
like image 182
alecxe Avatar answered Nov 07 '22 14:11

alecxe


For a lot of the pricing comparison scrapers, they will do the product search on the vendor site when a user indicates they wish to track a price of something. Once the user selects what they are interested in this will be added to a global cache of products that can then be periodically scraped rather than having to always trawl the whole site on a frequent basis

like image 44
Mark Ruse Avatar answered Nov 07 '22 13:11

Mark Ruse