Do scrapers need to be written for every site they target?

Question

I'm new in scraping. I've wrote a scraper which will scrape Maplin store. I used Python as the language and BeautifulSoup to scrape the store.

I want to ask that if I need to scrape some other eCommerce store (say Amazon, Flipkart), do I need to customize my code since they have different HTML schema (id and class names are different, plus other things as well). So, the scraper I wrote will not work for other eCommerce store.

I want to know how price-comparison sites scrape data from all the online stores? Do they have different code for different online store or is there's a generic one? Do they study the HTML schema of every online store?

alecxe · Accepted Answer

do I need to customize my code

Yes, sure. It is not only because the web-sites have different HTML schema. It is also about the mechanisms involved in loading/rendering the page: some sites use AJAX to load partial content of a page, others let the javascript fill out the placeholders on the page which makes it harder to scrape - there can be lots and lots of differences. Others would use anti-web-scraping techniques: check your headers, behavior, ban you after hitting a site too often, etc.

I've also seen cases when prices were kept as images, or obfuscated with a "noise" - different tags inside one another that were hidden using different techniques, like CSS rules, classes, JS code, "display: None" etc - for an end-user in a browser the data looked normally, but for a web-scraping "robot" it was a mess.

want to know how price-comparison sites scrape data from all the online stores?

Usually, they use APIs whenever possible. But, if not, web-scraping and HTML parsing is always an option.

The general high-level idea is to split the scraping code into two main parts. The static one is a generic web-scraping spider (logic) that reads the parameters or configuration that is passed in. And a dynamic one - an annotator/web-site specific configuration - is usually field-specific XPath expressions or CSS selectors.

See, as an example, Autoscraping tool provided by Scrapinghub:

Autoscraping is a tool to scrape web sites without any programming knowledge. You just annotate web pages visually (with a point and click tool) to indicate where each field is on the page and Autoscraping will scrape any similar page from the site.

And, FYI, study what Scrapinghub offers and documents - there is a lot of useful information and a set of different unique web-scraping tools.

I've personally been involved in a project where we were building a generic Scrapy spider. As far as I remember, we had a "target" database table where records were inserted by a browser extension (annotator), field annotations were kept in JSON:

{
    "price": "//div[@class='price']/text()",  
    "description": "//div[@class='title']/span[2]/text()"
}

The generic spider received a target id as a parameter, read the configuration, and crawled the web-site.

We had a lot of problems staying on a generic side. Once a web-site involved javascript and ajax, we started to write site-specific logic to get to the desired data.

Do scrapers need to be written for every site they target?

Tags:

python

html

html-parsing

beautifulsoup

web-scraping

PythonEnthusiast

2 Answers

alecxe

Mark Ruse

Recent Activity

Donate For Us

Do scrapers need to be written for every site they target?

Tags:

python

html

html-parsing

beautifulsoup

web-scraping

PythonEnthusiast

2 Answers

alecxe

Mark Ruse

Related questions

Recent Activity

Donate For Us