I'm new in scraping. I've wrote a scraper which will scrape Maplin store. I used Python as the language and BeautifulSoup to scrape the store.
I want to ask that if I need to scrape some other eCommerce store (say Amazon, Flipkart), do I need to customize my code since they have different HTML schema (id
and class
names are different, plus other things as well). So, the scraper I wrote will not work for other eCommerce store.
I want to know how price-comparison sites scrape data from all the online stores? Do they have different code for different online store or is there's a generic one? Do they study the HTML schema of every online store?
do I need to customize my code
Yes, sure. It is not only because the web-sites have different HTML schema. It is also about the mechanisms involved in loading/rendering the page: some sites use AJAX to load partial content of a page, others let the javascript fill out the placeholders on the page which makes it harder to scrape - there can be lots and lots of differences. Others would use anti-web-scraping techniques: check your headers, behavior, ban you after hitting a site too often, etc.
I've also seen cases when prices were kept as images, or obfuscated with a "noise" - different tags inside one another that were hidden using different techniques, like CSS rules, classes, JS code, "display: None" etc - for an end-user in a browser the data looked normally, but for a web-scraping "robot" it was a mess.
want to know how price-comparison sites scrape data from all the online stores?
Usually, they use APIs whenever possible. But, if not, web-scraping and HTML parsing is always an option.
The general high-level idea is to split the scraping code into two main parts. The static one is a generic web-scraping spider (logic) that reads the parameters or configuration that is passed in. And a dynamic one - an annotator/web-site specific configuration - is usually field-specific XPath expressions or CSS selectors.
See, as an example, Autoscraping tool provided by Scrapinghub
:
Autoscraping is a tool to scrape web sites without any programming knowledge. You just annotate web pages visually (with a point and click tool) to indicate where each field is on the page and Autoscraping will scrape any similar page from the site.
And, FYI, study what Scrapinghub
offers and documents - there is a lot of useful information and a set of different unique web-scraping tools.
I've personally been involved in a project where we were building a generic Scrapy
spider. As far as I remember, we had a "target" database table where records were inserted by a browser extension (annotator), field annotations were kept in JSON:
{
"price": "//div[@class='price']/text()",
"description": "//div[@class='title']/span[2]/text()"
}
The generic spider received a target id as a parameter, read the configuration, and crawled the web-site.
We had a lot of problems staying on a generic side. Once a web-site involved javascript and ajax, we started to write site-specific logic to get to the desired data.
See also:
For a lot of the pricing comparison scrapers, they will do the product search on the vendor site when a user indicates they wish to track a price of something. Once the user selects what they are interested in this will be added to a global cache of products that can then be periodically scraped rather than having to always trawl the whole site on a frequent basis
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With