Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find whether a url is of ecommerce or non ecommerce website, programatically?

In a project there is a module takes a URL and determines whether it is of "Ecommerce" or "NON-Ecommerce" website.

I have tried following approaches:

  1. Using Apache mahout, Classification : URL ---> Take html dump ---> pre process the html dump by a) removing all html tags

    b) removing stop words(a.k.a common words) like CDATA, href, value, and, of , between etc.

    c) training model and then testing it.

Following params i have used for training

bin/mahout trainclassifier \ -i training-data \ -o bayes-model \ > -type bayes -ng 1

Testing:

/bin/mahout testclassifier \
  -d test-data \
  -m bayes-model \
  -type bayes -source hdfs -ng 1 -method sequential

Accuracy i am getting as 73% and with cbayes algorithm getting 52%.

I am thinking to improve pre processing stage by extracting info which are found in ecommerce website like "Checkout button","pay pal link", "Prices/ dollar symbol", text like "Cash on delivery", "30 day gurantee" etc.

Any suggestions on how to extract this info or any other ways to predict a site as Ecommerce or Non-Ecommerce?

like image 832
geek Avatar asked Jan 22 '12 14:01

geek


1 Answers

I am very astonished that you get such a good accuracy with just plain html extraction and a bayes classifier.

But you seem to be on the right track with the features like a checkout button and prices.

Here is a paper I found yesterday while reading about Yandex:

"To find out or to buy? Product review vs. Web shop classifier"

It is about how to distinct these two sites and some techniques they used. They also used SVM instead of naive bayes.

like image 148
Thomas Jungblut Avatar answered Sep 23 '22 06:09

Thomas Jungblut