How to find whether a url is of ecommerce or non ecommerce website, programatically?

Question

In a project there is a module takes a URL and determines whether it is of "Ecommerce" or "NON-Ecommerce" website.

I have tried following approaches:

Using Apache mahout, Classification : URL ---> Take html dump ---> pre process the html dump by a) removing all html tags

b) removing stop words(a.k.a common words) like CDATA, href, value, and, of , between etc.

c) training model and then testing it.

Following params i have used for training

bin/mahout trainclassifier \ -i training-data \ -o bayes-model \ > -type bayes -ng 1

Testing:

/bin/mahout testclassifier \
  -d test-data \
  -m bayes-model \
  -type bayes -source hdfs -ng 1 -method sequential

Accuracy i am getting as 73% and with cbayes algorithm getting 52%.

I am thinking to improve pre processing stage by extracting info which are found in ecommerce website like "Checkout button","pay pal link", "Prices/ dollar symbol", text like "Cash on delivery", "30 day gurantee" etc.

Any suggestions on how to extract this info or any other ways to predict a site as Ecommerce or Non-Ecommerce?

Thomas Jungblut · Accepted Answer

I am very astonished that you get such a good accuracy with just plain html extraction and a bayes classifier.

But you seem to be on the right track with the features like a checkout button and prices.

Here is a paper I found yesterday while reading about Yandex:

"To find out or to buy? Product review vs. Web shop classifier"

It is about how to distinct these two sites and some techniques they used. They also used SVM instead of naive bayes.

How to find whether a url is of ecommerce or non ecommerce website, programatically?

Tags:

java

machine-learning

classification

mahout

geek

1 Answers

Thomas Jungblut

Recent Activity

Donate For Us

How to find whether a url is of ecommerce or non ecommerce website, programatically?

Tags:

java

machine-learning

classification

mahout

geek

1 Answers

Thomas Jungblut

Related questions

Recent Activity

Donate For Us