Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Monitor brands with common words

Let's say you should monitor the brand "ONE" online. What algorithms can be used to separate pages about the brand ONE from pages containing the common word ONE?

I'm thinking maybe Bayes could work, but are there other ways to do this?

like image 751
Christian Davén Avatar asked Feb 15 '10 12:02

Christian Davén


1 Answers

If it's not really unique word then I would suggest the next approach.

Let's imagine that our key-word is Java. Then there are at least 2 categories: about programming and about tourism in Indonesia. We are interested in the first one.

Lets take a small text about Java (maybe from books or from wikipedia). Then lets assume some threshold (for example, 0.7). Then let's compare our text with different pages (one of the fastest ways is using Classic Vector Space Model algorithm, you can implement it yourself or find it's implementation in google). Then compare results with your threshold and filter weak results.


About using Bayes algorithm: it's not bad approach imo. But you should 'teach' your algorithm very carefully because several bad inputs can spoil the whole work.

Let me explain. Input for your Bayes algorithm is text with your brand-word. Output is probability [0 .. 1] that your text is about your brand but not about something else. In practice this algorithm very often gives you results near 0 or near 1 and it rare returns values between 0.2 and 0.8. It means that the algorithm is very sensitive to small variations and 1 or 2 words in text of 100 words can seriously affect the result.

like image 144
Roman Avatar answered Oct 06 '22 01:10

Roman