Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python data mining

I am not too much onto data mining but I require some ideas on clustering. Let me first describe my problem.

I have a around 100 data sheets which contain user reviews. I am trying to find for instances words that describe quality. One can say it is amazing quality another person can say great quality now I have to cluster those documents which describe those similar sentences and get the frequency of such sentences. What concept to apply here?

Guess I have to specify some stop words and synonyms. I am not too familiar with this concept.

Can some one give me some detailed links or explanation? and what tool to be used? I am basically a python programmer so any python module would be appreciated.

Thank You

like image 907
Rkz Avatar asked Mar 06 '26 22:03

Rkz


1 Answers

There is http://www.nltk.org/ for language processing. With this library you are able to split text into sentences, calculate term frequences, find synonyms and more.

Carrot^2 is a nice opensource project for clustering text snippets, unfortunately it's written in Java. The idea behind its clustering is terms and phrases (bigrams and trigrams) frequences. After preprocessing each document (snippet, review) is represented as a vector of term/phrase frequences. To calculate clusters they use some linear algebra and find principal components in that terms space. Then this components are used to form clusters and labels for them.

In yuor case it's worth considering reviews as documents, cluster them and get labels for clusters. May be labels would somehow evaluate reviews.

In your specific case it's worth eliminate words of interest so dramatically decreasing dimensionality which is very critical in such tasks

Another useful project - montylingua

like image 157
Andrey Sboev Avatar answered Mar 08 '26 12:03

Andrey Sboev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!