Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the Brown clustering algorithm output mean?

I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:

0        the        6
10        chased        3
110        dog        2
1110        mouse        2
1111        cat        2

What does the binary and the integer mean?

From the first link, the binary is known as a bit-string, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/

But how do I tell from the output that dog and mouse and cat is one cluster and the and chased is not in the same cluster?

like image 905
alvas Avatar asked Jan 08 '14 14:01

alvas


People also ask

How does Brown clustering work?

Brown clustering uses mutual information to de- termine distributional similarity, placing similar words in the same cluster and similar clusters nearby in the binary tree. This is an unsuper- vised learned representation of language from the input corpus (Bengio et al., 2013).

What is brown algorithm?

Brown clustering is a method used to create clusters of words that are similar. It is an instance of a Clustering algorithm which generates a hierarchical cluster of words.

What is the meaning of clustering algorithm?

The clustering algorithm is an unsupervised method, where the input is not a labeled one and problem solving is based on the experience that the algorithm gains out of solving similar problems as a training schedule. From: Internet of Things in Biomedical Engineering, 2019.


2 Answers

If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L characters.

For example, cutting at the second character gives you two clusters

10           chased     

11           dog        
11           mouse      
11           cat        

At the third character you get

110           dog        

111           mouse      
111           cat        

The cutting strategy is a different subject though.

like image 80
Łukasz Kidziński Avatar answered Oct 04 '22 15:10

Łukasz Kidziński


In Percy Liang's implementation (https://github.com/percyliang/brown-cluster), the -C parameter allows you to specify the number of word clusters. The output contains all the words in the corpus, together with a bit-string annotating the cluster and the word frequency in the following format: <bit string> <word> <word frequency>. The number of distinct bit strings in the output equals the number of desired clusters and the words with the same bit string belong to the same cluster.

like image 39
Paul Baltescu Avatar answered Oct 04 '22 14:10

Paul Baltescu