I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example: <pre class="prettyprint"><code>0 the 6 10 chased 3 110 dog 2 1110 mouse 2 1111 cat 2 </code></pre> What does the binary and the integer mean? From the first link, the binary is known as a <code>bit-string</code>, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/ But how do I tell from the output that <code>dog and mouse and cat</code> is one cluster and <code>the and chased</code> is not in the same cluster?

If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first <code>L</code> characters. For example, cutting at the second character gives you two clusters <pre class="prettyprint"><code>10 chased 11 dog 11 mouse 11 cat </code></pre> At the third character you get <pre class="prettyprint"><code>110 dog 111 mouse 111 cat </code></pre> The cutting strategy is a different subject though.

What does the Brown clustering algorithm output mean?

Tags:

python

algorithm

machine-learning

nlp

cluster-analysis

I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:

0        the        6
10        chased        3
110        dog        2
1110        mouse        2
1111        cat        2

What does the binary and the integer mean?

From the first link, the binary is known as a bit-string, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/

But how do I tell from the output that dog and mouse and cat is one cluster and the and chased is not in the same cluster?

905

asked Jan 08 '14 14:01

alvas

2 Answers

If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L characters.

For example, cutting at the second character gives you two clusters

10           chased     

11           dog        
11           mouse      
11           cat

At the third character you get

110           dog        

111           mouse      
111           cat

The cutting strategy is a different subject though.

answered Oct 04 '22 15:10

Łukasz Kidziński

In Percy Liang's implementation (https://github.com/percyliang/brown-cluster), the -C parameter allows you to specify the number of word clusters. The output contains all the words in the corpus, together with a bit-string annotating the cluster and the word frequency in the following format: <bit string> <word> <word frequency>. The number of distinct bit strings in the output equals the number of desired clusters and the words with the same bit string belong to the same cluster.

answered Oct 04 '22 14:10

Paul Baltescu

Related questions
                            
                                Is there a multithreaded map() function? [closed]
                            
                                Subsetting data in Python
                            
                                python 3: how to check if an object is a function? [duplicate]
                            
                                Can a python program be run on a computer without Python? What about C/C++?
                            
                                How to use pipe in IPython
                            
                                Jinja2 ignore UndefinedErrors for objects that aren't found
                            
                                How to monkey patch Django?
                            
                                django querysets + memcached: best practices
                            
                                slices to immutable strings by reference and not copy
                            
                                UUID field added after data already in database. Is there any way to populate the UUID field for existing data?
                            
                                Python Opencv SolvePnP yields wrong translation vector
                            
                                Why are uncompiled, repeatedly used regexes so much slower in Python 3?
                            
                                Find closest row of DataFrame to given time in Pandas
                            
                                web scraping google news with python
                            
                                How to disable cookie handling with the Python requests library?
                            
                                Using Python to Remove All Lines Matching Regex
                            
                                pandas group by year, rank by sales column, in a dataframe with duplicate data
                            
                                pymongo method of getting statistics for collection byte usage?
                            
                                Can I use 'eval' to define a function in Python?
                            
                                Sum up column values in Pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With