The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot). I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together. For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer'] My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example. I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step). I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words. Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster. However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task. This paper worth to take a look: More or less supervised supersense tagging of Twitter And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in <code>word2vec</code> fashion might be useful: How can a sentence or a document be converted to a vector? I can add more related papers later if it comes to my mind.

How to automatically label a cluster of words using semantics?

Tags:

python

r

nlp

nltk

wordnet

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).

I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.

For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer'] My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.

I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).

I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.

Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

749

asked Jun 30 '15 21:06

Stéphanie C

2 Answers

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.

Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

169

answered Sep 19 '22 06:09

mbatchkarov

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.

However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.

This paper worth to take a look: More or less supervised supersense tagging of Twitter

And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful: How can a sentence or a document be converted to a vector?

I can add more related papers later if it comes to my mind.

answered Sep 22 '22 06:09

Mehdi

Related questions
                            
                                Cython Metaclass .pxd: How should I implement `__eq__()`?
                            
                                How can I prevent the inheritance of python loggers and handlers during multiprocessing based on fork?
                            
                                imported modules becomes None when replacing current module in sys.modules using a class object
                            
                                python matplotlib save graph without showing
                            
                                Django data migration fails when running manage.py test, but not when running manage.py migrate
                            
                                Upgrading transaction.commit_manually() to Django > 1.6
                            
                                Download files over SSH using Python
                            
                                Long Polling in Python with Flask
                            
                                How to find out if the elements in one list are in another? [duplicate]
                            
                                Why am I getting a ConnectionResetError here?
                            
                                Is there any smart way to combine overlapping paths in python?
                            
                                python password rules validation
                            
                                How to get a padded slice of a multidimensional array?
                            
                                Image Gradient Vector Field in Python
                            
                                Python regex pattern max length in re.compile?
                            
                                Pandas read csv out of memory
                            
                                ANTLR4 and the Python target
                            
                                Python: Size of Reference?
                            
                                Why are there redundant ways to import in Python?
                            
                                Python beginner Class variable Error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With