Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classify a noun into abstract or concrete using NLTK or similar

Tags:

python

nlp

nltk

How can I categorize a list of nouns into abstract or concrete in Python?

For example:

"Have a seat in that chair."

In above sentence chair is noun and can be categorized as concrete.

like image 743
singhalc Avatar asked Feb 18 '15 02:02

singhalc


2 Answers

I would suggest training a classifier using pretrained word vectors.

You need two libraries: spacy for tokenizing text and extracting word vectors, and scikit-learn for machine learning:

import spacy
from sklearn.linear_model import LogisticRegression
import numpy as np
nlp = spacy.load("en_core_web_md")

Distinguishing concrete and abstract nouns is a simple task, so you can train a model with very few examples:

classes = ['concrete', 'abstract']
# todo: add more examples
train_set = [
    ['apple', 'owl', 'house'],
    ['agony', 'knowledge', 'process'],
]
X = np.stack([list(nlp(w))[0].vector for part in train_set for w in part])
y = [label for label, part in enumerate(train_set) for _ in part]
classifier = LogisticRegression(C=0.1, class_weight='balanced').fit(X, y)

When you have a trained model, you can apply it to any text:

for token in nlp("Have a seat in that chair with comfort and drink some juice to soothe your thirst."):
    if token.pos_ == 'NOUN':
        print(token, classes[classifier.predict([token.vector])[0]])

The result looks satisfying:

# seat concrete
# chair concrete
# comfort abstract
# juice concrete
# thirst abstract

You can improve the model by applying it to different nouns, spotting the errors and adding them to the training set under the correct label.

like image 172
David Dale Avatar answered Oct 31 '22 07:10

David Dale


Try to use WordNet via NLTK and explore the hypernym tree of the words you are interested in. WordNet is a lexical database that organises words in a tree-like structure based on their abstraction level. You can use this to get more abstract versions of your target word.

For example, the following example code tells you that that the word "chair" belongs to the category "seats", which belongs to the over-arching category "entity". The word "anger" on the other hand, belongs to the category "emotion".

from nltk.corpus import wordnet as wn
wn.synsets('chair')
wn.synset('chair.n.01').hypernyms()
# [Synset('seat.n.03')]
wn.synset('chair.n.01').root_hypernyms()
# [Synset('entity.n.01')]

wn.synsets('anger')
wn.synset('anger.n.01').hypernyms()
# [Synset('emotion.n.01')]

=> look at the NLTK WordNet documentation and play around with the hypernym trees to categorise words into abstract or concrete categories. You will have to define yourself what exactly you mean by "abstract" or "concrete" and which categories of words you want to put into these two buckets though.

like image 20
Moritz Avatar answered Oct 31 '22 06:10

Moritz