How can I categorize a list of nouns into abstract or concrete in Python?
For example:
"Have a seat in that chair."
In above sentence chair
is noun and can be categorized as concrete.
I would suggest training a classifier using pretrained word vectors.
You need two libraries: spacy
for tokenizing text and extracting word vectors, and scikit-learn
for machine learning:
import spacy
from sklearn.linear_model import LogisticRegression
import numpy as np
nlp = spacy.load("en_core_web_md")
Distinguishing concrete and abstract nouns is a simple task, so you can train a model with very few examples:
classes = ['concrete', 'abstract']
# todo: add more examples
train_set = [
['apple', 'owl', 'house'],
['agony', 'knowledge', 'process'],
]
X = np.stack([list(nlp(w))[0].vector for part in train_set for w in part])
y = [label for label, part in enumerate(train_set) for _ in part]
classifier = LogisticRegression(C=0.1, class_weight='balanced').fit(X, y)
When you have a trained model, you can apply it to any text:
for token in nlp("Have a seat in that chair with comfort and drink some juice to soothe your thirst."):
if token.pos_ == 'NOUN':
print(token, classes[classifier.predict([token.vector])[0]])
The result looks satisfying:
# seat concrete
# chair concrete
# comfort abstract
# juice concrete
# thirst abstract
You can improve the model by applying it to different nouns, spotting the errors and adding them to the training set under the correct label.
Try to use WordNet via NLTK and explore the hypernym tree of the words you are interested in. WordNet is a lexical database that organises words in a tree-like structure based on their abstraction level. You can use this to get more abstract versions of your target word.
For example, the following example code tells you that that the word "chair" belongs to the category "seats", which belongs to the over-arching category "entity". The word "anger" on the other hand, belongs to the category "emotion".
from nltk.corpus import wordnet as wn
wn.synsets('chair')
wn.synset('chair.n.01').hypernyms()
# [Synset('seat.n.03')]
wn.synset('chair.n.01').root_hypernyms()
# [Synset('entity.n.01')]
wn.synsets('anger')
wn.synset('anger.n.01').hypernyms()
# [Synset('emotion.n.01')]
=> look at the NLTK WordNet documentation and play around with the hypernym trees to categorise words into abstract or concrete categories. You will have to define yourself what exactly you mean by "abstract" or "concrete" and which categories of words you want to put into these two buckets though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With