Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove verbs, prepositions, conjunctions etc from my text? [closed]

Basically in my text I just want to keep nouns and remove other parts of speech.

I do not think there is any automated way for this. If there is please suggest.

If there is no automated way, I can also do it manually, but for that I would require lists of all possible say, verbs or prepositions or conjunctions or adjectives etc. Can somebody please suggest a possible source where I can get these specific lists.

like image 279
user3710832 Avatar asked Jun 25 '14 10:06

user3710832


People also ask

What is verb and preposition?

Updated on July 02, 2019. A prepositional verb is an idiomatic expression that combines a verb and a preposition to make a new verb with a distinct meaning. Some examples of prepositional verbs in English are care for, long for, apply for, approve of, add to, resort to, result in, count on, and deal with.

Is it a verb or preposition?

Is is a verb or a noun? Is it a preposition? In this post, we have learned that the word is a verb and functions solely as a verb to describe a state of being or existence. Is is a verb.

Which verbs do not take prepositions?

The verbs lack, approach and enter are directly followed by objects without prepositions. Other verbs that do not normally take prepositions are: discuss, marry and resemble.


2 Answers

You can use the NLTK part-of-speech tagger to tag each word, then only keep the nouns. Here's an example of the NLTK tagger, taken from the NLTK homepage:

>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]

In your case, you'd keep every element of the tagged list that have a tag starting with N, i.e. all the nouns, and throw the rest away. Check out the complete list of tags; you might also want to include foreign words (FW), for example.

NLTK is free to use, and it comes with its own data sets that are also free. You won't have to build lists of prepositions and so on yourself.

like image 169
Wander Nauta Avatar answered Sep 20 '22 14:09

Wander Nauta


On the manual end.

The wiktionary dump.

https://dumps.wikimedia.org/enwiktionary/20140609/

I would just skip the full-articles dump in any flavor, and just go with the abstracts. It contains the word class. Good luck, the formatting is a beast.

to get started in python:

import xml.etree.ElementTree as ET
wiktionary = file('/path/to/wiktionary.xml')
tree = ET.iterparse(wiktionary.xml)
for event, elem in tree:
    if elem.tag == your_target_tag:
        do magic

Should get you started.

It's more work than a lot of other lists, but it is far richer than anything else I've used for NLP. Best of luck to you, and watch out for the unicode!

like image 39
blanket_cat Avatar answered Sep 19 '22 14:09

blanket_cat