How do I remove verbs, prepositions, conjunctions etc from my text? [closed]

Tags:

Basically in my text I just want to keep nouns and remove other parts of speech.

I do not think there is any automated way for this. If there is please suggest.

If there is no automated way, I can also do it manually, but for that I would require lists of all possible say, verbs or prepositions or conjunctions or adjectives etc. Can somebody please suggest a possible source where I can get these specific lists.

279

asked Jun 25 '14 10:06

user3710832

2 Answers

You can use the NLTK part-of-speech tagger to tag each word, then only keep the nouns. Here's an example of the NLTK tagger, taken from the NLTK homepage:

>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]

In your case, you'd keep every element of the tagged list that have a tag starting with N, i.e. all the nouns, and throw the rest away. Check out the complete list of tags; you might also want to include foreign words (FW), for example.

NLTK is free to use, and it comes with its own data sets that are also free. You won't have to build lists of prepositions and so on yourself.

169

answered Sep 20 '22 14:09

Wander Nauta

On the manual end.

The wiktionary dump.

https://dumps.wikimedia.org/enwiktionary/20140609/

I would just skip the full-articles dump in any flavor, and just go with the abstracts. It contains the word class. Good luck, the formatting is a beast.

to get started in python:

import xml.etree.ElementTree as ET
wiktionary = file('/path/to/wiktionary.xml')
tree = ET.iterparse(wiktionary.xml)
for event, elem in tree:
    if elem.tag == your_target_tag:
        do magic

Should get you started.

It's more work than a lot of other lists, but it is far richer than anything else I've used for NLP. Best of luck to you, and watch out for the unicode!

answered Sep 19 '22 14:09

blanket_cat

Related questions
                            
                                Line reading chokes on 0x1A
                            
                                PIL does not save transparency
                            
                                Alternatives to keeping large lists in memory (python)
                            
                                zip() alternative for iterating through two iterables
                            
                                Python: How do you get an XML element's text content using xml.dom.minidom?
                            
                                Add to integers in a list
                            
                                xvfb run error in ubuntu 11.04
                            
                                Styling long chains in Python
                            
                                Arguments to cv2::imshow
                            
                                Applying map for partial argument
                            
                                Why does a python module act like a singleton?
                            
                                SQLAlchemy and UnicodeDecodeError
                            
                                Python list.remove() skips next element in list
                            
                                Does the `shell` in `shell=True` in subprocess means `bash`?
                            
                                Django -- Conditional Login Redirect
                            
                                Increase all of a lists values by an increment [duplicate]
                            
                                permanently remove directory from python path
                            
                                Error using cv2.equalizeHist
                            
                                Search for a value in a nested dictionary python
                            
                                How to make a list from a raw_input in python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I remove verbs, prepositions, conjunctions etc from my text? [closed]

Tags:

python

r

text-mining

user3710832

People also ask

2 Answers

Wander Nauta

blanket_cat

Recent Activity

Donate For Us