python nltk keyword extraction from sentence

Tags:

"First thing we do, let's kill all the lawyers." - William Shakespeare

Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:

[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]

The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.

*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)

554

asked Jul 10 '12 04:07

waigani

1 Answers

I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.

Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.

By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.

Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.

148

answered Oct 06 '22 00:10

Aditya Mukherji

Related questions
                            
                                Is there a PyQT equivalent to wx.FutureCall (calling a function after window is initialized and drawn)?
                            
                                Plotting eigenbehaviours with matplotlib
                            
                                Extract field list from reStructuredText
                            
                                PS3 controller driver -> uinput-> python? somehow?
                            
                                Is it safe to use pip with a git repo?
                            
                                string split issue
                            
                                Delete the last input row in Python
                            
                                Inhibit screensaver with Python
                            
                                Flask and Web.py both hang on atexit
                            
                                Scripting Bridge vs PyObjC vs py2app
                            
                                Python numpy addition error
                            
                                python: should decorator names be actions or descriptions?
                            
                                How to initialize an instance using pickle()?
                            
                                Recording the total time taken for running a spider in scrapy
                            
                                Sending notifications with GObjects
                            
                                networkx draw_networkx_edges capstyle
                            
                                How to get the text of cluster centers from scikit-learn KMeans?
                            
                                file access peek ahead
                            
                                importing and using a module that uses multiprocessing without causing infinite loop on Windows
                            
                                CSV remove field value wrap quotes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python nltk keyword extraction from sentence

Tags:

python

nlp

nltk

waigani

People also ask

1 Answers

Aditya Mukherji

Recent Activity

Donate For Us