Parsing Meaning from Text

Tags:

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:

"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",

what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).

To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.

667

asked Jul 17 '09 00:07

Tom

2 Answers

You need to look at the Natural Language Toolkit, which is for exactly this sort of thing.

This section of the manual looks very relevant: Categorizing and Tagging Words - here's an extract:

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

answered Oct 26 '22 23:10

RichieHindle

Use the NLTK, in particular chapter 7 on Information Extraction.

You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.

See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:

>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent) 
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  ...
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ...)

Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.

answered Oct 26 '22 23:10

Bluu

Related questions
                            
                                How to convert a DictProxy object into JSON serializable dict?
                            
                                Are numbers considered objects in python?
                            
                                Remove border from html table created via pandas
                            
                                Cython Fatal Error: Python.h No such file or directory
                            
                                Big O of min and max in Python
                            
                                Python - Organisation of 3 subplots with matplotlib
                            
                                NaN in mapper - name 'nan' is not defined
                            
                                How to convert csv into a dictionary in apache beam dataflow
                            
                                Scrollbar on Matplotlib showing page
                            
                                Flask JWT extend validity of token on each request
                            
                                List comprehensions in Python with mutable state between iterations
                            
                                Python - Request being blocked by Cloudflare
                            
                                Replacing -inf values to np.nan in a feature pandas.series [duplicate]
                            
                                How does pytorch broadcasting work?
                            
                                Alternative methods of initializing floats to '+inf', '-inf' and 'nan'
                            
                                Calling cuda() with async results in SyntaxError
                            
                                Pandas GroupBy.agg() throws TypeError: aggregate() missing 1 required positional argument: 'arg'
                            
                                How to base64 encode a PDF file in Python
                            
                                Is there a value in using map() vs for?
                            
                                Count lines of code in a Django Project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing Meaning from Text

Tags:

python

parsing

nlp

lexical-analysis

Tom

People also ask

2 Answers

RichieHindle

Bluu

Recent Activity

Donate For Us