I am trying to figure out how to use NLTK's cascading chunker as per Chapter 7 of the NLTK book. Unfortunately, I'm running into a few issues when performing non-trivial chunking measures.
Let's start with this phrase:
"adventure movies between 2000 and 2015 featuring performances by daniel craig"
I am able to find all the relevant NPs when I use the following grammar:
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"
However, I am not sure how to build nested structures with NLTK. The book gives the following format, but there are clearly a few things missing (e.g. How does one actually specify multiple rules?):
grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
CLAUSE: {<NP><VP>} # Chunk NP, VP
"""
In my case, I'd like to do something like the following:
grammar = r"""
MEDIA: {<DT>?<JJ>*<NN.*>+}
RELATION: {<V.*>}{<DT>?<JJ>*<NN.*>+}
ENTITY: {<NN.*>}
"""
Assuming that I'd like to use a cascaded chunker for my task, what syntax would I need to use? Additionally, is it possible for me to specify specific words (e.g. "directed" or "acted") when using a chunker?
chunk package. Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called “chunk parsing” or “chunking”, and the identified groups are called “chunks”.
Chunking is a process of extracting phrases from unstructured text, which means analyzing a sentence to identify the constituents(Noun Groups, Verbs, verb groups, etc.) However, it does not specify their internal structure, nor their role in the main sentence. It works on top of POS tagging.
Chunking is used to get the required phrases from a given sentence. However, POS tagging can be used only to spot the parts of speech that every word of the sentence belongs to.
Practical Data Science using Python Chunking is the process of grouping similar words together based on the nature of the word. In the below example we define a grammar by which the chunk must be generated. The grammar suggests the sequence of the phrases like nouns and adjectives etc.
I can't comment on the relationship extraction part, not least because you don't give any details on what you want to do and what kind of data you have. So this is a rather partial answer.
a.) How does cascading chunking work in NLTK b.) Is it possible to treat the chunker like a context-free grammar, and if so, how?
As I understand section "Building nested structure with cascaded chunkers" in the NLTK book, you can use it with a context free grammar but you have to apply it repeatedly to get the recursive structure. Chunkers are flat, but you can add chunks on top of chunks.
c.) How can I use chunking to perform relation extraction?
I can't really speak to that, and anyway as I said you don't give any specifics; but if you're dealing with real text, my understanding is is that hand-written rulesets for any task are useless unless you have a large team and a lot of time. Look into the probabilistic tools that come with the NLTK. It'll be a whole lot easier if you have an annotated training corpus.
Anyway, a couple more comments about the RegexpParser.
You'll find a lot more use examples on http://www.nltk.org/howto/chunk.html. (Unfortunately it's not a real how-to, but a test suite.)
According to this, you can specify multiple expansion rules like this:
patterns = """NP: {<DT|PP\$>?<JJ>*<NN>}
{<NNP>+}
{<NN>+}
"""
I should add that grammars can have multiple rules with the same left side. That should add some flexibility with grouping related rules, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With