I have this simple example of chunking in nltk. My data: <pre class="prettyprint"><code>data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.' </code></pre> ...pre-processing ... <pre class="prettyprint"><code>data_tok = nltk.word_tokenize(data) #tokenisation data_pos = nltk.pos_tag(data_tok) #POS tagging </code></pre> CHUNKING: <pre class="prettyprint"><code>cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc. chunker = nltk.RegexpParser(cfg_1) data_chunked = chunker.parse(data_pos) </code></pre> This returns (among other stuff): <code>(CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP)</code>, so it did what I wanted it to do. Now my question: I want to switch to spacy for my projects. How would I do this in spacy? I come as far as to tag it (the coarser <code>.pos</code> method will do for me): <pre class="prettyprint"><code>from spacy.en import English parser = English() parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where') def print_coarse_pos(token): print(token, token.pos_) for sentence in parsed_sent.sents: for token in sentence: print_coarse_pos(token) </code></pre> ... which returns the tags and tokens <code>The DET little ADJ yellow ADJ dog NOUN will VERB then ADV walk VERB ...</code> How could I extract chunks with my own grammar?

Copied verbatim from https://github.com/spacy-io/spaCy/issues/342 There's a few ways to go about this. The closest functionality to that <code>RegexpParser</code> class is spaCy's <code>Matcher</code>. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the <code>doc.noun_chunks</code> iterator: <pre class="prettyprint"><code>doc = nlp(text) for np in doc.noun_chunks: print(np.text) </code></pre> The basic way that this works is something like this: <pre class="prettyprint"><code>for token in doc: if is_head_of_chunk(token) chunk_start = token.left_edge.i chunk_end = token.right_edge.i + 1 yield doc[chunk_start : chunk_end] </code></pre> You can define the hypothetical <code>is_head_of</code> function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy

Chunking with rule-based grammar in spacy

Tags:

text-parsing

nlp

nltk

spacy

I have this simple example of chunking in nltk.

My data:

data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'

...pre-processing ...

data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging

CHUNKING:

cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)

This returns (among other stuff): (CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP), so it did what I wanted it to do.

Now my question: I want to switch to spacy for my projects. How would I do this in spacy?

I come as far as to tag it (the coarser .pos method will do for me):

from spacy.en import English    
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')

def print_coarse_pos(token):
  print(token, token.pos_)

for sentence in parsed_sent.sents:
  for token in sentence:
    print_coarse_pos(token)

... which returns the tags and tokens The DET little ADJ yellow ADJ dog NOUN will VERB then ADV walk VERB ...

How could I extract chunks with my own grammar?

284

asked Apr 18 '16 15:04

ben_aaron

1 Answers

Copied verbatim from https://github.com/spacy-io/spaCy/issues/342

There's a few ways to go about this. The closest functionality to that RegexpParser class is spaCy's Matcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks iterator:

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

The basic way that this works is something like this:

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

You can define the hypothetical is_head_of function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy

answered Nov 27 '22 12:11

fenceop

Related questions
                            
                                Using WN-Affect to detect emotion/mood of a string
                            
                                How can I print the entire contents of Wordnet (preferably with NLTK)?
                            
                                Using BERT for next sentence prediction
                            
                                How I train an Named Entity Recognizer identifier in OpenNLP?
                            
                                How does nltk.pos_tag() work?
                            
                                How to do text pre-processing using spaCy?
                            
                                NLTK Performance
                            
                                Improving Gensim Doc2vec results
                            
                                implementing a perceptron classifier
                            
                                How to ensure user submit only english text
                            
                                How do I evaluate a text summarization tool?
                            
                                real word count in NLTK
                            
                                Algorithm for separating nonsense text from meaningful text
                            
                                Generating the plural form of a noun
                            
                                How to integrate "WordNet Domains" into WordNet DB?
                            
                                Generate a pseudo-poem that would contain 160 bits of recoverable information [closed]
                            
                                How to determine the language(English, Chinese...) of a given string in Oracle?
                            
                                Lemmatization with apache lucene
                            
                                InvalidArgumentError: 2 root error(s) found. Incompatible shapes in Tensorflow text-classification model
                            
                                Is there a good stemmer for Hebrew?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With