Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chunking with rule-based grammar in spacy

I have this simple example of chunking in nltk.

My data:

data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'

...pre-processing ...

data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging

CHUNKING:

cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)

This returns (among other stuff): (CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP), so it did what I wanted it to do.

Now my question: I want to switch to spacy for my projects. How would I do this in spacy?

I come as far as to tag it (the coarser .pos method will do for me):

from spacy.en import English    
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')

def print_coarse_pos(token):
  print(token, token.pos_)

for sentence in parsed_sent.sents:
  for token in sentence:
    print_coarse_pos(token)

... which returns the tags and tokens The DET little ADJ yellow ADJ dog NOUN will VERB then ADV walk VERB ...

How could I extract chunks with my own grammar?

like image 284
ben_aaron Avatar asked Apr 18 '16 15:04

ben_aaron


People also ask

Is spaCy rule based?

spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_ , and flags like IS_PUNCT ).

Is spaCy better than NLTK?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What does NLP () in spaCy do?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.


1 Answers

Copied verbatim from https://github.com/spacy-io/spaCy/issues/342

There's a few ways to go about this. The closest functionality to that RegexpParser class is spaCy's Matcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks iterator:

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

The basic way that this works is something like this:

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

You can define the hypothetical is_head_of function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy

like image 96
fenceop Avatar answered Nov 27 '22 12:11

fenceop