I have this simple example of chunking in nltk.
My data:
data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'
...pre-processing ...
data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging
CHUNKING:
cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)
This returns (among other stuff): (CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP)
, so it did what I wanted it to do.
Now my question: I want to switch to spacy for my projects. How would I do this in spacy?
I come as far as to tag it (the coarser .pos
method will do for me):
from spacy.en import English
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')
def print_coarse_pos(token):
print(token, token.pos_)
for sentence in parsed_sent.sents:
for token in sentence:
print_coarse_pos(token)
... which returns the tags and tokens
The DET
little ADJ
yellow ADJ
dog NOUN
will VERB
then ADV
walk VERB
...
How could I extract chunks with my own grammar?
spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_ , and flags like IS_PUNCT ).
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.
Copied verbatim from https://github.com/spacy-io/spaCy/issues/342
There's a few ways to go about this. The closest functionality to that RegexpParser
class is spaCy's Matcher
. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks
iterator:
doc = nlp(text)
for np in doc.noun_chunks:
print(np.text)
The basic way that this works is something like this:
for token in doc:
if is_head_of_chunk(token)
chunk_start = token.left_edge.i
chunk_end = token.right_edge.i + 1
yield doc[chunk_start : chunk_end]
You can define the hypothetical is_head_of
function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With