Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy Pipeline?

Tags:

python

nlp

spacy

So lately I've been playing around with a WikiDump. I preprocessed it and trained it on Word2Vec + Gensim

Does anyone know if there is only one script within Spacy that would generate tokenization, sentence recognition, part of speech tagging, lemmatization, dependency parsing, and named entity recognition all at once

I have not been able to find clear documentation Thank you

like image 587
Silas Avatar asked Aug 17 '16 00:08

Silas


People also ask

What is spaCy pipeline?

Fundamentally, a spaCy pipeline package consists of three components: the weights, i.e. binary data loaded in from a directory, a pipeline of functions called in order, and language data like the tokenization rules and language-specific settings.

How do you load a spaCy pipeline?

To load a pipeline from a data directory, you can use spacy. load() with the local path. This will look for a config. cfg in the directory and use the lang and pipeline settings to initialize a Language class with a processing pipeline and load in the model data.

Is spaCy or NLTK better?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What is NLP pipeline?

NLP Pipeline is a set of steps followed to build an end to end NLP software. Before we started we have to remember this things pipeline is not universal, Deep Learning Pipelines are slightly different, and Pipeline is non-linear.


1 Answers

Spacy gives you all of that with just using en_nlp = spacy.load('en'); doc=en_nlp(sentence). The documentation gives you details about how to access each of the elements.

An example is given below:

In [1]: import spacy
   ...: en_nlp = spacy.load('en')

In [2]: en_doc = en_nlp(u'Hello, world. Here are two sentences.')

Sentences can be obtained by using doc.sents:

In [4]: list(en_doc.sents)
Out[4]: [Hello, world., Here are two sentences.]

Noun chunks are given by doc.noun_chunks:

In [6]: list(en_doc.noun_chunks)
Out[6]: [two sentences]

Named entity is given by doc.ents:

In [11]: [(ent, ent.label_) for ent in en_doc.ents]
Out[11]: [(two, u'CARDINAL')]

Tokenization: You can iterate over the doc to get tokens. token.orth_ gives str of the token.

In [12]: [tok.orth_ for tok in en_doc]
Out[12]: [u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

POS is given by token.tag_:

In [13]: [tok.tag_ for tok in en_doc]
Out[13]: [u'UH', u',', u'NN', u'.', u'RB', u'VBP', u'CD', u'NNS', u'.']

Lemmatization:

In [15]: [tok.lemma_ for tok in en_doc]
Out[15]: [u'hello', u',', u'world', u'.', u'here', u'be', u'two', u'sentence', u'.']

Dependency parsing. You can traverse the parse tree by using token.dep_ token.rights or token.lefts. You can write a function to print dependencies:

In [19]: for token in en_doc:
    ...:     print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])
    ...:     
(u'Hello', u'ROOT', u'Hello', [], [u',', u'world', u'.'])
(u',', u'punct', u'Hello', [], [])
(u'world', u'npadvmod', u'Hello', [], [])
...

For more details please consult the spacy documentation.

like image 79
CentAu Avatar answered Oct 23 '22 00:10

CentAu