Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use spacy in large dataset with short sentences efficiently?

Tags:

python

nlp

spacy

I choose spacy to process kinds of text because of the performance of it's lemmatation compared with nltk. But When I process millions short text, it always consumed all of my memory(32G) and crashed. Without it just a few minutes and less than 10G mem is consumed.

Is something wrong with the usage of this method? is there any better solution to improve the performance? Thanks!

def tokenizer(text):
    try:
        tokens = [ word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
        tokens = list(filter(lambda t: t.lower() not in stop_words, tokens))
        tokens = list(filter(lambda t: t not in punctuation, tokens))
        tokens = list(filter(lambda t: len(t) > 4, tokens))
        filtered_tokens = []
        for token in tokens:
            if re.search('[a-zA-Z]', token):
                filtered_tokens.append(token)

        spacy_parsed = nlp(' '.join(filtered_tokens))
        filtered_tokens = [token.lemma_ for token in spacy_parsed]
        return filtered_tokens
    except Exception as e:
        raise e

Dask parrallel computing

ddata = dd.from_pandas(res, npartitions=50)
def dask_tokenizer(df):
    df['text_token'] = df['text'].map(tokenizer)
    return df
%time res_final = ddata.map_partitions(dask_tokenizer).compute(get=get)

Info about spaCy

spaCy version      2.0.5          
Location           /opt/conda/lib/python3.6/site-packages/spacy
Platform           Linux-4.4.0-103-generic-x86_64-with-debian-stretch-sid
Python version     3.6.3          
Models             en, en_default 
like image 767
Tony Wang Avatar asked Jan 11 '18 03:01

Tony Wang


People also ask

How can I make my spaCy faster?

Disable components you aren't using If you're using spaCy just for the NER predictions, then time spent running the parser is wasted. Be sure to disable or avoid loading components you won't use. If you aren't using a component at all, you can avoid ever running it like this.

Is spaCy or NLTK better?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

How do you use Tokenize sentences with spaCy?

In the example below, we are tokenizing the text using spacy. First, we imported the Spacy library and then loaded the English language model of spacy and then iterate over the tokens of doc objects to print them in the output. [Out] : You only live once , but if you do it right , once is enough .

Is spaCy good for NLP?

spaCy is a free and open-source library for Natural Language Processing (NLP) in Python with a lot of in-built capabilities. It's becoming increasingly popular for processing and analyzing data in NLP.


2 Answers

You can use multithreading in spacy to create a fast tokenization and data ingestion pipeline.

Rewriting your code block and functionality using the nlp.pipe method would look something like this:

import spacy
nlp = spacy.load('en')

docs = df['text'].tolist()

def token_filter(token):
    return not (token.is_punct | token.is_space | token.is_stop | len(token.text) <= 4)

filtered_tokens = []
for doc in nlp.pipe(docs):
    tokens = [token.lemma_ for token in doc if token_filter(token)]
    filtered_tokens.append(tokens)

This way puts all your filtering into the token_filter function, which takes in a spacy token and returns True only if it is not punctuation, a space, a stopword, and 4 or less characters. Then, you use this function as you pass through each token in each document, where it will return the lemma only if it meets all of those conditions. Then, filtered_tokens is a list of your tokenized documents.

Some helpful references for customizing this pipeline would be:

  • Token attributes
  • Language.pipe
like image 114
pmbaumgartner Avatar answered Oct 25 '22 08:10

pmbaumgartner


You should filter out tokens after parsing. This way the trained model will give better tagging (unless it was trained on text filtered in a similar way, which is unlikely). Also, filtering afterwards makes it possible to use nlp.pipe, which is told to be fast. See the nlp.pipe example at http://spacy.io/usage/spacy-101#lightning-tour-multi-threaded.

like image 34
adam.ra Avatar answered Oct 25 '22 07:10

adam.ra