Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK tokenize - faster way?

I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word.

import nltk
from collections import Counter

def freq(string):
    f = Counter()
    sentence_list = nltk.tokenize.sent_tokenize(string)
    for sentence in sentence_list:
        words = nltk.word_tokenize(sentence)
        words = [word.lower() for word in words]
        for word in words:
            f[word] += 1
    return f

I'm supposed to optimize the above code further to result in faster preprocessing time, and am unsure how to do so. The return value should obviously be exactly the same as the above, so I'm expected to use nltk though not explicitly required to do so.

Any way to speed up the above code? Thanks.

like image 569
user3280193 Avatar asked Jan 28 '17 16:01

user3280193


People also ask

How do you tokenize in NLTK?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

What does word_tokenize () function in NLTK do?

word_tokenize() method, we are able to extract the tokens from string of characters by using tokenize. word_tokenize() method. It actually returns the syllables from a single word. A single word can contain one or two syllables.

How long does it take to tokenize?

How long does it take to tokenize assets. The technical aspect of tokenizing an asset is actually very quick, it can be done in minutes by a professional partner. However, the total process takes ~3 weeks to ~3 months, depending on many factors.


2 Answers

If you just want a flat list of tokens, note that word_tokenize would call sent_tokenize implicitly, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L98

_treebank_word_tokenize = TreebankWordTokenizer().tokenize
def word_tokenize(text, language='english'):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus
    """
    return [token for sent in sent_tokenize(text, language)
            for token in _treebank_word_tokenize(sent)]

Using brown corpus as an example, with Counter(word_tokenize(string_corpus)):

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> from nltk import sent_tokenize, word_tokenize
>>> string_corpus = brown.raw() # Plaintext, str type.
>>> start = time.time(); fdist = Counter(word_tokenize(string_corpus)); end = time.time() - start
>>> end
12.662328958511353
>>> fdist.most_common(5)
[(u',', 116672), (u'/', 89031), (u'the/at', 62288), (u'.', 60646), (u'./', 48812)]
>>> sum(fdist.values())
1423314

~1.4 million words took 12 secs (without saving the tokenized corpus) on my machine with specs:

alvas@ubi:~$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 69
model name  : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
stepping    : 1
microcode   : 0x17
cpu MHz     : 1600.027
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2

$ cat /proc/meminfo
MemTotal:       12004468 kB

Saving the tokenized corpus first tokenized_corpus = [word_tokenize(sent) for sent in sent_tokenize(string_corpus)], then using Counter(chain*(tokenized_corpus)):

>>> from itertools import chain
>>> start = time.time(); tokenized_corpus = [word_tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start
>>> end
16.421464920043945

Using ToktokTokenizer()

>>> from collections import Counter
>>> import time
>>> from itertools import chain
>>> from nltk.corpus import brown
>>> from nltk import sent_tokenize, word_tokenize
>>> from nltk.tokenize import ToktokTokenizer
>>> toktok = ToktokTokenizer()
>>> string_corpus = brown.raw()

>>> start = time.time(); tokenized_corpus = [toktok.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
10.00472116470337

Using MosesTokenizer():

>>> from nltk.tokenize.moses import MosesTokenizer
>>> moses = MosesTokenizer()
>>> start = time.time(); tokenized_corpus = [moses.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
30.783339023590088
>>> start = time.time(); tokenized_corpus = [moses.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
30.559681177139282

Why use MosesTokenizer

It was implemented in such a way that there is a way to reverse the tokens back to string, i.e. "detokenize".

>>> from nltk.tokenize.moses import MosesTokenizer, MosesDetokenizer
>>> t, d = MosesTokenizer(), MosesDetokenizer()
>>> sent = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?"
>>> expected_tokens = [u'This', u'ain', u'&apos;t', u'funny.', u'It', u'&apos;s', u'actually', u'hillarious', u',', u'yet', u'double', u'Ls.', u'&#124;', u'&#91;', u'&#93;', u'&lt;', u'&gt;', u'&#91;', u'&#93;', u'&amp;', u'You', u'&apos;re', u'gonna', u'shake', u'it', u'off', u'?', u'Don', u'&apos;t', u'?']
>>> expected_detokens = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?"
>>> tokens = t.tokenize(sent)
>>> tokens == expected_tokens
True
>>> detokens = d.detokenize(tokens)
>>> " ".join(detokens) == expected_detokens
True

Using ReppTokenizer():

>>> repp = ReppTokenizer('/home/alvas/repp')
>>> start = time.time(); sentences = sent_tokenize(string_corpus); tokenized_corpus = repp.tokenize_sents(sentences); fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start
>>> end
76.44129395484924

Why use ReppTokenizer?

It returns offset of the tokens from in the original string.

>>> sents = ['Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve.' ,
... 'But rule-based tokenizers are hard to maintain and their rules language specific.' ,
... 'We evaluated our method on three languages and obtained error rates of 0.27% (English), 0.35% (Dutch) and 0.76% (Italian) for our best models.'
... ]
>>> tokenizer = ReppTokenizer('/home/alvas/repp/') # doctest: +SKIP
>>> for sent in sents:                             # doctest: +SKIP
...     tokenizer.tokenize(sent)                   # doctest: +SKIP
... 
(u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.')
(u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.')
(u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents): 
...     print sent                               
... 
(u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.')
(u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.')
(u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents, keep_token_positions=True): 
...     print sent
... 
[(u'Tokenization', 0, 12), (u'is', 13, 15), (u'widely', 16, 22), (u'regarded', 23, 31), (u'as', 32, 34), (u'a', 35, 36), (u'solved', 37, 43), (u'problem', 44, 51), (u'due', 52, 55), (u'to', 56, 58), (u'the', 59, 62), (u'high', 63, 67), (u'accuracy', 68, 76), (u'that', 77, 81), (u'rulebased', 82, 91), (u'tokenizers', 92, 102), (u'achieve', 103, 110), (u'.', 110, 111)]
[(u'But', 0, 3), (u'rule-based', 4, 14), (u'tokenizers', 15, 25), (u'are', 26, 29), (u'hard', 30, 34), (u'to', 35, 37), (u'maintain', 38, 46), (u'and', 47, 50), (u'their', 51, 56), (u'rules', 57, 62), (u'language', 63, 71), (u'specific', 72, 80), (u'.', 80, 81)]
[(u'We', 0, 2), (u'evaluated', 3, 12), (u'our', 13, 16), (u'method', 17, 23), (u'on', 24, 26), (u'three', 27, 32), (u'languages', 33, 42), (u'and', 43, 46), (u'obtained', 47, 55), (u'error', 56, 61), (u'rates', 62, 67), (u'of', 68, 70), (u'0.27', 71, 75), (u'%', 75, 76), (u'(', 77, 78), (u'English', 78, 85), (u')', 85, 86), (u',', 86, 87), (u'0.35', 88, 92), (u'%', 92, 93), (u'(', 94, 95), (u'Dutch', 95, 100), (u')', 100, 101), (u'and', 102, 105), (u'0.76', 106, 110), (u'%', 110, 111), (u'(', 112, 113), (u'Italian', 113, 120), (u')', 120, 121), (u'for', 122, 125), (u'our', 126, 129), (u'best', 130, 134), (u'models', 135, 141), (u'.', 141, 142)]

TL;DR

Advantages of different tokenizers

  • word_tokenize() implicitly calls sent_tokenize()
  • ToktokTokenizer() is fastest
  • MosesTokenizer() is able to detokenize text
  • ReppTokenizer() is able to provide token offsets

Q: Is there a fast tokenizer that can detokenizer and also provides me with offsets and also do sentence tokenization in NLTK ?

A: I don't think so, try gensim or spacy.

like image 133
alvas Avatar answered Oct 07 '22 22:10

alvas


Unnecessary list creation is evil

Your code is implicitly creating a lot of potentially very long list instances which don't need to be there, for example:

words = [word.lower() for word in words]

Using the [...] syntax for list comprehension creates a list of length n for n tokens found in your input, but all you want to do is get the frequency of each token, not actually store them:

f[word] += 1

Therefore, you should use a generator instead:

words = (word.lower() for word in words)

Similarly, nltk.tokenize.sent_tokenize and nltk.tokenize.word_tokenize both seem to produce lists as output, which is again unnecessary; Try to use a more low-level function, e.g. nltk.tokenize.api.StringTokenizer.span_tokenize, which merely generates an iterator that yields token offsets for your input stream, i.e. pairs of indices of your input string representing each token.

A better solution

Here is an example using no intermediate lists:

def freq(string):
    '''
    @param string: The string to get token counts for. Note that this should already have been normalized if you wish it to be so.
    @return: A new Counter instance representing the frequency of each token found in the input string.
    '''
    spans = nltk.tokenize.WhitespaceTokenizer().span_tokenize(string)   
    # Yield the relevant slice of the input string representing each individual token in the sequence
    tokens = (string[begin : end] for (begin, end) in spans)
    return Counter(tokens)

Disclaimer: I've not profiled this, so it's possible that e.g. the NLTK people have made word_tokenize blazingly fast but neglected span_tokenize; Always profile your application to be sure.

TL;DR

Don't use lists when generators will suffice: Every time you create a list just to throw it away after using it once, God kills a kitten.

like image 23
errantlinguist Avatar answered Oct 08 '22 00:10

errantlinguist