Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Consequences of abusing nltk's word_tokenize(sent)

Tags:

python

nltk

I'm attempting to split a paragraph into words. I've got the lovely nltk.tokenize.word_tokenize(sent) on hand, but help(word_tokenize) says, "This tokenizer is designed to work on a sentence at a time."

Does anyone know what could happen if you use it on a paragraph, i.e. max 5 sentences, instead? I've tried it on a few short paragraphs myself and it seems to work, but that's hardly conclusive proof.

like image 886
Garrett Disco Avatar asked Oct 15 '13 04:10

Garrett Disco


2 Answers

nltk.tokenize.word_tokenize(text) is simply a thin wrapper function that calls the tokenize method of an instance of a TreebankWordTokenizer class, which apparently uses simple regex to parse a sentence.

The documentation for that class states that:

This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.

The underlying tokenize method itself is very simple:

def tokenize(self, text):
    for regexp in self.CONTRACTIONS2:
        text = regexp.sub(r'\1 \2', text)
    for regexp in self.CONTRACTIONS3:
        text = regexp.sub(r'\1 \2 \3', text)

    # Separate most punctuation
    text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)

    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)

    # Separate single quotes if they're followed by a space.
    text = re.sub(r"('\s)", r' \1', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)

    return text.split()

Basically, what the method normally does is tokenize the period as a separate token if it falls at the end of the string:

>>> nltk.tokenize.word_tokenize("Hello, world.")
['Hello', ',', 'world', '.']

Any periods that fall inside the string are tokenized as a part of the word, under the assumption that it's an abbreviation:

>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") 
['Hello', ',', 'world.', 'How', 'are', 'you', '?']

As long as that behavior is acceptable, you should be fine.

like image 77
Michael0x2a Avatar answered Oct 10 '22 05:10

Michael0x2a


Try this sort of hack:

>>> from string import punctuation as punct
>>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?"
# Add spaces before punctuations
>>> for ch in sent:
...     if ch in punct:
...             sent = sent.replace(ch, " "+ch+" ")
# Remove double spaces if it happens after adding spaces before punctuations.
>>> sent = " ".join(sent.split())

Then most probably the follow code is what you need to count frequency too =)

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
>>> for i in fdist:
...     print i, fdist[i]
like image 35
alvas Avatar answered Oct 10 '22 04:10

alvas