I'm attempting to split a paragraph into words. I've got the lovely nltk.tokenize.word_tokenize(sent) on hand, but help(word_tokenize) says, "This tokenizer is designed to work on a sentence at a time."
Does anyone know what could happen if you use it on a paragraph, i.e. max 5 sentences, instead? I've tried it on a few short paragraphs myself and it seems to work, but that's hardly conclusive proof.
nltk.tokenize.word_tokenize(text)
is simply a thin wrapper function that calls the tokenize
method of an instance of a TreebankWordTokenizer class, which apparently uses simple regex to parse a sentence.
The documentation for that class states that:
This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.
The underlying tokenize
method itself is very simple:
def tokenize(self, text):
for regexp in self.CONTRACTIONS2:
text = regexp.sub(r'\1 \2', text)
for regexp in self.CONTRACTIONS3:
text = regexp.sub(r'\1 \2 \3', text)
# Separate most punctuation
text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)
# Separate commas if they're followed by space.
# (E.g., don't separate 2,500)
text = re.sub(r"(,\s)", r' \1', text)
# Separate single quotes if they're followed by a space.
text = re.sub(r"('\s)", r' \1', text)
# Separate periods that come before newline or end of string.
text = re.sub('\. *(\n|$)', ' . ', text)
return text.split()
Basically, what the method normally does is tokenize the period as a separate token if it falls at the end of the string:
>>> nltk.tokenize.word_tokenize("Hello, world.")
['Hello', ',', 'world', '.']
Any periods that fall inside the string are tokenized as a part of the word, under the assumption that it's an abbreviation:
>>> nltk.tokenize.word_tokenize("Hello, world. How are you?")
['Hello', ',', 'world.', 'How', 'are', 'you', '?']
As long as that behavior is acceptable, you should be fine.
Try this sort of hack:
>>> from string import punctuation as punct
>>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?"
# Add spaces before punctuations
>>> for ch in sent:
... if ch in punct:
... sent = sent.replace(ch, " "+ch+" ")
# Remove double spaces if it happens after adding spaces before punctuations.
>>> sent = " ".join(sent.split())
Then most probably the follow code is what you need to count frequency too =)
>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
>>> for i in fdist:
... print i, fdist[i]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With