Tokenize() in nltk.TweetTokenizer
returning the 32-bit integers by dividing them into digits. It is only happening to some certain numbers, and I don't see any reason why?
>>> from nltk.tokenize import TweetTokenizer
>>> tw = TweetTokenizer()
>>> tw.tokenize('the 23135851162 of 3151942776...')
[u'the', u'2313585116', u'2', u'of', u'3151942776', u'...']
The input 23135851162
has been split into [u'2313585116', u'2']
Interestingly, it seems to segment all numbers into 10 digits
>>> tw.tokenize('the 231358511621231245 of 3151942776...')
[u'the', u'2313585116', u'2123124', u'5', u'of', u'3151942776', u'...']
>>> tw.tokenize('the 231123123358511621231245 of 3151942776...')
[u'the', u'2311231233', u'5851162123', u'1245', u'of', u'3151942776', u'...']
The length of number token affects the tokenization:
>>> s = 'the 1234567890 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'90', u'of']
>>> s = 'the 123456789 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'9', u'of']
>>> s = 'the 12345678 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'of']
>>> s = 'the 1234567 of'
>>> tw.tokenize(s)
[u'the', u'1234567', u'of']
>>> s = 'the 123456 of'
>>> tw.tokenize(s)
[u'the', u'123456', u'of']
>>> s = 'the 12345 of'
>>> tw.tokenize(s)
[u'the', u'12345', u'of']
>>> s = 'the 1234 of'
>>> tw.tokenize(s)
[u'the', u'1234', u'of']
>>> s = 'the 123 of'
>>> tw.tokenize(s)
[u'the', u'123', u'of']
>>> s = 'the 12 of'
>>> tw.tokenize(s)
[u'the', u'12', u'of']
>>> s = 'the 1 of'
>>> tw.tokenize(s)
[u'the', u'1', u'of']
If contiguous digits + whitespace goes beyond length 10:
>>> s = 'the 123 456 78901234 of'
>>> tw.tokenize(s)
[u'the', u'123 456 7890', u'1234', u'of']
It seems to be a bug/feature of the TweetTokenizer()
which we're unsure what motivates this.
Read on to find out where the bug/feature occurs...
Looking at the tokenize()
function in TweetTokenizer, before the actual tokenizing, the tokenizer does some preprocessing:
First, it remove entities from text by converting them to their corresponding unicode character through the _replace_html_entities()
function
Optionally, it removes username handles using the remove_handles()
function.
Optionally, it normalize the word length through the reduce_lengthening function
Then, shortens the problematic sequences of characters using the HANG_RE
regex
Lastly, the actual tokenization takes place through the WORD_RE
regex
After the WORD_RE
regex, it
In code:
def tokenize(self, text):
"""
:param text: str
:rtype: list(str)
:return: a tokenized list of strings; concatenating this list returns\
the original string if `preserve_case=False`
"""
# Fix HTML character entities:
text = _replace_html_entities(text)
# Remove username handles
if self.strip_handles:
text = remove_handles(text)
# Normalize word lengthening
if self.reduce_len:
text = reduce_lengthening(text)
# Shorten problematic sequences of characters
safe_text = HANG_RE.sub(r'\1\1\1', text)
# Tokenize:
words = WORD_RE.findall(safe_text)
# Possibly alter the case, but avoid changing emoticons like :D into :d:
if not self.preserve_case:
words = list(map((lambda x : x if EMOTICON_RE.search(x) else
x.lower()), words))
return words
By default, the handle stripping and length reduction doesn't kick in, unless specified by user.
class TweetTokenizer:
r"""
Tokenizer for tweets.
>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
>>> tknzr.tokenize(s0)
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
Examples using `strip_handles` and `reduce_len parameters`:
>>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
>>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
>>> tknzr.tokenize(s1)
[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
"""
def __init__(self, preserve_case=True, reduce_len=False, strip_handles=False):
self.preserve_case = preserve_case
self.reduce_len = reduce_len
self.strip_handles = strip_handles
Let's go through the steps and regexes:
>>> from nltk.tokenize.casual import _replace_html_entities
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> _replace_html_entities(s)
u'the 231358523423423421162 of 3151942776...'
Checked, _replace_html_entities()
isn't the culprit.
By default, remove_handles()
and reduce_lengthening()
is skipped but for sanity sake, let's see:
>>> from nltk.tokenize.casual import _replace_html_entities
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> _replace_html_entities(s)
u'the 231358523423423421162 of 3151942776...'
>>> from nltk.tokenize.casual import remove_handles, reduce_lengthening
>>> remove_handles(_replace_html_entities(s))
u'the 231358523423423421162 of 3151942776...'
>>> reduce_lengthening(remove_handles(_replace_html_entities(s)))
u'the 231358523423423421162 of 3151942776...'
Checked too, neither of the optional functions are behaving badly
>>> import re
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> HANG_RE = re.compile(r'([^a-zA-Z0-9])\1{3,}')
>>> HANG_RE.sub(r'\1\1\1', s)
'the 231358523423423421162 of 3151942776...'
Klar! The HANG_RE
is cleared of its name too
>>> import re
>>> from nltk.tokenize.casual import REGEXPS
>>> WORD_RE = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I | re.UNICODE)
>>> WORD_RE.findall(s)
['the', '2313585234', '2342342116', '2', 'of', '3151942776', '...']
Achso! That's where the splits appear!
Now let's look deeper into the WORD_RE
, it's a tuple of regexes.
The first is a massive URL pattern regex from https://gist.github.com/winzig/8894715
Let's go through them one by one:
>>> from nltk.tokenize.casual import REGEXPS
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I | re.UNICODE)
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> patt.findall(s)
['the', '2313585234', '2342342116', '2', 'of', '3151942776', '...']
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[:1]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
[]
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[:2]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
['2313585234', '2342342116', '3151942776']
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[1:2]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
['2313585234', '2342342116', '3151942776']
Ah ha! It seems like the 2nd regex from REGEXPS
is causing the problem!!
If we look at https://github.com/alvations/nltk/blob/develop/nltk/tokenize/casual.py#L122:
# The components of the tokenizer:
REGEXPS = (
URLS,
# Phone numbers:
r"""
(?:
(?: # (international)
\+?[01]
[\-\s.]*
)?
(?: # (area code)
[\(]?
\d{3}
[\-\s.\)]*
)?
\d{3} # exchange
[\-\s.]*
\d{4} # base
)"""
,
# ASCII Emoticons
EMOTICONS
,
# HTML tags:
r"""<[^>\s]+>"""
,
# ASCII Arrows
r"""[\-]+>|<[\-]+"""
,
# Twitter username:
r"""(?:@[\w_]+)"""
,
# Twitter hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# email addresses
r"""[\w.+-]+@[\w-]+\.(?:[\w-]\.?)+[\w-]"""
,
# Remaining word types:
r"""
(?:[^\W\d_](?:[^\W\d_]|['\-_])+[^\W\d_]) # Words with apostrophes or dashes.
|
(?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
|
(?:[\w_]+) # Words without apostrophes or dashes.
|
(?:\.(?:\s*\.){1,}) # Ellipsis dots.
|
(?:\S) # Everything else that isn't whitespace.
"""
)
The second regex from REGEXP tries to parse numbers as phone-numbers:
# Phone numbers:
r"""
(?:
(?: # (international)
\+?[01]
[\-\s.]*
)?
(?: # (area code)
[\(]?
\d{3}
[\-\s.\)]*
)?
\d{3} # exchange
[\-\s.]*
\d{4} # base
)"""
The pattern tries to recognize
See https://regex101.com/r/BQpnsg/1 for a detailed explanation.
That's why it's trying to split contiguous digits up into 10 digits block!!
But note the quirks, since the phone number regex is hard coded, it is possible to catch real phone numbers in the \d{3}-d{3}-\d{4}
or \d{10}
patterns, but if the dashes are in other order, it won't work:
>>> from nltk.tokenize.casual import REGEXPS
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[1:2]), re.VERBOSE | re.I | re.UNICODE)
>>> s = '231-358-523423423421162'
>>> patt.findall(s)
['231-358-5234', '2342342116']
>>> s = '2313-58-523423423421162'
>>> patt.findall(s)
['5234234234']
Can we fix it?
See https://github.com/nltk/nltk/issues/1799
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With