Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenize() in nltk.TweetTokenizer returning integers by splitting

Tokenize() in nltk.TweetTokenizer returning the 32-bit integers by dividing them into digits. It is only happening to some certain numbers, and I don't see any reason why?

>>> from nltk.tokenize import TweetTokenizer 
>>> tw = TweetTokenizer()
>>> tw.tokenize('the 23135851162 of 3151942776...')
[u'the', u'2313585116', u'2', u'of', u'3151942776', u'...']

The input 23135851162 has been split into [u'2313585116', u'2']

Interestingly, it seems to segment all numbers into 10 digits

>>> tw.tokenize('the 231358511621231245 of 3151942776...')
[u'the', u'2313585116', u'2123124', u'5', u'of', u'3151942776', u'...']
>>> tw.tokenize('the 231123123358511621231245 of 3151942776...')
[u'the', u'2311231233', u'5851162123', u'1245', u'of', u'3151942776', u'...']

The length of number token affects the tokenization:

>>> s = 'the 1234567890 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'90', u'of']
>>> s = 'the 123456789 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'9', u'of']
>>> s = 'the 12345678 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'of']
>>> s = 'the 1234567 of'
>>> tw.tokenize(s)
[u'the', u'1234567', u'of']
>>> s = 'the 123456 of'
>>> tw.tokenize(s)
[u'the', u'123456', u'of']
>>> s = 'the 12345 of'
>>> tw.tokenize(s)
[u'the', u'12345', u'of']
>>> s = 'the 1234 of'
>>> tw.tokenize(s)
[u'the', u'1234', u'of']
>>> s = 'the 123 of'
>>> tw.tokenize(s)
[u'the', u'123', u'of']
>>> s = 'the 12 of'
>>> tw.tokenize(s)
[u'the', u'12', u'of']
>>> s = 'the 1 of'
>>> tw.tokenize(s)
[u'the', u'1', u'of']

If contiguous digits + whitespace goes beyond length 10:

>>> s = 'the 123 456 78901234  of'
>>> tw.tokenize(s)
[u'the', u'123 456 7890', u'1234', u'of']
like image 756
Jim Mirzakhalov Avatar asked Jul 31 '17 21:07

Jim Mirzakhalov


1 Answers

TL;DR

It seems to be a bug/feature of the TweetTokenizer() which we're unsure what motivates this.

Read on to find out where the bug/feature occurs...


In Long

Looking at the tokenize() function in TweetTokenizer, before the actual tokenizing, the tokenizer does some preprocessing:

  • First, it remove entities from text by converting them to their corresponding unicode character through the _replace_html_entities() function

  • Optionally, it removes username handles using the remove_handles() function.

  • Optionally, it normalize the word length through the reduce_lengthening function

  • Then, shortens the problematic sequences of characters using the HANG_RE regex

  • Lastly, the actual tokenization takes place through the WORD_RE regex

After the WORD_RE regex, it

  • optionally preserve the case of emoticons before lowercasing the tokenized output

In code:

def tokenize(self, text):
    """
    :param text: str
    :rtype: list(str)
    :return: a tokenized list of strings; concatenating this list returns\
    the original string if `preserve_case=False`
    """
    # Fix HTML character entities:
    text = _replace_html_entities(text)
    # Remove username handles
    if self.strip_handles:
        text = remove_handles(text)
    # Normalize word lengthening
    if self.reduce_len:
        text = reduce_lengthening(text)
    # Shorten problematic sequences of characters
    safe_text = HANG_RE.sub(r'\1\1\1', text)
    # Tokenize:
    words = WORD_RE.findall(safe_text)
    # Possibly alter the case, but avoid changing emoticons like :D into :d:
    if not self.preserve_case:
        words = list(map((lambda x : x if EMOTICON_RE.search(x) else
                          x.lower()), words))
    return words

By default, the handle stripping and length reduction doesn't kick in, unless specified by user.

class TweetTokenizer:
    r"""
    Tokenizer for tweets.

        >>> from nltk.tokenize import TweetTokenizer
        >>> tknzr = TweetTokenizer()
        >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
        >>> tknzr.tokenize(s0)
        ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']

    Examples using `strip_handles` and `reduce_len parameters`:

        >>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
        >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
        >>> tknzr.tokenize(s1)
        [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
    """

    def __init__(self, preserve_case=True, reduce_len=False, strip_handles=False):
        self.preserve_case = preserve_case
        self.reduce_len = reduce_len
        self.strip_handles = strip_handles

Let's go through the steps and regexes:

>>> from nltk.tokenize.casual import _replace_html_entities
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> _replace_html_entities(s)
u'the 231358523423423421162 of 3151942776...'

Checked, _replace_html_entities() isn't the culprit.

By default, remove_handles() and reduce_lengthening() is skipped but for sanity sake, let's see:

>>> from nltk.tokenize.casual import _replace_html_entities
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> _replace_html_entities(s)
u'the 231358523423423421162 of 3151942776...'
>>> from nltk.tokenize.casual import remove_handles, reduce_lengthening
>>> remove_handles(_replace_html_entities(s))
u'the 231358523423423421162 of 3151942776...'
>>> reduce_lengthening(remove_handles(_replace_html_entities(s)))
u'the 231358523423423421162 of 3151942776...'

Checked too, neither of the optional functions are behaving badly

>>> import re
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> HANG_RE = re.compile(r'([^a-zA-Z0-9])\1{3,}')
>>> HANG_RE.sub(r'\1\1\1', s)
'the 231358523423423421162 of 3151942776...'

Klar! The HANG_RE is cleared of its name too

>>> import re
>>> from nltk.tokenize.casual import REGEXPS
>>> WORD_RE = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I | re.UNICODE)
>>> WORD_RE.findall(s)
['the', '2313585234', '2342342116', '2', 'of', '3151942776', '...']

Achso! That's where the splits appear!

Now let's look deeper into the WORD_RE, it's a tuple of regexes.

The first is a massive URL pattern regex from https://gist.github.com/winzig/8894715

Let's go through them one by one:

>>> from nltk.tokenize.casual import REGEXPS
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I | re.UNICODE)
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> patt.findall(s)
['the', '2313585234', '2342342116', '2', 'of', '3151942776', '...']
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[:1]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
[]
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[:2]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
['2313585234', '2342342116', '3151942776']
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[1:2]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
['2313585234', '2342342116', '3151942776']

Ah ha! It seems like the 2nd regex from REGEXPS is causing the problem!!

If we look at https://github.com/alvations/nltk/blob/develop/nltk/tokenize/casual.py#L122:

# The components of the tokenizer:
REGEXPS = (
    URLS,
    # Phone numbers:
    r"""
    (?:
      (?:            # (international)
        \+?[01]
        [\-\s.]*
      )?
      (?:            # (area code)
        [\(]?
        \d{3}
        [\-\s.\)]*
      )?
      \d{3}          # exchange
      [\-\s.]*
      \d{4}          # base
    )"""
    ,
    # ASCII Emoticons
    EMOTICONS
    ,
    # HTML tags:
    r"""<[^>\s]+>"""
    ,
    # ASCII Arrows
    r"""[\-]+>|<[\-]+"""
    ,
    # Twitter username:
    r"""(?:@[\w_]+)"""
    ,
    # Twitter hashtags:
    r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
    ,
    # email addresses
    r"""[\w.+-]+@[\w-]+\.(?:[\w-]\.?)+[\w-]"""
    ,
    # Remaining word types:
    r"""
    (?:[^\W\d_](?:[^\W\d_]|['\-_])+[^\W\d_]) # Words with apostrophes or dashes.
    |
    (?:[+\-]?\d+[,/.:-]\d+[+\-]?)  # Numbers, including fractions, decimals.
    |
    (?:[\w_]+)                     # Words without apostrophes or dashes.
    |
    (?:\.(?:\s*\.){1,})            # Ellipsis dots.
    |
    (?:\S)                         # Everything else that isn't whitespace.
    """
    )

The second regex from REGEXP tries to parse numbers as phone-numbers:

# Phone numbers:
    r"""
    (?:
      (?:            # (international)
        \+?[01]
        [\-\s.]*
      )?
      (?:            # (area code)
        [\(]?
        \d{3}
        [\-\s.\)]*
      )?
      \d{3}          # exchange
      [\-\s.]*
      \d{4}          # base
    )"""

The pattern tries to recognize

  • Optionally, the first digits will be matched as international code.
  • the next 3 digits as the area code
  • optionally followed by a dash
  • then 3 more digits which is the (telecom) exchange code
  • another optional dash
  • lastly 4 digit base phone number.

See https://regex101.com/r/BQpnsg/1 for a detailed explanation.

That's why it's trying to split contiguous digits up into 10 digits block!!

But note the quirks, since the phone number regex is hard coded, it is possible to catch real phone numbers in the \d{3}-d{3}-\d{4} or \d{10} patterns, but if the dashes are in other order, it won't work:

>>> from nltk.tokenize.casual import REGEXPS
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[1:2]), re.VERBOSE | re.I | re.UNICODE)
>>> s = '231-358-523423423421162'
>>> patt.findall(s)
['231-358-5234', '2342342116']
>>> s = '2313-58-523423423421162'
>>> patt.findall(s)
['5234234234']

Can we fix it?

See https://github.com/nltk/nltk/issues/1799

like image 196
alvas Avatar answered Sep 22 '22 17:09

alvas