Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modify python nltk.word_tokenize to exclude "#" as delimiter

I am using Python's NLTK library to tokenize my sentences.

If my code is

text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)

I get this as my output

['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?

I want my output to be

['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

I want C# to be considered as one token.

like image 589
Poorva Rane Avatar asked Dec 16 '25 14:12

Poorva Rane


2 Answers

As dealing with multi-word tokenization, another way would be to retokenize the extracted tokens with NLTK Multi-Word Expression tokenizer:

mwtokenizer = nltk.MWETokenizer(separator='')
mwtokenizer.add_mwe(('c', '#'))
mwtokenizer.tokenize(tokens)
like image 191
AidinZadeh Avatar answered Dec 19 '25 05:12

AidinZadeh


Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.

txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)

i_offset = 0
for i, t in enumerate(tokens):
    i -= i_offset
    if t == '#' and i > 0:
        left = tokens[:i-1]
        joined = [tokens[i - 1] + t]
        right = tokens[i + 1:]
        tokens = left + joined + right
        i_offset += 1

>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
like image 45
Alex Avatar answered Dec 19 '25 07:12

Alex



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!